You are here

Detecting Web Spam using Machine Learning

Student: 
Andrejs Mironovs
Grade: 
second1

Primary goal: to develop a classification algorithm to detect Web Spam.

Web Spam refers to a set of techniques that intend to increase the ranking of a page in a search engine. From search engine providers and Web users point of view, Web Spam decreases the quality of information search in the Web [1] [2] [3]. The Web Spam can be broadly classified into two types: content spam and link spam. It is a critical and challenging task to detect Web Spam. The success of Web Spam detection has a high commercial value for industries.

The goal of detecting Web Spam is to identify whether a given page or website is a spam or not. This is a typical classification problem in Machine learning Field.

This project will focus on developing a classification algorithm to detect Web Spam. It is expected to target one or more Web Spam types, which may be content spam and or link spam. The outcome of this project is a classification algorithm with a prototype. The dataset is from the WEbspam-uk2006 and 2007 [4] for training and testing.

Project status: 
Finished
Degree level: 
MSc
Background: 
Machine learning, knowledge of Database, programming in Java or other languages
Supervisors @ NeSC: 
Subject areas: 
Machine Learning/Neural Networks/Connectionist Computing
Student project type: 
References: 
* [1] Z.Gyongyi, H.Garcia-Molina and J.Pedersen. Combating Web Spam with Trust Rank, In VLDB 2004. * [2] L. Becchett, C. Castillo, D. Donato, R. Baeza-yates, S. Leonardi. Link Analysis for Web Spam Detection. ACM Transactions on the Web (TWEB), 2(1) (2008) 2.1-2.45 * [3] H. Najada and I. Himeidi. Web Spam detection using Machine Learning in Specific Domain Features. Journal of Information Assurance and Security. 3 (2008) 220-229 * [4] WEBSPAM-UK2007, http://barcelona.research.yahoo.net/webspam/datasets/uk2007/