基于遺傳規(guī)劃和集成學習的Web Spam檢測關鍵技術研究
[Abstract]:With the explosive growth of information on the network, search engines have become an important tool in daily life to help people find information they want. A given query, a search engine usually can return thousands of pages, but most users read only a few before, so it is very important to rank in the search engine. Many people use some means to cheat the search engine sorting algorithm, so that some web pages get undue high sort values to attract users' attention, so as to achieve the purpose of gaining a certain interest. All the frauds trying to increase the sort of the web page in the search engine are called the Web Spam (Network cheating).Web Spam, which has severely reduced the search. The quality of the engine retrieval results, the user has a huge obstacle in the process of obtaining information and produces a poor user experience. For the search engine, even if these cheating pages are not enough to come to disrupt the user, it is necessary to capture, index, and store these web pages. Therefore, the identification of Web Spam has become a heavy search engine. One of the challenges.
According to the characteristics of Web Spam dataset, this paper studies the construction of classifier based on Web page features to detect Web Spam. The main work includes the following three aspects:
(1) A method for Web Spam detection based on learning discriminant function of genetic programming is proposed.
The individual is defined as a discriminant function for detecting Web Spam. Through genetic manipulation, genetic programming can find an optimized discriminant function to improve the detection performance of Web Spam. However, there will be a problem when using genetic programming to produce a discriminant function, because there is no prior knowledge of the optimal solution, so it is difficult to know the appropriate individual. Length, if the length of the individual is too short, the characteristics contained in the individual will be very few, the individual's discrimination is not high, the classification performance of the corresponding function expression is not good. To make full use of the contents of the Web Spam data set, link and other characteristics, it needs a longer discriminant function, the size of the individual is larger. For the larger individual, it is made up of a large scale individual. The time required for a population, construction and search is longer. Based on the principle that a longer discriminant function is composed of several shorter discriminant functions, this paper proposes a genetic programming learning discriminant function to detect Web Spam. This method first creates a number of populations with a number of small individuals, each of which produces the best of the population through genetic manipulation. A better discriminant function is obtained by combining the best individuals of each population through genetic programming, so that a longer discriminant function with better performance can be generated by using a shorter time to detect Web Spam.. The influence of the two forked tree depth on the evolutionary process of genetic programming and the effect of the combination are also studied. Rate.
Experiments on the WEBSPAM-UK2006 data set show that compared with the single population genetic programming, the recall rate can be increased by 5.6%, the F measure is improved by 2.25%. The recall rate of the new method is increased by 26%, the F measure is increased by 11%, and the accuracy is improved by 4%., compared with the two combination genetic programming.
(2) An ensemble learning method based on genetic programming is proposed to detect Web Spam.
At present, most of the methods based on the classification detection Web Spam only use one kind of classification algorithm to construct a classifier, and most of them ignore the imbalance between the sample and the cheating sample in the data set, that is, the normal sample is much more than the cheating sample. Because there are many different types of Web Spam technology, the new type of Spam technology is also coming out At present, it is impossible to find a universal classifier to detect all types of WebSpam. Therefore, it is an effective method to find an enhanced classifier to detect Web Spam by integrating the detection results of multiple classifiers. And integrated learning is also one of the effective methods to solve the problem of non balanced dataset classification. How to generate a variety of base classifiers and how to combine their classification results are two key problems. This paper proposes to use genetic programming based integrated learning to detect Web Spam. Firstly, different classification algorithms are used to train different base classifiers on different sample sets and feature sets, and then use heredity. A novel classifier is obtained by programming learning, which gives the final test results based on the test results of multiple base classifiers.
According to the characteristics of the Web Spam data set, the method produces different base classifiers with different data sets and classification algorithms, and integrates the results of the base classifier by genetic programming. It is not only easy to integrate the results of different types of classifiers, improve the classification performance, but also select some base classifiers for integration and reduce the prediction. This method can also integrate undersampling and integrated learning to improve the classification performance of non balanced data sets. In order to verify the effectiveness of genetic programming integration methods, experiments are carried out on balanced data sets and non balanced datasets respectively. In the experiment part of the balanced dataset, the classification algorithm and feature set pair are analyzed. The effects of integration are compared with the known integrated learning algorithms, and the results show that the accuracy, recall, F-, accuracy, error rate and AUC are superior to some known integrated learning algorithms; experiments on nonbalanced datasets show that genetic programming integration can improve the score of genetic programming. Class performance, and heteromorphic integration is more effective than homomorphic integration; genetic programming integration is better than AdaBoost, Bagging, RandomForest, majority voting integration, EDKC algorithm and Prediction Spamicity based methods to achieve higher F- metrics.
(3) A new feature detection method based on genetic programming for Web Spam is proposed.
Features play a very important role in the classification. The Web Spam data set has 96 content features, 41 link features and 138 conversion link features, of which 138 conversion link features are simple combinations or logarithmic operations of 41 link features. These features not only need to be completed by experts, but also very expensive and not easy to do. Combining the characteristics of different types (such as content features and link features), this method proposes to use genetic programming to combine the existing features to produce new features with more distinct forces, and then use these new features as input to detect the experimental display of Web Spam. on the WEBSPAM-UK2006 dataset and use 10 new ones. The classification result of the feature classifier is better than that of the original 41 link features, and the performance of the classifier is comparable to that of the 138 transform link features.
【學位授予單位】:山東大學
【學位級別】:博士
【學位授予年份】:2012
【分類號】:TP18;TP391.3
【參考文獻】
相關期刊論文 前9條
1 趙強利;蔣艷凰;徐明;;選擇性集成算法分類與比較[J];計算機工程與科學;2012年02期
2 張春霞;張講社;;選擇性集成學習算法綜述[J];計算機學報;2011年08期
3 武磊;高斌;李京;;基于結構信息和時域信息的垃圾網(wǎng)頁檢測技術[J];計算機應用研究;2008年04期
4 余慧佳;劉奕群;張敏;茹立云;馬少平;;基于大規(guī)模日志分析的搜索引擎用戶行為分析[J];中文信息學報;2007年01期
5 余慧佳;劉奕群;張敏;馬少平;茹立云;;基于目的分析的作弊頁面分類[J];中文信息學報;2009年02期
6 楊明;尹軍梅;吉根林;;不平衡數(shù)據(jù)分類方法綜述[J];南京師范大學學報(工程技術版);2008年04期
7 賀志明;王麗宏;張剛;程學旗;;一種抵抗鏈接作弊的PageRank改進算法[J];中文信息學報;2012年05期
8 丁岳偉;王虎林;;降級Web Spam的可信度鏈接分析算法[J];計算機工程與設計;2009年10期
9 曾剛;李宏;;一個基于現(xiàn)實世界的大型Web參照數(shù)據(jù)集——UK2006 Datasets的初步研究[J];企業(yè)技術開發(fā);2009年05期
相關會議論文 前1條
1 李智超;余慧佳;馬少平;;使用支持向量機進行作弊頁面識別[A];第三屆全國信息檢索與內容安全學術會議論文集[C];2007年
相關博士學位論文 前4條
1 李軍;不平衡數(shù)據(jù)學習的研究[D];吉林大學;2011年
2 趙強利;基于選擇性集成的在線機器學習關鍵技術研究[D];國防科學技術大學;2010年
3 陳海霞;面向數(shù)據(jù)挖掘的分類器集成研究[D];吉林大學;2006年
4 謝元澄;分類器集成研究[D];南京理工大學;2009年
相關碩士學位論文 前3條
1 馮東慶;基于鏈接分析的網(wǎng)頁排序作弊檢測方法研究[D];吉林大學;2011年
2 孫麗娜;集成異種分類器分類稀有類[D];鄭州大學;2007年
3 韓博;反搜索引擎作弊中種子集合自動擴展算法研究[D];大連理工大學;2009年
,本文編號:2174959
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2174959.html