基于遺傳規(guī)劃和集成學(xué)習(xí)的Web Spam檢測(cè)關(guān)鍵技術(shù)研究
[Abstract]:With the explosive growth of information on the network, search engines have become an important tool in daily life to help people find information they want. A given query, a search engine usually can return thousands of pages, but most users read only a few before, so it is very important to rank in the search engine. Many people use some means to cheat the search engine sorting algorithm, so that some web pages get undue high sort values to attract users' attention, so as to achieve the purpose of gaining a certain interest. All the frauds trying to increase the sort of the web page in the search engine are called the Web Spam (Network cheating).Web Spam, which has severely reduced the search. The quality of the engine retrieval results, the user has a huge obstacle in the process of obtaining information and produces a poor user experience. For the search engine, even if these cheating pages are not enough to come to disrupt the user, it is necessary to capture, index, and store these web pages. Therefore, the identification of Web Spam has become a heavy search engine. One of the challenges.
According to the characteristics of Web Spam dataset, this paper studies the construction of classifier based on Web page features to detect Web Spam. The main work includes the following three aspects:
(1) A method for Web Spam detection based on learning discriminant function of genetic programming is proposed.
The individual is defined as a discriminant function for detecting Web Spam. Through genetic manipulation, genetic programming can find an optimized discriminant function to improve the detection performance of Web Spam. However, there will be a problem when using genetic programming to produce a discriminant function, because there is no prior knowledge of the optimal solution, so it is difficult to know the appropriate individual. Length, if the length of the individual is too short, the characteristics contained in the individual will be very few, the individual's discrimination is not high, the classification performance of the corresponding function expression is not good. To make full use of the contents of the Web Spam data set, link and other characteristics, it needs a longer discriminant function, the size of the individual is larger. For the larger individual, it is made up of a large scale individual. The time required for a population, construction and search is longer. Based on the principle that a longer discriminant function is composed of several shorter discriminant functions, this paper proposes a genetic programming learning discriminant function to detect Web Spam. This method first creates a number of populations with a number of small individuals, each of which produces the best of the population through genetic manipulation. A better discriminant function is obtained by combining the best individuals of each population through genetic programming, so that a longer discriminant function with better performance can be generated by using a shorter time to detect Web Spam.. The influence of the two forked tree depth on the evolutionary process of genetic programming and the effect of the combination are also studied. Rate.
Experiments on the WEBSPAM-UK2006 data set show that compared with the single population genetic programming, the recall rate can be increased by 5.6%, the F measure is improved by 2.25%. The recall rate of the new method is increased by 26%, the F measure is increased by 11%, and the accuracy is improved by 4%., compared with the two combination genetic programming.
(2) An ensemble learning method based on genetic programming is proposed to detect Web Spam.
At present, most of the methods based on the classification detection Web Spam only use one kind of classification algorithm to construct a classifier, and most of them ignore the imbalance between the sample and the cheating sample in the data set, that is, the normal sample is much more than the cheating sample. Because there are many different types of Web Spam technology, the new type of Spam technology is also coming out At present, it is impossible to find a universal classifier to detect all types of WebSpam. Therefore, it is an effective method to find an enhanced classifier to detect Web Spam by integrating the detection results of multiple classifiers. And integrated learning is also one of the effective methods to solve the problem of non balanced dataset classification. How to generate a variety of base classifiers and how to combine their classification results are two key problems. This paper proposes to use genetic programming based integrated learning to detect Web Spam. Firstly, different classification algorithms are used to train different base classifiers on different sample sets and feature sets, and then use heredity. A novel classifier is obtained by programming learning, which gives the final test results based on the test results of multiple base classifiers.
According to the characteristics of the Web Spam data set, the method produces different base classifiers with different data sets and classification algorithms, and integrates the results of the base classifier by genetic programming. It is not only easy to integrate the results of different types of classifiers, improve the classification performance, but also select some base classifiers for integration and reduce the prediction. This method can also integrate undersampling and integrated learning to improve the classification performance of non balanced data sets. In order to verify the effectiveness of genetic programming integration methods, experiments are carried out on balanced data sets and non balanced datasets respectively. In the experiment part of the balanced dataset, the classification algorithm and feature set pair are analyzed. The effects of integration are compared with the known integrated learning algorithms, and the results show that the accuracy, recall, F-, accuracy, error rate and AUC are superior to some known integrated learning algorithms; experiments on nonbalanced datasets show that genetic programming integration can improve the score of genetic programming. Class performance, and heteromorphic integration is more effective than homomorphic integration; genetic programming integration is better than AdaBoost, Bagging, RandomForest, majority voting integration, EDKC algorithm and Prediction Spamicity based methods to achieve higher F- metrics.
(3) A new feature detection method based on genetic programming for Web Spam is proposed.
Features play a very important role in the classification. The Web Spam data set has 96 content features, 41 link features and 138 conversion link features, of which 138 conversion link features are simple combinations or logarithmic operations of 41 link features. These features not only need to be completed by experts, but also very expensive and not easy to do. Combining the characteristics of different types (such as content features and link features), this method proposes to use genetic programming to combine the existing features to produce new features with more distinct forces, and then use these new features as input to detect the experimental display of Web Spam. on the WEBSPAM-UK2006 dataset and use 10 new ones. The classification result of the feature classifier is better than that of the original 41 link features, and the performance of the classifier is comparable to that of the 138 transform link features.
【學(xué)位授予單位】:山東大學(xué)
【學(xué)位級(jí)別】:博士
【學(xué)位授予年份】:2012
【分類(lèi)號(hào)】:TP18;TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前9條
1 趙強(qiáng)利;蔣艷凰;徐明;;選擇性集成算法分類(lèi)與比較[J];計(jì)算機(jī)工程與科學(xué);2012年02期
2 張春霞;張講社;;選擇性集成學(xué)習(xí)算法綜述[J];計(jì)算機(jī)學(xué)報(bào);2011年08期
3 武磊;高斌;李京;;基于結(jié)構(gòu)信息和時(shí)域信息的垃圾網(wǎng)頁(yè)檢測(cè)技術(shù)[J];計(jì)算機(jī)應(yīng)用研究;2008年04期
4 余慧佳;劉奕群;張敏;茹立云;馬少平;;基于大規(guī)模日志分析的搜索引擎用戶(hù)行為分析[J];中文信息學(xué)報(bào);2007年01期
5 余慧佳;劉奕群;張敏;馬少平;茹立云;;基于目的分析的作弊頁(yè)面分類(lèi)[J];中文信息學(xué)報(bào);2009年02期
6 楊明;尹軍梅;吉根林;;不平衡數(shù)據(jù)分類(lèi)方法綜述[J];南京師范大學(xué)學(xué)報(bào)(工程技術(shù)版);2008年04期
7 賀志明;王麗宏;張剛;程學(xué)旗;;一種抵抗鏈接作弊的PageRank改進(jìn)算法[J];中文信息學(xué)報(bào);2012年05期
8 丁岳偉;王虎林;;降級(jí)Web Spam的可信度鏈接分析算法[J];計(jì)算機(jī)工程與設(shè)計(jì);2009年10期
9 曾剛;李宏;;一個(gè)基于現(xiàn)實(shí)世界的大型Web參照數(shù)據(jù)集——UK2006 Datasets的初步研究[J];企業(yè)技術(shù)開(kāi)發(fā);2009年05期
相關(guān)會(huì)議論文 前1條
1 李智超;余慧佳;馬少平;;使用支持向量機(jī)進(jìn)行作弊頁(yè)面識(shí)別[A];第三屆全國(guó)信息檢索與內(nèi)容安全學(xué)術(shù)會(huì)議論文集[C];2007年
相關(guān)博士學(xué)位論文 前4條
1 李軍;不平衡數(shù)據(jù)學(xué)習(xí)的研究[D];吉林大學(xué);2011年
2 趙強(qiáng)利;基于選擇性集成的在線(xiàn)機(jī)器學(xué)習(xí)關(guān)鍵技術(shù)研究[D];國(guó)防科學(xué)技術(shù)大學(xué);2010年
3 陳海霞;面向數(shù)據(jù)挖掘的分類(lèi)器集成研究[D];吉林大學(xué);2006年
4 謝元澄;分類(lèi)器集成研究[D];南京理工大學(xué);2009年
相關(guān)碩士學(xué)位論文 前3條
1 馮東慶;基于鏈接分析的網(wǎng)頁(yè)排序作弊檢測(cè)方法研究[D];吉林大學(xué);2011年
2 孫麗娜;集成異種分類(lèi)器分類(lèi)稀有類(lèi)[D];鄭州大學(xué);2007年
3 韓博;反搜索引擎作弊中種子集合自動(dòng)擴(kuò)展算法研究[D];大連理工大學(xué);2009年
,本文編號(hào):2174959
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2174959.html