基于遺傳規(guī)劃和集成學(xué)習(xí)的Web Spam檢測(cè)關(guān)鍵技術(shù)研究

發(fā)布時(shí)間：2018-08-09 18:40

【摘要】：隨著網(wǎng)絡(luò)上的信息呈爆炸式增長(zhǎng),搜索引擎已成為日常生活中幫助人們發(fā)現(xiàn)其想要信息的重要工具。給定一個(gè)確定的查詢(xún),搜索引擎通常能返回成千上萬(wàn)個(gè)網(wǎng)頁(yè),但是大部分用戶(hù)只讀前幾個(gè),所以在搜索引擎中網(wǎng)頁(yè)排名非常重要。因此,許多人采用一些手段來(lái)欺騙搜索引擎排序算法,使一些網(wǎng)頁(yè)獲得不應(yīng)有的高排序值來(lái)吸引用戶(hù)的關(guān)注,從而達(dá)到獲取某方面利益的目的。所有試圖增加網(wǎng)頁(yè)在搜索引擎中排序的欺詐行為被稱(chēng)為Web Spam(網(wǎng)絡(luò)作弊)。Web Spam嚴(yán)重降低了搜索引擎檢索結(jié)果的質(zhì)量,使用戶(hù)在獲取信息的過(guò)程中遇到巨大障礙,產(chǎn)生較差的用戶(hù)體驗(yàn)。對(duì)于搜索引擎而言,即使這些作弊網(wǎng)頁(yè)沒(méi)有排得足夠靠前來(lái)擾亂用戶(hù),抓取、索引和存儲(chǔ)這些網(wǎng)頁(yè)也需要成本。因此,識(shí)別Web Spam已成為搜索引擎的重要挑戰(zhàn)之一。本文根據(jù)Web Spam數(shù)據(jù)集的特點(diǎn),圍繞基于網(wǎng)頁(yè)特征構(gòu)建分類(lèi)器檢測(cè)Web Spam方面進(jìn)行了研究,主要工作包括以下三方面： (1)提出基于遺傳規(guī)劃學(xué)習(xí)判別函數(shù)檢測(cè)Web Spam的方法將個(gè)體定義為檢測(cè)Web Spam的判別函數(shù),經(jīng)過(guò)遺傳操作,遺傳規(guī)劃就可以找到優(yōu)化的判別函數(shù)來(lái)提高Web Spam的檢測(cè)性能。然而,使用遺傳規(guī)劃產(chǎn)生判別函數(shù)時(shí)會(huì)出現(xiàn)一個(gè)問(wèn)題,因?yàn)闆](méi)有關(guān)于最優(yōu)解的任何先驗(yàn)知識(shí),所以很難知道個(gè)體的適當(dāng)長(zhǎng)度,如果個(gè)體長(zhǎng)度太短,則個(gè)體中所包含的特征就會(huì)很少,個(gè)體的辨別力不高,對(duì)應(yīng)函數(shù)表達(dá)式的分類(lèi)性能就不好。要想充分利用Web Spam數(shù)據(jù)集中的內(nèi)容、鏈接等特征,需要較長(zhǎng)的判別函數(shù),對(duì)應(yīng)個(gè)體規(guī)模較大。對(duì)于由較大規(guī)模個(gè)體組成的種群,構(gòu)造和搜索所需時(shí)間較長(zhǎng)。基于較長(zhǎng)判別函數(shù)是由若干較短判別函數(shù)組成的這一原理,本文提出通過(guò)遺傳規(guī)劃學(xué)習(xí)判別函數(shù)檢測(cè)Web Spam,該方法先使用若干小規(guī)模的個(gè)體創(chuàng)建多個(gè)種群,每個(gè)種群經(jīng)過(guò)遺傳操作產(chǎn)生本種群的最好個(gè)體,然后再將每個(gè)種群所得的最好個(gè)體通過(guò)遺傳規(guī)劃進(jìn)行組合得到更好的判別函數(shù),從而利用較短時(shí)間就能產(chǎn)生性能更好的較長(zhǎng)判別函數(shù)來(lái)檢測(cè)Web Spam。本文還研究了表示個(gè)體的二叉樹(shù)深度在遺傳規(guī)劃進(jìn)化過(guò)程中的影響以及組合的效率。在WEBSPAM-UK2006數(shù)據(jù)集上進(jìn)行了實(shí)驗(yàn),實(shí)驗(yàn)結(jié)果表明,與單種群遺傳規(guī)劃相比,使用兩次組合的多種群遺傳規(guī)劃能將召回率提高5.6%,F度量提高2.25%,正確率提高2.83%。與SVM相比,新方法將召回率提高了26%,F度量提高了11%,精確度提高了4%。 (2)提出利用基于遺傳規(guī)劃的集成學(xué)習(xí)檢測(cè)Web Spam的方法。目前多數(shù)基于分類(lèi)檢測(cè)Web Spam的方法只使用一種分類(lèi)算法構(gòu)造一個(gè)分類(lèi)器,并且大都忽略了數(shù)據(jù)集中作弊樣本和正常樣本的不平衡性,即正常樣本比作弊樣本多很多。由于存在多種不同類(lèi)型的Web Spam技術(shù),新類(lèi)型的Spam技術(shù)也在不斷出現(xiàn),期望發(fā)現(xiàn)一個(gè)萬(wàn)能分類(lèi)器來(lái)檢測(cè)所有類(lèi)型的WebSpam是不可能的。所以,通過(guò)集成多個(gè)分類(lèi)器的檢測(cè)結(jié)果來(lái)找到增強(qiáng)分類(lèi)器用于檢測(cè)Web Spam是一種有效方法,并且集成學(xué)習(xí)也是解決非平衡數(shù)據(jù)集分類(lèi)問(wèn)題的有效方法之一。在集成學(xué)習(xí)中,如何產(chǎn)生多樣的基分類(lèi)器和如何組合它們的分類(lèi)結(jié)果是兩個(gè)關(guān)鍵的問(wèn)題。本文提出利用基于遺傳規(guī)劃的集成學(xué)習(xí)來(lái)檢測(cè)Web Spam,首先使用不同的分類(lèi)算法分別在不同的樣本集和特征集上進(jìn)行訓(xùn)練產(chǎn)生多樣的基分類(lèi)器,然后使用遺傳規(guī)劃學(xué)習(xí)得到一個(gè)新穎的分類(lèi)器,由它基于多個(gè)基分類(lèi)器的檢測(cè)結(jié)果給出最終檢測(cè)結(jié)果。該方法根據(jù)Web Spam數(shù)據(jù)集的特點(diǎn),利用不同的數(shù)據(jù)集合和分類(lèi)算法產(chǎn)生差異較大的基分類(lèi)器,利用遺傳規(guī)劃對(duì)基分類(lèi)器的結(jié)果進(jìn)行集成,不僅易于集成不同類(lèi)型分類(lèi)器的結(jié)果,提高分類(lèi)性能,還能選擇部分基分類(lèi)器用于集成,降低預(yù)測(cè)時(shí)間。該方法還可以將欠抽樣技術(shù)和集成學(xué)習(xí)融合起來(lái)提高非平衡數(shù)據(jù)集的分類(lèi)性能。為了驗(yàn)證遺傳規(guī)劃集成方法的有效性,分別在平衡數(shù)據(jù)集和非平衡數(shù)據(jù)集上進(jìn)行了實(shí)驗(yàn)。在平衡數(shù)據(jù)集的實(shí)驗(yàn)部分,首先分析了分類(lèi)算法和特征集合對(duì)集成的影響,然后將其與已知集成學(xué)習(xí)算法進(jìn)行比較,結(jié)果顯示在準(zhǔn)確率、召回率、F-度量、精確度,錯(cuò)誤率和AUC方面,優(yōu)于一些已知的集成學(xué)習(xí)算法；在非平衡數(shù)據(jù)集上的實(shí)驗(yàn)表明無(wú)論是同態(tài)集成還是異態(tài)集成,遺傳規(guī)劃集成均能提高分類(lèi)的性能,且異態(tài)集成比同態(tài)集成更加有效；遺傳規(guī)劃集成比AdaBoost、Bagging、RandomForest、多數(shù)投票集成、EDKC算法和基于Prediction Spamicity的方法取得更高的F-度量值。 (3)提出基于遺傳規(guī)劃產(chǎn)生新特征檢測(cè)Web Spam的方法。特征在分類(lèi)中扮演著很重要的角色,Web Spam數(shù)據(jù)集中有96個(gè)內(nèi)容特征、41個(gè)鏈接特征和138個(gè)轉(zhuǎn)換鏈接特征,其中138個(gè)轉(zhuǎn)換鏈接特征是41個(gè)鏈接特征的簡(jiǎn)單組合或?qū)?shù)操作,這些特征的產(chǎn)生不僅需要由專(zhuān)家來(lái)完成,還很耗費(fèi)人力,并且也不易把不同類(lèi)型(如內(nèi)容特征和鏈接特征)的特征融合在一起。該方法提出利用遺傳規(guī)劃將已有特征進(jìn)行組合從而產(chǎn)生更有區(qū)別力的新特征,然后將這些新特征作為分類(lèi)器的輸入來(lái)檢測(cè)Web Spam。在WEBSPAM-UK2006數(shù)據(jù)集上的實(shí)驗(yàn)顯示,使用10個(gè)新特征的分類(lèi)器的分類(lèi)結(jié)果好于使用原41個(gè)鏈接特征的分類(lèi)器,與使用138個(gè)轉(zhuǎn)換鏈接特征的分類(lèi)器的性能相當(dāng)。
[Abstract]:With the explosive growth of information on the network, search engines have become an important tool in daily life to help people find information they want. A given query, a search engine usually can return thousands of pages, but most users read only a few before, so it is very important to rank in the search engine. Many people use some means to cheat the search engine sorting algorithm, so that some web pages get undue high sort values to attract users' attention, so as to achieve the purpose of gaining a certain interest. All the frauds trying to increase the sort of the web page in the search engine are called the Web Spam (Network cheating).Web Spam, which has severely reduced the search. The quality of the engine retrieval results, the user has a huge obstacle in the process of obtaining information and produces a poor user experience. For the search engine, even if these cheating pages are not enough to come to disrupt the user, it is necessary to capture, index, and store these web pages. Therefore, the identification of Web Spam has become a heavy search engine. One of the challenges.
According to the characteristics of Web Spam dataset, this paper studies the construction of classifier based on Web page features to detect Web Spam. The main work includes the following three aspects:
(1) A method for Web Spam detection based on learning discriminant function of genetic programming is proposed.
The individual is defined as a discriminant function for detecting Web Spam. Through genetic manipulation, genetic programming can find an optimized discriminant function to improve the detection performance of Web Spam. However, there will be a problem when using genetic programming to produce a discriminant function, because there is no prior knowledge of the optimal solution, so it is difficult to know the appropriate individual. Length, if the length of the individual is too short, the characteristics contained in the individual will be very few, the individual's discrimination is not high, the classification performance of the corresponding function expression is not good. To make full use of the contents of the Web Spam data set, link and other characteristics, it needs a longer discriminant function, the size of the individual is larger. For the larger individual, it is made up of a large scale individual. The time required for a population, construction and search is longer. Based on the principle that a longer discriminant function is composed of several shorter discriminant functions, this paper proposes a genetic programming learning discriminant function to detect Web Spam. This method first creates a number of populations with a number of small individuals, each of which produces the best of the population through genetic manipulation. A better discriminant function is obtained by combining the best individuals of each population through genetic programming, so that a longer discriminant function with better performance can be generated by using a shorter time to detect Web Spam.. The influence of the two forked tree depth on the evolutionary process of genetic programming and the effect of the combination are also studied. Rate.
Experiments on the WEBSPAM-UK2006 data set show that compared with the single population genetic programming, the recall rate can be increased by 5.6%, the F measure is improved by 2.25%. The recall rate of the new method is increased by 26%, the F measure is increased by 11%, and the accuracy is improved by 4%., compared with the two combination genetic programming.
(2) An ensemble learning method based on genetic programming is proposed to detect Web Spam.
At present, most of the methods based on the classification detection Web Spam only use one kind of classification algorithm to construct a classifier, and most of them ignore the imbalance between the sample and the cheating sample in the data set, that is, the normal sample is much more than the cheating sample. Because there are many different types of Web Spam technology, the new type of Spam technology is also coming out At present, it is impossible to find a universal classifier to detect all types of WebSpam. Therefore, it is an effective method to find an enhanced classifier to detect Web Spam by integrating the detection results of multiple classifiers. And integrated learning is also one of the effective methods to solve the problem of non balanced dataset classification. How to generate a variety of base classifiers and how to combine their classification results are two key problems. This paper proposes to use genetic programming based integrated learning to detect Web Spam. Firstly, different classification algorithms are used to train different base classifiers on different sample sets and feature sets, and then use heredity. A novel classifier is obtained by programming learning, which gives the final test results based on the test results of multiple base classifiers.
According to the characteristics of the Web Spam data set, the method produces different base classifiers with different data sets and classification algorithms, and integrates the results of the base classifier by genetic programming. It is not only easy to integrate the results of different types of classifiers, improve the classification performance, but also select some base classifiers for integration and reduce the prediction. This method can also integrate undersampling and integrated learning to improve the classification performance of non balanced data sets. In order to verify the effectiveness of genetic programming integration methods, experiments are carried out on balanced data sets and non balanced datasets respectively. In the experiment part of the balanced dataset, the classification algorithm and feature set pair are analyzed. The effects of integration are compared with the known integrated learning algorithms, and the results show that the accuracy, recall, F-, accuracy, error rate and AUC are superior to some known integrated learning algorithms; experiments on nonbalanced datasets show that genetic programming integration can improve the score of genetic programming. Class performance, and heteromorphic integration is more effective than homomorphic integration; genetic programming integration is better than AdaBoost, Bagging, RandomForest, majority voting integration, EDKC algorithm and Prediction Spamicity based methods to achieve higher F- metrics.
(3) A new feature detection method based on genetic programming for Web Spam is proposed.
Features play a very important role in the classification. The Web Spam data set has 96 content features, 41 link features and 138 conversion link features, of which 138 conversion link features are simple combinations or logarithmic operations of 41 link features. These features not only need to be completed by experts, but also very expensive and not easy to do. Combining the characteristics of different types (such as content features and link features), this method proposes to use genetic programming to combine the existing features to produce new features with more distinct forces, and then use these new features as input to detect the experimental display of Web Spam. on the WEBSPAM-UK2006 dataset and use 10 new ones. The classification result of the feature classifier is better than that of the original 41 link features, and the performance of the classifier is comparable to that of the 138 transform link features.
【學(xué)位授予單位】：山東大學(xué)
【學(xué)位級(jí)別】：博士
【學(xué)位授予年份】：2012
【分類(lèi)號(hào)】：TP18;TP391.3

【參考文獻(xiàn)】

相關(guān)期刊論文前9條

1 趙強(qiáng)利;蔣艷凰;徐明;;選擇性集成算法分類(lèi)與比較[J];計(jì)算機(jī)工程與科學(xué);2012年02期

2 張春霞;張講社;;選擇性集成學(xué)習(xí)算法綜述[J];計(jì)算機(jī)學(xué)報(bào);2011年08期

3 武磊;高斌;李京;;基于結(jié)構(gòu)信息和時(shí)域信息的垃圾網(wǎng)頁(yè)檢測(cè)技術(shù)[J];計(jì)算機(jī)應(yīng)用研究;2008年04期

4 余慧佳;劉奕群;張敏;茹立云;馬少平;;基于大規(guī)模日志分析的搜索引擎用戶(hù)行為分析[J];中文信息學(xué)報(bào);2007年01期

5 余慧佳;劉奕群;張敏;馬少平;茹立云;;基于目的分析的作弊頁(yè)面分類(lèi)[J];中文信息學(xué)報(bào);2009年02期

6 楊明;尹軍梅;吉根林;;不平衡數(shù)據(jù)分類(lèi)方法綜述[J];南京師范大學(xué)學(xué)報(bào)(工程技術(shù)版);2008年04期

7 賀志明;王麗宏;張剛;程學(xué)旗;;一種抵抗鏈接作弊的PageRank改進(jìn)算法[J];中文信息學(xué)報(bào);2012年05期

8 丁岳偉;王虎林;;降級(jí)Web Spam的可信度鏈接分析算法[J];計(jì)算機(jī)工程與設(shè)計(jì);2009年10期

9 曾剛;李宏;;一個(gè)基于現(xiàn)實(shí)世界的大型Web參照數(shù)據(jù)集——UK2006 Datasets的初步研究[J];企業(yè)技術(shù)開(kāi)發(fā);2009年05期

相關(guān)會(huì)議論文前1條

1 李智超;余慧佳;馬少平;;使用支持向量機(jī)進(jìn)行作弊頁(yè)面識(shí)別[A];第三屆全國(guó)信息檢索與內(nèi)容安全學(xué)術(shù)會(huì)議論文集[C];2007年

相關(guān)博士學(xué)位論文前4條

1 李軍;不平衡數(shù)據(jù)學(xué)習(xí)的研究[D];吉林大學(xué);2011年

2 趙強(qiáng)利;基于選擇性集成的在線(xiàn)機(jī)器學(xué)習(xí)關(guān)鍵技術(shù)研究[D];國(guó)防科學(xué)技術(shù)大學(xué);2010年

3 陳海霞;面向數(shù)據(jù)挖掘的分類(lèi)器集成研究[D];吉林大學(xué);2006年

4 謝元澄;分類(lèi)器集成研究[D];南京理工大學(xué);2009年

相關(guān)碩士學(xué)位論文前3條

1 馮東慶;基于鏈接分析的網(wǎng)頁(yè)排序作弊檢測(cè)方法研究[D];吉林大學(xué);2011年

2 孫麗娜;集成異種分類(lèi)器分類(lèi)稀有類(lèi)[D];鄭州大學(xué);2007年

3 韓博;反搜索引擎作弊中種子集合自動(dòng)擴(kuò)展算法研究[D];大連理工大學(xué);2009年

，

本文編號(hào)：2174959

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2174959.html

上一篇：信息技術(shù)創(chuàng)新的財(cái)務(wù)基礎(chǔ)
下一篇：基于Nutch的Web數(shù)學(xué)公式提取

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于遺傳規(guī)劃和集成學(xué)習(xí)的Web Spam檢測(cè)關(guān)鍵技術(shù)研究