天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當前位置:主頁 > 科技論文 > 搜索引擎論文 >

基于遺傳規(guī)劃和集成學習的Web Spam檢測關鍵技術研究

發(fā)布時間:2018-08-09 18:40
【摘要】:隨著網(wǎng)絡上的信息呈爆炸式增長,搜索引擎已成為日常生活中幫助人們發(fā)現(xiàn)其想要信息的重要工具。給定一個確定的查詢,搜索引擎通常能返回成千上萬個網(wǎng)頁,但是大部分用戶只讀前幾個,所以在搜索引擎中網(wǎng)頁排名非常重要。因此,許多人采用一些手段來欺騙搜索引擎排序算法,使一些網(wǎng)頁獲得不應有的高排序值來吸引用戶的關注,從而達到獲取某方面利益的目的。所有試圖增加網(wǎng)頁在搜索引擎中排序的欺詐行為被稱為Web Spam(網(wǎng)絡作弊)。Web Spam嚴重降低了搜索引擎檢索結果的質量,使用戶在獲取信息的過程中遇到巨大障礙,產(chǎn)生較差的用戶體驗。對于搜索引擎而言,即使這些作弊網(wǎng)頁沒有排得足夠靠前來擾亂用戶,抓取、索引和存儲這些網(wǎng)頁也需要成本。因此,識別Web Spam已成為搜索引擎的重要挑戰(zhàn)之一。 本文根據(jù)Web Spam數(shù)據(jù)集的特點,圍繞基于網(wǎng)頁特征構建分類器檢測Web Spam方面進行了研究,主要工作包括以下三方面: (1)提出基于遺傳規(guī)劃學習判別函數(shù)檢測Web Spam的方法 將個體定義為檢測Web Spam的判別函數(shù),經(jīng)過遺傳操作,遺傳規(guī)劃就可以找到優(yōu)化的判別函數(shù)來提高Web Spam的檢測性能。然而,使用遺傳規(guī)劃產(chǎn)生判別函數(shù)時會出現(xiàn)一個問題,因為沒有關于最優(yōu)解的任何先驗知識,所以很難知道個體的適當長度,如果個體長度太短,則個體中所包含的特征就會很少,個體的辨別力不高,對應函數(shù)表達式的分類性能就不好。要想充分利用Web Spam數(shù)據(jù)集中的內容、鏈接等特征,需要較長的判別函數(shù),對應個體規(guī)模較大。對于由較大規(guī)模個體組成的種群,構造和搜索所需時間較長�;谳^長判別函數(shù)是由若干較短判別函數(shù)組成的這一原理,本文提出通過遺傳規(guī)劃學習判別函數(shù)檢測Web Spam,該方法先使用若干小規(guī)模的個體創(chuàng)建多個種群,每個種群經(jīng)過遺傳操作產(chǎn)生本種群的最好個體,然后再將每個種群所得的最好個體通過遺傳規(guī)劃進行組合得到更好的判別函數(shù),從而利用較短時間就能產(chǎn)生性能更好的較長判別函數(shù)來檢測Web Spam。本文還研究了表示個體的二叉樹深度在遺傳規(guī)劃進化過程中的影響以及組合的效率。 在WEBSPAM-UK2006數(shù)據(jù)集上進行了實驗,實驗結果表明,與單種群遺傳規(guī)劃相比,使用兩次組合的多種群遺傳規(guī)劃能將召回率提高5.6%,F度量提高2.25%,正確率提高2.83%。與SVM相比,新方法將召回率提高了26%,F度量提高了11%,精確度提高了4%。 (2)提出利用基于遺傳規(guī)劃的集成學習檢測Web Spam的方法。 目前多數(shù)基于分類檢測Web Spam的方法只使用一種分類算法構造一個分類器,并且大都忽略了數(shù)據(jù)集中作弊樣本和正常樣本的不平衡性,即正常樣本比作弊樣本多很多。由于存在多種不同類型的Web Spam技術,新類型的Spam技術也在不斷出現(xiàn),期望發(fā)現(xiàn)一個萬能分類器來檢測所有類型的WebSpam是不可能的。所以,通過集成多個分類器的檢測結果來找到增強分類器用于檢測Web Spam是一種有效方法,并且集成學習也是解決非平衡數(shù)據(jù)集分類問題的有效方法之一。在集成學習中,如何產(chǎn)生多樣的基分類器和如何組合它們的分類結果是兩個關鍵的問題。本文提出利用基于遺傳規(guī)劃的集成學習來檢測Web Spam,首先使用不同的分類算法分別在不同的樣本集和特征集上進行訓練產(chǎn)生多樣的基分類器,然后使用遺傳規(guī)劃學習得到一個新穎的分類器,由它基于多個基分類器的檢測結果給出最終檢測結果。 該方法根據(jù)Web Spam數(shù)據(jù)集的特點,利用不同的數(shù)據(jù)集合和分類算法產(chǎn)生差異較大的基分類器,利用遺傳規(guī)劃對基分類器的結果進行集成,不僅易于集成不同類型分類器的結果,提高分類性能,還能選擇部分基分類器用于集成,降低預測時間。該方法還可以將欠抽樣技術和集成學習融合起來提高非平衡數(shù)據(jù)集的分類性能。為了驗證遺傳規(guī)劃集成方法的有效性,分別在平衡數(shù)據(jù)集和非平衡數(shù)據(jù)集上進行了實驗。在平衡數(shù)據(jù)集的實驗部分,首先分析了分類算法和特征集合對集成的影響,然后將其與已知集成學習算法進行比較,結果顯示在準確率、召回率、F-度量、精確度,錯誤率和AUC方面,優(yōu)于一些已知的集成學習算法;在非平衡數(shù)據(jù)集上的實驗表明無論是同態(tài)集成還是異態(tài)集成,遺傳規(guī)劃集成均能提高分類的性能,且異態(tài)集成比同態(tài)集成更加有效;遺傳規(guī)劃集成比AdaBoost、Bagging、RandomForest、多數(shù)投票集成、EDKC算法和基于Prediction Spamicity的方法取得更高的F-度量值。 (3)提出基于遺傳規(guī)劃產(chǎn)生新特征檢測Web Spam的方法。 特征在分類中扮演著很重要的角色,Web Spam數(shù)據(jù)集中有96個內容特征、41個鏈接特征和138個轉換鏈接特征,其中138個轉換鏈接特征是41個鏈接特征的簡單組合或對數(shù)操作,這些特征的產(chǎn)生不僅需要由專家來完成,還很耗費人力,并且也不易把不同類型(如內容特征和鏈接特征)的特征融合在一起。該方法提出利用遺傳規(guī)劃將已有特征進行組合從而產(chǎn)生更有區(qū)別力的新特征,然后將這些新特征作為分類器的輸入來檢測Web Spam。在WEBSPAM-UK2006數(shù)據(jù)集上的實驗顯示,使用10個新特征的分類器的分類結果好于使用原41個鏈接特征的分類器,與使用138個轉換鏈接特征的分類器的性能相當。
[Abstract]:With the explosive growth of information on the network, search engines have become an important tool in daily life to help people find information they want. A given query, a search engine usually can return thousands of pages, but most users read only a few before, so it is very important to rank in the search engine. Many people use some means to cheat the search engine sorting algorithm, so that some web pages get undue high sort values to attract users' attention, so as to achieve the purpose of gaining a certain interest. All the frauds trying to increase the sort of the web page in the search engine are called the Web Spam (Network cheating).Web Spam, which has severely reduced the search. The quality of the engine retrieval results, the user has a huge obstacle in the process of obtaining information and produces a poor user experience. For the search engine, even if these cheating pages are not enough to come to disrupt the user, it is necessary to capture, index, and store these web pages. Therefore, the identification of Web Spam has become a heavy search engine. One of the challenges.
According to the characteristics of Web Spam dataset, this paper studies the construction of classifier based on Web page features to detect Web Spam. The main work includes the following three aspects:
(1) A method for Web Spam detection based on learning discriminant function of genetic programming is proposed.
The individual is defined as a discriminant function for detecting Web Spam. Through genetic manipulation, genetic programming can find an optimized discriminant function to improve the detection performance of Web Spam. However, there will be a problem when using genetic programming to produce a discriminant function, because there is no prior knowledge of the optimal solution, so it is difficult to know the appropriate individual. Length, if the length of the individual is too short, the characteristics contained in the individual will be very few, the individual's discrimination is not high, the classification performance of the corresponding function expression is not good. To make full use of the contents of the Web Spam data set, link and other characteristics, it needs a longer discriminant function, the size of the individual is larger. For the larger individual, it is made up of a large scale individual. The time required for a population, construction and search is longer. Based on the principle that a longer discriminant function is composed of several shorter discriminant functions, this paper proposes a genetic programming learning discriminant function to detect Web Spam. This method first creates a number of populations with a number of small individuals, each of which produces the best of the population through genetic manipulation. A better discriminant function is obtained by combining the best individuals of each population through genetic programming, so that a longer discriminant function with better performance can be generated by using a shorter time to detect Web Spam.. The influence of the two forked tree depth on the evolutionary process of genetic programming and the effect of the combination are also studied. Rate.
Experiments on the WEBSPAM-UK2006 data set show that compared with the single population genetic programming, the recall rate can be increased by 5.6%, the F measure is improved by 2.25%. The recall rate of the new method is increased by 26%, the F measure is increased by 11%, and the accuracy is improved by 4%., compared with the two combination genetic programming.
(2) An ensemble learning method based on genetic programming is proposed to detect Web Spam.
At present, most of the methods based on the classification detection Web Spam only use one kind of classification algorithm to construct a classifier, and most of them ignore the imbalance between the sample and the cheating sample in the data set, that is, the normal sample is much more than the cheating sample. Because there are many different types of Web Spam technology, the new type of Spam technology is also coming out At present, it is impossible to find a universal classifier to detect all types of WebSpam. Therefore, it is an effective method to find an enhanced classifier to detect Web Spam by integrating the detection results of multiple classifiers. And integrated learning is also one of the effective methods to solve the problem of non balanced dataset classification. How to generate a variety of base classifiers and how to combine their classification results are two key problems. This paper proposes to use genetic programming based integrated learning to detect Web Spam. Firstly, different classification algorithms are used to train different base classifiers on different sample sets and feature sets, and then use heredity. A novel classifier is obtained by programming learning, which gives the final test results based on the test results of multiple base classifiers.
According to the characteristics of the Web Spam data set, the method produces different base classifiers with different data sets and classification algorithms, and integrates the results of the base classifier by genetic programming. It is not only easy to integrate the results of different types of classifiers, improve the classification performance, but also select some base classifiers for integration and reduce the prediction. This method can also integrate undersampling and integrated learning to improve the classification performance of non balanced data sets. In order to verify the effectiveness of genetic programming integration methods, experiments are carried out on balanced data sets and non balanced datasets respectively. In the experiment part of the balanced dataset, the classification algorithm and feature set pair are analyzed. The effects of integration are compared with the known integrated learning algorithms, and the results show that the accuracy, recall, F-, accuracy, error rate and AUC are superior to some known integrated learning algorithms; experiments on nonbalanced datasets show that genetic programming integration can improve the score of genetic programming. Class performance, and heteromorphic integration is more effective than homomorphic integration; genetic programming integration is better than AdaBoost, Bagging, RandomForest, majority voting integration, EDKC algorithm and Prediction Spamicity based methods to achieve higher F- metrics.
(3) A new feature detection method based on genetic programming for Web Spam is proposed.
Features play a very important role in the classification. The Web Spam data set has 96 content features, 41 link features and 138 conversion link features, of which 138 conversion link features are simple combinations or logarithmic operations of 41 link features. These features not only need to be completed by experts, but also very expensive and not easy to do. Combining the characteristics of different types (such as content features and link features), this method proposes to use genetic programming to combine the existing features to produce new features with more distinct forces, and then use these new features as input to detect the experimental display of Web Spam. on the WEBSPAM-UK2006 dataset and use 10 new ones. The classification result of the feature classifier is better than that of the original 41 link features, and the performance of the classifier is comparable to that of the 138 transform link features.
【學位授予單位】:山東大學
【學位級別】:博士
【學位授予年份】:2012
【分類號】:TP18;TP391.3

【參考文獻】

相關期刊論文 前9條

1 趙強利;蔣艷凰;徐明;;選擇性集成算法分類與比較[J];計算機工程與科學;2012年02期

2 張春霞;張講社;;選擇性集成學習算法綜述[J];計算機學報;2011年08期

3 武磊;高斌;李京;;基于結構信息和時域信息的垃圾網(wǎng)頁檢測技術[J];計算機應用研究;2008年04期

4 余慧佳;劉奕群;張敏;茹立云;馬少平;;基于大規(guī)模日志分析的搜索引擎用戶行為分析[J];中文信息學報;2007年01期

5 余慧佳;劉奕群;張敏;馬少平;茹立云;;基于目的分析的作弊頁面分類[J];中文信息學報;2009年02期

6 楊明;尹軍梅;吉根林;;不平衡數(shù)據(jù)分類方法綜述[J];南京師范大學學報(工程技術版);2008年04期

7 賀志明;王麗宏;張剛;程學旗;;一種抵抗鏈接作弊的PageRank改進算法[J];中文信息學報;2012年05期

8 丁岳偉;王虎林;;降級Web Spam的可信度鏈接分析算法[J];計算機工程與設計;2009年10期

9 曾剛;李宏;;一個基于現(xiàn)實世界的大型Web參照數(shù)據(jù)集——UK2006 Datasets的初步研究[J];企業(yè)技術開發(fā);2009年05期

相關會議論文 前1條

1 李智超;余慧佳;馬少平;;使用支持向量機進行作弊頁面識別[A];第三屆全國信息檢索與內容安全學術會議論文集[C];2007年

相關博士學位論文 前4條

1 李軍;不平衡數(shù)據(jù)學習的研究[D];吉林大學;2011年

2 趙強利;基于選擇性集成的在線機器學習關鍵技術研究[D];國防科學技術大學;2010年

3 陳海霞;面向數(shù)據(jù)挖掘的分類器集成研究[D];吉林大學;2006年

4 謝元澄;分類器集成研究[D];南京理工大學;2009年

相關碩士學位論文 前3條

1 馮東慶;基于鏈接分析的網(wǎng)頁排序作弊檢測方法研究[D];吉林大學;2011年

2 孫麗娜;集成異種分類器分類稀有類[D];鄭州大學;2007年

3 韓博;反搜索引擎作弊中種子集合自動擴展算法研究[D];大連理工大學;2009年

,

本文編號:2174959

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2174959.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權申明:資料由用戶c52a0***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com