多視圖學(xué)習(xí)在垃圾網(wǎng)頁檢測中的應(yīng)用研究
發(fā)布時間:2018-04-27 12:03
本文選題:多視圖學(xué)習(xí) + 垃圾網(wǎng)頁檢測; 參考:《山東師范大學(xué)》2014年碩士論文
【摘要】:現(xiàn)在網(wǎng)絡(luò)極大地改變了人們表達自己和與他人互動的方式,已經(jīng)成為最主要的信息檢索方式。正因如此,向HTML頁面或其他網(wǎng)絡(luò)文件添加信息也變得越來越容易,同時用戶就會更難分辨準(zhǔn)確和不準(zhǔn)確的信息或可信賴和不可靠的信息,因此創(chuàng)建一個有效的垃圾網(wǎng)頁檢測方法是當(dāng)前面對的一大挑戰(zhàn)。如今垃圾網(wǎng)頁檢測的主要工作在于檢測基于內(nèi)容作弊和鏈接作弊的垃圾網(wǎng)頁,F(xiàn)有垃圾網(wǎng)頁的檢測方法通常利用網(wǎng)頁單一視圖的特征對其是否屬于垃圾網(wǎng)頁進行分類,而將垃圾網(wǎng)頁兩個方面的特征同時用于檢測的多視圖學(xué)習(xí)手段,可以使垃圾網(wǎng)頁檢測問題更為全面。 本文圍繞多視圖學(xué)習(xí),針對垃圾網(wǎng)頁檢測的問題,對多視圖學(xué)習(xí)的特征提取方法、分類方法以及網(wǎng)頁具體鏈接結(jié)構(gòu)等進行研究,具體研究成果如下: (1)將垃圾網(wǎng)頁數(shù)據(jù)集基于內(nèi)容和鏈接的特征看作此檢測問題的兩個不同的視圖,首先應(yīng)用典型相關(guān)分析和其他改進方法提取特征,用轉(zhuǎn)換矩陣得到兩視圖上相關(guān)性最大的投影方向的特征。然后使用不同的針對兩視圖特征的組合方式將兩視圖特征合為一個特征,使用新的單視圖特征來訓(xùn)練分類器進行分類。實驗結(jié)果顯示把垃圾網(wǎng)頁檢測作為多視圖分類問題即看成兩個視圖的數(shù)據(jù)使用典型相關(guān)分析方法,可提高分類精度。 (2)由于垃圾網(wǎng)頁檢測問題中只有少量標(biāo)記網(wǎng)頁,因此可使用半監(jiān)督協(xié)同訓(xùn)練方法進行垃圾網(wǎng)頁檢測。將網(wǎng)頁特征分為內(nèi)容和鏈接兩個視圖。在進行具體的分類步驟之前使用獨立成分分析,提取兩個視圖特征的獨立成分,具體的分類步驟是由協(xié)同訓(xùn)練實現(xiàn)的。實驗結(jié)果顯示這種特征提取和半監(jiān)督分類的組合能夠提高垃圾網(wǎng)頁檢測精度,對兩個視圖分別進行獨立成分分析也更為有效。 (3)利用網(wǎng)頁鏈接結(jié)構(gòu)修改SVM分類器,,首先利用直接鏈接矩陣和間接鏈接矩陣來構(gòu)建保持鏈接結(jié)構(gòu)的類內(nèi)散布矩陣,然后將網(wǎng)頁鏈接結(jié)構(gòu)組合到SVM分類器中來重新配置一個優(yōu)化問題。此方法在利用網(wǎng)頁鏈接信息方面具有優(yōu)勢。垃圾網(wǎng)頁數(shù)據(jù)集上的實驗結(jié)果表明將網(wǎng)頁鏈接結(jié)構(gòu)與SVM分類器組合可以顯著地優(yōu)于其他相關(guān)方法,實驗還顯示了分類準(zhǔn)確率隨間接鏈接步長的變化。 (4)通過嚴密考慮內(nèi)容與鏈接兩視圖特征的不同構(gòu)造和統(tǒng)計特性來解決這個問題。分別針對內(nèi)容及鏈接特征重構(gòu)特征提取方法PCA和LPP,然后將它們組合到本文的方法中,從多視圖表示的多視圖嵌入中提取出一個一致的模式。通過一個迭代算法,可以求出每個視圖的不同的嵌入表示以及從每個視圖到一致模式的轉(zhuǎn)換矩陣。同時提供了一個計算測試樣本的一致模式的方法。WEBSPAM-UK2006和WEBSPAM-UK2007數(shù)據(jù)集上的實驗結(jié)果顯示使用一致模式來解決垃圾網(wǎng)頁檢測問題優(yōu)于其他相關(guān)的降維方法。
[Abstract]:Nowadays, the Internet has greatly changed the way people express themselves and interact with others, and has become the most important way of information retrieval. As a result, it is becoming easier to add information to HTML pages or other web files, and it is becoming more difficult for users to distinguish between accurate and inaccurate information or trustworthy and unreliable information. Therefore, it is a great challenge to create an effective method for detecting spam pages. Nowadays, the main task of spam detection is to detect spam pages based on content cheating and link cheating. The existing detection methods of garbage pages usually use the features of a single view to classify whether they belong to garbage pages, while the features of the two aspects of garbage pages are used to detect the multi-view learning method at the same time. Can make the spam page detection problem more comprehensive. This paper focuses on multi-view learning, aiming at the problem of spam page detection, the feature extraction method, classification method and specific link structure of multi-view learning are studied. The specific research results are as follows: (1) considering the feature of garbage page dataset based on content and link as two different views of this detection problem, we first apply canonical correlation analysis and other improved methods to extract features. The transformation matrix is used to obtain the features of the projection direction with the greatest correlation between the two views. Then, two view features are combined into one feature by different combination methods for two view features, and a new single view feature is used to train the classifier for classification. The experimental results show that using the canonical correlation analysis method to treat garbage page detection as a multi-view classification problem can improve the classification accuracy. 2) since there are only a few tagged pages in the problem of spam page detection, semi-supervised cooperative training method can be used to detect spam pages. The page features are divided into two views: content and link. The independent component analysis (ICA) is used to extract the independent components of the two view features before the specific classification steps are implemented by cooperative training. The experimental results show that the combination of feature extraction and semi-supervised classification can improve the accuracy of garbage page detection, and the independent component analysis for the two views is also more effective. The SVM classifier is modified by using the link structure of the web page. Firstly, the direct link matrix and the indirect link matrix are used to construct the in-class scatter matrix that maintains the link structure. Then the web page link structure is combined into the SVM classifier to reconfigure an optimization problem. This method has advantages in utilizing web link information. The experimental results on the garbage data set show that the combination of the web page link structure and the SVM classifier can be significantly superior to other related methods. The experimental results also show that the classification accuracy varies with the indirect link step size. 4) this problem is solved by carefully considering the different structure and statistical characteristics of the features of the two views of content and link. The methods of feature extraction for content and link reconstruction are PCA and LPP respectively. Then they are combined into this method to extract a consistent pattern from multi-view embedding of multi-view representation. Through an iterative algorithm, the different embedded representations of each view and the transformation matrix from each view to the consistent mode can be obtained. The experimental results on WEBSPAM-UK2006 and WEBSPAM-UK2007 datasets show that using consistent mode to solve the problem of spam detection is better than other related dimensionality reduction methods.
【學(xué)位授予單位】:山東師范大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2014
【分類號】:TP393.092
【參考文獻】
相關(guān)期刊論文 前2條
1 楊竹青,李勇,胡德文;獨立成分分析方法綜述[J];自動化學(xué)報;2002年05期
2 陳曉紅;陳松燦;;監(jiān)督型局部保持的典型相關(guān)分析[J];小型微型計算機系統(tǒng);2010年08期
相關(guān)博士學(xué)位論文 前2條
1 孫廷凱;增強型典型相關(guān)分析研究與應(yīng)用[D];南京航空航天大學(xué);2006年
2 王嬌;多視圖的半監(jiān)督學(xué)習(xí)研究[D];北京交通大學(xué);2010年
本文編號:1810658
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/1810658.html
最近更新
教材專著