基于鏈接相似性分析的WEB結(jié)構(gòu)挖掘方法研究
發(fā)布時(shí)間:2018-08-24 08:41
【摘要】:WEB服務(wù)和應(yīng)用近年來得到了飛速發(fā)展,其信息量呈幾何級(jí)數(shù)增長,每天都有數(shù)以百萬計(jì)的網(wǎng)頁加入到WEB中。它已經(jīng)成為了一個(gè)涉及教育、政府、電子商務(wù)、新聞、廣告、消費(fèi)信息、金融管理和許多其它信息服務(wù)的、巨大的、分布廣泛、全球性的信息服務(wù)中心。WEB網(wǎng)頁它們之間相互鏈接,盤根錯(cuò)節(jié),組織成了一個(gè)類似于人類社會(huì)的網(wǎng)絡(luò),,結(jié)合鏈接相似性分析方法,將對(duì)WEB資源挖掘進(jìn)行研究,幫助人們高效的獲取所需信息,尋找所需領(lǐng)域的權(quán)威信息。 本文針對(duì)WEB結(jié)構(gòu)挖掘中的四個(gè)問題進(jìn)行研究:WEB頁面鏈接預(yù)測算法、垃圾頁面(SPAM)識(shí)別算法、WEB結(jié)構(gòu)挖掘算法以及WEB頁面聚類算法。 首先,提出了基于相似性的多路徑游走鏈接預(yù)測算法。1)提出新的衰減因子,通過使用新的衰減因子定義出新的相似度公式;2)改進(jìn)Rubin算法,與新的相似度公式相結(jié)合進(jìn)行相似度計(jì)算,得出節(jié)點(diǎn)的相似度;3)對(duì)節(jié)點(diǎn)相似度排序,從而進(jìn)行預(yù)測可能性判斷,得出預(yù)測結(jié)果。4)最后通過實(shí)驗(yàn)對(duì)算法進(jìn)行了驗(yàn)證。 其次,提出了頁面互相鏈接相似度的概念,然后給出了一個(gè)Spam頁面鏈接結(jié)構(gòu)的假設(shè),并且提出了一種基于頁面互鏈接相似度聚類的Spam頁面識(shí)別算法,該算法考慮了網(wǎng)頁之間會(huì)出現(xiàn)的彼此互相連接的關(guān)系,因此更加合理;并通過實(shí)驗(yàn)分析,驗(yàn)證了所提假設(shè),并實(shí)驗(yàn)驗(yàn)證了算法的有效性。 再次,針對(duì)PageRank算法其存在的“主題漂移”和偏重舊網(wǎng)頁現(xiàn)象,提出了一種基于相似度和時(shí)間反饋因子的改進(jìn)PageRank算法。第一步,利用向量空間模型VSM來計(jì)算鏈接文本和其指向網(wǎng)頁之間的相似度;第二步,根據(jù)網(wǎng)頁產(chǎn)生時(shí)間,設(shè)計(jì)一個(gè)時(shí)間反饋因子,削弱舊網(wǎng)頁的網(wǎng)頁等級(jí)值,提高新網(wǎng)頁的網(wǎng)頁等級(jí)值;第三步,將相似度值和時(shí)間反饋因子融入到PageRank算法計(jì)算網(wǎng)頁等級(jí)值中,根據(jù)算法流程計(jì)算改進(jìn)后網(wǎng)頁的PageRank值。最后通過實(shí)驗(yàn)對(duì)算法的性能進(jìn)行了分析。 第四,研究國內(nèi)外已有的基于局部信息的啟發(fā)式聚類方法研究現(xiàn)狀,然后進(jìn)行總結(jié)分析;并詳細(xì)研究基于局部信息的標(biāo)簽傳播方法,分析該算法在迭代過程中,采用隨機(jī)策略為某個(gè)節(jié)點(diǎn)選擇所屬的簇結(jié)構(gòu)時(shí)所存在的問題;隨后提出了一種針對(duì)隨機(jī)策略選擇簇結(jié)構(gòu)問題的改進(jìn)聚類方法——基于節(jié)點(diǎn)屬性相似度的標(biāo)簽傳播算法;最后,為了幫助高效的發(fā)現(xiàn)互聯(lián)網(wǎng)的分組信息資源,通過實(shí)驗(yàn)對(duì)該算法的有效性和性能進(jìn)行了驗(yàn)證,并將其在實(shí)際的網(wǎng)頁聚類中進(jìn)行了應(yīng)用。本文最后得出結(jié)論,并對(duì)未來工作進(jìn)行了展望。
[Abstract]:WEB services and applications have been rapidly developed in recent years, the amount of information is geometric growth, millions of pages are added to the WEB every day. It has become a huge, widely spread information service involving education, government, e-commerce, news, advertising, consumer information, financial management, and many other information services. The global information service center. Web pages are linked and intertwined among them. They are organized into a network similar to human society. Combined with the method of link similarity analysis, the WEB resource mining will be studied. Help people get the information they need and find the authority information in the field. In this paper, four problems in WEB structure mining are studied, such as: Web page link prediction algorithm, garbage page (SPAM) recognition algorithm, Web structure mining algorithm and WEB page clustering algorithm. Firstly, a similarity based multipath walking link prediction algorithm is proposed. (1) A new attenuation factor is proposed, and a new similarity formula is defined by using the new attenuation factor to improve the Rubin algorithm. Combining with the new similarity formula to calculate the similarity, the similarity degree of nodes is obtained. The similarity ranking of nodes is obtained, and the prediction possibility is judged. Finally, the algorithm is verified by experiments. Secondly, the concept of the similarity between pages is proposed, then a hypothesis of Spam page link structure is given, and a Spam page recognition algorithm based on the similarity clustering between pages is proposed. The algorithm takes into account the interconnectedness between web pages, so it is more reasonable, and through experimental analysis, the proposed hypothesis is verified, and the validity of the algorithm is verified by experiments. Thirdly, an improved PageRank algorithm based on similarity and time feedback factor is proposed to solve the problem of "topic drift" and emphasis on old web pages. In the first step, a vector space model (VSM) is used to calculate the similarity between the link text and its pointing to the web page, and the second step is to design a time feedback factor according to the generated time of the page, which weakens the page rank of the old web page. In the third step, the similarity value and time feedback factor are incorporated into the PageRank algorithm to calculate the web page rank value, and the improved PageRank value is calculated according to the algorithm flow. Finally, the performance of the algorithm is analyzed through experiments. Fourthly, the current situation of heuristic clustering methods based on local information is studied, and then summarized and analyzed, and the label propagation method based on local information is studied in detail to analyze the iterative process of the algorithm. The problems existing in the selection of cluster structure for a node by random strategy are discussed. Then, an improved clustering method for the cluster structure problem is proposed, which is based on the similarity of node attributes. Finally, a label propagation algorithm based on the similarity of node attributes is proposed. In order to help the efficient discovery of packet information resources in the Internet, the effectiveness and performance of the algorithm are verified by experiments, and the algorithm is applied in the actual web page clustering. Finally, the conclusion is drawn and the future work is prospected.
【學(xué)位授予單位】:哈爾濱工程大學(xué)
【學(xué)位級(jí)別】:博士
【學(xué)位授予年份】:2012
【分類號(hào)】:TP393.092;TP311.13
本文編號(hào):2200241
[Abstract]:WEB services and applications have been rapidly developed in recent years, the amount of information is geometric growth, millions of pages are added to the WEB every day. It has become a huge, widely spread information service involving education, government, e-commerce, news, advertising, consumer information, financial management, and many other information services. The global information service center. Web pages are linked and intertwined among them. They are organized into a network similar to human society. Combined with the method of link similarity analysis, the WEB resource mining will be studied. Help people get the information they need and find the authority information in the field. In this paper, four problems in WEB structure mining are studied, such as: Web page link prediction algorithm, garbage page (SPAM) recognition algorithm, Web structure mining algorithm and WEB page clustering algorithm. Firstly, a similarity based multipath walking link prediction algorithm is proposed. (1) A new attenuation factor is proposed, and a new similarity formula is defined by using the new attenuation factor to improve the Rubin algorithm. Combining with the new similarity formula to calculate the similarity, the similarity degree of nodes is obtained. The similarity ranking of nodes is obtained, and the prediction possibility is judged. Finally, the algorithm is verified by experiments. Secondly, the concept of the similarity between pages is proposed, then a hypothesis of Spam page link structure is given, and a Spam page recognition algorithm based on the similarity clustering between pages is proposed. The algorithm takes into account the interconnectedness between web pages, so it is more reasonable, and through experimental analysis, the proposed hypothesis is verified, and the validity of the algorithm is verified by experiments. Thirdly, an improved PageRank algorithm based on similarity and time feedback factor is proposed to solve the problem of "topic drift" and emphasis on old web pages. In the first step, a vector space model (VSM) is used to calculate the similarity between the link text and its pointing to the web page, and the second step is to design a time feedback factor according to the generated time of the page, which weakens the page rank of the old web page. In the third step, the similarity value and time feedback factor are incorporated into the PageRank algorithm to calculate the web page rank value, and the improved PageRank value is calculated according to the algorithm flow. Finally, the performance of the algorithm is analyzed through experiments. Fourthly, the current situation of heuristic clustering methods based on local information is studied, and then summarized and analyzed, and the label propagation method based on local information is studied in detail to analyze the iterative process of the algorithm. The problems existing in the selection of cluster structure for a node by random strategy are discussed. Then, an improved clustering method for the cluster structure problem is proposed, which is based on the similarity of node attributes. Finally, a label propagation algorithm based on the similarity of node attributes is proposed. In order to help the efficient discovery of packet information resources in the Internet, the effectiveness and performance of the algorithm are verified by experiments, and the algorithm is applied in the actual web page clustering. Finally, the conclusion is drawn and the future work is prospected.
【學(xué)位授予單位】:哈爾濱工程大學(xué)
【學(xué)位級(jí)別】:博士
【學(xué)位授予年份】:2012
【分類號(hào)】:TP393.092;TP311.13
【參考文獻(xiàn)】
相關(guān)期刊論文 前7條
1 李曉佳;張鵬;狄增如;樊瑛;;復(fù)雜網(wǎng)絡(luò)中的社團(tuán)結(jié)構(gòu)[J];復(fù)雜系統(tǒng)與復(fù)雜性科學(xué);2008年03期
2 東昱曉;柯慶;吳斌;;基于節(jié)點(diǎn)相似性的鏈接預(yù)測[J];計(jì)算機(jī)科學(xué);2011年07期
3 沈華偉;程學(xué)旗;陳海強(qiáng);劉悅;;基于信息瓶頸的社區(qū)發(fā)現(xiàn)[J];計(jì)算機(jī)學(xué)報(bào);2008年04期
4 魏小娟;李翠平;陳紅;;Co-Training——內(nèi)容和鏈接的Web Spam檢測方法[J];計(jì)算機(jī)科學(xué)與探索;2010年10期
5 余慧佳;劉奕群;張敏;馬少平;茹立云;;基于目的分析的作弊頁面分類[J];中文信息學(xué)報(bào);2009年02期
6 楊博;劉大有;金弟;馬海賓;;復(fù)雜網(wǎng)絡(luò)聚類方法[J];軟件學(xué)報(bào);2009年01期
7 楊寧;唐常杰;王悅;陳瑜;鄭皎凌;;基于譜聚類的多數(shù)據(jù)流演化事件挖掘[J];軟件學(xué)報(bào);2010年10期
本文編號(hào):2200241
本文鏈接:http://sikaile.net/wenyilunwen/guanggaoshejilunwen/2200241.html
最近更新
教材專著