基于PageRank算法的Web數(shù)據(jù)挖掘的研究

發(fā)布時間：2018-06-05 17:53

本文選題：PageRank算法 + 網(wǎng)頁相似度��；參考：《天津理工大學(xué)》2017年碩士論文

【摘要】：面對互聯(lián)網(wǎng)中龐大的數(shù)據(jù),怎樣獲取所需要的信息形成了研究所面對的一個難題。而Web數(shù)據(jù)挖掘這門學(xué)科的泛起為這個難題提出了解決方法。Web數(shù)據(jù)挖掘由Web內(nèi)容挖掘、Web結(jié)構(gòu)挖掘和Web使用挖掘構(gòu)成。Web結(jié)構(gòu)挖掘中主要有PageRank算法和HITS算法。由于Page Rank算法相比于HITS算法的應(yīng)用更為廣泛,同時它的效率也優(yōu)于HITS算法。所以本文通過對Web結(jié)構(gòu)挖掘中的PageRank算法的特征進(jìn)行學(xué)習(xí),提出了改進(jìn)的方法,本文主要的貢獻(xiàn)如下:(1)針對PageRank算法存在的平均分配PR值的問題。本文提出基于網(wǎng)頁相似度的改良方法。將網(wǎng)頁之間的指向關(guān)系作為一種鏈接向量,通過這種鏈接向量來表示某個網(wǎng)頁。通過鏈接向量來表示網(wǎng)頁之間的相似度。以當(dāng)前網(wǎng)頁和入鏈網(wǎng)頁的相似度的大小來傳遞PR值,代替了原來PageRank算法的平均傳遞值的方法。對PageRank算法和改良的方法進(jìn)行實驗對比,改良后的算法在查準(zhǔn)率上有所提高。(2)針對PageRank算法存在的主題漂移問題。本文提出基于主題相關(guān)性的改良方法。此改進(jìn)方法的基本原理是:對一個關(guān)鍵字進(jìn)行檢索時,若檢索系統(tǒng)可以在檢索結(jié)果的排名中依據(jù)網(wǎng)頁和客戶要求的相關(guān)性的大小來排名,這么我們就認(rèn)為這個檢索系統(tǒng)的精確度是可以的。本文利用已經(jīng)發(fā)展成熟的概率檢索模型BM25F模型,利用此模型來獲得網(wǎng)頁與查詢關(guān)鍵字之間的相關(guān)性。對PageRank算法,Top-Sensitive PageRank算法和改良后的算法進(jìn)行實驗對比,改良后的算法在網(wǎng)頁質(zhì)量的上有較大提升。(3)針對PageRank算法存在的偏重舊網(wǎng)頁的問題。本文提出基于網(wǎng)頁更新率的改良方法。傳統(tǒng)的PageRank算法下只是考慮了網(wǎng)頁之間的鏈接結(jié)構(gòu)沒有將時間因素作為一個評價標(biāo)準(zhǔn),這樣新的網(wǎng)頁由于存在的時間短被其他網(wǎng)頁所引用的概率就會大大降低,這對新網(wǎng)頁是不利的。此改進(jìn)方法是基于網(wǎng)頁的變化是泊松過程,本文通過泊松分布的數(shù)據(jù)模型來計算網(wǎng)頁的更新率。對PageRank算法和改良后的算法進(jìn)行實驗對比,改良后的算法對新網(wǎng)頁的排名有所提升。
[Abstract]:In the face of the huge data in the Internet, how to obtain the needed information has become a difficult problem. For this problem, the Web data mining is composed of PageRank algorithm and HITS algorithm, which is composed of Web content mining, web structure mining and Web usage mining. Page Rank algorithm is more widely used than HITS algorithm, and its efficiency is better than that of HITS algorithm. Therefore, by learning the features of PageRank algorithm in Web structure mining, an improved method is proposed. The main contribution of this paper is as follows: 1) aiming at the problem of average allocation of PR value in PageRank algorithm. This paper proposes an improved method based on web similarity. The relationship between web pages is used as a link vector to represent a web page. The similarity between web pages is expressed by link vectors. The PR value is transferred by the similarity between the current web page and the linked web page, instead of the average transfer value of the original PageRank algorithm. By comparing the PageRank algorithm with the improved method, the improved algorithm can improve the precision. (2) aiming at the topic drift problem of the PageRank algorithm. This paper proposes an improved method based on thematic correlation. The basic principle of this improved method is that when a keyword is retrieved, if the retrieval system can rank the search results according to the size of the correlation between the web page and the customer's requirements, So we think the accuracy of the retrieval system is possible. This paper uses the developed probabilistic retrieval model, BM25F model, to obtain the correlation between web pages and query keywords. The PageRank algorithm Top-Sensitive PageRank algorithm is compared with the improved algorithm. The improved algorithm has a great improvement in the quality of the web page. This paper proposes an improved method based on the update rate of web pages. The traditional PageRank algorithm only considers the link structure between web pages and does not take the time factor as an evaluation criterion, so the probability of the new web page being quoted by other web pages will be greatly reduced because of the short time of existence. This is bad for the new web page. This improved method is based on the Poisson process of web page change. In this paper, the update rate of web page is calculated by Poisson distribution data model. Compared the PageRank algorithm with the improved algorithm, the improved algorithm improved the ranking of new web pages.
【學(xué)位授予單位】：天津理工大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2017
【分類號】：TP393.09;TP311.13

【相似文獻(xiàn)】

相關(guān)期刊論文前10條

1 李村合;呂克強(qiáng);;一種改進(jìn)PageRank的新方法[J];計算機(jī)系統(tǒng)應(yīng)用;2008年03期

2 鐘靈;章云;曾啟杰;羅文良;;能見度與缺失分析的改進(jìn)PageRank算法[J];微計算機(jī)信息;2009年15期

3 LIU Gui-mei;;An adaptive improvement on PageRank algorithm[J];Applied Mathematics:A Journal of Chinese Universities(Series B);2013年01期

4 史銘茗;;加權(quán)PageRank算法研究綜述[J];軟件導(dǎo)刊;2013年02期

5 劉悅;程學(xué)旗;李國杰;;提高PageRank算法效率的方法初探[J];計算機(jī)科學(xué);2002年06期

6 張巍,李志蜀;基于PageRank算法的搜索引擎優(yōu)化策略[J];計算機(jī)應(yīng)用;2005年07期

7 戚華春,黃德才,鄭月鋒;具有時間反饋的PageRank改進(jìn)算法[J];浙江工業(yè)大學(xué)學(xué)報;2005年03期

8 黃德才;戚華春;;PageRank算法研究[J];計算機(jī)工程;2006年04期

9 楊彬;康慕寧;;基于概念的權(quán)重PageRank改進(jìn)算法[J];情報雜志;2006年11期

10 張麗;;PageRank算法的改進(jìn)[J];科學(xué)技術(shù)與工程;2007年05期

相關(guān)會議論文前10條

1 ;Key Nodes Mining in Transport Networks Based on PageRank Algorithm[A];2009中國控制與決策會議論文集（3）[C];2009年

2 劉松彬;都云程;施水才;;基于分解轉(zhuǎn)移矩陣的PageRank迭代計算方法[A];內(nèi)容計算的研究與應(yīng)用前沿——第九屆全國計算語言學(xué)學(xué)術(shù)會議論文集[C];2007年

3 藺繼國;徐錫山;;一種基于用戶點擊數(shù)據(jù)的個性化PageRank算法[A];第六屆全國信息檢索學(xué)術(shù)會議論文集[C];2010年

4 李文;李淼;張建;朱海;陳雷;;基于混淆網(wǎng)絡(luò)和PageRank的Nbest重排序[A];少數(shù)民族青年自然語言處理技術(shù)研究與進(jìn)展——第三屆全國少數(shù)民族青年自然語言信息處理、第二屆全國多語言知識庫建設(shè)聯(lián)合學(xué)術(shù)研討會論文集[C];2010年

5 陳小飛;王軼彤;馮小軍;;一種基于網(wǎng)頁質(zhì)量的PageRank算法改進(jìn)[A];第26屆中國數(shù)據(jù)庫學(xué)術(shù)會議論文集（B輯）[C];2009年

6 劉菁菁;林鴻飛;楊志豪;;基于PageRank和錨文本的網(wǎng)頁排序研究[A];第三屆學(xué)生計算語言學(xué)研討會論文集[C];2006年

7 李洋濤;李川;許超;雷曉;徐洪宇;唐常杰;楊寧;;空間評分:基于PageRank的信息網(wǎng)絡(luò)可視化中節(jié)點重要性度量[A];第29屆中國數(shù)據(jù)庫學(xué)術(shù)會議論文集（B輯）（NDBC2012）[C];2012年

8 Jonathan J.H.Zhu;;PPS Sampling of Web Graph Using Preferential Jumping Strategy[A];Proceedings 2010 IEEE 2nd Symposium on Web Society[C];2010年

9 劉建毅;王菁華;王樅;;基于語言網(wǎng)絡(luò)的關(guān)鍵詞抽取[A];第三屆全國信息檢索與內(nèi)容安全學(xué)術(shù)會議論文集[C];2007年

10 ;Thinking with simple computer models:Modeling of social-economic systems[A];全國復(fù)雜系統(tǒng)研究論壇論文集（一）[C];2005年

相關(guān)碩士學(xué)位論文前10條

1 何逍;復(fù)雜網(wǎng)絡(luò)的可視化顯示[D];電子科技大學(xué);2015年

2 李金圻;基于Hadoop的微博輿情分析[D];北京郵電大學(xué);2015年

3 孫樂天;基于PageRank和對象關(guān)系的聚類算法研究[D];蘭州大學(xué);2016年

4 劉卓然;基于改進(jìn)PageRank算法的輿情引導(dǎo)技術(shù)研究[D];昆明理工大學(xué);2016年

5 王文文;深度重啟的Arnoldi加速的PageRank方法[D];上海大學(xué);2016年

6 孟德鑫;基于MapReduce計算模型的PageRank算法的優(yōu)化與實現(xiàn)[D];南京郵電大學(xué);2016年

7 吳恒超;基于PageRank算法的二分網(wǎng)絡(luò)社區(qū)劃分[D];沈陽航空航天大學(xué);2016年

8 楊碩;PageRank算法在社區(qū)劃分中的應(yīng)用研究[D];沈陽航空航天大學(xué);2016年

9 鄭普亨;基于PageRank算法的Web數(shù)據(jù)挖掘的研究[D];天津理工大學(xué);2017年

10 蔡建超;基于PageRank算法的搜索引擎優(yōu)化研究[D];江南大學(xué);2008年

，

本文編號：1982904

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1982904.html

上一篇：基于視覺顯著性和非監(jiān)督學(xué)習(xí)的目標(biāo)檢測
下一篇：一種基于時間序列的熱點話題發(fā)現(xiàn)模型和算法

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于PageRank算法的Web數(shù)據(jù)挖掘的研究