天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁 > 科技論文 > 搜索引擎論文 >

Web挖掘中的HITS算法的一種改進(jìn)策略

發(fā)布時間:2018-04-30 18:25

  本文選題:Web挖掘 + HITS算法; 參考:《吉林大學(xué)》2013年碩士論文


【摘要】:21世紀(jì)是一個社會信息化程度不斷提高,網(wǎng)絡(luò)技術(shù)高速發(fā)展的時代。越來越多的信息不斷整合集中,并通過互聯(lián)網(wǎng)進(jìn)行保存和傳遞。如何從海量信息中快速高效的獲取所需信息,在做任何事情都離不開計算機的這樣一個時代,這無疑是人們必須面對的一個問題。 搜索引擎技術(shù)的產(chǎn)生和發(fā)展無疑為網(wǎng)絡(luò)中信息的抓取和檢索提供了可能。但是任何的技術(shù)都不是完美的,由于搜索引擎是基于通用性而產(chǎn)生的,,這就使得其對網(wǎng)頁的選取并不具有偏好性,也就不能實現(xiàn)更為精確和科學(xué)的抓取。由于web頁面中包含各種復(fù)雜的信息,數(shù)據(jù)的結(jié)構(gòu)形式也很復(fù)雜,面對這種情況,對web頁面進(jìn)行精確的分析和信息的抓取和檢索具有非常特殊的復(fù)雜性。Web挖掘技術(shù)其實是在傳統(tǒng)數(shù)據(jù)挖掘技術(shù)的基礎(chǔ)上而產(chǎn)生的。在這種方法中,它可以通過對web的結(jié)構(gòu)信息、文本信息或者其他的網(wǎng)頁內(nèi)容信息進(jìn)行相關(guān)性分析,進(jìn)而能夠從web頁面中的半結(jié)構(gòu)化文檔中抽取便于數(shù)據(jù)挖掘的結(jié)構(gòu)化信息。本文研究的課題就是如何能夠提供一種有效的精確的信息檢索方案。 本文首先對Web挖掘中經(jīng)典的鏈接分析算法HITS算法和PageRank算法進(jìn)行了介紹,并分析了其優(yōu)缺點。本文中選擇HITS算法作為研究的基本算法。在實驗中發(fā)現(xiàn),HITS算法對實效性的信息不敏感,另一方面,HITS算法存在不能識別冗余的無效鏈接的問題。在此基礎(chǔ)上,本文提出了一種基于時間衰減參數(shù)的方法,其原理是對傳統(tǒng)的HITS算法進(jìn)行改進(jìn),提出了TM-HITS算法。分別進(jìn)行了引入對模擬數(shù)據(jù)的分析以及針對網(wǎng)頁抓取技術(shù)獲得的真實數(shù)據(jù)進(jìn)行分析實驗,實驗數(shù)據(jù)表明了,該算法能夠有效的獲取實效性更高的網(wǎng)頁,同時較好的避免了廣告鏈接和無效頁面等惡意的或者是非惡意的無用鏈入鏈接的干擾。 另一方面,本文根據(jù)上述實驗和改進(jìn)的經(jīng)驗總結(jié)的同時,也對今后Web挖掘技術(shù)的發(fā)展趨勢做了一些展望,提出了一種基于兩種鏈接分析算法綜合使用的信息檢索模型的可行性方法。該方法可以分別在服務(wù)器端以及客戶端建立集成了不同算法的鏈接分析模塊,可以根據(jù)不同用戶的需求,進(jìn)行不同精度下的搜索,同時該方法可以引入機器學(xué)習(xí)的方法不斷對模型進(jìn)行修正,以期能夠達(dá)到智能化檢索以及不同用戶可以根據(jù)自己的喜好來定制檢索服務(wù)等更深層次的需求。
[Abstract]:The 21st century is an era in which the information level of society is constantly improving and the network technology is developing at a high speed. More and more information is continuously integrated and centralized, and stored and transmitted through the Internet. How to get the required information quickly and efficiently from the massive information is a problem that people must face in the era of doing anything without computer. The emergence and development of search engine technology undoubtedly provides the possibility for information capture and retrieval in the network. But any technology is not perfect, because the search engine is based on generality, which makes the selection of web pages has no preference, so it can not achieve more accurate and scientific capture. Because the web page contains a variety of complex information, the structure of the data is also very complex, faced with this situation, The accurate analysis of web pages and the retrieval and retrieval of information have very special complexity. Web mining technology is based on the traditional data mining technology. In this method, the structure information, text information or other web content information of web can be analyzed by correlation analysis, and then the structured information can be extracted from semi-structured documents in web pages to facilitate data mining. The topic of this paper is how to provide an effective and accurate information retrieval scheme. In this paper, the classical link analysis algorithms in Web mining, HITS algorithm and PageRank algorithm, are introduced, and their advantages and disadvantages are analyzed. In this paper, HITS algorithm is chosen as the basic algorithm. It is found in the experiment that the hits algorithm is not sensitive to the actual information. On the other hand, the hits algorithm does not recognize redundant invalid links. On this basis, this paper proposes a method based on the time attenuation parameter. The principle is to improve the traditional HITS algorithm and propose the TM-HITS algorithm. The analysis of simulated data and the analysis of real data obtained by web crawling technology are carried out respectively. The experimental data show that the algorithm can effectively obtain more effective web pages. At the same time, it can avoid the malevolent or non-malicious chain-in links such as advertising links and invalid pages. On the other hand, based on the above experiments and improved experience, this paper also makes some prospects for the development trend of Web mining technology in the future. A feasible information retrieval model based on two link analysis algorithms is proposed. This method can set up link analysis module which integrates different algorithms in the server and client, and can search with different precision according to the needs of different users. At the same time, this method can introduce the machine learning method to modify the model constantly, in order to achieve intelligent retrieval and different users can customize retrieval services according to their own preferences and other deeper requirements.
【學(xué)位授予單位】:吉林大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP311.13

【參考文獻(xiàn)】

相關(guān)期刊論文 前4條

1 石晶,龔震宇,裘杭萍,張毓森;一種更穩(wěn)定的鏈接分析算法——子空間HITS算法[J];吉林大學(xué)學(xué)報(理學(xué)版);2003年01期

2 王艷華,張紀(jì);Web結(jié)構(gòu)挖掘及其算法[J];計算機工程;2005年S1期

3 王曉宇,周傲英;萬維網(wǎng)的鏈接結(jié)構(gòu)分析及其應(yīng)用綜述[J];軟件學(xué)報;2003年10期

4 常慶;周明全;耿國華;;基于PageRank和HITS的Web搜索[J];計算機技術(shù)與發(fā)展;2008年07期

相關(guān)碩士學(xué)位論文 前2條

1 黃雋毅;關(guān)于Web數(shù)據(jù)挖掘中HITS算法的研究[D];大連理工大學(xué);2004年

2 桂擋平;基于鏈接相似度的Web社區(qū)發(fā)現(xiàn)算法研究[D];大連理工大學(xué);2008年



本文編號:1825669

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1825669.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶5196c***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com