Web挖掘中的HITS算法的一種改進(jìn)策略

發(fā)布時(shí)間：2018-04-30 18:25

本文選題：Web挖掘 + HITS算法　；參考：《吉林大學(xué)》2013年碩士論文

【摘要】：21世紀(jì)是一個(gè)社會(huì)信息化程度不斷提高，網(wǎng)絡(luò)技術(shù)高速發(fā)展的時(shí)代。越來(lái)越多的信息不斷整合集中，并通過(guò)互聯(lián)網(wǎng)進(jìn)行保存和傳遞。如何從海量信息中快速高效的獲取所需信息，在做任何事情都離不開(kāi)計(jì)算機(jī)的這樣一個(gè)時(shí)代，這無(wú)疑是人們必須面對(duì)的一個(gè)問(wèn)題。搜索引擎技術(shù)的產(chǎn)生和發(fā)展無(wú)疑為網(wǎng)絡(luò)中信息的抓取和檢索提供了可能。但是任何的技術(shù)都不是完美的，由于搜索引擎是基于通用性而產(chǎn)生的，，這就使得其對(duì)網(wǎng)頁(yè)的選取并不具有偏好性，也就不能實(shí)現(xiàn)更為精確和科學(xué)的抓取。由于web頁(yè)面中包含各種復(fù)雜的信息，數(shù)據(jù)的結(jié)構(gòu)形式也很復(fù)雜，面對(duì)這種情況，對(duì)web頁(yè)面進(jìn)行精確的分析和信息的抓取和檢索具有非常特殊的復(fù)雜性。Web挖掘技術(shù)其實(shí)是在傳統(tǒng)數(shù)據(jù)挖掘技術(shù)的基礎(chǔ)上而產(chǎn)生的。在這種方法中，它可以通過(guò)對(duì)web的結(jié)構(gòu)信息、文本信息或者其他的網(wǎng)頁(yè)內(nèi)容信息進(jìn)行相關(guān)性分析，進(jìn)而能夠從web頁(yè)面中的半結(jié)構(gòu)化文檔中抽取便于數(shù)據(jù)挖掘的結(jié)構(gòu)化信息。本文研究的課題就是如何能夠提供一種有效的精確的信息檢索方案。本文首先對(duì)Web挖掘中經(jīng)典的鏈接分析算法HITS算法和PageRank算法進(jìn)行了介紹，并分析了其優(yōu)缺點(diǎn)。本文中選擇HITS算法作為研究的基本算法。在實(shí)驗(yàn)中發(fā)現(xiàn)，HITS算法對(duì)實(shí)效性的信息不敏感，另一方面，HITS算法存在不能識(shí)別冗余的無(wú)效鏈接的問(wèn)題。在此基礎(chǔ)上，本文提出了一種基于時(shí)間衰減參數(shù)的方法，其原理是對(duì)傳統(tǒng)的HITS算法進(jìn)行改進(jìn)，提出了TM-HITS算法。分別進(jìn)行了引入對(duì)模擬數(shù)據(jù)的分析以及針對(duì)網(wǎng)頁(yè)抓取技術(shù)獲得的真實(shí)數(shù)據(jù)進(jìn)行分析實(shí)驗(yàn)，實(shí)驗(yàn)數(shù)據(jù)表明了，該算法能夠有效的獲取實(shí)效性更高的網(wǎng)頁(yè)，同時(shí)較好的避免了廣告鏈接和無(wú)效頁(yè)面等惡意的或者是非惡意的無(wú)用鏈入鏈接的干擾。另一方面，本文根據(jù)上述實(shí)驗(yàn)和改進(jìn)的經(jīng)驗(yàn)總結(jié)的同時(shí)，也對(duì)今后Web挖掘技術(shù)的發(fā)展趨勢(shì)做了一些展望，提出了一種基于兩種鏈接分析算法綜合使用的信息檢索模型的可行性方法。該方法可以分別在服務(wù)器端以及客戶(hù)端建立集成了不同算法的鏈接分析模塊，可以根據(jù)不同用戶(hù)的需求，進(jìn)行不同精度下的搜索，同時(shí)該方法可以引入機(jī)器學(xué)習(xí)的方法不斷對(duì)模型進(jìn)行修正，以期能夠達(dá)到智能化檢索以及不同用戶(hù)可以根據(jù)自己的喜好來(lái)定制檢索服務(wù)等更深層次的需求。
[Abstract]:The 21st century is an era in which the information level of society is constantly improving and the network technology is developing at a high speed. More and more information is continuously integrated and centralized, and stored and transmitted through the Internet. How to get the required information quickly and efficiently from the massive information is a problem that people must face in the era of doing anything without computer. The emergence and development of search engine technology undoubtedly provides the possibility for information capture and retrieval in the network. But any technology is not perfect, because the search engine is based on generality, which makes the selection of web pages has no preference, so it can not achieve more accurate and scientific capture. Because the web page contains a variety of complex information, the structure of the data is also very complex, faced with this situation, The accurate analysis of web pages and the retrieval and retrieval of information have very special complexity. Web mining technology is based on the traditional data mining technology. In this method, the structure information, text information or other web content information of web can be analyzed by correlation analysis, and then the structured information can be extracted from semi-structured documents in web pages to facilitate data mining. The topic of this paper is how to provide an effective and accurate information retrieval scheme. In this paper, the classical link analysis algorithms in Web mining, HITS algorithm and PageRank algorithm, are introduced, and their advantages and disadvantages are analyzed. In this paper, HITS algorithm is chosen as the basic algorithm. It is found in the experiment that the hits algorithm is not sensitive to the actual information. On the other hand, the hits algorithm does not recognize redundant invalid links. On this basis, this paper proposes a method based on the time attenuation parameter. The principle is to improve the traditional HITS algorithm and propose the TM-HITS algorithm. The analysis of simulated data and the analysis of real data obtained by web crawling technology are carried out respectively. The experimental data show that the algorithm can effectively obtain more effective web pages. At the same time, it can avoid the malevolent or non-malicious chain-in links such as advertising links and invalid pages. On the other hand, based on the above experiments and improved experience, this paper also makes some prospects for the development trend of Web mining technology in the future. A feasible information retrieval model based on two link analysis algorithms is proposed. This method can set up link analysis module which integrates different algorithms in the server and client, and can search with different precision according to the needs of different users. At the same time, this method can introduce the machine learning method to modify the model constantly, in order to achieve intelligent retrieval and different users can customize retrieval services according to their own preferences and other deeper requirements.
【學(xué)位授予單位】：吉林大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2013
【分類(lèi)號(hào)】：TP311.13

【參考文獻(xiàn)】

相關(guān)期刊論文前4條

1 石晶,龔震宇,裘杭萍,張毓森;一種更穩(wěn)定的鏈接分析算法——子空間HITS算法[J];吉林大學(xué)學(xué)報(bào)(理學(xué)版);2003年01期

2 王艷華,張紀(jì);Web結(jié)構(gòu)挖掘及其算法[J];計(jì)算機(jī)工程;2005年S1期

3 王曉宇,周傲英;萬(wàn)維網(wǎng)的鏈接結(jié)構(gòu)分析及其應(yīng)用綜述[J];軟件學(xué)報(bào);2003年10期

4 常慶;周明全;耿國(guó)華;;基于PageRank和HITS的Web搜索[J];計(jì)算機(jī)技術(shù)與發(fā)展;2008年07期

相關(guān)碩士學(xué)位論文前2條

1 黃雋毅;關(guān)于Web數(shù)據(jù)挖掘中HITS算法的研究[D];大連理工大學(xué);2004年

2 桂擋平;基于鏈接相似度的Web社區(qū)發(fā)現(xiàn)算法研究[D];大連理工大學(xué);2008年

本文編號(hào)：1825669

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1825669.html

上一篇：新建網(wǎng)站的推廣方法與對(duì)策研究
下一篇：中美企業(yè)牽手合作

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

Web挖掘中的HITS算法的一種改進(jìn)策略