Web挖掘中的鏈接分析與話題檢測(cè)研究
本文選題:Web信息檢索 + Web挖掘。 參考:《大連理工大學(xué)》2012年博士論文
【摘要】:Web已經(jīng)成為人類存儲(chǔ)和共享信息的主要平臺(tái)。對(duì)于這個(gè)龐大的信息源,如何檢索有用的信息是個(gè)十分具有挑戰(zhàn)性的課題。由于Web的特性,如大量的非結(jié)構(gòu)或半結(jié)構(gòu)化的文檔以及多媒體信息、參差不齊的網(wǎng)頁(yè)質(zhì)量等,使傳統(tǒng)的用于結(jié)構(gòu)化數(shù)據(jù)的信息檢索技術(shù)很難有效地應(yīng)用。Web上的信息檢索已經(jīng)形成一門獨(dú)立學(xué)科,研究?jī)?nèi)容非常廣泛。 本文針對(duì)Web上的信息檢索的研究熱點(diǎn),在以下幾個(gè)方面展開(kāi)深入研究。 首先,本文研究了現(xiàn)代搜索引擎的重要組成部分——網(wǎng)頁(yè)排名算法。針對(duì)現(xiàn)在主流的主題相關(guān)網(wǎng)頁(yè)排名算法HITS的不足,本文提出了基于引力模型的排名算法G-HITS。該模型將網(wǎng)頁(yè)看作質(zhì)點(diǎn),將涉及網(wǎng)頁(yè)排名的各種因素描述成網(wǎng)頁(yè)的質(zhì)量或距離,用萬(wàn)有引力描述網(wǎng)頁(yè)的關(guān)系,從而克服純粹基于鏈接的排名算法的不足。 其次,針對(duì)日益猖獗的網(wǎng)頁(yè)排名作弊現(xiàn)象,本文研究了反基于鏈接的網(wǎng)頁(yè)排名作弊問(wèn)題。本文首先分析了著名的TrustRank算法和Anti-TrustRank算法只能傳播信任或非信任的問(wèn)題,提出了同時(shí)傳播信任和非信任的綜合框架。該算法克服了TrustRank算法和Anti-TrustRank算法的不足,提高了反網(wǎng)頁(yè)排名作弊的效率。 第三,本文研究了Web上社區(qū)識(shí)別問(wèn)題。社區(qū)是Web上的重要現(xiàn)象,反映了Web上話題的分布。社區(qū)識(shí)別可以通過(guò)挖掘Web圖的稠密子圖發(fā)現(xiàn)這種話題分布,F(xiàn)有社區(qū)識(shí)別算法都是以網(wǎng)頁(yè)為基本單位的。但每個(gè)網(wǎng)頁(yè)都包含多個(gè)主題。本文提出了基于網(wǎng)頁(yè)分塊的社區(qū)識(shí)別算法,解決了網(wǎng)頁(yè)的多主題問(wèn)題,使社區(qū)識(shí)別的精確度得到明顯提高。 最后,本文研究了Web上的話題檢測(cè)問(wèn)題。為了更有效地檢測(cè)話題,本文首先研究了譜聚類算法,對(duì)現(xiàn)有譜聚類算法進(jìn)行了改進(jìn),并用改進(jìn)的譜聚類算法進(jìn)行話題檢測(cè)。接下來(lái),本文提出基于超圖劃分的話題檢測(cè)算法。該算法對(duì)Web特征進(jìn)行了二次提取,并使用超圖劃分算法進(jìn)行話題檢測(cè),使話題檢測(cè)的精度得到明顯提高。
[Abstract]:Web has become the main platform for human storage and sharing information . For this huge information source , how to retrieve useful information is a very challenging task . Because of the nature of the Web , such as a large number of unstructured or semi - structured documents , multimedia information and uneven web quality , it is difficult to apply the traditional information retrieval technology for structured data . The information retrieval on the Web has formed an independent subject , and the research content is very wide .
This paper focuses on the research focus of information retrieval in Web , and studies deeply in the following aspects .
First , this paper studies the important component _ web ranking algorithm of modern search engine . Based on the deficiency of HITS , a ranking algorithm based on gravity model is presented in this paper . This model describes the web page as a particle , describes the various factors related to the web page ranking as the quality or distance of the web page , describes the relationship of web pages with universal gravitation , and overcomes the shortage of purely link - based ranking algorithm .
Secondly , aiming at increasingly rampant web page ranking cheating , this paper studies the problem of anti - trust and non - trust based on link - based web page ranking . This paper first analyzes the problems of trust or non - trust in the famous trust rank algorithm and Anti - Trust Rank algorithm , and puts forward a comprehensive framework for simultaneous propagation of trust and non - trust . The algorithm overcomes the shortcomings of the trust rank algorithm and Anti - Trust Rank algorithm , and improves the efficiency of the anti - webpage ranking cheating .
Thirdly , the problem of community identification on the Web is studied in this paper . The community is an important phenomenon on the Web , which reflects the distribution of the topic in the Web . The community identification can find the topic distribution by digging the dense subgraph of the Web graph . But each web page contains a plurality of topics .
Finally , this paper studies the topic detection in Web . In order to detect the topic more effectively , this paper first studies the spectrum clustering algorithm , improves the existing spectral clustering algorithm , and uses the improved spectral clustering algorithm to detect the topic . Next , this paper proposes a topic detection algorithm based on hypergraph partition .
【學(xué)位授予單位】:大連理工大學(xué)
【學(xué)位級(jí)別】:博士
【學(xué)位授予年份】:2012
【分類號(hào)】:TP391.3;TP393.092
【參考文獻(xiàn)】
相關(guān)期刊論文 前7條
1 賈自艷 ,何清 ,張? ,李嘉佑 ,史忠植;一種基于動(dòng)態(tài)進(jìn)化模型的事件探測(cè)和追蹤算法[J];計(jì)算機(jī)研究與發(fā)展;2004年07期
2 于滿泉;駱衛(wèi)華;許洪波;白碩;;話題識(shí)別與跟蹤中的層次化話題識(shí)別技術(shù)研究[J];計(jì)算機(jī)研究與發(fā)展;2006年03期
3 趙華;趙鐵軍;于浩;鄭德權(quán);;基于查詢向量的英語(yǔ)話題跟蹤研究[J];計(jì)算機(jī)研究與發(fā)展;2007年08期
4 洪宇;張宇;范基禮;劉挺;李生;;基于子話題分治匹配的新事件檢測(cè)[J];計(jì)算機(jī)學(xué)報(bào);2008年04期
5 王會(huì)珍;朱靖波;季鐸;葉娜;張斌;;基于反饋學(xué)習(xí)自適應(yīng)的中文話題追蹤[J];中文信息學(xué)報(bào);2006年03期
6 張闊;李涓子;吳剛;王克宏;;基于詞元再評(píng)估的新事件檢測(cè)模型[J];軟件學(xué)報(bào);2008年04期
7 洪宇;張宇;范基禮;劉挺;李生;;基于語(yǔ)義域語(yǔ)言模型的中文話題關(guān)聯(lián)檢測(cè)[J];軟件學(xué)報(bào);2008年09期
,本文編號(hào):2094147
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2094147.html