基于詞語相關(guān)度的搜索引擎排序算法

發(fā)布時(shí)間：2018-11-03 20:27

【摘要】：搜索引擎的主要任務(wù)是搜集網(wǎng)絡(luò)信息,根據(jù)用戶提供的檢索詞為用戶返回與檢索詞相關(guān)的網(wǎng)頁鏈接。隨著Internet網(wǎng)絡(luò)容積的擴(kuò)大,信息量的增多,搜索引擎在網(wǎng)絡(luò)上抓取足夠的網(wǎng)頁并不難,難在于如何將這些網(wǎng)頁整理出來,選擇合適的排序算法,將網(wǎng)頁鏈接發(fā)送到用戶界面。現(xiàn)在搜索引擎的排序算法主要是基于鏈接結(jié)構(gòu),如PageRank算法和HITS算法,并在此基礎(chǔ)上結(jié)合其它算法形成改進(jìn)后的排序模型,實(shí)踐證明搜索效果很好。但是基于鏈接的排序算法有自身的不足,如對(duì)自然語言的分析力度不夠,定程度上脫離了人對(duì)語言的理解。因此,本文提出了基于詞語相關(guān)度的排序算法,一是在大量語料集的基礎(chǔ)上,通過統(tǒng)計(jì)分析文檔內(nèi)詞語的共現(xiàn)率、詞間距以及詞語在語料集內(nèi)的信息增益,得出關(guān)鍵詞在文檔集內(nèi)的相關(guān)詞語,并計(jì)錄它們的相關(guān)度大小；二是在檢索界面獲取用戶輸入的關(guān)鍵詞后,將得到的相關(guān)詞及相關(guān)度值按一定算法加權(quán)到PageRank算法中,影響網(wǎng)頁的排序結(jié)果。由于本文沒有實(shí)現(xiàn)完整的搜索引擎系統(tǒng),所以本文通過現(xiàn)有搜索引擎Google來獲文檔,利用上述算法對(duì)文檔重新排序,并與Google的排序結(jié)果對(duì)比。通過實(shí)驗(yàn)對(duì)比分析,本文提出的算法能夠改善基于鏈接結(jié)構(gòu)排序的問題,同時(shí)也存在著一些不足：一是語料集的主題單一,實(shí)驗(yàn)范圍小；二是檢索中算法的時(shí)間效率問題考慮不周。本文提出的算法還需要在更廣的領(lǐng)域和更多的實(shí)驗(yàn)分析基礎(chǔ)上進(jìn)一步改進(jìn)。
[Abstract]:The main task of the search engine is to collect the network information and return the web link related to the search word for the user according to the key word provided by the user. With the expansion of Internet network volume and the increase of information, it is not difficult for search engines to grab enough web pages on the network. The difficulty lies in how to sort out these pages, select the appropriate sorting algorithm, and send links to the user interface. Now search engine sorting algorithms are mainly based on link structure, such as PageRank algorithm and HITS algorithm, and combined with other algorithms to form an improved sorting model, practice shows that the search results are very good. But the link-based sorting algorithm has its own shortcomings, such as the analysis of natural language is not strong enough, fixed degree is divorced from the understanding of language. Therefore, this paper proposes a ranking algorithm based on word relevance. Firstly, based on a large number of corpus, the co-occurrence rate of words, word spacing and the information gain of words in the corpus are analyzed statistically. The relevant words and expressions in the document set are obtained, and their correlation degree is counted. Secondly, after the key words input by the user are obtained in the retrieval interface, the relevant words and correlation values are weighted into the PageRank algorithm according to a certain algorithm, which affects the sorting results of the web pages. Because there is no complete search engine system in this paper, we use the existing search engine Google to obtain documents, resort the documents by using the above algorithm, and compare the results with those of Google. Through the comparative analysis of experiments, the algorithm proposed in this paper can improve the problem of ranking based on link structure. At the same time, there are some shortcomings: first, the subject of corpus is single and the scope of experiment is small; Second, the time efficiency of the retrieval algorithm is not well considered. The algorithm proposed in this paper needs to be further improved on the basis of a wider range of fields and more experimental analysis.
【學(xué)位授予單位】：蘭州大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2012
【分類號(hào)】：TP391.3

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 許云,樊孝忠,張鋒;基于知網(wǎng)的語義相關(guān)度計(jì)算[J];北京理工大學(xué)學(xué)報(bào);2005年05期

2 李廣原;屬性論在文本相似度計(jì)算中的應(yīng)用[J];廣西師院學(xué)報(bào)(自然科學(xué)版);2000年03期

3 張嶺,馬范援;加速評(píng)估算法:一種提高Web結(jié)構(gòu)挖掘質(zhì)量的新方法[J];計(jì)算機(jī)研究與發(fā)展;2004年01期

4 謝桂芳;李仁發(fā);;具有概念聯(lián)想功能的語義關(guān)系庫的自動(dòng)構(gòu)建[J];計(jì)算機(jī)工程與應(yīng)用;2007年07期

5 魯松,白碩;自然語言處理中詞語上下文有效范圍的定量描述[J];計(jì)算機(jī)學(xué)報(bào);2001年07期

6 田萱;杜小勇;李海華;;信息檢索中一種基于詞語—主題詞相關(guān)度的語言模型[J];中文信息學(xué)報(bào);2007年06期

7 宋聚平,王永成,尹中航,滕偉;對(duì)網(wǎng)頁P(yáng)ageRank算法的改進(jìn)[J];上海交通大學(xué)學(xué)報(bào);2003年03期

8 徐南軒;鄒恒明;;一種反映詞語相關(guān)度語義庫的構(gòu)建方法[J];上海交通大學(xué)學(xué)報(bào);2008年07期

9 李星毅;曾路平;施化吉;;基于單詞相似度的文本聚類[J];計(jì)算機(jī)工程與設(shè)計(jì);2009年08期

10 郭鴻;周婭;;Web結(jié)構(gòu)挖掘中HITS算法的改進(jìn)[J];信息化縱橫;2009年16期

相關(guān)碩士學(xué)位論文前4條

1 肖江濤;基于本體的語義相關(guān)度算法研究[D];國(guó)防科學(xué)技術(shù)大學(xué);2010年

2 戚華春;互聯(lián)網(wǎng)絡(luò)信息挖掘算法的研究[D];浙江工業(yè)大學(xué);2005年

3 王廣正;基于知網(wǎng)語義相關(guān)度計(jì)算的漢語自動(dòng)分詞方法的研究[D];云南師范大學(xué);2006年

4 陳潔惠;搜索引擎排序算法的研究[D];河海大學(xué);2007年

，

本文編號(hào)：2308938

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2308938.html

上一篇：基于搜索引擎的知識(shí)發(fā)現(xiàn)
下一篇：基于詞語相關(guān)度的搜索引擎排序算法

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于詞語相關(guān)度的搜索引擎排序算法