基于詞語相關(guān)度的搜索引擎排序算法
[Abstract]:The main task of the search engine is to collect the network information and return the web link related to the search word for the user according to the key word provided by the user. With the expansion of Internet network volume and the increase of information, it is not difficult for search engines to grab enough web pages on the network. The difficulty lies in how to sort out these pages, select the appropriate sorting algorithm, and send links to the user interface. Now search engine sorting algorithms are mainly based on link structure, such as PageRank algorithm and HITS algorithm, and combined with other algorithms to form an improved sorting model, practice shows that the search results are very good. But the link-based sorting algorithm has its own shortcomings, such as the analysis of natural language is not strong enough, fixed degree is divorced from the understanding of language. Therefore, this paper proposes a ranking algorithm based on word relevance. Firstly, based on a large number of corpus, the co-occurrence rate of words, word spacing and the information gain of words in the corpus are analyzed statistically. The relevant words and expressions in the document set are obtained, and their correlation degree is counted. Secondly, after the key words input by the user are obtained in the retrieval interface, the relevant words and correlation values are weighted into the PageRank algorithm according to a certain algorithm, which affects the sorting results of the web pages. Because there is no complete search engine system in this paper, we use the existing search engine Google to obtain documents, resort the documents by using the above algorithm, and compare the results with those of Google. Through the comparative analysis of experiments, the algorithm proposed in this paper can improve the problem of ranking based on link structure. At the same time, there are some shortcomings: first, the subject of corpus is single and the scope of experiment is small; Second, the time efficiency of the retrieval algorithm is not well considered. The algorithm proposed in this paper needs to be further improved on the basis of a wider range of fields and more experimental analysis.
【學(xué)位授予單位】:蘭州大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2012
【分類號(hào)】:TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 許云,樊孝忠,張鋒;基于知網(wǎng)的語義相關(guān)度計(jì)算[J];北京理工大學(xué)學(xué)報(bào);2005年05期
2 李廣原;屬性論在文本相似度計(jì)算中的應(yīng)用[J];廣西師院學(xué)報(bào)(自然科學(xué)版);2000年03期
3 張嶺,馬范援;加速評(píng)估算法:一種提高Web結(jié)構(gòu)挖掘質(zhì)量的新方法[J];計(jì)算機(jī)研究與發(fā)展;2004年01期
4 謝桂芳;李仁發(fā);;具有概念聯(lián)想功能的語義關(guān)系庫的自動(dòng)構(gòu)建[J];計(jì)算機(jī)工程與應(yīng)用;2007年07期
5 魯松,白碩;自然語言處理中詞語上下文有效范圍的定量描述[J];計(jì)算機(jī)學(xué)報(bào);2001年07期
6 田萱;杜小勇;李海華;;信息檢索中一種基于詞語—主題詞相關(guān)度的語言模型[J];中文信息學(xué)報(bào);2007年06期
7 宋聚平,王永成,尹中航,滕偉;對(duì)網(wǎng)頁P(yáng)ageRank算法的改進(jìn)[J];上海交通大學(xué)學(xué)報(bào);2003年03期
8 徐南軒;鄒恒明;;一種反映詞語相關(guān)度語義庫的構(gòu)建方法[J];上海交通大學(xué)學(xué)報(bào);2008年07期
9 李星毅;曾路平;施化吉;;基于單詞相似度的文本聚類[J];計(jì)算機(jī)工程與設(shè)計(jì);2009年08期
10 郭鴻;周婭;;Web結(jié)構(gòu)挖掘中HITS算法的改進(jìn)[J];信息化縱橫;2009年16期
相關(guān)碩士學(xué)位論文 前4條
1 肖江濤;基于本體的語義相關(guān)度算法研究[D];國(guó)防科學(xué)技術(shù)大學(xué);2010年
2 戚華春;互聯(lián)網(wǎng)絡(luò)信息挖掘算法的研究[D];浙江工業(yè)大學(xué);2005年
3 王廣正;基于知網(wǎng)語義相關(guān)度計(jì)算的漢語自動(dòng)分詞方法的研究[D];云南師范大學(xué);2006年
4 陳潔惠;搜索引擎排序算法的研究[D];河海大學(xué);2007年
,本文編號(hào):2308938
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2308938.html