面向網(wǎng)頁排序的關(guān)鍵詞權(quán)值計(jì)算

發(fā)布時(shí)間：2018-11-01 16:11

【摘要】：隨著信息科技的發(fā)展和互聯(lián)網(wǎng)的日益普及，搜索引擎深受人們的重視，近年來最主流的搜索引擎是基于關(guān)鍵詞檢索的搜索引擎，在基于關(guān)鍵詞檢索的搜索引擎中，用戶查詢語句中各個(gè)詞語權(quán)值計(jì)算的精度將直接影響到后續(xù)網(wǎng)頁排序的好壞，因此正確計(jì)算檢索條件中詞語權(quán)值是至關(guān)重要的。本文的研究是試圖尋找一種面向網(wǎng)頁排序的用戶查詢語句關(guān)鍵詞權(quán)值計(jì)算方法，使基于關(guān)鍵詞檢索的搜索引擎在網(wǎng)頁排序這一環(huán)節(jié)達(dá)到一個(gè)更高的水平，為后續(xù)檢索處理打下良好的基礎(chǔ)。為了完成研究目的，本文的工作主要包括以下三個(gè)部分：用戶查詢語句自身特點(diǎn)分析。對標(biāo)注了核心詞的5000句查詢語句自身特點(diǎn)與詞語權(quán)值關(guān)系進(jìn)行分析，對查詢語句中含有的停用詞和現(xiàn)代漢語語料中停用詞進(jìn)行分析，并對不同類別下查詢語句中停用詞進(jìn)行了分析和舉例。面向網(wǎng)頁排序的關(guān)鍵詞權(quán)值計(jì)算。對用戶查詢?nèi)罩具M(jìn)行分詞和詞性標(biāo)注，將關(guān)鍵詞抽取任務(wù)視為分類任務(wù)，結(jié)合查詢語句自身的特點(diǎn)，，最終確定出每個(gè)詞語的八個(gè)上下文特征作為決策樹森林分類的特征，并分別介紹了各個(gè)特征的計(jì)算方法。并對實(shí)驗(yàn)結(jié)果進(jìn)行錯(cuò)誤分析，加入一些規(guī)則對模型分類的結(jié)果進(jìn)行后處理。實(shí)驗(yàn)結(jié)果分析。對決策樹分類方法與傳統(tǒng)關(guān)鍵詞提取和權(quán)值計(jì)算方法的結(jié)果進(jìn)行對比分析，從用戶查詢?nèi)罩局须S機(jī)抽取1000條左右查詢語句進(jìn)行人工評測，使用交叉驗(yàn)證的方法評測模型準(zhǔn)確率和召回率；比較模型方法與傳統(tǒng)的網(wǎng)頁排序中權(quán)值計(jì)算方法的勝出率；選擇幾個(gè)查詢語句，到“百度”上搜索，得出由模型確定的關(guān)鍵詞序列進(jìn)行搜索與不對關(guān)鍵詞進(jìn)行處理的查詢語句搜索對網(wǎng)頁排序效果的影響。實(shí)驗(yàn)結(jié)果表明本文采用的關(guān)鍵詞抽取和權(quán)值計(jì)算方法在網(wǎng)頁排序的權(quán)值計(jì)算中是切實(shí)可行的。
[Abstract]:With the development of information technology and the increasing popularity of the Internet, search engines are paid more attention by people. In recent years, the most mainstream search engine is the search engine based on keyword search, which is based on keyword search engine. The accuracy of calculating the weight of each word in the user query statement will directly affect the order of the subsequent web pages, so it is very important to correctly calculate the word weight value in the retrieval condition. In this paper, we try to find a method to calculate the keyword weight of user query statements in order to make the search engine based on keyword search reach a higher level. It lays a good foundation for the subsequent retrieval processing. In order to accomplish the purpose of the research, this paper mainly includes the following three parts: the characteristics of user query statements. This paper analyzes the relationship between the characteristics of the 5000 sentence query sentences marked with the core words and the weight of the words, and analyzes the stop words contained in the query statements and the stop words in the modern Chinese corpus. At the same time, the analysis and examples of stop-word in query statements under different categories are given. Keyword weight calculation for web page sorting. The segmentation and part of speech tagging of user query log is carried out, and the task of keyword extraction is regarded as a classification task. Combining with the characteristics of query statements, the eight contextual features of each word are finally determined as the characteristics of forest classification in decision tree. The calculation methods of each characteristic are introduced respectively. Error analysis of the experimental results is carried out, and some rules are added to post-process the results of model classification. Analysis of experimental results. The results of decision tree classification method and traditional keyword extraction and weight calculation methods are compared and analyzed. About 1000 query statements are randomly extracted from the user's query log for manual evaluation. The accuracy and recall rate of the model are evaluated by cross-validation. Compare the winning rate between the model method and the traditional weight calculation method in web page sorting; Several query statements are selected to search on "Baidu", and the influence of the keyword sequence determined by the model and the search statement that does not deal with the keywords on the ranking effect of the web pages is obtained. The experimental results show that the method of keyword extraction and weight calculation used in this paper is feasible in the weight calculation of web page sorting.
【學(xué)位授予單位】：中國社會科學(xué)院研究生院
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2013
【分類號】：TP391.3

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 羅智勇;宋柔;;基于多特征的自適應(yīng)新詞識別[J];北京工業(yè)大學(xué)學(xué)報(bào);2007年07期

2 李衛(wèi)東;宋威;李欣;楊炳儒;;一種多標(biāo)準(zhǔn)決策樹剪枝方法及其在入侵檢測中的應(yīng)用[J];北京科技大學(xué)學(xué)報(bào);2007年04期

3 呂鳴劍;;數(shù)據(jù)挖掘在知識工程中的應(yīng)用研究[J];電腦知識與技術(shù);2011年23期

4 熊文新;宋柔;;信息檢索用戶查詢語句的停用詞過濾[J];計(jì)算機(jī)工程;2007年06期

5 張映海;何中市;陳永鋒;;搜索引擎結(jié)果中Web文檔的排序研究[J];計(jì)算機(jī)與數(shù)字工程;2007年02期

6 文炯;;搜索引擎之競價(jià)排名研究[J];江西圖書館學(xué)刊;2006年01期

7 游榮彥;Zipf定律與漢字字頻分布[J];中文信息學(xué)報(bào);2000年03期

8 黃永文,何中市;基于互信息的統(tǒng)計(jì)語言模型平滑技術(shù)[J];中文信息學(xué)報(bào);2005年04期

9 索紅光;劉玉樹;曹淑英;;一種基于詞匯鏈的關(guān)鍵詞抽取方法[J];中文信息學(xué)報(bào);2006年06期

10 黃昌寧;趙海;;中文分詞十年回顧[J];中文信息學(xué)報(bào);2007年03期

相關(guān)會議論文前2條

1 張建強(qiáng);;基于語料庫的現(xiàn)代漢語疑問句使用情況調(diào)查[A];第五屆全國語言文字應(yīng)用學(xué)術(shù)研討會論文集[C];2007年

2 魏志成;;漢語句型系統(tǒng)的解構(gòu)與重構(gòu)[A];中國英漢語比較研究會第七次全國學(xué)術(shù)研討會論文集[C];2006年

相關(guān)博士學(xué)位論文前1條

1 張俊林;基于語言模型的信息檢索系統(tǒng)研究[D];中國科學(xué)院研究生院（軟件研究所）;2004年

相關(guān)碩士學(xué)位論文前1條

1 毛婷婷;中文專有名詞識別的研究[D];大連理工大學(xué);2006年

本文編號：2304434

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2304434.html

上一篇：融合多類特征的Web查詢意圖識別
下一篇：網(wǎng)絡(luò)信息移動(dòng)搜索的結(jié)構(gòu)框架與技術(shù)機(jī)理探討

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

面向網(wǎng)頁排序的關(guān)鍵詞權(quán)值計(jì)算