基于長尾查詢需求理解的搜索引擎性能改進(jìn)
[Abstract]:Search engine is an important tool for people to obtain information. The search engine needs to construct the query term when searching for the requirement target. The frequency of the query term is distributed according to the power law. We call the query word at the end of the distribution as long tail query. Based on the analysis of real search engine data, it is found that long-tailed queries account for about 70% of the total number of independent queries, and almost all users have the demand for long-tailed queries. However, the user behavior data of long tail query is sparse, it is difficult to directly use the existing search quality optimization method, which becomes a difficulty in search engine. Through the sampling analysis of real search engine logs, we find that some of the long tail queries are not able to retrieve the correct results effectively because of improper representation, rather than lack of network resources to meet the needs of users. In order to solve this problem, we try to understand the user's query requirements by analyzing the behavior of rewriting the query, help the user to find the appropriate query expression, and directly optimize the query results. The main contributions of this paper are as follows: 1. Analysis and prediction of query rewriting behavior pattern. Combined with previous research work, the query rewriting behavior pattern is divided into four types of: New Topic (new topic,), Generalization (generalization,), Specification (specialization) and Parallel (parallel topic). Based on the analysis of real search engine log sampling data, the prediction and classification methods of query rewriting behavior patterns are proposed. The overall accuracy reaches 79.29, which is the basis for further understanding the user needs. 2. The correlation of long tail query results is evaluated automatically. This paper analyzes the relationship between the correlation degree of long tail query result document and display and click, extracts click feature, red feature and search engine sorting feature, and trains a classifier based on integrated learning method. Good results have been achieved in the correlation of prediction results. 3. 3. A long tail query performance improvement method based on multi-result fusion is proposed. By mining the possible rewriting words of the long tail query, we can find the query words with similar intention and more appropriate expression. Furthermore, the results of these rewriting words are fused with the results of the original query, and the long-tailed query is improved directly at the level of the result list. Our approach introduces new results, not just reordering. In the process of sorting, information is added to reflect whether the original query can be improved. Experiments on real search engine data show that this method can significantly improve the ERR@10 evaluation index by 3.69%. It is worth mentioning that our method is also effective for improving the performance of non-long tail queries. 4. 4. A long tail query performance improvement system based on user intention understanding is proposed. The prediction of query rewriting behavior is combined with the method of multi-result fusion, and the individualized information of individual user is introduced, and the new result document is introduced pertinently, and the performance improvement effect is further improved.
【學(xué)位授予單位】:清華大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2015
【分類號】:TP391.3
【相似文獻(xiàn)】
相關(guān)期刊論文 前10條
1 張志寬;羅曉沛;;基于Web Dynpro Java平臺的查詢技術(shù)應(yīng)用分析[J];計(jì)算機(jī)工程與設(shè)計(jì);2009年20期
2 敖鵬蛟;裴志偉;;集成電路生產(chǎn)線EAP監(jiān)控系統(tǒng)大數(shù)據(jù)量查詢性能優(yōu)化方法[J];工業(yè)控制計(jì)算機(jī);2013年11期
3 武德亮;如何提高INFORMIX-ONLINE數(shù)據(jù)庫數(shù)據(jù)查詢性能[J];中國金融電腦;2001年04期
4 ;開發(fā)人員升級至ASE 15.0的10大理由(十)[J];鐵路計(jì)算機(jī)應(yīng)用;2011年03期
5 薛穎;閔聯(lián)營;邱橋春;;基于hibernate緩存機(jī)制的查詢性能優(yōu)化研究[J];電腦知識與技術(shù)(學(xué)術(shù)交流);2007年17期
6 鐘玲;張丹;孫淑杰;賈軍;;MapX4.0中存在的問題及查詢性能研究[J];沈陽工業(yè)大學(xué)學(xué)報(bào);2006年02期
7 徐懷平;;優(yōu)化Oracle的查詢性能[J];電腦編程技巧與維護(hù);2012年23期
8 李鍇;;基于查詢性能預(yù)測的案例庫維護(hù)策略[J];山西電子技術(shù);2010年02期
9 張曉麗;;SQL查詢性能的優(yōu)化研究[J];西安航空技術(shù)高等?茖W(xué)校學(xué)報(bào);2009年01期
10 ;關(guān)于TPC-H測試[J];每周電腦報(bào);2008年10期
相關(guān)會議論文 前1條
1 劉靜;;淺析提高SQL查詢性能的方法[A];'06MIS/S&A學(xué)術(shù)交流會論文集[C];2006年
相關(guān)重要報(bào)紙文章 前1條
1 ;富士通:搜索新紀(jì)元[N];計(jì)算機(jī)世界;2004年
相關(guān)碩士學(xué)位論文 前7條
1 霍帥;基于長尾查詢需求理解的搜索引擎性能改進(jìn)[D];清華大學(xué);2015年
2 洪佳;OLAP系統(tǒng)的查詢性能研究[D];天津工業(yè)大學(xué);2007年
3 彭敦志;基于聚集系數(shù)的文本檢索查詢性能預(yù)測[D];中國科學(xué)技術(shù)大學(xué);2009年
4 李桂花;基于DB2關(guān)系型數(shù)據(jù)庫的查詢性能調(diào)優(yōu)[D];電子科技大學(xué);2010年
5 王昆;Spring框架下Web查詢性能優(yōu)化研究[D];西南交通大學(xué);2009年
6 武佳林;XML數(shù)據(jù)索引技術(shù)與優(yōu)化[D];遼寧師范大學(xué);2010年
7 鄧克國;基于前綴編碼的有序XML文檔更新計(jì)算研究[D];電子科技大學(xué);2011年
,本文編號:2319884
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2319884.html