基于長(zhǎng)尾查詢需求理解的搜索引擎性能改進(jìn)

發(fā)布時(shí)間：2018-11-09 08:27

【摘要】：搜索引擎是人們獲取信息的重要工具。用戶搜索引擎中查找需求目標(biāo)時(shí)需要構(gòu)建查詢?cè)~,查詢?cè)~的頻度服從冪律分布,我們將處于分布尾端的查詢?cè)~稱為長(zhǎng)尾查詢。在真實(shí)搜索引擎數(shù)據(jù)上的分析發(fā)現(xiàn),長(zhǎng)尾查詢約占獨(dú)立查詢總數(shù)的70%,并且?guī)缀跛杏脩舳加虚L(zhǎng)尾查詢的需求。然而,長(zhǎng)尾查詢的用戶行為數(shù)據(jù)稀疏,很難直接使用現(xiàn)有的檢索質(zhì)量?jī)?yōu)化方法,成為搜索引擎的一個(gè)難點(diǎn)。通過對(duì)真實(shí)搜索引擎日志的抽樣分析,我們發(fā)現(xiàn)長(zhǎng)尾查詢中有相當(dāng)一部分是由于表述不恰當(dāng)而導(dǎo)致不能有效檢索到正確的結(jié)果而非缺少滿足用戶需求的網(wǎng)絡(luò)資源。針對(duì)這一問題,我們嘗試通過分析用戶改寫查詢的行為理解用戶查詢需求,幫助用戶找到合適的查詢表述形式,并直接對(duì)查詢結(jié)果進(jìn)行優(yōu)化。本文工作的主要貢獻(xiàn)如下:1.對(duì)查詢改寫行為模式的分析與預(yù)測(cè)。結(jié)合前人研究工作,將查詢改寫行為模式劃分為四種類型:New Topic(新主題),Generalization(一般化),Specification(特殊化)和Parallel(平行主題)。通過對(duì)真實(shí)搜索引擎日志抽樣數(shù)據(jù)進(jìn)行分析,提出查詢改寫行為模式的預(yù)測(cè)和分類方法,整體精度達(dá)到79.29%,為進(jìn)一步理解用戶需求做好基礎(chǔ)。2.對(duì)長(zhǎng)尾查詢結(jié)果相關(guān)度進(jìn)行自動(dòng)評(píng)價(jià)。分析長(zhǎng)尾查詢結(jié)果文檔的相關(guān)度與展現(xiàn)情況和點(diǎn)擊情況的關(guān)系,提取了點(diǎn)擊特征、標(biāo)紅特征和搜索引擎排序特征,訓(xùn)練基于集成學(xué)習(xí)方法的分類器,在預(yù)測(cè)結(jié)果相關(guān)度方面取得不錯(cuò)效果。3.提出多結(jié)果融合的長(zhǎng)尾查詢性能改進(jìn)方法。通過挖掘長(zhǎng)尾查詢可能的改寫詞,尋找具有相似意圖且表述更加恰當(dāng)?shù)牟樵冊(cè)~。進(jìn)一步的,將這些查詢改寫詞的結(jié)果與原查詢的結(jié)果進(jìn)行融合排序,直接在結(jié)果列表的層面對(duì)長(zhǎng)尾查詢進(jìn)行改進(jìn)。我們的方法引入了新的結(jié)果而不僅僅是重排序。在排序過程中,加入了體現(xiàn)原查詢能否被改進(jìn)的信息。真實(shí)搜索引擎數(shù)據(jù)上的實(shí)驗(yàn)顯示,該方法在ERR@10評(píng)價(jià)指標(biāo)上得到3.69%的顯著提升。值得一提的是,我們的方法對(duì)于非長(zhǎng)尾查詢性能的改進(jìn)同樣有效。4.提出基于用戶意圖理解的長(zhǎng)尾查詢性能改進(jìn)系統(tǒng)。將查詢改寫行為的預(yù)測(cè)與多結(jié)果融合的方法相結(jié)合,引入單個(gè)用戶的個(gè)性化信息,有針對(duì)性的引入新的結(jié)果文檔,性能提升效果有進(jìn)一步提高。
[Abstract]:Search engine is an important tool for people to obtain information. The search engine needs to construct the query term when searching for the requirement target. The frequency of the query term is distributed according to the power law. We call the query word at the end of the distribution as long tail query. Based on the analysis of real search engine data, it is found that long-tailed queries account for about 70% of the total number of independent queries, and almost all users have the demand for long-tailed queries. However, the user behavior data of long tail query is sparse, it is difficult to directly use the existing search quality optimization method, which becomes a difficulty in search engine. Through the sampling analysis of real search engine logs, we find that some of the long tail queries are not able to retrieve the correct results effectively because of improper representation, rather than lack of network resources to meet the needs of users. In order to solve this problem, we try to understand the user's query requirements by analyzing the behavior of rewriting the query, help the user to find the appropriate query expression, and directly optimize the query results. The main contributions of this paper are as follows: 1. Analysis and prediction of query rewriting behavior pattern. Combined with previous research work, the query rewriting behavior pattern is divided into four types of: New Topic (new topic,), Generalization (generalization,), Specification (specialization) and Parallel (parallel topic). Based on the analysis of real search engine log sampling data, the prediction and classification methods of query rewriting behavior patterns are proposed. The overall accuracy reaches 79.29, which is the basis for further understanding the user needs. 2. The correlation of long tail query results is evaluated automatically. This paper analyzes the relationship between the correlation degree of long tail query result document and display and click, extracts click feature, red feature and search engine sorting feature, and trains a classifier based on integrated learning method. Good results have been achieved in the correlation of prediction results. 3. 3. A long tail query performance improvement method based on multi-result fusion is proposed. By mining the possible rewriting words of the long tail query, we can find the query words with similar intention and more appropriate expression. Furthermore, the results of these rewriting words are fused with the results of the original query, and the long-tailed query is improved directly at the level of the result list. Our approach introduces new results, not just reordering. In the process of sorting, information is added to reflect whether the original query can be improved. Experiments on real search engine data show that this method can significantly improve the ERR@10 evaluation index by 3.69%. It is worth mentioning that our method is also effective for improving the performance of non-long tail queries. 4. 4. A long tail query performance improvement system based on user intention understanding is proposed. The prediction of query rewriting behavior is combined with the method of multi-result fusion, and the individualized information of individual user is introduced, and the new result document is introduced pertinently, and the performance improvement effect is further improved.
【學(xué)位授予單位】：清華大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2015
【分類號(hào)】：TP391.3

【相似文獻(xiàn)】

相關(guān)期刊論文前10條

1 張志寬;羅曉沛;;基于Web Dynpro Java平臺(tái)的查詢技術(shù)應(yīng)用分析[J];計(jì)算機(jī)工程與設(shè)計(jì);2009年20期

2 敖鵬蛟;裴志偉;;集成電路生產(chǎn)線EAP監(jiān)控系統(tǒng)大數(shù)據(jù)量查詢性能優(yōu)化方法[J];工業(yè)控制計(jì)算機(jī);2013年11期

3 武德亮;如何提高INFORMIX-ONLINE數(shù)據(jù)庫(kù)數(shù)據(jù)查詢性能[J];中國(guó)金融電腦;2001年04期

4 ;開發(fā)人員升級(jí)至ASE 15.0的10大理由(十)[J];鐵路計(jì)算機(jī)應(yīng)用;2011年03期

5 薛穎;閔聯(lián)營(yíng);邱橋春;;基于hibernate緩存機(jī)制的查詢性能優(yōu)化研究[J];電腦知識(shí)與技術(shù)(學(xué)術(shù)交流);2007年17期

6 鐘玲;張丹;孫淑杰;賈軍;;MapX4.0中存在的問題及查詢性能研究[J];沈陽(yáng)工業(yè)大學(xué)學(xué)報(bào);2006年02期

7 徐懷平;;優(yōu)化Oracle的查詢性能[J];電腦編程技巧與維護(hù);2012年23期

8 李鍇;;基于查詢性能預(yù)測(cè)的案例庫(kù)維護(hù)策略[J];山西電子技術(shù);2010年02期

9 張曉麗;;SQL查詢性能的優(yōu)化研究[J];西安航空技術(shù)高等�？茖W(xué)校學(xué)報(bào);2009年01期

10 ;關(guān)于TPC-H測(cè)試[J];每周電腦報(bào);2008年10期

相關(guān)會(huì)議論文前1條

1 劉靜;;淺析提高SQL查詢性能的方法[A];'06MIS/S&A學(xué)術(shù)交流會(huì)論文集[C];2006年

相關(guān)重要報(bào)紙文章前1條

1 ;富士通:搜索新紀(jì)元[N];計(jì)算機(jī)世界;2004年

相關(guān)碩士學(xué)位論文前7條

1 霍帥;基于長(zhǎng)尾查詢需求理解的搜索引擎性能改進(jìn)[D];清華大學(xué);2015年

2 洪佳;OLAP系統(tǒng)的查詢性能研究[D];天津工業(yè)大學(xué);2007年

3 彭敦志;基于聚集系數(shù)的文本檢索查詢性能預(yù)測(cè)[D];中國(guó)科學(xué)技術(shù)大學(xué);2009年

4 李桂花;基于DB2關(guān)系型數(shù)據(jù)庫(kù)的查詢性能調(diào)優(yōu)[D];電子科技大學(xué);2010年

5 王昆;Spring框架下Web查詢性能優(yōu)化研究[D];西南交通大學(xué);2009年

6 武佳林;XML數(shù)據(jù)索引技術(shù)與優(yōu)化[D];遼寧師范大學(xué);2010年

7 鄧克國(guó);基于前綴編碼的有序XML文檔更新計(jì)算研究[D];電子科技大學(xué);2011年

，

本文編號(hào)：2319884

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2319884.html

上一篇：Web實(shí)體活動(dòng)融合關(guān)鍵技術(shù)研究
下一篇：一種基于頻繁項(xiàng)集的搜索引擎聚類瀏覽算法

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于長(zhǎng)尾查詢需求理解的搜索引擎性能改進(jìn)