基于分類技術(shù)的個性化檢索系統(tǒng)的研究與設計

發(fā)布時間：2018-07-14 07:44

【摘要】：隨著Internet和網(wǎng)絡信息技術(shù)的迅猛發(fā)展,網(wǎng)絡資源呈指數(shù)急劇增長,傳統(tǒng)的通用搜索引擎的查詢結(jié)果只依賴于查詢關(guān)鍵詞,而實際上,即便相同的查詢詞,不同的用戶查詢目的可能不同,所希望的返回結(jié)果也會因人而異。針對這種情況,人們迫切需要一種針對個人特點提供更加精確查詢結(jié)果的搜索工具,本文提出了以用戶為中心的基于分類的個性化搜索引擎。本文在對個性化信息檢索相關(guān)技術(shù)進行了較為全面深入的分析基礎上,分別研究個性化搜索引擎的常用技術(shù),和搜索引擎中理解用戶搜索目的的主要技術(shù)。并根據(jù)用戶的瀏覽及查詢?nèi)罩窘⒘藱z索系統(tǒng)的模型。對文本的自動分類進行了介紹,給出幾種常見的文本表示模型,以及利用WEKA和LibSVM對文本進行自動分類。基于文本分類,提出一種排序算法,在檢索結(jié)果中,顯示盡量多的類別,讓盡量多的不同類別的用戶都能找到相應主題類別的信息。同時,根據(jù)用戶行為特征,即用戶對各個主題類別的點擊率和各個主題類別網(wǎng)頁的平均訪問時間,通過修改lucene的評分域,從而改變lucene自有對文檔的排序評分。通過實驗證明,經(jīng)過考慮用戶的行為特征,在具有不同興趣的用戶查詢相同詞語時,可以檢索出不同的結(jié)果頁面。由于用戶搜索的關(guān)鍵詞有很大一部分是重復的,按照2/8定律,20%的搜索詞占到了總搜索次數(shù)的80%。當用戶提交由一組關(guān)鍵詞組成的查詢的時候,系統(tǒng)就判斷這個查詢對應的記錄是否在Cache中存在,如果不存在,把查詢語句遞交給檢索器,檢索器返回結(jié)果的綜合的文檔號序列存到一個文件中,在Cache中保存所存儲的序列在文件中的偏移值。如果已經(jīng)存在,就從Cache中獲得這個存儲記錄的偏移。然后是系統(tǒng)原型的設計與實現(xiàn),首先給出了系統(tǒng)的完整架構(gòu),然后分檢索模塊、結(jié)果排名模塊、查詢cache模塊等幾個主要模塊做詳細說明,分析了系統(tǒng)中幾個主要的數(shù)據(jù)結(jié)構(gòu)。最后對系統(tǒng)進行了測試分析,驗證了可行性。最后,總結(jié)了本文的工作,并展望下一步的工作計劃。同時指出本系統(tǒng)的一些缺陷,提出系統(tǒng)在整體架構(gòu)上的改進方法。
[Abstract]:With the rapid development of Internet and network information technology, the network resources increase exponentially. The query results of traditional general search engine only depend on the query keywords, but in fact, even if the same query words, Different users may query for different purposes, and the desired return results will vary from person to person. In view of this situation, people urgently need a search tool to provide more accurate query results according to individual characteristics. In this paper, a user-centered personalized search engine based on classification is proposed. Based on the thorough analysis of the relevant technologies of personalized information retrieval, this paper studies the common technologies of personalized search engine and the main technology of understanding the purpose of user search in the search engine. According to the user's browsing and query log, the model of retrieval system is established. This paper introduces the automatic text classification, presents several common text representation models, and makes use of WEKA and LibSVM to classify the text automatically. Based on text classification, a sorting algorithm is proposed, in which as many categories as possible can be displayed in the retrieval results, so that users of as many different categories as possible can find the information of the corresponding subject categories. At the same time, according to the user behavior characteristics, that is, the user's click rate of each topic category and the average visit time of each topic category web page, by modifying the lucene scoring field, we can change the lucene's own ranking score on the documents. It is proved by experiments that different result pages can be retrieved when users with different interests query the same words after considering the behavior characteristics of users. Because a large part of the search keywords are repeated, 20% of the search terms account for 80% of the total search times according to the law of 2 / 8. When the user submits a query consisting of a set of keywords, the system determines whether the corresponding record of the query exists in the cache, and if not, submits the query statement to the searcher. The synthetic document number sequence of the result returned by the searcher is stored in a file and the offset value of the stored sequence in the file is saved in the cache. If it already exists, the offset of the stored record is obtained from Cache. Then the design and implementation of the prototype of the system is given. Firstly, the complete architecture of the system is given, and then several main modules, such as retrieval module, result ranking module, query cache module, etc., are described in detail, and several main data structures in the system are analyzed. Finally, the system is tested and analyzed, and the feasibility is verified. Finally, the paper summarizes the work of this paper and looks forward to the next work plan. At the same time, some defects of the system are pointed out, and the improvement method of the whole system is put forward.
【學位授予單位】：武漢理工大學
【學位級別】：碩士
【學位授予年份】：2013
【分類號】：TP391.3

【參考文獻】

相關(guān)期刊論文前5條

1 李巍巍;;全文檢索引擎工具包Lucene的結(jié)構(gòu)與索引原理的研究[J];才智;2008年09期

2 趙銀春,付關(guān)友,朱征宇;基于Web瀏覽內(nèi)容和行為相結(jié)合的用戶興趣挖掘[J];計算機工程;2005年12期

3 原福永;梁順攀;;元搜索引擎的現(xiàn)狀與發(fā)展[J];計算機工程與設計;2005年12期

4 吳小蘭;汪琪;;元搜索引擎研究綜述[J];圖書情報工作;2009年09期

5 門鳳超;濮德敏;王東菊;;論元搜索引擎的實現(xiàn)技術(shù)與發(fā)展趨勢[J];現(xiàn)代情報;2008年07期

相關(guān)碩士學位論文前10條

1 吳代文;基于Lucene的二次全文檢索系統(tǒng)設計與實現(xiàn)[D];西安電子科技大學;2009年

2 黃衛(wèi)平;個性化搜索引擎的研究與實現(xiàn)[D];武漢理工大學;2011年

3 藺繼國;基于點擊數(shù)據(jù)分析的個性化搜索引擎研究[D];國防科學技術(shù)大學;2010年

4 蘇力華;基于向量空間模型的文本分類技術(shù)研究[D];西安電子科技大學;2006年

5 霍長青;個性化元搜索引擎研究與設計[D];山東科技大學;2006年

6 龐劍鋒;基于向量空間模型的自反饋的文本分類系統(tǒng)的研究與實現(xiàn)[D];中國科學院研究生院（計算技術(shù)研究所）;2001年

7 鄒漢斌;支持向量機在文本分類中的應用[D];江南大學;2006年

8 董梅;文本內(nèi)容的信息過濾技術(shù)研究[D];合肥工業(yè)大學;2006年

9 丁瓊;基于向量空間模型的文本自動分類系統(tǒng)的研究與實現(xiàn)[D];同濟大學;2007年

10 王小燕;文本分類相關(guān)技術(shù)與應用研究[D];西北大學;2007年

，

本文編號：2120952

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2120952.html

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于分類技術(shù)的個性化檢索系統(tǒng)的研究與設計