基于垂直領(lǐng)域的分布式搜索多樣性的研究與實現(xiàn)
發(fā)布時間:2018-07-17 06:55
【摘要】:自21世紀以來,信息技術(shù)和計算機網(wǎng)絡取得了重大的進步,海量數(shù)據(jù)和信息過載使得用戶要從如此多的資訊中檢索出感興趣的內(nèi)容變得越來越困難。而隨著信息存儲的壓力不斷增大,分布式系統(tǒng)也應運而生,與此同時也給傳統(tǒng)的檢索系統(tǒng)和搜索引擎帶來了一系列新的挑戰(zhàn)。其中一部分的挑戰(zhàn)來自于用戶搜索要求包含的多樣性,這使得檢索系統(tǒng)不僅需要快速且準確的把握用戶檢索的信息來自哪一垂直領(lǐng)域,即滿足用戶查詢多樣性;同時也需要考慮信息的正確性以覆蓋用戶的需求。而將分布式搜索系統(tǒng)與多樣性相結(jié)合,就成為解決當前諸多挑戰(zhàn)的答案;诂F(xiàn)今分布式搜索引擎的結(jié)構(gòu),本文在垂直領(lǐng)域選擇,資源選擇以及結(jié)果融合三個方面,結(jié)合檢索信息的多樣性,提出了多種可行的算法,為用戶提供更具針對性的服務。本文主要的研究工作包括:(1)在垂直領(lǐng)域選擇方面,本文提出詞向量判斷法和擴展詞匯排序法的垂直領(lǐng)域選擇算法,在擴展查詢詞的同時,對垂直領(lǐng)域進行關(guān)鍵詞提取,并根據(jù)兩者的相似度進行垂直領(lǐng)域的選擇。實驗結(jié)果表明本文提出的兩種算法與之前已有的垂直領(lǐng)域選擇方法相比,在準確率和召回率方面有了一定的提升。(2)在資源選擇方面,本文提出兩種資源庫描述方法,LDA主題描述和TF-IDF資源描述法,結(jié)合資源描述法提出了資源庫選擇算法框架,該框架同時結(jié)合垂直領(lǐng)域選擇結(jié)果,對用戶輸入查詢進行資源庫選取。實驗結(jié)果表明,本文提出的資源庫選擇算法能夠有效地應用于真實復雜的網(wǎng)絡搜索引擎的分布式環(huán)境,并取得優(yōu)異的效果。(3)在查詢結(jié)果的融合方面,本文基于垂直領(lǐng)域特性以及查詢詞多樣性的特性,提出了一種基于文檔,資源庫,垂直領(lǐng)域三個維度的特征計算的結(jié)果融合算法框架,該框架使用改進的CORI算法和線性融合算法進行最終的結(jié)果融合分數(shù)計算。算法體現(xiàn)了查詢結(jié)果多樣性以及準確性,與已有的方法相比,在搜索結(jié)果的準確率,召回率以及n DCG值方面取得了不錯的表現(xiàn)和較大的提升。本文在上述研究的基礎(chǔ)上,驗證了本文提出的三個領(lǐng)域算法能夠有效地提高系統(tǒng)的正確率,并保證了多樣化的反饋效果,證明了系統(tǒng)能夠滿足用戶多角度查詢的需求。
[Abstract]:Since the 21st century, great progress has been made in information technology and computer network. Huge amounts of data and information overload make it more and more difficult for users to retrieve the content of interest from so much information. With the increasing pressure of information storage, distributed systems emerge as the times require, at the same time, it also brings a series of new challenges to traditional search systems and search engines. Some of the challenges come from the diversity of user search requirements, which makes the retrieval system not only need to quickly and accurately grasp which vertical domain the user retrieves information from, that is, to satisfy the diversity of user queries; At the same time, we also need to consider the correctness of the information to cover the needs of users. The combination of distributed search system and diversity becomes the answer to many current challenges. Based on the structure of today's distributed search engine, this paper proposes a variety of feasible algorithms to provide users with more targeted services in three aspects: vertical field selection, resource selection and result fusion, combined with the diversity of retrieval information. The main research work of this paper is as follows: (1) in the aspect of vertical field selection, this paper proposes the vertical field selection algorithm of word vector judgment method and extended lexical sorting method. The vertical domain is chosen according to the similarity between the two. The experimental results show that the two algorithms proposed in this paper have improved the accuracy and recall rate compared with the previous vertical domain selection methods. (2) in the aspect of resource selection, In this paper, two resource base description methods, LDA topic description and TF-IDF resource description method, are proposed. Combined with the resource description method, a resource base selection algorithm framework is proposed, which combines the vertical domain selection results to select the resource base for user input queries. Experimental results show that the proposed resource base selection algorithm can be effectively applied to the distributed environment of real and complex network search engines, and achieve excellent results. (3) in the aspect of fusion of query results, Based on the characteristics of vertical domain and query word diversity, this paper proposes a result fusion algorithm framework based on three dimensions of document, resource base and vertical domain. The framework uses improved Cori algorithm and linear fusion algorithm to calculate the final fusion score. The algorithm reflects the diversity and accuracy of query results. Compared with the existing methods, the algorithm has achieved good performance and great improvement in the accuracy of search results, recall rate and n-DCG value. On the basis of the above research, this paper verifies that the three domain algorithms proposed in this paper can effectively improve the accuracy of the system, and ensure a variety of feedback effects. It is proved that the system can meet the needs of users' multi-angle query.
【學位授予單位】:華南理工大學
【學位級別】:碩士
【學位授予年份】:2016
【分類號】:TP391.3
本文編號:2129494
[Abstract]:Since the 21st century, great progress has been made in information technology and computer network. Huge amounts of data and information overload make it more and more difficult for users to retrieve the content of interest from so much information. With the increasing pressure of information storage, distributed systems emerge as the times require, at the same time, it also brings a series of new challenges to traditional search systems and search engines. Some of the challenges come from the diversity of user search requirements, which makes the retrieval system not only need to quickly and accurately grasp which vertical domain the user retrieves information from, that is, to satisfy the diversity of user queries; At the same time, we also need to consider the correctness of the information to cover the needs of users. The combination of distributed search system and diversity becomes the answer to many current challenges. Based on the structure of today's distributed search engine, this paper proposes a variety of feasible algorithms to provide users with more targeted services in three aspects: vertical field selection, resource selection and result fusion, combined with the diversity of retrieval information. The main research work of this paper is as follows: (1) in the aspect of vertical field selection, this paper proposes the vertical field selection algorithm of word vector judgment method and extended lexical sorting method. The vertical domain is chosen according to the similarity between the two. The experimental results show that the two algorithms proposed in this paper have improved the accuracy and recall rate compared with the previous vertical domain selection methods. (2) in the aspect of resource selection, In this paper, two resource base description methods, LDA topic description and TF-IDF resource description method, are proposed. Combined with the resource description method, a resource base selection algorithm framework is proposed, which combines the vertical domain selection results to select the resource base for user input queries. Experimental results show that the proposed resource base selection algorithm can be effectively applied to the distributed environment of real and complex network search engines, and achieve excellent results. (3) in the aspect of fusion of query results, Based on the characteristics of vertical domain and query word diversity, this paper proposes a result fusion algorithm framework based on three dimensions of document, resource base and vertical domain. The framework uses improved Cori algorithm and linear fusion algorithm to calculate the final fusion score. The algorithm reflects the diversity and accuracy of query results. Compared with the existing methods, the algorithm has achieved good performance and great improvement in the accuracy of search results, recall rate and n-DCG value. On the basis of the above research, this paper verifies that the three domain algorithms proposed in this paper can effectively improve the accuracy of the system, and ensure a variety of feedback effects. It is proved that the system can meet the needs of users' multi-angle query.
【學位授予單位】:華南理工大學
【學位級別】:碩士
【學位授予年份】:2016
【分類號】:TP391.3
【相似文獻】
相關(guān)期刊論文 前1條
1 劉云;美國國家科學基金會的總體目標與戰(zhàn)略領(lǐng)域選擇[J];科學學與科學技術(shù)管理;1999年05期
相關(guān)會議論文 前1條
1 邢懷濱;;談中國原始創(chuàng)新的領(lǐng)域選擇[A];新觀點新學說學術(shù)沙龍文集14:科技創(chuàng)新——科學優(yōu)先還是技術(shù)優(yōu)先[C];2007年
相關(guān)重要報紙文章 前1條
1 胡葦;建議發(fā)行新版普通郵票[N];中國集郵報;2009年
相關(guān)碩士學位論文 前1條
1 謝一帆;基于垂直領(lǐng)域的分布式搜索多樣性的研究與實現(xiàn)[D];華南理工大學;2016年
,本文編號:2129494
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2129494.html
最近更新
教材專著