分布式信息檢索中的若干重要問題研究

發(fā)布時(shí)間：2018-05-27 23:05

本文選題：分布式信息檢索 + 信息檢索　；參考：《北京郵電大學(xué)》2012年博士論文

【摘要】：分布式信息檢索是信息檢索中的重要研究領(lǐng)域之一。越來越多的檢索系統(tǒng)都利用到了分布式檢索理論和技術(shù)。例如,互聯(lián)網(wǎng)的信息需求之一就是如何整合來自于各個(gè)垂直搜索引擎返回的結(jié)果,跨語言檢索也無法避免的要處理不同語種下文檔相關(guān)性排序的問題,專業(yè)的專利檢索可能需要同時(shí)查詢多個(gè)專利庫(kù)等等。同時(shí),研究上也論證過在一定的條件下分布式檢索的效果優(yōu)于傳統(tǒng)檢索。分布式信息檢索是同時(shí)查詢多個(gè)文檔數(shù)據(jù)庫(kù)的技術(shù)和方法。具體來說,檢索系統(tǒng)在收到用戶的查詢時(shí),首先會(huì)按照相關(guān)性對(duì)文檔數(shù)據(jù)庫(kù)進(jìn)行選擇,把查詢送往選出的文檔數(shù)據(jù)庫(kù),并從中得到返回的的檢索結(jié)果,最后進(jìn)行合并統(tǒng)一返回給用戶。分布式信息檢索主要有三個(gè)重要的問題：如何來描述文檔數(shù)據(jù)庫(kù)(文檔數(shù)據(jù)庫(kù)的描述),針對(duì)給定的查詢?nèi)绾芜x擇合適的文檔數(shù)據(jù)庫(kù)(文檔數(shù)據(jù)庫(kù)的選擇),如何對(duì)返回的結(jié)果進(jìn)行合并(查詢結(jié)果的合并)。經(jīng)過詳盡的調(diào)研,本文詳細(xì)的研究了分布式信息檢索的若干重要問題,取得了一定的創(chuàng)新性成果,主要工作成果如下： 1.對(duì)于文檔數(shù)據(jù)庫(kù)的描述問題,本文驗(yàn)證了基于查詢的抽樣算法在中文環(huán)境下的可靠性、穩(wěn)定性和必要性。非協(xié)同環(huán)境下的基于查詢的抽樣算法是研究的重點(diǎn)和熱點(diǎn),之前的研究工作都是針對(duì)英文的標(biāo)準(zhǔn)數(shù)據(jù)集進(jìn)行的,但是并沒有專門研究證實(shí)其在中文環(huán)境的可靠和有效。本文在研究了基于查詢的抽樣算法的前提假設(shè)和基本理論之后,從實(shí)踐的角度考慮,通過結(jié)構(gòu)完整的邏輯清晰的實(shí)驗(yàn)驗(yàn)證其在中文環(huán)境下的可靠性和有效性,從檢索流程上來說包括數(shù)據(jù)庫(kù)描述層面的、數(shù)據(jù)庫(kù)選擇層面的、檢索層面的測(cè)試和檢驗(yàn)。一系列廣泛的實(shí)驗(yàn)都證明了中文環(huán)境下的查詢抽樣技術(shù)的可行和高效,尤其是數(shù)據(jù)庫(kù)描述層面的實(shí)驗(yàn)結(jié)果更是論證了抽樣技術(shù)的可靠性、穩(wěn)定性、必要性。 2.對(duì)于文檔數(shù)據(jù)庫(kù)的選擇問題,本文提出了基于判別模型的選擇算法和基于主題聚類的選擇算法,并驗(yàn)證了其有效性。該領(lǐng)域已經(jīng)出現(xiàn)過很多的研究工作。大致可分為基于詞頻的、基于文檔的、基于分類／聚類的選擇方法。從判別模型和生成模型的區(qū)別來看,本文的工作包括兩點(diǎn)：第一,考慮不同數(shù)據(jù)庫(kù)之間的信息,我們提出了一種基于判別模型的選擇算法。第二,考慮到數(shù)據(jù)庫(kù)的語義問題,我們從理論上提出了基于主題聚類的選擇算法。對(duì)于前者,我們進(jìn)行理論上的探討。而后者是我們工作的重點(diǎn),因?yàn)橹黝}聚類算法不但考慮了文檔因素的影響,而且引入了數(shù)據(jù)庫(kù)的語義因素,這在建模上具有明顯的可解釋性。同時(shí),我們也從概率圖的角度對(duì)該類模型進(jìn)行了統(tǒng)一的分析和解釋。實(shí)驗(yàn)證實(shí),基于主題聚類的選擇算法在已有數(shù)據(jù)集上的表現(xiàn)是非常有競(jìng)爭(zhēng)力的。 3.對(duì)于結(jié)果合并的問題,本文建模了加權(quán)曲線擬合算法,并證實(shí)對(duì)已有算法有明顯的穩(wěn)定的改善。結(jié)果合并領(lǐng)域的經(jīng)典算法分別是CORI合并算法(CORI Merging)、SSL算法(Semi-Supervised Learning)、SAFE算法(Sample-Agglomerate Fitting Estimate)。SSL算法解決了CORI合并算法在非協(xié)同環(huán)境下的不穩(wěn)定性問題；SAFE算法解決了SSL樣本數(shù)量不足的問題。而SAFE算法在使用文檔上也有其不足,主要有兩點(diǎn),其一是沒有考慮文檔排名不同而產(chǎn)生不同的重要性,其二是沒有考慮文檔的排名的估計(jì)偏差。針對(duì)這兩點(diǎn),在SAFE算法基礎(chǔ)上,本文提出了加權(quán)曲線擬合算法(Weighted Curve Fitting,即WCF算法)。通過豐富的實(shí)驗(yàn)證明,與SAFE算法相比,WCF算法的優(yōu)越性是一致的穩(wěn)定的。在一定的環(huán)境下,我們給出了WCF算法達(dá)到最優(yōu)的可能參數(shù)組合。
[Abstract]:Distributed information retrieval is one of the most important research fields in information retrieval. More and more retrieval systems have been used in distributed retrieval theory and technology. For example, one of the information requirements of the Internet is how to integrate the results returned from the vertical search engines, and the different languages can not be avoided to deal with different languages. Under the problem of document correlation sorting, professional patent retrieval may need to query multiple patent libraries at the same time. At the same time, research has demonstrated that the effect of distributed retrieval is better than traditional retrieval under certain conditions. Distributed information retrieval is a technique and method to query multiple document databases at the same time. When the user's query is received, it will first select the document database according to the relevance, send the query to the selected document database, and get the retrieved results from it, and then merge and return to the user. There are three important problems in the distributed information retrieval: how to describe the document database (document data) The description of the Library) how to select the appropriate document database (the selection of the document database) for a given query, and how to merge the returned results (the merge of the query results).
After detailed investigation, this paper has studied some important issues of distributed information retrieval in detail, and achieved some innovative results. The main results are as follows:
1. for the description of document database, this paper verifies the reliability, stability and necessity of query based sampling algorithm in Chinese environment.
The query based sampling algorithm in non cooperative environment is the focus and hot spot. The previous research work is based on the standard data set in English, but there is no special research to prove its reliability and effectiveness in the Chinese environment. In the perspective of practice, the reliability and effectiveness of the Chinese environment are verified through a complete and clear logical experiment. The retrieval process includes the database description level, the database selection level, the retrieval level test and the test. A series of extensive experiments have proved the query sampling technique in the Chinese environment. The feasibility and efficiency of the method, especially the experimental results at the database description level, demonstrates the reliability, stability and necessity of the sampling technology.
2. for document database selection problem, this paper proposes a selection algorithm based on discriminant model and a topic clustering based selection algorithm, and verifies its effectiveness.
There have been a lot of research work in this field. It can be roughly divided into word frequency based, document based, and clustering based selection methods. From the distinction between discriminant model and generation model, the work of this paper includes two points: first, considering the information between different databases, we propose a choice based on discriminant model. Second, considering the semantic problem of the database, we put forward a selection algorithm based on topic clustering in theory. For the former, we have a theoretical discussion. The latter is the focus of our work, because the theme clustering algorithm not only takes into account the influence of the document factors, but also introduces the semantic factors of the database, which is built. At the same time, we also analyze and explain the model from the point of view of probability graph. The experiment proves that the selection algorithm based on the topic clustering is very competitive on the existing data set.
3. for the result merging problem, this paper builds a weighted curve fitting algorithm, and proves that the algorithm has obvious stable improvement.
The classical algorithms in the merging area are CORI merging algorithm (CORI Merging), SSL algorithm (Semi-Supervised Learning) and SAFE algorithm (Sample-Agglomerate Fitting Estimate).SSL algorithm to solve the instability problem of the CORI merging algorithm in the non cooperative environment. There are two main points in the use of documents, one is that one is not considering the different importance of the document ranking, and the other is not considering the estimation deviation of the ranking of the document. On the basis of these two points, the weighted curve fitting method (Weighted Curve Fitting, WCF algorithm) is put forward on the basis of the SAFE algorithm. The experimental results show that the superiority of the WCF algorithm is consistent and stable compared with the SAFE algorithm. In a certain environment, we give the optimal possible parameter combination of the WCF algorithm.
【學(xué)位授予單位】：北京郵電大學(xué)
【學(xué)位級(jí)別】：博士
【學(xué)位授予年份】：2012
【分類號(hào)】：TP391.3;TP311.13

【相似文獻(xiàn)】

相關(guān)期刊論文前10條

1 張玉葉;;解讀“匈牙利法”——對(duì)《“匈牙利法”存在的問題及改進(jìn)方法》一文的質(zhì)疑[J];計(jì)算機(jī)時(shí)代;2011年10期

2 馮玉才;盧正鼎;張嵐;;實(shí)現(xiàn)聯(lián)接運(yùn)算的有效算法——CRDS中的聯(lián)接運(yùn)算的實(shí)現(xiàn)[J];計(jì)算機(jī)工程與應(yīng)用;1989年09期

3 黃鐵英,姜昱明;一種快速手寫漢字細(xì)化算法[J];計(jì)算機(jī)工程;2004年19期

4 龍騰芳,楊路明;一種基于多目標(biāo)綜合決策的算法[J];計(jì)算機(jī)應(yīng)用與軟件;2005年06期

5 張超,張家樹,陳輝,賈東立;基于局部模糊熵的圖像過渡區(qū)提取算法[J];西南交通大學(xué)學(xué)報(bào);2005年05期

6 孫玉強(qiáng);周蕾;劉三陽(yáng);王洪元;張英麗;;基于PRAM模型的CFGs并行識(shí)別與語法分析的擴(kuò)充算法[J];計(jì)算機(jī)科學(xué);2005年08期

7 顧沈明;吳偉志;高濟(jì);;不完備信息系統(tǒng)中知識(shí)獲取算法[J];計(jì)算機(jī)科學(xué);2005年09期

8 許紅梅;許毅;;多QOS約束的動(dòng)態(tài)多播路由算法[J];交通與計(jì)算機(jī);2006年01期

9 姜新農(nóng);王文香;;基于免疫遺傳的BP網(wǎng)絡(luò)在機(jī)械手逆運(yùn)動(dòng)學(xué)中的應(yīng)用[J];機(jī)械與電子;2006年01期

10 王嵐;翟正軍;;Web日志挖掘的預(yù)處理及路徑補(bǔ)全算法的研究[J];微電子學(xué)與計(jì)算機(jī);2006年08期

相關(guān)會(huì)議論文前10條

1 邵玲玲;鄔銳;黃炎;;多普勒天氣雷達(dá)產(chǎn)品一中氣旋在強(qiáng)風(fēng)預(yù)報(bào)中的應(yīng)用研究[A];首屆長(zhǎng)三角氣象科技論壇論文集[C];2004年

2 李改肖;劉雁春;崔高嵩;劉穎;何桂敏;;海圖深度基準(zhǔn)面的確定及算法研究[A];第二十一屆海洋測(cè)繪綜合性學(xué)術(shù)研討會(huì)論文集[C];2009年

3 龐興豪;任國(guó)春;徐以濤;邱煒;;一種新型功放自適應(yīng)預(yù)失真器設(shè)計(jì)[A];2009年全國(guó)無線電應(yīng)用與管理學(xué)術(shù)會(huì)議論文集[C];2009年

4 龐興豪;任國(guó)春;徐以濤;邱煒;;一種新型功放自適應(yīng)預(yù)失真器設(shè)計(jì)[A];2009年全國(guó)無線電應(yīng)用與管理學(xué)術(shù)會(huì)議論文集[C];2009年

5 藍(lán)章禮;曹建秋;王華清;;基于動(dòng)態(tài)梯度的指紋圖像二值化算法[A];2008年計(jì)算機(jī)應(yīng)用技術(shù)交流會(huì)論文集[C];2008年

6 平亮;何川;楊青彬;;無線電智能天線技術(shù)的特點(diǎn)及發(fā)展[A];2008通信理論與技術(shù)新發(fā)展——第十三屆全國(guó)青年通信學(xué)術(shù)會(huì)議論文集（下）[C];2008年

7 杜玫芳;王昕;;基于特征加權(quán)的模糊c均值聚類算法及其應(yīng)用[A];2008通信理論與技術(shù)新進(jìn)展——第十三屆全國(guó)青年通信學(xué)術(shù)會(huì)議論文集（上）[C];2008年

8 樊新海;李勝利;安鋼;王凱;;基于Delphi的零相位數(shù)字濾波算法研究[A];2008中國(guó)儀器儀表與測(cè)控技術(shù)進(jìn)展大會(huì)論文集（Ⅲ）[C];2008年

9 羅芳;艾廷華;王洪;;閉合坐標(biāo)鏈多邊形數(shù)據(jù)的拓?fù)潢P(guān)系快速構(gòu)建[A];全國(guó)測(cè)繪科技信息網(wǎng)中南分網(wǎng)第二十四次學(xué)術(shù)信息交流會(huì)論文集[C];2010年

10 步山岳;張有東;王汝傳;;NTRU公開密鑰體制快速實(shí)現(xiàn)算法[A];2008年全國(guó)開放式分布與并行計(jì)算機(jī)學(xué)術(shù)會(huì)議論文集(上冊(cè))[C];2008年

相關(guān)重要報(bào)紙文章前10條

1 PALADIN;算法中的NP問題[N];電腦報(bào);2003年

2 ;機(jī)器人激活算法和程序設(shè)計(jì)教學(xué)[N];中國(guó)電腦教育報(bào);2004年

3 記者侯建華;會(huì)計(jì)所算了經(jīng)濟(jì)賬又算法律賬[N];重慶商報(bào);2001年

4 汪蔚;用算法改變世界[N];中國(guó)計(jì)算機(jī)報(bào);2008年

5 PALADIN;算法演義[N];電腦報(bào);2003年

6 記者雷敏　張旭東　劉錚;我國(guó)人均GDP仍在世界100位之后[N];新華每日電訊;2005年

7 H Q;改善照片VCD的制作效果[N];電腦報(bào);2003年

8 南京朱罕非;一種實(shí)用單片機(jī)多字節(jié)除法的算法[N];電子報(bào);2004年

9 胡英;高安全行業(yè)應(yīng)考慮SSL VPN算法[N];計(jì)算機(jī)世界;2007年

10 格非;不以“飯量”算“房量”[N];中國(guó)房地產(chǎn)報(bào);2005年

相關(guān)博士學(xué)位論文前10條

1 何川;分布式信息檢索中的若干重要問題研究[D];北京郵電大學(xué);2012年

2 唐煜;均勻設(shè)計(jì)的組合性質(zhì)及其構(gòu)作[D];蘇州大學(xué);2005年

3 余金華;電阻層析成像技術(shù)應(yīng)用研究[D];浙江大學(xué);2005年

4 Zhao Peixin;[D];山東大學(xué);2005年

5 呂翔;波長(zhǎng)路由光網(wǎng)絡(luò)相關(guān)問題研究[D];浙江大學(xué);2006年

6 申遠(yuǎn);一些求解結(jié)構(gòu)型優(yōu)化的一階算法[D];南京大學(xué);2012年

7 趙裕眾;生物序列分析算法的研究及其應(yīng)用[D];中國(guó)科學(xué)技術(shù)大學(xué);2010年

8 楊奎元;基于深層結(jié)構(gòu)的圖像內(nèi)容分析及其應(yīng)用[D];中國(guó)科學(xué)技術(shù)大學(xué);2012年

9 Shaker Kazem Ali（沙克）;應(yīng)用于疾病診斷的圖像分析方法[D];中南大學(xué);2010年

10 王秀紅;文本相似度計(jì)算核函數(shù)的構(gòu)造及其在分布式信息檢索中的應(yīng)用研究[D];江蘇大學(xué);2012年

相關(guān)碩士學(xué)位論文前10條

1 江亮;SVM算法研究及其在交流控制系統(tǒng)中的應(yīng)用[D];西北工業(yè)大學(xué);2005年

2 高翔;嵌入式三維圖形引擎的設(shè)計(jì)與實(shí)現(xiàn)[D];電子科技大學(xué);2005年

3 朱曉麗;QoS組播路由問題研究[D];山東師范大學(xué);2005年

4 楊浩軍;計(jì)算機(jī)軟件專利保護(hù)問題研究[D];中國(guó)政法大學(xué);2006年

5 朱江;AGV車載控制原理研究[D];昆明理工大學(xué);2006年

6 葉海鋒;解鞍點(diǎn)問題的UZAWA算法及其收斂性分析[D];浙江大學(xué);2005年

7 王新政;樹木年輪分析系統(tǒng)的研究[D];東北林業(yè)大學(xué);2007年

8 高瑩瑩;大孔徑靜態(tài)干涉成像光譜儀（LASIS）圖像配準(zhǔn)技術(shù)研究[D];中國(guó)科學(xué)院研究生院（西安光學(xué)精密機(jī)械研究所）;2007年

9 樂葉青;基于Wigner-Ville分布的電能質(zhì)量擾動(dòng)的分析[D];浙江大學(xué);2007年

10 李友國(guó);Voronoi圖在機(jī)械加工路徑規(guī)劃中的應(yīng)用[D];同濟(jì)大學(xué);2008年

，

本文編號(hào)：1944156

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1944156.html

上一篇：基于時(shí)間敏感的個(gè)性化查詢?cè)~補(bǔ)全算法研究
下一篇：基于P2P全文檢索系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

分布式信息檢索中的若干重要問題研究