基于DEA的列表型排序?qū)W習(xí)方法研究
本文選題:信息檢索 切入點:排序?qū)W習(xí) 出處:《西南交通大學(xué)》2014年碩士論文
【摘要】:互聯(lián)網(wǎng)的蓬勃發(fā)展與數(shù)碼產(chǎn)品的快速增長,產(chǎn)生了海量的信息,使人們深陷其中無所適從,迫切需要一種能夠提供高效便捷的信息檢索服務(wù)的系統(tǒng),網(wǎng)絡(luò)搜索引擎因此而逐漸成為人們獲取信息的重要工具。搜索引擎系統(tǒng)包含多個子系統(tǒng),其中排名系統(tǒng)處于核心地位。排名系統(tǒng)能夠根據(jù)用戶提交的檢索詞從海量的數(shù)據(jù)中快速定位最相關(guān)的文檔集合,并按照相關(guān)程度順次反饋給用戶,有效減少用戶信息檢索的時間開銷。為此,研究人員提出了大量的排名算法,主要基于內(nèi)容分析或鏈接分析,利用文檔的相關(guān)性特征、重要性特征評價文檔同用戶檢索意圖的契合程度。它們極大地改善了信息檢索系統(tǒng)的排名系統(tǒng),但仍然存在兩個重要的缺陷:用于構(gòu)建排序模型的檢索詞-文檔特征有限;或者在利用大量特征構(gòu)建排序模型時,優(yōu)選模型參數(shù)成為最大的障礙。 排序?qū)W習(xí)方法是一種機器學(xué)習(xí)與信息檢索的交叉學(xué)科,可以從大量的包含人工標記的訓(xùn)練集中自動學(xué)習(xí)排序模型,并應(yīng)用于對未知數(shù)據(jù)的預(yù)測分析。排序?qū)W習(xí)使用的訓(xùn)練實例表示成多維特征的向量形式,包含各種反映文檔與檢索詞相關(guān)性的復(fù)雜信息。目前,排序?qū)W習(xí)方法大致可以分成逐點型、序?qū)π秃土斜硇腿?研究表明列表型排序?qū)W習(xí)方法在大多數(shù)公開數(shù)據(jù)集上的表現(xiàn)最佳。本文重點研究列表型排序?qū)W習(xí)方法,并利用數(shù)據(jù)包絡(luò)分析技術(shù),結(jié)合提升技術(shù)提出一種新的排名方法——DEARank。 本文修改經(jīng)典的CCR模型,構(gòu)建出兩種退化的數(shù)據(jù)包絡(luò)分析模型:CCR-I與CCR-O,將待排名的文檔作為決策單元進行處理,并使用過模型最優(yōu)權(quán)值構(gòu)建弱排名函數(shù)集合。每個備選弱排名函數(shù)反映了決策單元對于各個特征的偏好,代表從整個特征空間抽取的一個特征子集。本文利用這些備選弱排名函數(shù),基于提升技術(shù)訓(xùn)練性能更優(yōu)的排序模型。此外,本文還就DEARank在LETOR數(shù)據(jù)集(包括HP2003、HP2004、 NP2003、NP2004、TD2003、TD2004、OHSUMED、MQ2007與MQ2008)上的實證結(jié)果,同其它十二個經(jīng)典的排序?qū)W習(xí)算法進行對比,實驗結(jié)果表明DEARank有突出表現(xiàn),給網(wǎng)絡(luò)信息檢索系統(tǒng)提供了一個重要的排名算法。
[Abstract]:With the rapid development of the Internet and the rapid growth of digital products, mass information is produced, and people are trapped in it, so they urgently need a system that can provide efficient and convenient information retrieval services.As a result, the network search engine has gradually become an important tool for people to obtain information.Search engine system includes many subsystems, in which ranking system is at the core.The ranking system can quickly locate the most relevant document set from the massive data according to the key words submitted by the user and feedback to the user according to the correlation degree in order to effectively reduce the time cost of user information retrieval.For this reason, researchers put forward a large number of ranking algorithms, mainly based on content analysis or link analysis, using the relevant features of documents, importance features to evaluate the document and user retrieval intention of the degree of agreement.They greatly improve the ranking system of the information retrieval system, but there are still two important shortcomings: the limited feature of the document used to build the sorting model, or the use of a large number of features to build the sorting model.Optimal selection of model parameters is the biggest obstacle.Sorting learning is an interdiscipline between machine learning and information retrieval. It can automatically learn the sorting model from a large number of training sets containing manual markers and be applied to the prediction and analysis of unknown data.The training example used in sorting learning is expressed as a vector form of multidimensional features and contains a variety of complex information reflecting the correlation between documents and search words.At present, sorting learning methods can be divided into three types: point-by-point, order-pair and table-type.This paper focuses on the list ranking learning method, and proposes a new ranking method named DEARankusing the data Envelopment Analysis (DEA) technique and combining with the lifting technique.In this paper, we modify the classical CCR model and construct two degenerated data envelopment analysis models: CCR-I and CCR-O. the documents to be ranked are treated as decision making units, and the weak rank function set is constructed by using the optimal weights of the model.Each candidate weak rank function reflects the preference of the decision making unit for each feature and represents a feature subset extracted from the entire feature space.In this paper, these alternative weak rank functions are used to improve the performance of technical training based on a better ranking model.In addition, the empirical results of DEARank on LETOR data sets (including HP2003 / HP2004, NP2003 / NP2004 / TD2004 / TD2004 / OHSUMEDU MQ2007 and MQ2008) are compared with the other 12 classical sorting learning algorithms. The experimental results show that DEARank has outstanding performance.It provides an important ranking algorithm for network information retrieval system.
【學(xué)位授予單位】:西南交通大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2014
【分類號】:O223
【共引文獻】
相關(guān)期刊論文 前10條
1 劉喜文;鄭昌興;王文龍;湯剛強;;構(gòu)建數(shù)據(jù)倉庫過程中的數(shù)據(jù)清洗研究[J];圖書與情報;2013年05期
2 潘善亮;茅琴嬌;韓露;;一種基于虛擬社交化的Web服務(wù)發(fā)現(xiàn)方法研究[J];電信科學(xué);2013年12期
3 顧益軍;解易;張培晶;;面向有組織犯罪分析的人際關(guān)系網(wǎng)絡(luò)節(jié)點重要性評價研究[J];中國人民公安大學(xué)學(xué)報(自然科學(xué)版);2013年04期
4 鄭晶;;基于云計算的Pagerank算法的改進[J];福州大學(xué)學(xué)報(自然科學(xué)版);2014年01期
5 許明;吳建平;杜怡曼;謝峰;肖云鵬;;基于三部圖的路網(wǎng)節(jié)點關(guān)鍵度排序方法[J];北京郵電大學(xué)學(xué)報;2014年S1期
6 張勝;;譜聚類在圖像識別中的應(yīng)用[J];安徽電子信息職業(yè)技術(shù)學(xué)院學(xué)報;2014年02期
7 程凡;仲紅;李龍澍;張以文;;一種基于結(jié)構(gòu)化學(xué)習(xí)的排序算法[J];計算機工程與應(yīng)用;2011年12期
8 牛樹梓;程學(xué)旗;郭嘉豐;;排序?qū)W習(xí)中數(shù)據(jù)噪音敏感度分析[J];中文信息學(xué)報;2012年05期
9 范文禮;劉志剛;;一種基于效率矩陣的網(wǎng)絡(luò)節(jié)點重要度評價算法[J];計算物理;2013年05期
10 張s,
本文編號:1727652
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1727652.html