異質(zhì)數(shù)據(jù)相似度學習及其在網(wǎng)絡(luò)搜索中的應(yīng)用

發(fā)布時間：2018-09-07 07:53

【摘要】：本文研究異質(zhì)數(shù)據(jù)相似度學習的問題，以及相似度學習在網(wǎng)絡(luò)搜索中的應(yīng)用。相似度學習在網(wǎng)絡(luò)搜索，推薦系統(tǒng)，圖片標注以及機器翻譯等諸多應(yīng)用中都扮演著重要的角色。本質(zhì)上來說，這些應(yīng)用的任務(wù)都可以歸結(jié)為學習并利用一個相似度函數(shù)來匹配兩種異質(zhì)的實例。這兩種實例在網(wǎng)絡(luò)搜索中是查詢和文檔，在推薦系統(tǒng)中是用戶和物品，在圖片標注中是關(guān)鍵詞和圖片，在機器翻譯中是兩種語言下的翻譯。特別的，在網(wǎng)絡(luò)搜索中，搜索引擎是產(chǎn)生查詢文檔匹配的媒介網(wǎng)絡(luò)上信息的急劇膨脹使人們的生活越來越離不開搜索引擎。搜索引擎的任務(wù)是對不同用戶提出的查詢檢索相關(guān)文檔，并根據(jù)其相關(guān)性大小產(chǎn)生文檔排序。查詢與文檔是兩種異質(zhì)實例，它們的相關(guān)性由它們之間的相似度決定。相似度函數(shù)的好壞直接決定了搜索引擎性能的優(yōu)劣。本文定義希爾伯特空間的內(nèi)積作為相似度函數(shù)。具體來說，本文為兩種異質(zhì) 實例分別定義一個映射函數(shù)。映射函數(shù)將異質(zhì)實例映射到相同的希爾伯特空間然后映射像的內(nèi)積被定義為相似度函數(shù)。在這樣的定義下，本文考慮以兩種方式學習異質(zhì)數(shù)據(jù)的相似度：(1)先學習映射函數(shù)，然后再計算映射像的內(nèi)積得到相似度函數(shù)；(2)直接學習相似度函數(shù)。在每一種方式下，本文試圖解決三個問題(1)如何綜合利用來自不同源的各種信息。例如，在網(wǎng)絡(luò)搜索中，查詢與文檔的內(nèi)容以及用戶點擊數(shù)據(jù)(click through data)都可以被用來學習相似度函數(shù)；(2)如何提高學習算法的效率及擴展性(scalability)，使其能夠處理海量的數(shù)據(jù)；(3)如何分析學習算法的泛化能力。本文首先考慮先學習映射，再通過映射像的內(nèi)積定義相似度函數(shù)。特別的，本文考慮學習兩個線性映射，那么最后的相似度函數(shù)由一個雙線性型表示。在這種方法下，本文為線性映射定義了兩種假設(shè)空間。首先，我們要求線性映射的列單位正交。在這個假設(shè)下，本文提出了一個多視角(multi-view)的學習方法。該方法能有效利用來自不同源的各種信息。隨后，為了提高學習的效率和擴展性，本文又給出了一個正則化的方法。具體來說，我們約束線性映射行向量的l_1范數(shù)和l_2范數(shù)。這個假設(shè)保證了解的稀疏性，同時使得算法很容易并行化。最后，本文還系統(tǒng)地研究了相似度學習方法的泛化能力。隨后，，本文考慮直接定義相似度函數(shù)的假設(shè)空間來學習異質(zhì)數(shù)據(jù)相似度函數(shù)。特別的，本文利用了機器學習中的核方法，提出了一種基于核的相似度學習。具體來說，本文定義了一種特殊的半正定核：S-核。一個S-核可以生成一個相似度函數(shù)的假設(shè)空間。核方法可以保證解的最優(yōu)性以及它的泛化能力。為了提高學習算法的效率，本文提出了一個算法的在線近似。我們將異質(zhì)數(shù)據(jù)相似度學習應(yīng)用到網(wǎng)絡(luò)搜索中，并說明本文提出的學習方法可以解決網(wǎng)絡(luò)搜索中的詞語不匹配(term mismatch)問題。我們在真實的大規(guī)模企業(yè)搜索數(shù)據(jù)和網(wǎng)絡(luò)搜索數(shù)據(jù)上進行了實驗。實驗效果表明，本文提出的方法可以有效地克服詞語不匹配問題，顯著地改善傳統(tǒng)方法在相關(guān)性排序，以及相似查詢發(fā)現(xiàn)上的表現(xiàn)。
[Abstract]:This paper studies the similarity learning of heterogeneous data and the application of similarity learning in Web search. Similarity learning plays an important role in many applications such as web search, recommendation system, image annotation and machine translation. Essentially, the tasks of these applications can be summed up as learning and utilizing a phase. The similarity function matches two heterogeneous instances. These two instances are queries and documents in network search, users and objects in recommendation system, keywords and pictures in image annotation, and translations in two languages in machine translation. In particular, search engines are the media networks that produce query document matches in network search. The rapid expansion of information on the Internet makes people's lives more and more inseparable from search engines. The task of search engines is to retrieve relevant documents from queries submitted by different users and to sort them according to their relevance. Queries and documents are two heterogeneous instances whose correlation is determined by their similarity. In this paper, the inner product of Hilbert space is defined as similarity function. Specifically, two kinds of heterogeneity are discussed.
The mapping function maps heterogeneous instances to the same Hilbert space and the inner product of the mapping image is defined as a similarity function. Under this definition, this paper considers two ways to learn the similarity of heterogeneous data: (1) First, the mapping function is studied, and then the inner product of the mapping image is calculated. In each way, this paper attempts to solve three problems: (1) how to synthesize information from different sources. For example, in Web search, both the content of query and document and the click through data can be used to learn similarity functions; (2) how to Improve the efficiency and scalability of the learning algorithm, so that it can deal with massive data; (3) How to analyze the generalization ability of the learning algorithm.
In this paper, we first consider learning mappings and then defining similarity functions by the inner product of the mapping image. In particular, we consider learning two linear mappings, and then the final similarity function is represented by a bilinear form. Orthogonal. Under this assumption, a multi-view learning method is proposed. This method can effectively utilize information from different sources. Subsequently, in order to improve the efficiency and scalability of learning, a regularization method is given. Specifically, we constrain the l_1 norm and l_2 norm of linear mapping row vectors. This assumption guarantees the sparsity of the solution and makes the algorithm easy to parallelize. Finally, the generalization ability of similarity learning methods is systematically studied.
Then, we consider directly defining the hypothesis space of the similarity function to learn the similarity function of heterogeneous data. In particular, we propose a kernel-based similarity learning by using the kernel method in machine learning. In order to improve the efficiency of the learning algorithm, an on-line approximation of the algorithm is proposed.
We apply heterogeneous data similarity learning to network search, and show that the proposed learning method can solve the term mismatch problem in network search. We experimented on real large-scale enterprise search data and network search data. It effectively overcomes the problem of word mismatch and significantly improves the performance of traditional methods in relativity ranking and similar query discovery.
【學位授予單位】：北京大學
【學位級別】：博士
【學位授予年份】：2012
【分類號】：TP391.3

【相似文獻】

相關(guān)期刊論文前10條

1 程鴻;;技術(shù)——網(wǎng)絡(luò)搜索的核心競爭力[J];互聯(lián)網(wǎng)天地;2004年08期

2 邢志宇;;網(wǎng)絡(luò)搜索中的檢索式及其構(gòu)建[J];科技情報開發(fā)與經(jīng)濟;2007年17期

3 武二偉;;網(wǎng)絡(luò)搜索中的檢索式及其構(gòu)建[J];情報科學;2009年05期

4 王冰睿;;鮑爾默冀望bing改變競爭格局微軟新搜索品牌在敵視中誕生[J];IT時代周刊;2009年12期

5 一嘯傾城;;搜出隨心所欲[J];電腦迷;2010年06期

6 ;Windows 7哪種網(wǎng)絡(luò)共享方式適合我?[J];數(shù)碼世界(B版);2011年01期

7 飄零雪;;亮出你的搜索結(jié)果[J];電腦迷;2005年08期

8 邢志宇;;分類搜索引擎探析[J];河南圖書館學刊;2006年05期

9 ;新產(chǎn)品&工具點評[J];程序員;2007年05期

10 李紅巖;;智能Agent技術(shù)淺談[J];科技信息;2008年33期

相關(guān)會議論文前10條

1 張陣陣;劉永昌;馮嘉禮;;最大相似結(jié)構(gòu)互補結(jié)合與最大相似功能互補匹配的相似度函數(shù)建立[A];中國生物化學與分子生物學會第八屆會員代表大會暨全國學術(shù)會議論文摘要集[C];2001年

2 盧福剛;趙榮椿;;紅外圖象斑塊狀目標自動檢測[A];中國圖象圖形科學技術(shù)新進展——第九屆全國圖象圖形科技大會論文集[C];1998年

3 郁梅;董海濤;蔣剛毅;;基于視差插值與相似度的多視點視差估計算法[A];第一屆建立和諧人機環(huán)境聯(lián)合學術(shù)會議（HHME2005）論文集[C];2005年

4 褚庭亮;王茂生;湯文杰;趙蕾;;基于網(wǎng)絡(luò)搜索的CTP主流技術(shù)分析實驗報告[A];2008印刷版材發(fā)展技術(shù)論壇論文集[C];2008年

5 蘇航;張解;陳曉玲;木原重光;張永權(quán);;多國鋼鐵材料牌號的計算機自動匹配技術(shù)[A];2005年全國計算材料、模擬與圖像分析學術(shù)會議論文集[C];2005年

6 余小高;;P2P環(huán)境中k最近鄰搜索算法研究[A];2009年全國開放式分布與并行計算機學術(shù)會議論文集(下冊)[C];2009年

7 王新燕;范金剛;;初探云計算[A];兩化融合與物聯(lián)網(wǎng)發(fā)展學術(shù)研討會論文集[C];2010年

8 劉素萍;仁立學;胡廣春;胡永波;郝樊華;儲誠勝;;夾角余弦法用于輻射源一致性判定的評估[A];第十四屆全國核電子學與核探測技術(shù)學術(shù)年會論文集（下冊）[C];2008年

9 劉素萍;仁立學;胡廣春;胡永波;郝樊華;儲誠勝;;夾角余弦法用于輻射源一致性判定的評估[A];第十四屆全國核電子學與核探測技術(shù)學術(shù)年會論文集（2）[C];2008年

10 陳伯倫;陳];王俊生;;一種基于距離調(diào)節(jié)的聚類算法[A];2008年全國開放式分布與并行計算機學術(shù)會議論文集(上冊)[C];2008年

相關(guān)重要報紙文章前10條

1 ;網(wǎng)絡(luò)搜索誰主沉浮[N];中國高新技術(shù)產(chǎn)業(yè)導報;2004年

2 本報記者惠正一;Google 12億美元收購廣播廣告公司[N];第一財經(jīng)日報;2006年

3 車文秋;關(guān)注網(wǎng)絡(luò)搜索中的商標問題[N];中國知識產(chǎn)權(quán)報;2006年

4 ;打開搜索的窗戶就打開了世界[N];中國經(jīng)營報;2005年

5 記者王俊鳴;美開發(fā)出新的網(wǎng)絡(luò)搜索軟件[N];科技日報;2000年

6 譚俞雄;網(wǎng)絡(luò)搜索市場呼喚誠信[N];中華工商時報;2004年

7 李贄;中國搜索：網(wǎng)絡(luò)之行始于“豬”[N];大眾科技報;2004年

8 南京工程學院仿真部施建強;用VB制作網(wǎng)絡(luò)搜索軟件[N];計算機世界;2002年

9 本報記者劉笑一;網(wǎng)絡(luò)搜索指數(shù)將成購房“風向標”[N];中國房地產(chǎn)報;2004年

10 四川許睿;網(wǎng)絡(luò)搜索利器——GoToLink媒體中心[N];電腦報;2003年

相關(guān)博士學位論文前10條

1 武威;異質(zhì)數(shù)據(jù)相似度學習及其在網(wǎng)絡(luò)搜索中的應(yīng)用[D];北京大學;2012年

2 鄭中團;基于隨機圖演化與圖上隨機游動的復雜網(wǎng)絡(luò)研究[D];上海大學;2009年

3 檀敬東;文本挖掘的若干關(guān)鍵算法研究[D];中國科學技術(shù)大學;2010年

4 曲建華;基于群體智能的聚類分析[D];山東師范大學;2010年

5 董寶力;Web制造資源的語義發(fā)現(xiàn)關(guān)鍵技術(shù)研究[D];浙江大學;2007年

6 袁慶霓;基于網(wǎng)絡(luò)化制造環(huán)境的制造資源共享服務(wù)語義關(guān)鍵技術(shù)研究[D];西南交通大學;2010年

7 黃杰賢;FPC外觀缺陷自動光學檢測關(guān)鍵技術(shù)研究[D];華南理工大學;2012年

8 吳宇;對等網(wǎng)絡(luò)內(nèi)容搜索及索引緩存研究[D];中國科學院研究生院（計算技術(shù)研究所）;2006年

9 顧弘;基于半監(jiān)督聚類分析及廣義距離函數(shù)學習的圖像識別技術(shù)研究[D];浙江大學;2011年

10 沈鄭燕;聲納圖像去噪與分割技術(shù)研究[D];哈爾濱工程大學;2010年

相關(guān)碩士學位論文前10條

1 于耀輝;網(wǎng)絡(luò)搜索服務(wù)提供商侵犯著作權(quán)的刑事責任[D];中國政法大學;2010年

2 梁繼能;基于三層體系結(jié)構(gòu)的網(wǎng)絡(luò)搜索與信息處理系統(tǒng)[D];廣東工業(yè)大學;2005年

3 龐永杰;基于Web的社會網(wǎng)絡(luò)搜索中人名同一性判斷方法研究[D];華中科技大學;2011年

4 劉嵐;Web News Hunter智能代理[D];中國科學院研究生院（軟件研究所）;2003年

5 劉小燕;上海大學生網(wǎng)絡(luò)自我效能的實證研究[D];上海師范大學;2005年

6 羅琪;模糊聚類算法及其在入侵檢測中的應(yīng)用[D];西安電子科技大學;2008年

7 田震;字符識別研究及其應(yīng)用[D];北方工業(yè)大學;2012年

8 劉樹勛;Internet智能搜索Agent研究與實現(xiàn)[D];廣東工業(yè)大學;2000年

9 王可為;基于統(tǒng)計的雙語術(shù)語自動抽取[D];南京理工大學;2007年

10 張宇;數(shù)字圖像椒鹽噪聲濾波算法研究[D];哈爾濱理工大學;2009年

本文編號：2227661

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2227661.html

上一篇：基于Intranet的搜索引擎
下一篇：圖書館網(wǎng)上免費專利信息開發(fā)與利用的研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

異質(zhì)數(shù)據(jù)相似度學習及其在網(wǎng)絡(luò)搜索中的應(yīng)用