面向大規(guī)模數(shù)據(jù)的高效LTR調(diào)研系統(tǒng)設(shè)計與實現(xiàn)
發(fā)布時間:2018-06-12 17:34
本文選題:網(wǎng)頁排序 + LTR調(diào)研系統(tǒng); 參考:《南京大學(xué)》2015年碩士論文
【摘要】:LTR(learning to rank,使用機器學(xué)習(xí)的方法做網(wǎng)頁排序)在商業(yè)搜索引擎中起著越來越重要的作用。各大商業(yè)搜索引擎都逐漸使用LTR作為搜索網(wǎng)頁排序的重要手段。就目前網(wǎng)頁排序的發(fā)展而言,LTR算法本身對搜索精度的提升已經(jīng)相對較小,雅虎在2010年舉辦的LTR算法比賽的結(jié)果顯示,精度最高的算法和基準(zhǔn)算法(GBDT和RankSVM)相比,提升也非常有限,而且這些提升有相當(dāng)一部分是來自于對訓(xùn)練數(shù)據(jù)的處理。而隨著網(wǎng)頁數(shù)目的迅速提升,訓(xùn)練集的規(guī)模越來越大,LTR需要能夠處理這種越來越大的訓(xùn)練集;另外,訓(xùn)練數(shù)據(jù)的一些非常重要的特征,比如用戶點擊數(shù)據(jù)等,會隨著時間會快速發(fā)生變化,所以訓(xùn)練模型需要快速的進行更新。因此,高效和能夠處理更大規(guī)模數(shù)據(jù)是目前對LTR算法的主要需求。除此之外,LTR訓(xùn)練使用的特征較多(可達(dá)700特征左右),而這些特征中大部分是帶有噪聲的,對最終模型的貢獻(xiàn)很小,選取合適的特征集合參與訓(xùn)練,既可以提高精度,又可以大大降低訓(xùn)練的時間。怎樣選取合適的特征也是LTR研究中的重要部分。LTR調(diào)研系統(tǒng)的目的就是快速選出合適的模型供搜索引擎使用,對網(wǎng)頁搜索結(jié)果進行排序。原始的LTR系統(tǒng)有三個主要問題:缺乏對特征分析和選擇的支持;不能處理大規(guī)模的數(shù)據(jù)集;以及訓(xùn)練算法本身的效率較低。這幾個問題導(dǎo)致了LTR算法的訓(xùn)練和更新的效率較低,不能適應(yīng)逐漸增長的數(shù)據(jù)和快速更新的要求。本文主要針對這三個問題設(shè)計實現(xiàn)了新的LTR調(diào)研系統(tǒng),整個系統(tǒng)主要包含三個部分的改進:第一個部分是一個支持大規(guī)模數(shù)據(jù)的可擴展的特征分析平臺,利用它進行特征分析,對模型所需特征的選取提供參考,并對最終結(jié)果進行一定程度上的解釋;第二個部分是一個高效的單機LTR訓(xùn)練算法的實現(xiàn),可以充分利用新的軟硬件環(huán)境來提高算法的訓(xùn)練效率;第三個部分是一個用來處理大批量數(shù)據(jù)的大規(guī)模數(shù)據(jù)樹模型的訓(xùn)練平臺,包括解決計算資源問題的資源調(diào)度模塊和支持故障自動恢復(fù)的分布式樹模型訓(xùn)練模塊。最終的結(jié)果顯示,該調(diào)研系統(tǒng)可以將特征和模型參數(shù)選擇的迭代過程的效率提升一倍以上,并支持大規(guī)模數(shù)據(jù)的處理,在效率和精度兩個方面對LTR模型的訓(xùn)練進行提升。
[Abstract]:In order to improve the search accuracy , the LTR algorithm itself has a relatively small improvement in search accuracy . As the number of web pages increases rapidly , the scale of the training set becomes larger and larger , and the LTR needs to be able to handle the more and more training sets .
In addition , some very important features of the training data , such as user ' s click data , will change rapidly over time , so the training model needs to be updated quickly .
the large - scale data set cannot be processed ;
The efficiency of the training algorithm is low , which leads to the lower efficiency of the training and updating of the LTR algorithm , which can not adapt to the increasing data and the requirement of fast update . The paper mainly focuses on the three parts : the first part is an extensible characteristic analysis platform which supports large - scale data , and the first part is a scalable characteristic analysis platform which supports large - scale data , and the characteristic analysis is carried out to provide reference for the selection of the characteristics required by the model , and the final result is explained to some extent ;
The second part is an efficient single - machine LTR training algorithm , which can make full use of the new hardware and software environment to improve the training efficiency of the algorithm ;
The third part is a training platform for large - scale data tree model to deal with large - scale data , including the resource scheduling module for solving the problem of computing resources and the distributed tree model training module supporting the automatic recovery . The result shows that the research system can double the efficiency of the iterative process of the characteristic and model parameter selection , and support the large - scale data processing , and improve the training of the LTR model in terms of efficiency and precision .
【學(xué)位授予單位】:南京大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2015
【分類號】:TP391.3
【參考文獻(xiàn)】
相關(guān)碩士學(xué)位論文 前1條
1 張平;基于直接優(yōu)化信息檢索評價方法的排序?qū)W習(xí)算法研究[D];大連理工大學(xué);2013年
,本文編號:2010502
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2010502.html
最近更新
教材專著