面向大規(guī)模數(shù)據(jù)的高效LTR調(diào)研系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn)
發(fā)布時(shí)間:2018-06-12 17:34
本文選題:網(wǎng)頁(yè)排序 + LTR調(diào)研系統(tǒng)。 參考:《南京大學(xué)》2015年碩士論文
【摘要】:LTR(learning to rank,使用機(jī)器學(xué)習(xí)的方法做網(wǎng)頁(yè)排序)在商業(yè)搜索引擎中起著越來(lái)越重要的作用。各大商業(yè)搜索引擎都逐漸使用LTR作為搜索網(wǎng)頁(yè)排序的重要手段。就目前網(wǎng)頁(yè)排序的發(fā)展而言,LTR算法本身對(duì)搜索精度的提升已經(jīng)相對(duì)較小,雅虎在2010年舉辦的LTR算法比賽的結(jié)果顯示,精度最高的算法和基準(zhǔn)算法(GBDT和RankSVM)相比,提升也非常有限,而且這些提升有相當(dāng)一部分是來(lái)自于對(duì)訓(xùn)練數(shù)據(jù)的處理。而隨著網(wǎng)頁(yè)數(shù)目的迅速提升,訓(xùn)練集的規(guī)模越來(lái)越大,LTR需要能夠處理這種越來(lái)越大的訓(xùn)練集;另外,訓(xùn)練數(shù)據(jù)的一些非常重要的特征,比如用戶點(diǎn)擊數(shù)據(jù)等,會(huì)隨著時(shí)間會(huì)快速發(fā)生變化,所以訓(xùn)練模型需要快速的進(jìn)行更新。因此,高效和能夠處理更大規(guī)模數(shù)據(jù)是目前對(duì)LTR算法的主要需求。除此之外,LTR訓(xùn)練使用的特征較多(可達(dá)700特征左右),而這些特征中大部分是帶有噪聲的,對(duì)最終模型的貢獻(xiàn)很小,選取合適的特征集合參與訓(xùn)練,既可以提高精度,又可以大大降低訓(xùn)練的時(shí)間。怎樣選取合適的特征也是LTR研究中的重要部分。LTR調(diào)研系統(tǒng)的目的就是快速選出合適的模型供搜索引擎使用,對(duì)網(wǎng)頁(yè)搜索結(jié)果進(jìn)行排序。原始的LTR系統(tǒng)有三個(gè)主要問(wèn)題:缺乏對(duì)特征分析和選擇的支持;不能處理大規(guī)模的數(shù)據(jù)集;以及訓(xùn)練算法本身的效率較低。這幾個(gè)問(wèn)題導(dǎo)致了LTR算法的訓(xùn)練和更新的效率較低,不能適應(yīng)逐漸增長(zhǎng)的數(shù)據(jù)和快速更新的要求。本文主要針對(duì)這三個(gè)問(wèn)題設(shè)計(jì)實(shí)現(xiàn)了新的LTR調(diào)研系統(tǒng),整個(gè)系統(tǒng)主要包含三個(gè)部分的改進(jìn):第一個(gè)部分是一個(gè)支持大規(guī)模數(shù)據(jù)的可擴(kuò)展的特征分析平臺(tái),利用它進(jìn)行特征分析,對(duì)模型所需特征的選取提供參考,并對(duì)最終結(jié)果進(jìn)行一定程度上的解釋;第二個(gè)部分是一個(gè)高效的單機(jī)LTR訓(xùn)練算法的實(shí)現(xiàn),可以充分利用新的軟硬件環(huán)境來(lái)提高算法的訓(xùn)練效率;第三個(gè)部分是一個(gè)用來(lái)處理大批量數(shù)據(jù)的大規(guī)模數(shù)據(jù)樹(shù)模型的訓(xùn)練平臺(tái),包括解決計(jì)算資源問(wèn)題的資源調(diào)度模塊和支持故障自動(dòng)恢復(fù)的分布式樹(shù)模型訓(xùn)練模塊。最終的結(jié)果顯示,該調(diào)研系統(tǒng)可以將特征和模型參數(shù)選擇的迭代過(guò)程的效率提升一倍以上,并支持大規(guī)模數(shù)據(jù)的處理,在效率和精度兩個(gè)方面對(duì)LTR模型的訓(xùn)練進(jìn)行提升。
[Abstract]:LTR-learning to rank (using machine learning to sort web pages) plays an increasingly important role in commercial search engines. All major commercial search engines are gradually using LTR as an important means of searching web pages. As far as the development of web ranking is concerned, the LTR algorithm itself has been relatively small in improving the search accuracy. The results of the LTR algorithm competition held by Yahoo in 2010 show that the most accurate algorithm has a very limited improvement compared with the benchmark algorithms (GBDT and RankSVM). And much of these ascent comes from the processing of training data. With the rapid increase in the number of web pages, the size of the training set becomes larger and larger. LTR needs to be able to handle this growing training set. In addition, some very important features of the training data, such as user click data, Will change quickly over time, so the training model needs to be updated quickly. Therefore, the main demand for LTR algorithm is to be efficient and able to deal with larger data. In addition, LTR training uses more features (up to 700 features or so), and most of these features are noisy, so the contribution to the final model is very small. Selecting suitable feature sets to participate in the training can not only improve the accuracy, but also improve the accuracy of LTR training. It can also greatly reduce the training time. How to select suitable features is also an important part of LTR research. The purpose of LTR research system is to quickly select the appropriate model for search engine to sort web search results. The original LTR system has three main problems: lack of support for feature analysis and selection, inability to deal with large-scale data sets, and low efficiency of the training algorithm itself. These problems lead to the low efficiency of training and updating of LTR algorithm, which can not meet the requirements of increasing data and fast updating. In this paper, a new LTR research system is designed and implemented for these three problems. The whole system mainly includes three parts of improvement: the first part is an extensible feature analysis platform supporting large-scale data, which is used for feature analysis. It provides a reference for the selection of the required features of the model and explains the final results to a certain extent. The second part is the implementation of an efficient single-machine LTR training algorithm. We can make full use of the new software and hardware environment to improve the training efficiency of the algorithm. The third part is a training platform for large-scale data tree model to deal with mass data. It includes resource scheduling module for solving computing resource problem and distributed tree model training module for automatic fault recovery. The final results show that the system can improve the efficiency of iterative process of feature and model parameter selection more than twice as well as support large-scale data processing and improve the training of LTR model in both efficiency and accuracy.
【學(xué)位授予單位】:南京大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2015
【分類號(hào)】:TP391.3
【參考文獻(xiàn)】
相關(guān)碩士學(xué)位論文 前1條
1 張平;基于直接優(yōu)化信息檢索評(píng)價(jià)方法的排序?qū)W習(xí)算法研究[D];大連理工大學(xué);2013年
,本文編號(hào):2010501
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2010501.html
最近更新
教材專著