基于Hadoop和支持向量機的緊密度后處理的研究與實現(xiàn)
發(fā)布時間:2018-04-09 18:41
本文選題:自然語言處理 切入點:緊密度 出處:《北京交通大學(xué)》2015年碩士論文
【摘要】:如何將用戶所查結(jié)果準(zhǔn)確地提取出來并展示已經(jīng)成為目前搜索引擎的主要目標(biāo)。搜索引擎涉及多項技術(shù),自然語言處理是極為重要的一項,也是其他技術(shù)研究進行提升的基礎(chǔ)。緊密度是分詞并去停用詞之后的關(guān)鍵技術(shù)之一,用于描述分詞之后的最小單位(Term)之間的關(guān)系,是網(wǎng)頁搜索的相關(guān)性排序中一項重要指標(biāo)數(shù)據(jù),對于排序的結(jié)果起著決定性的作用,在搜索引擎中都發(fā)揮著重要的作用,同時對于提升用戶搜索結(jié)果的準(zhǔn)確率以及召回率有著十分重要的意義。 由于分詞的策略是最小切割,會盡可能地將語句進行細粒度切分,這就會將一些長詞組切分成多個Term,在隨后的搜索結(jié)果中,會召回一些不符合用戶的搜索需求的網(wǎng)頁,影響搜索結(jié)果的準(zhǔn)確率,并造成較差的用戶體驗。論文以搜狗搜索引擎的實際項目為背景,對于搜索引擎的中文分詞中新詞發(fā)現(xiàn)的算法策略進行了研究,設(shè)計了基于策略進行Term關(guān)系提取的算法,將這些關(guān)系進行提取組成特征,通過支持向量機(Support Vector Machine, SVM)進行特征分類,并對緊密度的實際效果進行提升。論文主要完成了下面的幾項工作: (1)數(shù)據(jù)預(yù)處理。對原始搜索日志進行分詞以及初始統(tǒng)計工作,得出后續(xù)策略的基礎(chǔ)數(shù)據(jù)。 (2)基于搜索回話日志的初步后處理。通過對搜索會話數(shù)據(jù)計算搜索語句差異值,得出部分會話數(shù)據(jù),并對緊密度進行初步后處理; (3)基于網(wǎng)頁正文的二步后處理。針對專有名詞級別的緊密度結(jié)果,基于新詞發(fā)現(xiàn)的算法,利用信息熵、互信息等方法,得出兩兩term之間的特征關(guān)系,并將特征值通過SVM進行分類。 (4)實驗結(jié)果驗證以及分析,通過訓(xùn)練集合對最終離線數(shù)據(jù)進行驗證,緊密度后處理的策略提升了相關(guān)性排序的效果,使得搜狗搜索引擎搜索結(jié)果更加準(zhǔn)確。 (5)策略效果。通過后處理策略對緊密度值進行調(diào)整,使得在相關(guān)性排序的結(jié)果更加準(zhǔn)確,將優(yōu)質(zhì)結(jié)果排序較前,差的結(jié)果靠后。
[Abstract]:How to extract and display the search results accurately has become the main target of the current search engine.Search engine involves many technologies, natural language processing is an extremely important one, and it is also the basis of other technical research.Tightness is one of the key techniques of word segmentation and deactivation. It is used to describe the relationship between the smallest units after word segmentation and is an important index data in the correlation ranking of web search.It plays a decisive role in ranking results, plays an important role in search engines, and also plays a very important role in improving the accuracy and recall rate of user search results.Because the strategy for word segmentation is to cut the words at a minimum, the statements are partitioned as fine-grained as possible, which divides long phrases into multiple Terms.In subsequent search results, web pages that do not meet the user's search requirements will be recalled.It affects the accuracy of search results and results in poor user experience.Based on the actual project of Sogou search engine, this paper studies the algorithm strategy of new word discovery in Chinese word segmentation of search engine, designs the algorithm of Term relation extraction based on strategy, and extracts the component features of these relationships.Feature classification is carried out by support Vector machine (SVM), and the actual effect of tightness is improved.The main work of the thesis is as follows:Data preprocessing.The participle of the original search log and the initial statistical work are carried out, and the basic data of the subsequent strategy are obtained.Initial post-processing based on search-in-call logs.By calculating the difference value of search statement to search session data, some session data are obtained, and the initial post-processing of tightness is carried out.3) two-step post-processing based on the body of a web page.According to the compactness result of proper noun level, based on the algorithm of neologism discovery, using the methods of information entropy and mutual information, the feature relationship between pairwise term is obtained, and the eigenvalues are classified by SVM.4) the experimental results are verified and analyzed. The final off-line data is verified by training set. The tightness post-processing strategy improves the effect of correlation ranking and makes the search results of Sogou search engine more accurate.5) the effect of strategy.The compactness value is adjusted by post-processing strategy, which makes the results of correlation ranking more accurate, ranking the high quality results before and putting the poor results behind.
【學(xué)位授予單位】:北京交通大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2015
【分類號】:TP391.3;TP18
【參考文獻】
相關(guān)期刊論文 前6條
1 張海軍;彭成;欒靜;;基于外部排序的字串左右熵快速計算方法[J];計算機工程與應(yīng)用;2011年19期
2 陳俊;陳孝威;;基于Hadoop建立云計算系統(tǒng)[J];貴州大學(xué)學(xué)報(自然科學(xué)版);2011年03期
3 胡光民;周亮;柯立新;;基于Hadoop的網(wǎng)絡(luò)日志分析系統(tǒng)研究[J];電腦知識與技術(shù);2010年22期
4 謝桂蘭;羅省賢;;基于Hadoop MapReduce模型的應(yīng)用研究[J];微型機與應(yīng)用;2010年08期
5 周浪;馮沖;黃河燕;;一種面向術(shù)語抽取的短語過濾技術(shù)[J];計算機工程與應(yīng)用;2009年19期
6 羅智勇;宋柔;;基于多特征的自適應(yīng)新詞識別[J];北京工業(yè)大學(xué)學(xué)報;2007年07期
,本文編號:1727664
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1727664.html
最近更新
教材專著