基于Hadoop和支持向量機(jī)的緊密度后處理的研究與實(shí)現(xiàn)
發(fā)布時(shí)間:2018-04-09 18:41
本文選題:自然語(yǔ)言處理 切入點(diǎn):緊密度 出處:《北京交通大學(xué)》2015年碩士論文
【摘要】:如何將用戶所查結(jié)果準(zhǔn)確地提取出來(lái)并展示已經(jīng)成為目前搜索引擎的主要目標(biāo)。搜索引擎涉及多項(xiàng)技術(shù),自然語(yǔ)言處理是極為重要的一項(xiàng),也是其他技術(shù)研究進(jìn)行提升的基礎(chǔ)。緊密度是分詞并去停用詞之后的關(guān)鍵技術(shù)之一,用于描述分詞之后的最小單位(Term)之間的關(guān)系,是網(wǎng)頁(yè)搜索的相關(guān)性排序中一項(xiàng)重要指標(biāo)數(shù)據(jù),對(duì)于排序的結(jié)果起著決定性的作用,在搜索引擎中都發(fā)揮著重要的作用,同時(shí)對(duì)于提升用戶搜索結(jié)果的準(zhǔn)確率以及召回率有著十分重要的意義。 由于分詞的策略是最小切割,會(huì)盡可能地將語(yǔ)句進(jìn)行細(xì)粒度切分,這就會(huì)將一些長(zhǎng)詞組切分成多個(gè)Term,在隨后的搜索結(jié)果中,會(huì)召回一些不符合用戶的搜索需求的網(wǎng)頁(yè),影響搜索結(jié)果的準(zhǔn)確率,并造成較差的用戶體驗(yàn)。論文以搜狗搜索引擎的實(shí)際項(xiàng)目為背景,對(duì)于搜索引擎的中文分詞中新詞發(fā)現(xiàn)的算法策略進(jìn)行了研究,設(shè)計(jì)了基于策略進(jìn)行Term關(guān)系提取的算法,將這些關(guān)系進(jìn)行提取組成特征,通過(guò)支持向量機(jī)(Support Vector Machine, SVM)進(jìn)行特征分類,并對(duì)緊密度的實(shí)際效果進(jìn)行提升。論文主要完成了下面的幾項(xiàng)工作: (1)數(shù)據(jù)預(yù)處理。對(duì)原始搜索日志進(jìn)行分詞以及初始統(tǒng)計(jì)工作,得出后續(xù)策略的基礎(chǔ)數(shù)據(jù)。 (2)基于搜索回話日志的初步后處理。通過(guò)對(duì)搜索會(huì)話數(shù)據(jù)計(jì)算搜索語(yǔ)句差異值,得出部分會(huì)話數(shù)據(jù),并對(duì)緊密度進(jìn)行初步后處理; (3)基于網(wǎng)頁(yè)正文的二步后處理。針對(duì)專有名詞級(jí)別的緊密度結(jié)果,基于新詞發(fā)現(xiàn)的算法,利用信息熵、互信息等方法,得出兩兩term之間的特征關(guān)系,并將特征值通過(guò)SVM進(jìn)行分類。 (4)實(shí)驗(yàn)結(jié)果驗(yàn)證以及分析,通過(guò)訓(xùn)練集合對(duì)最終離線數(shù)據(jù)進(jìn)行驗(yàn)證,緊密度后處理的策略提升了相關(guān)性排序的效果,使得搜狗搜索引擎搜索結(jié)果更加準(zhǔn)確。 (5)策略效果。通過(guò)后處理策略對(duì)緊密度值進(jìn)行調(diào)整,使得在相關(guān)性排序的結(jié)果更加準(zhǔn)確,將優(yōu)質(zhì)結(jié)果排序較前,差的結(jié)果靠后。
[Abstract]:How to extract and display the search results accurately has become the main target of the current search engine.Search engine involves many technologies, natural language processing is an extremely important one, and it is also the basis of other technical research.Tightness is one of the key techniques of word segmentation and deactivation. It is used to describe the relationship between the smallest units after word segmentation and is an important index data in the correlation ranking of web search.It plays a decisive role in ranking results, plays an important role in search engines, and also plays a very important role in improving the accuracy and recall rate of user search results.Because the strategy for word segmentation is to cut the words at a minimum, the statements are partitioned as fine-grained as possible, which divides long phrases into multiple Terms.In subsequent search results, web pages that do not meet the user's search requirements will be recalled.It affects the accuracy of search results and results in poor user experience.Based on the actual project of Sogou search engine, this paper studies the algorithm strategy of new word discovery in Chinese word segmentation of search engine, designs the algorithm of Term relation extraction based on strategy, and extracts the component features of these relationships.Feature classification is carried out by support Vector machine (SVM), and the actual effect of tightness is improved.The main work of the thesis is as follows:Data preprocessing.The participle of the original search log and the initial statistical work are carried out, and the basic data of the subsequent strategy are obtained.Initial post-processing based on search-in-call logs.By calculating the difference value of search statement to search session data, some session data are obtained, and the initial post-processing of tightness is carried out.3) two-step post-processing based on the body of a web page.According to the compactness result of proper noun level, based on the algorithm of neologism discovery, using the methods of information entropy and mutual information, the feature relationship between pairwise term is obtained, and the eigenvalues are classified by SVM.4) the experimental results are verified and analyzed. The final off-line data is verified by training set. The tightness post-processing strategy improves the effect of correlation ranking and makes the search results of Sogou search engine more accurate.5) the effect of strategy.The compactness value is adjusted by post-processing strategy, which makes the results of correlation ranking more accurate, ranking the high quality results before and putting the poor results behind.
【學(xué)位授予單位】:北京交通大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2015
【分類號(hào)】:TP391.3;TP18
【參考文獻(xiàn)】
相關(guān)期刊論文 前6條
1 張海軍;彭成;欒靜;;基于外部排序的字串左右熵快速計(jì)算方法[J];計(jì)算機(jī)工程與應(yīng)用;2011年19期
2 陳俊;陳孝威;;基于Hadoop建立云計(jì)算系統(tǒng)[J];貴州大學(xué)學(xué)報(bào)(自然科學(xué)版);2011年03期
3 胡光民;周亮;柯立新;;基于Hadoop的網(wǎng)絡(luò)日志分析系統(tǒng)研究[J];電腦知識(shí)與技術(shù);2010年22期
4 謝桂蘭;羅省賢;;基于Hadoop MapReduce模型的應(yīng)用研究[J];微型機(jī)與應(yīng)用;2010年08期
5 周浪;馮沖;黃河燕;;一種面向術(shù)語(yǔ)抽取的短語(yǔ)過(guò)濾技術(shù)[J];計(jì)算機(jī)工程與應(yīng)用;2009年19期
6 羅智勇;宋柔;;基于多特征的自適應(yīng)新詞識(shí)別[J];北京工業(yè)大學(xué)學(xué)報(bào);2007年07期
,本文編號(hào):1727664
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1727664.html
最近更新
教材專著