天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁(yè) > 科技論文 > 搜索引擎論文 >

基于Hadoop和支持向量機(jī)的緊密度后處理的研究與實(shí)現(xiàn)

發(fā)布時(shí)間:2018-04-09 18:41

  本文選題:自然語(yǔ)言處理 切入點(diǎn):緊密度 出處:《北京交通大學(xué)》2015年碩士論文


【摘要】:如何將用戶所查結(jié)果準(zhǔn)確地提取出來(lái)并展示已經(jīng)成為目前搜索引擎的主要目標(biāo)。搜索引擎涉及多項(xiàng)技術(shù),自然語(yǔ)言處理是極為重要的一項(xiàng),也是其他技術(shù)研究進(jìn)行提升的基礎(chǔ)。緊密度是分詞并去停用詞之后的關(guān)鍵技術(shù)之一,用于描述分詞之后的最小單位(Term)之間的關(guān)系,是網(wǎng)頁(yè)搜索的相關(guān)性排序中一項(xiàng)重要指標(biāo)數(shù)據(jù),對(duì)于排序的結(jié)果起著決定性的作用,在搜索引擎中都發(fā)揮著重要的作用,同時(shí)對(duì)于提升用戶搜索結(jié)果的準(zhǔn)確率以及召回率有著十分重要的意義。 由于分詞的策略是最小切割,會(huì)盡可能地將語(yǔ)句進(jìn)行細(xì)粒度切分,這就會(huì)將一些長(zhǎng)詞組切分成多個(gè)Term,在隨后的搜索結(jié)果中,會(huì)召回一些不符合用戶的搜索需求的網(wǎng)頁(yè),影響搜索結(jié)果的準(zhǔn)確率,并造成較差的用戶體驗(yàn)。論文以搜狗搜索引擎的實(shí)際項(xiàng)目為背景,對(duì)于搜索引擎的中文分詞中新詞發(fā)現(xiàn)的算法策略進(jìn)行了研究,設(shè)計(jì)了基于策略進(jìn)行Term關(guān)系提取的算法,將這些關(guān)系進(jìn)行提取組成特征,通過(guò)支持向量機(jī)(Support Vector Machine, SVM)進(jìn)行特征分類,并對(duì)緊密度的實(shí)際效果進(jìn)行提升。論文主要完成了下面的幾項(xiàng)工作: (1)數(shù)據(jù)預(yù)處理。對(duì)原始搜索日志進(jìn)行分詞以及初始統(tǒng)計(jì)工作,得出后續(xù)策略的基礎(chǔ)數(shù)據(jù)。 (2)基于搜索回話日志的初步后處理。通過(guò)對(duì)搜索會(huì)話數(shù)據(jù)計(jì)算搜索語(yǔ)句差異值,得出部分會(huì)話數(shù)據(jù),并對(duì)緊密度進(jìn)行初步后處理; (3)基于網(wǎng)頁(yè)正文的二步后處理。針對(duì)專有名詞級(jí)別的緊密度結(jié)果,基于新詞發(fā)現(xiàn)的算法,利用信息熵、互信息等方法,得出兩兩term之間的特征關(guān)系,并將特征值通過(guò)SVM進(jìn)行分類。 (4)實(shí)驗(yàn)結(jié)果驗(yàn)證以及分析,通過(guò)訓(xùn)練集合對(duì)最終離線數(shù)據(jù)進(jìn)行驗(yàn)證,緊密度后處理的策略提升了相關(guān)性排序的效果,使得搜狗搜索引擎搜索結(jié)果更加準(zhǔn)確。 (5)策略效果。通過(guò)后處理策略對(duì)緊密度值進(jìn)行調(diào)整,使得在相關(guān)性排序的結(jié)果更加準(zhǔn)確,將優(yōu)質(zhì)結(jié)果排序較前,差的結(jié)果靠后。
[Abstract]:How to extract and display the search results accurately has become the main target of the current search engine.Search engine involves many technologies, natural language processing is an extremely important one, and it is also the basis of other technical research.Tightness is one of the key techniques of word segmentation and deactivation. It is used to describe the relationship between the smallest units after word segmentation and is an important index data in the correlation ranking of web search.It plays a decisive role in ranking results, plays an important role in search engines, and also plays a very important role in improving the accuracy and recall rate of user search results.Because the strategy for word segmentation is to cut the words at a minimum, the statements are partitioned as fine-grained as possible, which divides long phrases into multiple Terms.In subsequent search results, web pages that do not meet the user's search requirements will be recalled.It affects the accuracy of search results and results in poor user experience.Based on the actual project of Sogou search engine, this paper studies the algorithm strategy of new word discovery in Chinese word segmentation of search engine, designs the algorithm of Term relation extraction based on strategy, and extracts the component features of these relationships.Feature classification is carried out by support Vector machine (SVM), and the actual effect of tightness is improved.The main work of the thesis is as follows:Data preprocessing.The participle of the original search log and the initial statistical work are carried out, and the basic data of the subsequent strategy are obtained.Initial post-processing based on search-in-call logs.By calculating the difference value of search statement to search session data, some session data are obtained, and the initial post-processing of tightness is carried out.3) two-step post-processing based on the body of a web page.According to the compactness result of proper noun level, based on the algorithm of neologism discovery, using the methods of information entropy and mutual information, the feature relationship between pairwise term is obtained, and the eigenvalues are classified by SVM.4) the experimental results are verified and analyzed. The final off-line data is verified by training set. The tightness post-processing strategy improves the effect of correlation ranking and makes the search results of Sogou search engine more accurate.5) the effect of strategy.The compactness value is adjusted by post-processing strategy, which makes the results of correlation ranking more accurate, ranking the high quality results before and putting the poor results behind.
【學(xué)位授予單位】:北京交通大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2015
【分類號(hào)】:TP391.3;TP18

【參考文獻(xiàn)】

相關(guān)期刊論文 前6條

1 張海軍;彭成;欒靜;;基于外部排序的字串左右熵快速計(jì)算方法[J];計(jì)算機(jī)工程與應(yīng)用;2011年19期

2 陳俊;陳孝威;;基于Hadoop建立云計(jì)算系統(tǒng)[J];貴州大學(xué)學(xué)報(bào)(自然科學(xué)版);2011年03期

3 胡光民;周亮;柯立新;;基于Hadoop的網(wǎng)絡(luò)日志分析系統(tǒng)研究[J];電腦知識(shí)與技術(shù);2010年22期

4 謝桂蘭;羅省賢;;基于Hadoop MapReduce模型的應(yīng)用研究[J];微型機(jī)與應(yīng)用;2010年08期

5 周浪;馮沖;黃河燕;;一種面向術(shù)語(yǔ)抽取的短語(yǔ)過(guò)濾技術(shù)[J];計(jì)算機(jī)工程與應(yīng)用;2009年19期

6 羅智勇;宋柔;;基于多特征的自適應(yīng)新詞識(shí)別[J];北京工業(yè)大學(xué)學(xué)報(bào);2007年07期

,

本文編號(hào):1727664

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1727664.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶56591***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com
免费大片黄在线观看日本| 午夜福利在线观看免费| 日韩一区二区三区在线日| 国产91人妻精品一区二区三区| 亚洲中文字幕在线视频频道| 激情三级在线观看视频| 色婷婷日本视频在线观看| 日本成人中文字幕一区| 视频在线观看色一区二区| 老司机亚洲精品一区二区| 韩国日本欧美国产三级| 国产91人妻精品一区二区三区| 国产综合欧美日韩在线精品| 美国欧洲日本韩国二本道| 国产中文字幕久久黄色片| 日本欧美视频在线观看免费| 日韩精品一级一区二区| 国产精品福利一二三区| 亚洲内射人妻一区二区| 黄色片一区二区三区高清| 人妻人妻人人妻人人澡| 国产精品视频第一第二区| 国产传媒免费观看视频| 夜夜嗨激情五月天精品| 国产欧美日韩一级小黄片| 国产丝袜美女诱惑一区二区| 精品久久少妇激情视频| 免费黄片视频美女一区| 国产日韩欧美专区一区| 日韩中文字幕在线不卡一区| 男人和女人黄 色大片| 欧美熟妇喷浆一区二区| 国产日产欧美精品视频| 亚洲婷婷开心色四房播播| 精品久久久一区二区三| 国产精品视频一级香蕉| 欧美日韩精品综合一区| 国产又粗又猛又长又大| 91欧美一区二区三区| 日韩av生活片一区二区三区| 欧美日韩亚洲巨色人妻|