基于統(tǒng)計語言模型的搜索引擎輸入糾錯技術(shù)研究

發(fā)布時間：2018-07-21 15:04

【摘要】：在信息化飛速發(fā)展的今天,搜索引擎在互聯(lián)網(wǎng)上扮演著越來越重要的角色,日益增多的互聯(lián)網(wǎng)用戶對搜索引擎的要求也變得越來越高.其中,搜索引擎輸入糾錯功能是一項非常重要的附加技術(shù),并且已經(jīng)得到了較為廣泛的應(yīng)用和推廣.因此研究搜索引擎的糾錯技術(shù)對于搜索引擎的發(fā)展有著重要深遠(yuǎn)的意義.糾錯技術(shù)是自然語言處理的重要研究課題之一.針對中文文本的糾錯研究相較于英文起步較晚.目前主要分為基于詞典和基于統(tǒng)計模型這兩大方法.基于詞典的糾錯受限于詞典的規(guī)模和內(nèi)容,而基于統(tǒng)計模型的方法則是基于海量實例,分析語言內(nèi)在之間的關(guān)系,無需專門詞典來實現(xiàn).用于糾錯的統(tǒng)計模型有有基于互信息概率,基于N-gram模型,基于組合度的漢語決策等.本文提出一種完全通過分析上下文統(tǒng)計信息的方法.為了論證本文方法的可行性,以Nutch和Hadoop為基礎(chǔ)搭建分布式搜索引擎平臺進(jìn)行實驗驗證.本文主要完成以下工作:為了構(gòu)架良好的搜索引擎平臺,本文首先介紹了主流的索引機(jī)制—倒排索引.本文分析介紹了倒排索引的性能模型以及壓縮技術(shù),同時對該索引機(jī)制的性能與一般索引進(jìn)行分析比較,計算倒排索引創(chuàng)建的時間復(fù)雜度和空間復(fù)雜度,進(jìn)而引出良好應(yīng)用倒排索引,構(gòu)架搜索引擎的工具包Lucene.由Lucene搭建起搜索引擎Nutch.由于實驗環(huán)境需要大數(shù)據(jù),因此采用分布式平臺,詳細(xì)介紹了由Nutch+Hadoop搭建的分布式搜索引擎.由于漢語理論研究存在局限性,因此要想實現(xiàn)對檢索引擎輸入的內(nèi)容實現(xiàn)糾錯功能,就需要對中文語料庫建立了N-gram語言模型,并對其進(jìn)行詳細(xì)的分析,確定語言模型所必須的參數(shù),并通過平滑技術(shù)解決數(shù)據(jù)稀疏問題.基于大量語料庫,通過N-gram模型糾錯后的關(guān)鍵詞可能存在相同的結(jié)果,利用TF-IDF計算初步處理后結(jié)果的權(quán)重,篩選結(jié)果,以此得到最佳的結(jié)果集.
[Abstract]:With the rapid development of information technology, search engines are playing a more and more important role in the Internet, and more Internet users are demanding more and more search engines. Among them, search engine input error correction function is a very important additional technology, and has been widely used and promoted. Therefore, the study of search engine error correction technology for the development of search engines has an important and far-reaching significance. Error correction technology is one of the important research topics in natural language processing. The research on error correction in Chinese text started later than in English. At present, there are two main methods based on dictionary and statistical model. The error correction based on the dictionary is limited by the size and content of the dictionary, while the statistical model-based approach is based on a large number of examples and analyzes the relationship between the languages without the need for a special dictionary. The statistical models used for error correction are based on mutual information probability, N-gram model, combination degree based Chinese decision making and so on. In this paper, a method of analyzing context statistics is presented. In order to prove the feasibility of this method, the distributed search engine platform is built based on Nutch and Hadoop. The main work of this paper is as follows: in order to construct a good search engine platform, this paper first introduces the mainstream indexing mechanism-inverted index. In this paper, the performance model and compression technology of inverted index are analyzed and introduced. At the same time, the performance of this index mechanism is compared with that of general index, and the time complexity and space complexity of inverted index are calculated. Then leads to the good application inverted index, constructs the search engine tool kit Lucene. By Lucene build search engine Nutch. Because the experimental environment needs big data, the distributed search engine built by Nutch Hadoop is introduced in detail by using distributed platform. Because of the limitation of Chinese theory research, in order to realize the error-correcting function of the contents input by the retrieval engine, we need to establish the N-gram language model of the Chinese corpus and analyze it in detail. The necessary parameters of the language model are determined and the data sparse problem is solved by smoothing technique. Based on a large number of corpus, there may be the same result for the keywords corrected by N-gram model. TF-IDF is used to calculate the weight of the preliminary processed results and to screen the results to obtain the best result set.
【學(xué)位授予單位】：江蘇科技大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2017
【分類號】：TP391.3

【參考文獻(xiàn)】

相關(guān)期刊論文前3條

1 丁潔;;基于Lucene的中文分詞系統(tǒng)設(shè)計與實現(xiàn)[J];自動化與儀器儀表;2016年05期

2 邱云飛;劉世興;林明明;邵良杉;;基于相關(guān)性及語義的n-grams特征加權(quán)算法[J];模式識別與人工智能;2015年11期

3 詹恒飛;楊岳湘;方宏;;Nutch分布式網(wǎng)絡(luò)爬蟲研究與優(yōu)化[J];計算機(jī)科學(xué)與探索;2011年01期

相關(guān)碩士學(xué)位論文前10條

1 黃鵬程;面向自然語言查詢的知識搜索關(guān)鍵技術(shù)研究[D];浙江大學(xué);2016年

2 丁楚;基于Lucene的基礎(chǔ)排序算法的研究及其改進(jìn)算法的應(yīng)用[D];電子科技大學(xué);2015年

3 張環(huán);垂直搜索引擎中主題網(wǎng)絡(luò)爬蟲算法研究[D];山東師范大學(xué);2015年

4 羅惠峰;基于Lucene的站內(nèi)檢索系統(tǒng)的設(shè)計與優(yōu)化[D];浙江工業(yè)大學(xué);2015年

5 高建貴;基于Lucene的大數(shù)據(jù)量全文搜索引擎的研究與實現(xiàn)[D];重慶大學(xué);2015年

6 杜雷;垂直搜索引擎網(wǎng)絡(luò)爬蟲的研究與設(shè)計[D];北京郵電大學(xué);2015年

7 徐月霞;面向語義的數(shù)學(xué)公式N-grams索引結(jié)構(gòu)研究[D];蘭州大學(xué);2015年

8 范晨熙;基于Hadoop的搜索引擎的研究與應(yīng)用[D];浙江理工大學(xué);2013年

9 高如家;基于LUCENE的全文搜索引擎的研究[D];長春工業(yè)大學(xué);2013年

10 張琦玉;基于Lucene的應(yīng)用系統(tǒng)內(nèi)部搜索的研究與設(shè)計[D];南京理工大學(xué);2013年

，

本文編號：2135910

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2135910.html

上一篇：輕量級企業(yè)內(nèi)容管理系統(tǒng)的設(shè)計與實現(xiàn)
下一篇：化學(xué)信息學(xué)與藥物發(fā)現(xiàn)研究的開放性

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于統(tǒng)計語言模型的搜索引擎輸入糾錯技術(shù)研究