基于統(tǒng)計語言模型的搜索引擎輸入糾錯技術研究
[Abstract]:With the rapid development of information technology, search engines are playing a more and more important role in the Internet, and more Internet users are demanding more and more search engines. Among them, search engine input error correction function is a very important additional technology, and has been widely used and promoted. Therefore, the study of search engine error correction technology for the development of search engines has an important and far-reaching significance. Error correction technology is one of the important research topics in natural language processing. The research on error correction in Chinese text started later than in English. At present, there are two main methods based on dictionary and statistical model. The error correction based on the dictionary is limited by the size and content of the dictionary, while the statistical model-based approach is based on a large number of examples and analyzes the relationship between the languages without the need for a special dictionary. The statistical models used for error correction are based on mutual information probability, N-gram model, combination degree based Chinese decision making and so on. In this paper, a method of analyzing context statistics is presented. In order to prove the feasibility of this method, the distributed search engine platform is built based on Nutch and Hadoop. The main work of this paper is as follows: in order to construct a good search engine platform, this paper first introduces the mainstream indexing mechanism-inverted index. In this paper, the performance model and compression technology of inverted index are analyzed and introduced. At the same time, the performance of this index mechanism is compared with that of general index, and the time complexity and space complexity of inverted index are calculated. Then leads to the good application inverted index, constructs the search engine tool kit Lucene. By Lucene build search engine Nutch. Because the experimental environment needs big data, the distributed search engine built by Nutch Hadoop is introduced in detail by using distributed platform. Because of the limitation of Chinese theory research, in order to realize the error-correcting function of the contents input by the retrieval engine, we need to establish the N-gram language model of the Chinese corpus and analyze it in detail. The necessary parameters of the language model are determined and the data sparse problem is solved by smoothing technique. Based on a large number of corpus, there may be the same result for the keywords corrected by N-gram model. TF-IDF is used to calculate the weight of the preliminary processed results and to screen the results to obtain the best result set.
【學位授予單位】:江蘇科技大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP391.3
【參考文獻】
相關期刊論文 前3條
1 丁潔;;基于Lucene的中文分詞系統(tǒng)設計與實現(xiàn)[J];自動化與儀器儀表;2016年05期
2 邱云飛;劉世興;林明明;邵良杉;;基于相關性及語義的n-grams特征加權算法[J];模式識別與人工智能;2015年11期
3 詹恒飛;楊岳湘;方宏;;Nutch分布式網(wǎng)絡爬蟲研究與優(yōu)化[J];計算機科學與探索;2011年01期
相關碩士學位論文 前10條
1 黃鵬程;面向自然語言查詢的知識搜索關鍵技術研究[D];浙江大學;2016年
2 丁楚;基于Lucene的基礎排序算法的研究及其改進算法的應用[D];電子科技大學;2015年
3 張環(huán);垂直搜索引擎中主題網(wǎng)絡爬蟲算法研究[D];山東師范大學;2015年
4 羅惠峰;基于Lucene的站內(nèi)檢索系統(tǒng)的設計與優(yōu)化[D];浙江工業(yè)大學;2015年
5 高建貴;基于Lucene的大數(shù)據(jù)量全文搜索引擎的研究與實現(xiàn)[D];重慶大學;2015年
6 杜雷;垂直搜索引擎網(wǎng)絡爬蟲的研究與設計[D];北京郵電大學;2015年
7 徐月霞;面向語義的數(shù)學公式N-grams索引結構研究[D];蘭州大學;2015年
8 范晨熙;基于Hadoop的搜索引擎的研究與應用[D];浙江理工大學;2013年
9 高如家;基于LUCENE的全文搜索引擎的研究[D];長春工業(yè)大學;2013年
10 張琦玉;基于Lucene的應用系統(tǒng)內(nèi)部搜索的研究與設計[D];南京理工大學;2013年
,本文編號:2135910
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2135910.html