基于Hadoop的漢語詞語搭配抽取系統(tǒng)的研究與實現(xiàn)
[Abstract]:Collocation is a repetitive, syntactic, but arbitrary, non-analogous combination of words. Collocation extraction refers to the automatic extraction of collocations from a corpus by computer computing power and programming language. With the rapid development of computer technology, automatic extraction of collocations has become more and more important. On the one hand, collocation extraction plays an important role in many applications in natural language processing, such as machine translation, word sense disambiguation, language generation and information retrieval. On the other hand, collocation plays an important role in language teaching and second language acquisition. Data and large-scale corpus are important sources of knowledge in Computational Linguistics collocation research. The explosive growth of Internet data and the continuous expansion of corpus size make it particularly important to develop effective methods for automatic collocation extraction. To extract typical collocations of Chinese substantive parts, a distributed word collocation retrieval system based on Java Web and Hadoop is studied by using the key technology of Hadoop distributed computing platform as the leading factor, integrating the knowledge of Chinese linguistics and referring to statistical methods. This system provides a new intelligent and convenient way for users to obtain collocation information. The research contents include: firstly, the existing statistical word collocation extraction methods and the key technologies of Hadoop distributed platform are described, the advantages and disadvantages of these methods are compared and analyzed, and the evaluation indicators of collocation extraction are introduced: accuracy, recall and F value. This paper analyzes the rules of part-of-speech formation between collocation words, selects the typical collocation types of Chinese notional words, and gives the description of the part-of-speech formation of Chinese notional words collocation. Finally, the experimental part gives the concrete implementation method of extracting Chinese notional lexical collocation from n-gram corpus. In this paper, sparse data and non-Chinese data are removed from the MapReduce model, and the NLPIR Chinese word segmentation system is called for word segmentation and part-of-speech tagging to realize corpus preprocessing, select the candidate collocation set for cross-distance extraction, and make use of lap. The matching rules are used to filter the collocation of real parts of speech, and the statistics are calculated according to three statistical methods: co-occurrence frequency, mutual information and chi-square test formula. The intermediate and final results are stored in HBase distributed database, and a Chinese word collocation user dictionary is constructed. (2) Hadoop-based Chinese word collocation dictionary is developed. The front-end page of the collocation extraction system is designed with the bootstrap development framework, and the function of setting the conditions of the word retrieval area and displaying the results is realized. (3) A typical collocation extraction method based on the content words is summarized, and this data technology, linguistic knowledge and statistics are used. Methods The comprehensive method was applied to four types of noun, verb, adjective and adverb collocation extraction experiments. Through quantitative comparative analysis, it was found that collocation extraction based on co-occurrence frequency method was the best. The accuracy rate of noun collocation extraction was 86%, recall rate was 59.72%, F value was 70.49%, verb collocation extraction was 80%. The recall rate is 65.57%, the F value is 72.07%, the accuracy of adjective extraction is 82%, the recall rate is 78.85%, the F value is 80.39%, the accuracy of adverbs is 88%, the recall rate is 43.56%, the F value is 58.28%. The accuracy of adjective and noun extraction is 2% - 4% higher than that of the existing collocation extraction software. Certain value.
【學位授予單位】:長江大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP391.1
【相似文獻】
相關(guān)期刊論文 前8條
1 曲維光,陳小荷,吉根林;基于框架的詞語搭配自動抽取方法[J];計算機工程;2004年23期
2 乃禾;詞語搭配要得當[J];新聞通訊;1984年03期
3 王漫宇;;辭忌失朋[J];新聞戰(zhàn)線;1982年11期
4 鄧耀臣,王同順;詞語搭配抽取的統(tǒng)計方法及計算機實現(xiàn)[J];外語電化教學;2005年05期
5 王璐;張仰森;;基于典型句型的詞語搭配定量分析及提取算法[J];計算機科學;2012年S1期
6 高明陽;;淺談英語詞語搭配和教學[J];甘肅科技縱橫;2012年01期
7 羅琴琴;周江林;;基于語料庫的詞語搭配研究綜述[J];外語教育;2005年00期
8 王素格;楊軍玲;張武;;自動獲取漢語詞語搭配[J];中文信息學報;2006年06期
相關(guān)重要報紙文章 前5條
1 譚志龍;句子中,詞語搭配有講究[N];語言文字周報;2013年
2 小波;助你解決詞語搭配困惑[N];中國圖書商報;2002年
3 《語言文字報》原主編 杜永道;權(quán)力與權(quán)利[N];人民日報海外版;2011年
4 卡克西·海爾江 (哈薩克族) 努爾巴汗 譯;在翻譯中要注意文化差異[N];文藝報;2013年
5 張輝 李國清 陳群安;“只字關(guān)天”[N];湖北日報;2004年
相關(guān)博士學位論文 前3條
1 馮奇;核心句的詞語搭配研究[D];上海外國語大學;2006年
2 申修瑛;現(xiàn)代漢語詞語搭配研究[D];復旦大學;2007年
3 徐潤華;基于詞語搭配知識和語法功能匹配的句法分析器[D];南京師范大學;2013年
相關(guān)碩士學位論文 前10條
1 張曉花;藏語形容詞的結(jié)構(gòu)及搭配庫構(gòu)建研究[D];西北民族大學;2016年
2 劉慧平;注釋方式和任務投入量對高中學生英語詞語搭配附帶習得的影響[D];揚州大學;2017年
3 梁君華;高級階段詞語搭配的輸出及其對外語教學的啟示[D];上海外國語大學;2005年
4 Diana Batsenkova;中文為外語翻譯中的詞語搭配錯誤[D];上海外國語大學;2014年
5 李獻慧;中國不同階段學生英語詞語搭配現(xiàn)狀研究[D];華北電力大學(北京);2011年
6 朱鑫;詞語搭配自動抽取方法對比研究[D];大連海事大學;2011年
7 李然;英語詞語搭配教學干預對大學英語寫作的影響[D];北京林業(yè)大學;2012年
8 周智慧;多項選擇注釋和單項注釋對附帶詞語搭配學習的影響[D];華南理工大學;2012年
9 周莎莎;母語習得者與二語習得者寫作中詞語搭配的描述性研究[D];貴州大學;2009年
10 司云偉;詞語搭配及搭配不當實例分析[D];延邊大學;2003年
,本文編號:2216281
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2216281.html