文本語(yǔ)料庫(kù)的精煉研究
[Abstract]:Text corpus is the foundation of text data mining. Many text corpora are derived from the actual work of production and life, and are usually defined by industry experts. The data set in this paper comes from the mayor's open telephone office. With the change of industry category in different periods, there will inevitably be a lot of incorrect data in the corpus. Because of the large corpus, it is usually not able to be proofread by experts one by one. Therefore, we must use the method of data mining to find error classification data, and then proofread the error classification data one by industry experts. The purpose of this paper is to screen the data of error classification in the corpus so as to correct the classification of data by industry experts. This paper discusses the discrimination and classification of text data. This paper first discusses the technology and flow of text classification, then discusses the nature of naive Bayes method, finally discusses the refinement of text corpus, and discusses the method of selecting category discrimination error data. An empirical analysis is given. Under the condition of big data, it is not realistic to adopt the method of manual correction of text data by industry experts because it will consume a lot of manpower, material resources and financial resources. According to certain rules, batch marking of text data categories is another effective method, this method can effectively avoid the shortcomings of direct expert classification, but the accuracy of text data class marking is low. In combination with the above two methods, the third method is put forward. Firstly, the classification of text data is labeled in batches, and the text data that is wrong in category marking is handed over to industry experts for manual marking. Then the text data in the text corpus is corrected by the text data marked by industry experts. The study of text corpus refining is based on the third method. Different methods are used to extract the text data of category discrimination errors in the text corpus. In all methods, the text data which is wrong in category discrimination is the most likely text data for category marking errors. The purpose of text corpus refining is to extract the text data which is most likely to be a category tagging error in the text corpus. This part of text data is handed over to the category of manual marking of industry experts. Finally, the category of text data of text corpus is corrected based on the text data of industry experts. This paper first introduces the general process of text data classification, then introduces the naive Bayes classification algorithm; finally, the purpose and method of text corpus preprocessing, feature extraction, text corpus refining, The text data which extract the category discrimination error and so on are studied. The emphasis of this paper is to study the method of extracting text data of category discrimination error.
【學(xué)位授予單位】:東北師范大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:H08
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 邸鵬;段利國(guó);;一種新型樸素貝葉斯文本分類算法[J];數(shù)據(jù)采集與處理;2014年01期
2 劉德喜;萬(wàn)常選;;社會(huì)化短文本自動(dòng)摘要研究綜述[J];小型微型計(jì)算機(jī)系統(tǒng);2013年12期
3 曾青華;袁家斌;張?jiān)浦?;基于Hadoop的貝葉斯過(guò)濾MapReduce模型[J];計(jì)算機(jī)工程;2013年11期
4 衛(wèi)潔;石洪波;冀素琴;;基于Hadoop的分布式樸素貝葉斯文本分類[J];計(jì)算機(jī)系統(tǒng)應(yīng)用;2012年02期
5 陳朝大;梁柱勛;鄭士基;;一種利用關(guān)聯(lián)規(guī)則的改進(jìn)樸素貝葉斯分類算法[J];計(jì)算機(jī)系統(tǒng)應(yīng)用;2010年11期
6 鄭煒;沈文;張英鵬;;基于改進(jìn)樸素貝葉斯算法的垃圾郵件過(guò)濾器的研究[J];西北工業(yè)大學(xué)學(xué)報(bào);2010年04期
7 黃魏;高兵;劉異;楊克巍;;基于詞條組合的中文文本分詞方法[J];科學(xué)技術(shù)與工程;2010年01期
8 鄧u&;付長(zhǎng)賀;;四種貝葉斯分類器及其比較[J];沈陽(yáng)師范大學(xué)學(xué)報(bào)(自然科學(xué)版);2008年01期
9 王雙成;忻瑞嬋;;廣義樸素貝葉斯分類器[J];計(jì)算機(jī)應(yīng)用與軟件;2007年11期
10 張玉芳;彭時(shí)名;呂佳;;基于文本分類TFIDF方法的改進(jìn)與應(yīng)用[J];計(jì)算機(jī)工程;2006年19期
相關(guān)碩士學(xué)位論文 前4條
1 吳文岫;短文本分類語(yǔ)料庫(kù)的構(gòu)建及分類方法的研究[D];安徽大學(xué);2015年
2 李太白;短文本分類中特征選擇算法的研究[D];重慶師范大學(xué);2013年
3 常娟;短文本分類方法研究[D];復(fù)旦大學(xué);2008年
4 張虎;漢語(yǔ)語(yǔ)料庫(kù)詞性標(biāo)注一致性檢查及自動(dòng)校對(duì)方法研究[D];山西大學(xué);2005年
,本文編號(hào):2263630
本文鏈接:http://sikaile.net/wenyilunwen/yuyanyishu/2263630.html