當(dāng)前位置：主頁(yè) > 文藝論文 > 語(yǔ)言藝術(shù)論文 >

文本語(yǔ)料庫(kù)的精煉研究

發(fā)布時(shí)間：2018-10-11 09:07

【摘要】：文本語(yǔ)料庫(kù)是文本數(shù)據(jù)挖掘的基礎(chǔ)。很多文本語(yǔ)料庫(kù)來(lái)源于生產(chǎn)生活的實(shí)際工作中,通常由行業(yè)專家為其定義類別。本文的數(shù)據(jù)集來(lái)源于市長(zhǎng)公開電話辦公室,隨著不同時(shí)期行業(yè)類別的變更,語(yǔ)料庫(kù)難免會(huì)有很多錯(cuò)誤數(shù)據(jù),由于語(yǔ)料庫(kù)較大,通常不能由專家逐條校對(duì),所以必須使用數(shù)據(jù)挖掘的方法找出錯(cuò)分類數(shù)據(jù),針對(duì)這些錯(cuò)分類數(shù)據(jù)再由行業(yè)專家逐一校對(duì)。本文研究的內(nèi)容就是篩選語(yǔ)料庫(kù)中的錯(cuò)分類數(shù)據(jù),以便于行業(yè)專家矯正數(shù)據(jù)類別。本文討論了文本數(shù)據(jù)的判別分類問(wèn)題。文中首先對(duì)文本分類技術(shù)和流程給出了論述,而后討論了樸素貝葉斯方法的性質(zhì),最后討論了文本語(yǔ)料庫(kù)的精煉研究,討論了類別判別錯(cuò)誤數(shù)據(jù)的選取方法,并給出了實(shí)證分析。在大數(shù)據(jù)條件下,通過(guò)行業(yè)專家對(duì)文本數(shù)據(jù)人工標(biāo)記類別的方法,由于會(huì)消耗大量的人力、物力、財(cái)力,采用行業(yè)專家人工校正的方法是不現(xiàn)實(shí)的。按照一定的規(guī)則,批量的對(duì)文本數(shù)據(jù)標(biāo)記類別是另一種有效的方法,該方法能夠有效的避免直接專家標(biāo)類別的缺點(diǎn),但文本數(shù)據(jù)類別標(biāo)記的精確度比較低。結(jié)合以上兩種方法,提出了第三種方法,首先批量對(duì)文本數(shù)據(jù)標(biāo)記類別,將類別標(biāo)記錯(cuò)誤的文本數(shù)據(jù)交給行業(yè)專家進(jìn)行人工標(biāo)記,然后用行業(yè)專家標(biāo)記的文本數(shù)據(jù)對(duì)文本語(yǔ)料庫(kù)中的文本數(shù)據(jù)進(jìn)行校正。文本語(yǔ)料庫(kù)的精煉研究是基于第三種方法的。利用不同的方法提取文本語(yǔ)料庫(kù)中類別判別錯(cuò)誤的文本數(shù)據(jù),在所有的方法中類別判別均為錯(cuò)誤的文本數(shù)據(jù)是最可能為類別標(biāo)記錯(cuò)誤的文本數(shù)據(jù)。文本語(yǔ)料庫(kù)精煉的目的是提取文本語(yǔ)料庫(kù)中最可能為類別標(biāo)記錯(cuò)誤的文本數(shù)據(jù)。將這部分文本數(shù)據(jù)交給行業(yè)專家人工標(biāo)記類別,最后基于行業(yè)專家人工標(biāo)記的文本數(shù)據(jù)將文本語(yǔ)料庫(kù)的文本數(shù)據(jù)的類別進(jìn)行校正。本文首先簡(jiǎn)述文本數(shù)據(jù)分類的一般流程;然后介紹樸素貝葉斯分類算法;最后對(duì)文本語(yǔ)料庫(kù)的預(yù)處理,特征詞提取,文本語(yǔ)料庫(kù)精煉的目的和方法,提取類別判別錯(cuò)誤的文本數(shù)據(jù)等進(jìn)行研究。本文重點(diǎn)內(nèi)容是研究提取類別判別錯(cuò)誤的文本數(shù)據(jù)的方法。
[Abstract]:Text corpus is the foundation of text data mining. Many text corpora are derived from the actual work of production and life, and are usually defined by industry experts. The data set in this paper comes from the mayor's open telephone office. With the change of industry category in different periods, there will inevitably be a lot of incorrect data in the corpus. Because of the large corpus, it is usually not able to be proofread by experts one by one. Therefore, we must use the method of data mining to find error classification data, and then proofread the error classification data one by industry experts. The purpose of this paper is to screen the data of error classification in the corpus so as to correct the classification of data by industry experts. This paper discusses the discrimination and classification of text data. This paper first discusses the technology and flow of text classification, then discusses the nature of naive Bayes method, finally discusses the refinement of text corpus, and discusses the method of selecting category discrimination error data. An empirical analysis is given. Under the condition of big data, it is not realistic to adopt the method of manual correction of text data by industry experts because it will consume a lot of manpower, material resources and financial resources. According to certain rules, batch marking of text data categories is another effective method, this method can effectively avoid the shortcomings of direct expert classification, but the accuracy of text data class marking is low. In combination with the above two methods, the third method is put forward. Firstly, the classification of text data is labeled in batches, and the text data that is wrong in category marking is handed over to industry experts for manual marking. Then the text data in the text corpus is corrected by the text data marked by industry experts. The study of text corpus refining is based on the third method. Different methods are used to extract the text data of category discrimination errors in the text corpus. In all methods, the text data which is wrong in category discrimination is the most likely text data for category marking errors. The purpose of text corpus refining is to extract the text data which is most likely to be a category tagging error in the text corpus. This part of text data is handed over to the category of manual marking of industry experts. Finally, the category of text data of text corpus is corrected based on the text data of industry experts. This paper first introduces the general process of text data classification, then introduces the naive Bayes classification algorithm; finally, the purpose and method of text corpus preprocessing, feature extraction, text corpus refining, The text data which extract the category discrimination error and so on are studied. The emphasis of this paper is to study the method of extracting text data of category discrimination error.
【學(xué)位授予單位】：東北師范大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2017
【分類號(hào)】：H08

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 邸鵬;段利國(guó);;一種新型樸素貝葉斯文本分類算法[J];數(shù)據(jù)采集與處理;2014年01期

2 劉德喜;萬(wàn)常選;;社會(huì)化短文本自動(dòng)摘要研究綜述[J];小型微型計(jì)算機(jī)系統(tǒng);2013年12期

3 曾青華;袁家斌;張?jiān)浦?;基于Hadoop的貝葉斯過(guò)濾MapReduce模型[J];計(jì)算機(jī)工程;2013年11期

4 衛(wèi)潔;石洪波;冀素琴;;基于Hadoop的分布式樸素貝葉斯文本分類[J];計(jì)算機(jī)系統(tǒng)應(yīng)用;2012年02期

5 陳朝大;梁柱勛;鄭士基;;一種利用關(guān)聯(lián)規(guī)則的改進(jìn)樸素貝葉斯分類算法[J];計(jì)算機(jī)系統(tǒng)應(yīng)用;2010年11期

6 鄭煒;沈文;張英鵬;;基于改進(jìn)樸素貝葉斯算法的垃圾郵件過(guò)濾器的研究[J];西北工業(yè)大學(xué)學(xué)報(bào);2010年04期

7 黃魏;高兵;劉異;楊克巍;;基于詞條組合的中文文本分詞方法[J];科學(xué)技術(shù)與工程;2010年01期

8 鄧u&;付長(zhǎng)賀;;四種貝葉斯分類器及其比較[J];沈陽(yáng)師范大學(xué)學(xué)報(bào)(自然科學(xué)版);2008年01期

9 王雙成;忻瑞嬋;;廣義樸素貝葉斯分類器[J];計(jì)算機(jī)應(yīng)用與軟件;2007年11期

10 張玉芳;彭時(shí)名;呂佳;;基于文本分類TFIDF方法的改進(jìn)與應(yīng)用[J];計(jì)算機(jī)工程;2006年19期

相關(guān)碩士學(xué)位論文前4條

1 吳文岫;短文本分類語(yǔ)料庫(kù)的構(gòu)建及分類方法的研究[D];安徽大學(xué);2015年

2 李太白;短文本分類中特征選擇算法的研究[D];重慶師范大學(xué);2013年

3 常娟;短文本分類方法研究[D];復(fù)旦大學(xué);2008年

4 張虎;漢語(yǔ)語(yǔ)料庫(kù)詞性標(biāo)注一致性檢查及自動(dòng)校對(duì)方法研究[D];山西大學(xué);2005年

，

本文編號(hào)：2263630

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/wenyilunwen/yuyanyishu/2263630.html

上一篇：哈貝馬斯實(shí)踐哲學(xué)視域下的中國(guó)方言對(duì)公共領(lǐng)域的構(gòu)建
下一篇：泰國(guó)宋卡王子大學(xué)孔子學(xué)院學(xué)習(xí)者對(duì)漢語(yǔ)教學(xué)的需求分析

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

文本語(yǔ)料庫(kù)的精煉研究