基于二次特征提取的中文文本抄襲檢測(cè)方法
本文選題:抄襲檢測(cè) 切入點(diǎn):文本預(yù)處理 出處:《西南大學(xué)》2013年碩士論文
【摘要】:近年來(lái),隨著信息技術(shù)和通信網(wǎng)絡(luò)的飛速發(fā)展,人們獲取信息的方式從大量的物質(zhì)介質(zhì)轉(zhuǎn)化為網(wǎng)絡(luò)文檔,這種發(fā)展給人們帶來(lái)了方便的同時(shí)也給我們的生活和技術(shù)本身的發(fā)展起到負(fù)面的作用。相比于傳統(tǒng)文件,電子文檔更容易被非法復(fù)制,且文本抄襲現(xiàn)象出現(xiàn)在很多領(lǐng)域,如學(xué)術(shù)界,商業(yè)界等都已非常嚴(yán)重。為了維護(hù)高校正常教學(xué)秩序,保護(hù)知識(shí)產(chǎn)權(quán),抑制抄襲現(xiàn)象的蔓延,文本抄襲檢測(cè)技術(shù)的研究具有重要意義。目前文本抄襲檢測(cè)研究領(lǐng)域中比較有效的檢測(cè)系統(tǒng)有Siff, COPS和中國(guó)知網(wǎng)檢測(cè)系統(tǒng),但普遍存在檢測(cè)準(zhǔn)確率不高的問(wèn)題。 中文文本文本抄襲檢測(cè)的主要思想是:首先對(duì)文本進(jìn)行預(yù)處理,包括去掉文本中與文本檢測(cè)無(wú)關(guān)的信息和文本分詞;其次是提取文本特征;最后計(jì)算待測(cè)文本與源文本的相似度,若得到的相似度值較事先設(shè)定的閾值高,說(shuō)明該待測(cè)文本有抄襲的嫌疑。文本預(yù)處理和特征提取是文本抄襲檢測(cè)的研究重點(diǎn)和難點(diǎn)。文本圍繞這兩個(gè)方面開(kāi)展研究,主要研究工作包括: 1、文本預(yù)處理:目前,大多針對(duì)中文的文本抄襲檢測(cè)方法都是對(duì)文本進(jìn)行簡(jiǎn)單的處理,未考慮中文文本的單字詞與多字詞特征,從而導(dǎo)致文本特征提取不全面的問(wèn)題,致使檢測(cè)準(zhǔn)確率不高。針對(duì)此問(wèn)題,提出一種合并整體詞的文本預(yù)處理方法,在文本分詞之后,根據(jù)各個(gè)詞的前后語(yǔ)義關(guān)系,合并具有整體意義的詞,以此作為文本預(yù)處理結(jié)果。實(shí)驗(yàn)表明,經(jīng)過(guò)合并整體詞后的文本,能減少后文中的計(jì)算次數(shù),為特征提取提供更好的提取方案,從而提高檢測(cè)準(zhǔn)確率。 2、文本特征提。禾卣魈崛∈且x取能夠代表文本特征的文本塊。選出的文本塊要求是能代表文本特征的信息,包括語(yǔ)義信息和一定的結(jié)構(gòu)信息,使文本抄襲檢測(cè)的準(zhǔn)確率盡量高。但是現(xiàn)階段的提取方法,提取的特征不全和特征數(shù)量太多,算法的計(jì)算次數(shù)多,時(shí)間復(fù)雜度高等問(wèn)題。針對(duì)此類問(wèn)題,我們提出將預(yù)處理之后的文本進(jìn)行二次特征提取,提高特征的精確度和減小特征長(zhǎng)度。主要采用數(shù)字指紋來(lái)表示文本信息,將所有的文本轉(zhuǎn)化為數(shù)字指紋集合,統(tǒng)計(jì)各個(gè)指紋出現(xiàn)的頻度,并將指紋集合利用匹配統(tǒng)計(jì)的相似度計(jì)算方法進(jìn)行相似度計(jì)算。實(shí)驗(yàn)表明,本特征提取方法提取的特征能夠精確地代表文本,且長(zhǎng)度適中。 3、基于二次特征提取的中文文本抄襲檢測(cè)方法:分別采用我們提出的合并整體詞的文本預(yù)處理方法處理文本和二次特征提取方法提取本文特征,實(shí)現(xiàn)基于二次特征提取的中文文本抄襲檢測(cè)方法。實(shí)驗(yàn)表明,該檢測(cè)方法的檢測(cè)準(zhǔn)確率和查全率都有明顯提高。
[Abstract]:In recent years, with the rapid development of information technology and communication network, the way people obtain information from a large number of material media into network documents, This development not only brings convenience to people, but also plays a negative role in the development of our life and technology itself. Compared with traditional documents, electronic documents are more easily copied illegally, and the phenomenon of text copying appears in many fields. For example, academic and business circles are already very serious. In order to maintain normal teaching order in colleges and universities, to protect intellectual property rights, and to curb the spread of plagiarism, The research of text plagiarism detection technology is of great significance. At present, the more effective detection systems in the field of text plagiarism detection are Siff, COPS and Chinese knowledge net detection system, but the detection accuracy is not high. The main ideas of text plagiarism detection in Chinese text are as follows: first, preprocessing the text, including removing the text information and text participle which are irrelevant to text detection; secondly, extracting the text features; Finally, the similarity between the text under test and the source text is calculated. It shows that the text under test is suspected of plagiarism. Text preprocessing and feature extraction are the focus and difficulty of text plagiarism detection. 1. Text preprocessing: at present, most of the text plagiarism detection methods for Chinese text are simple processing of the text, without considering the single-character and multi-word features of the Chinese text, which leads to the problem of incomplete text feature extraction. In order to solve this problem, a text preprocessing method is proposed to combine the whole words. After the text segmentation, according to the semantic relationship between the words before and after each word, we combine the words with the whole meaning. The experimental results show that the text can reduce the number of computations in the following text and provide a better extraction scheme for feature extraction, thus improving the accuracy of detection. 2. Text feature extraction: feature extraction is to select text blocks that can represent text features. The selected text blocks require information that represents text features, including semantic information and certain structural information. The accuracy of text plagiarism detection is as high as possible. However, in the present extraction methods, the feature extraction is incomplete and the number of features is too large, the algorithm has a lot of computation times and high time complexity and so on. In order to improve the accuracy of the features and reduce the length of the features, we propose to extract the pre-processed text by using the digital fingerprint to represent the text information, and to transform all the texts into the digital fingerprint set. The frequency of each fingerprint is counted and the similarity is calculated by using the similarity calculation method of matching statistics. The experiment shows that the feature extracted by this method can represent the text accurately and the length is moderate. 3. The Chinese text plagiarism detection method based on the quadratic feature extraction: the text preprocessing method proposed by us to combine the whole word and the second feature extraction method are used to extract the features of this paper, respectively. A Chinese text plagiarism detection method based on quadratic feature extraction is implemented. The experimental results show that the detection accuracy and recall rate of this detection method are obviously improved.
【學(xué)位授予單位】:西南大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 龔漢明,周長(zhǎng)勝;漢語(yǔ)分詞技術(shù)綜述[J];北京機(jī)械工業(yè)學(xué)院學(xué)報(bào);2004年03期
2 史彥軍,滕弘飛,金博;抄襲論文識(shí)別研究與進(jìn)展[J];大連理工大學(xué)學(xué)報(bào);2005年01期
3 金博;史彥軍;滕弘飛;;基于篇章結(jié)構(gòu)相似度的復(fù)制檢測(cè)算法[J];大連理工大學(xué)學(xué)報(bào);2007年01期
4 朱顯榮;試析抄襲的認(rèn)定標(biāo)準(zhǔn)[J];河南司法警官職業(yè)學(xué)院學(xué)報(bào);2005年02期
5 宋擒豹,沈鈞毅;數(shù)字商品非法復(fù)制和擴(kuò)散的監(jiān)測(cè)機(jī)制[J];計(jì)算機(jī)研究與發(fā)展;2001年01期
6 麻會(huì)東;劉國(guó)華;李旭;梁鵬;劉春輝;張凌宇;;基于提取關(guān)鍵詞的中文文檔復(fù)制檢測(cè)研究[J];計(jì)算機(jī)工程與科學(xué);2007年10期
7 張量;劉秀敏;劉秀娟;;Winnowing算法和動(dòng)態(tài)規(guī)劃算法在作業(yè)剽竊檢測(cè)中的應(yīng)用和比較[J];計(jì)算機(jī)工程與科學(xué);2009年06期
8 馮書曉,徐新,楊春梅;國(guó)內(nèi)中文分詞技術(shù)研究新進(jìn)展[J];情報(bào)雜志;2002年11期
9 文庭孝,侯經(jīng)川,邱均平,張洋;漢語(yǔ)自動(dòng)分詞新思維:無(wú)詞典切分[J];情報(bào)雜志;2005年02期
10 鮑軍鵬,沈鈞毅,劉曉東;一個(gè)基于網(wǎng)格的文本復(fù)制檢測(cè)系統(tǒng)[J];微電子學(xué)與計(jì)算機(jī);2004年09期
相關(guān)碩士學(xué)位論文 前1條
1 曹艷;漢語(yǔ)文本抄襲識(shí)別系統(tǒng)研究[D];南京農(nóng)業(yè)大學(xué);2008年
,本文編號(hào):1660143
本文鏈接:http://sikaile.net/falvlunwen/zhishichanquanfa/1660143.html