藏文文本復制檢測技術研究

發(fā)布時間：2018-08-09 07:32

【摘要】：文本是互聯網信息資源的一種主要存在形式。隨著互聯網的不斷發(fā)展和網絡數字資源的日益豐富,給人們提供了資源共享和信息交流的便利平臺。已經成為人們信息獲取的重要來源,同時為廣大科研工作者和師生提供了便捷的學術交流機會。一個文本經過增添刪改字詞或改變說法重述之后便可以形成新的文本,這種行為稱為文本的復制或抄襲。文本復制檢測技術就是防止這種行為、保護文本知識產權、端正學術風氣和提高信息檢索效率的重要手段。目前,中英文文本復制檢測技術較成熟。但由于藏文與中英文語言天然存在差異,諸多中英文自然語言的復制檢測技術并不能完全適用于藏文,也無法用它們來檢測藏文文本的復制率。這一空白導致了很多民族高校和藏學研究者出現了論文質量低、學術氣氛差和學術創(chuàng)新難以提高等現象。那么,如何針對這種現象設計并實現藏文文本的復制率檢測系統(tǒng)是本課題研究的重點。經過分析中英文復制檢測結果,發(fā)現一般抄襲者所抄襲的最小單元不會小于句子這個粒度。因為句子是文本內容中具有完整語義的基本文本單元。因此,本文是基于藏文句子級別的復制檢測方法,利用空間向量的余弦相似度算法計算藏語句子的相似度。該算法的關鍵是選取特征向量,用特征向量生成向量空間模型,最后計算余弦相似度。文中對文本復制檢測技術進行了研究。根據文本復制檢測的基本步驟對藏文文本進行預處理、文本分塊、特征提取、句子相似度計算,最后用句子相似度來度量整篇藏文文本的抄襲率。在藏文文本預處理時,考慮了編碼的統(tǒng)一性和可存儲性分別對藏文文本的編碼和藏文字符編碼進行了研究,將其統(tǒng)一轉換成Unicode編碼。在藏文文本分塊時,采用了藏文句子邊界識別方法,把藏文文本按句子為粒度進行分塊處理。同時建立句子與文檔的倒排索引表,減少重復句子的兩兩比較和定位句子的位置信息。在藏文文本的特征提取時,采用了藏文自動分詞方法,用TF-IDF計算每個詞的頻率,構建詞頻向量集。其次,計算待檢測文本的每個文本塊與庫文本的文本塊之間的相似度來度量整片文本的復制率。最后,用待檢測文本進行測試,對測試結果進行了比較和分析,用查準率和查全率兩個性能指標評價藏文文本復制檢測技術。
[Abstract]:Text is one of the main forms of Internet information resources. With the continuous development of the Internet and the increasingly rich digital resources of the network, it provides a convenient platform for people to share resources and exchange information. It has become an important source for people to obtain information, and provides convenient opportunities for scientific research workers and teachers and students to communicate with each other. A new text can be formed after a text has been added, deleted, or restated, which is called a copy or plagiarism of the text. Text copy detection technology is an important means to prevent this kind of behavior, to protect the intellectual property rights of text, to correct the academic atmosphere and to improve the efficiency of information retrieval. At present, Chinese and English text copy detection technology is more mature. However, due to the natural differences between Tibetan and Chinese, many natural language replication and detection techniques can not be fully applicable to Tibetan, nor can they be used to detect the copy rate of Tibetan text. This gap has led to the low quality of papers, poor academic atmosphere and difficult to improve academic innovation in many ethnic colleges and universities and Tibetology researchers. So, how to design and implement the copy detection system of Tibetan text is the focus of this research. It is found that the minimum unit of plagiarism is not smaller than the grain size of sentence. Because sentence is the basic text unit with complete semantics in text content. Therefore, this paper is based on Tibetan sentence level replication detection method, using space vector cosine similarity algorithm to calculate the similarity of Tibetan sentences. The key of the algorithm is to select the feature vector, generate the vector space model with the feature vector, and calculate the cosine similarity finally. In this paper, the text copy detection technology is studied. According to the basic steps of text copy detection, the Tibetan text is preprocessed, partitioned, feature extracted, sentence similarity calculated, and the plagiarism rate of the whole Tibetan text is measured by sentence similarity. In the preprocessing of Tibetan text, the unity and storability of encoding are considered, and the encoding of Tibetan text and Tibetan character coding are studied, respectively, and the unified conversion to Unicode coding is carried out. When the Tibetan text is divided into blocks, the Tibetan sentence boundary recognition method is adopted, and the Tibetan text is processed in blocks according to the grain size of the sentences. At the same time, the inverted index table of sentence and document is established to reduce the pairwise comparison of repeated sentences and locate the position information of sentences. In the feature extraction of Tibetan text, the Tibetan automatic word segmentation method is adopted, the frequency of each word is calculated by TF-IDF, and the word frequency vector set is constructed. Secondly, the similarity between each block of text to be detected and the block of library text is calculated to measure the copy rate of the whole piece of text. Finally, the test results are compared and analyzed with the text to be tested, and two performance indexes, precision and recall, are used to evaluate the copy detection technology of Tibetan text.
【學位授予單位】：青海民族大學
【學位級別】：碩士
【學位授予年份】：2015
【分類號】：TP391.1

【相似文獻】

相關期刊論文前10條

1 鄭煒冬;;試卷相似度自動評估技術的研究[J];智能計算機與應用;2011年06期

2 趙濤;肖建;;二型模糊相似度及其應用[J];計算機工程與應用;2013年08期

3 徐志明;李棟;劉挺;李生;王剛;袁樹侖;;微博用戶的相似性度量及其應用[J];計算機學報;2014年01期

4 李桂林,陳曉云;關于聚類分析中相似度的討論[J];計算機工程與應用;2004年31期

5 秦玉平;楊興凱;;基于案例推理的區(qū)間屬性相似度研究[J];遼寧師范大學學報(自然科學版);2006年04期

6 蔣鵬;;基于本體的應急案例相似度算法研究[J];南昌高專學報;2009年03期

7 何亞;;詞語相似度算法的分析與改進[J];硅谷;2011年24期

8 仇麗青;陳卓艷;;基于共同鄰居相似度的社區(qū)發(fā)現算法[J];信息系統(tǒng)工程;2014年05期

9 焦鵬;唐見兵;查亞兵;;仿真可信度評估中相似度方法的改進及其應用[J];系統(tǒng)仿真學報;2007年12期

10 姜毅;樂慶玲;;一種基于興趣相似度的學習社區(qū)算法[J];電腦知識與技術(學術交流);2007年16期

相關會議論文前10條

1 劉海波;鄭德權;趙鐵軍;;基于相似度線性加權方法的檢索結果聚類研究[A];中國計算語言學研究前沿進展（2009-2011）[C];2011年

2 陸勁挺;路強;劉曉平;;對比相似度計算方法及其在功能樹擴展中的應用[A];計算機技術與應用進展·2007——全國第18屆計算機技術與應用（CACIS）學術會議論文集[C];2007年

3 董刊生;方金云;;基于向量距離的詞序相似度算法[A];第四屆全國信息檢索與內容安全學術會議論文集（上）[C];2008年

4 劉曉平;陸勁挺;;任意功能樹的物元相似度求解方法[A];全國第21屆計算機技術與應用學術會議（CACIS·2010）暨全國第2屆安全關鍵技術與應用學術會議論文集[C];2010年

5 王茜;張衛(wèi)星;;基于分類樹相似度加權的協同過濾算法[A];2008年計算機應用技術交流會論文集[C];2008年

6 洪文學;王金甲;常鳳香;宋佳霖;劉文遠;王立強;;基于圖形特征增強的相似度分類器的研究[A];中國生物醫(yī)學工程進展——2007中國生物醫(yī)學工程聯合學術年會論文集（上冊）[C];2007年

7 雷慶;吳揚揚;;一種基于語義信息計算XML文檔相似度的新方法[A];第二十一屆中國數據庫學術會議論文集（技術報告篇）[C];2004年

8 葉正;林鴻飛;楊志豪;;基于問句相似度的中文FAQ問答系統(tǒng)研究[A];第三屆學生計算語言學研討會論文集[C];2006年

9 羅辛;歐陽元新;熊璋;袁滿;;通過相似度支持度優(yōu)化基于K近鄰的協同過濾算法[A];NDBC2010第27屆中國數據庫學術會議論文集A輯一[C];2010年

10 王健;劉衍珩;焦玉;;VANETs信任傳播建模[A];中國通信學會通信軟件技術委員會2009年學術會議論文集[C];2009年

相關重要報紙文章前1條

1 王伽　衛(wèi)江;出入境證件照片應及時更換[N];中國國門時報;2008年

相關博士學位論文前10條

1 操震洲;矢量數據動態(tài)多尺度網絡傳輸研究[D];南京大學;2015年

2 程亮;基于本體的疾病數據整合與挖掘方法研究[D];哈爾濱工業(yè)大學;2014年

3 張明西;信息網絡中的相似度搜索問題研究[D];復旦大學;2013年

4 武威;異質數據相似度學習及其在網絡搜索中的應用[D];北京大學;2012年

5 朱娜斐;基于RTT相似度的網絡延遲估測理論和方法[D];北京工業(yè)大學;2012年

6 錢鵬飛;基于模糊相似度的異構本體映射、合并及校驗方法的研究[D];上海交通大學;2008年

7 馬海平;基于概率生成模型的相似度建模技術研究及應用[D];中國科學技術大學;2013年

8 劉守群;海量網絡視頻快速檢索關鍵技術研究[D];中國科學技術大學;2010年

9 夏云慶;IHSMTS系統(tǒng)中啟發(fā)式類比翻譯處理機制（HATM）的設計與實現[D];中國科學院研究生院（計算技術研究所）;2001年

10 姜雅文;復雜網絡社區(qū)發(fā)現若干問題研究[D];北京交通大學;2014年

相關碩士學位論文前10條

1 楊巧;基于改進相似度的社會網絡鏈接預測研究[D];華南理工大學;2015年

2 張寧;某于《知網》的詞語相似度優(yōu)化算法[D];昆明理工大學;2015年

3 沈迤淳;歌曲中相似片段的檢測及其應用[D];復旦大學;2014年

4 梁霄;社交網絡中的社區(qū)聚集研究[D];上海交通大學;2015年

5 王魁;在線社交中基于微博的好友推薦系統(tǒng)設計與實現[D];電子科技大學;2015年

6 洪耀停;基于共同作者圖的合作者推薦系統(tǒng)[D];浙江大學;2015年

7 褚立超;基于相似度評分模型的人員識別方法研究[D];廣西大學;2015年

8 蒲進興;基于動態(tài)相似度的錯誤定位優(yōu)先排序[D];北京化工大學;2015年

9 余超;基于Google Map的地理位置查詢系統(tǒng)[D];電子科技大學;2014年

10 烏蘭;基于動力學行為的復雜網絡社區(qū)檢測研究[D];內蒙古工業(yè)大學;2015年

，

本文編號：2173367

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/falvlunwen/zhishichanquanfa/2173367.html

上一篇：我國專利侵權訴訟與專利無效審理模式問題研究
下一篇：科技型中小企業(yè)的知識產權質押融資發(fā)展路徑

論文發(fā)表

·知網|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

藏文文本復制檢測技術研究