文檔復(fù)制檢測方法研究與系統(tǒng)實現(xiàn)
本文關(guān)鍵詞: 本復(fù)制檢測 在線復(fù)制檢測 關(guān)鍵字提取 相似度計算 倒排索引 出處:《哈爾濱工業(yè)大學(xué)》2012年碩士論文 論文類型:學(xué)位論文
【摘要】:目前,隨著互聯(lián)網(wǎng)的快速發(fā)展,網(wǎng)絡(luò)信息資源日益豐富,人們的信息交流的方式變得日益便利。然而由于文本,圖片,視頻等網(wǎng)絡(luò)電子資源便利的復(fù)制基礎(chǔ),從而導(dǎo)致網(wǎng)絡(luò)資源過多的冗余,降低了網(wǎng)絡(luò)搜索引擎的檢索效率,同時加大了信息抽取的難度。近年來一些高校里也頻繁出現(xiàn)了作業(yè)抄襲,論文抄襲等現(xiàn)象。為了提高網(wǎng)絡(luò)信息檢索效率、保護知識產(chǎn)權(quán),以及端正學(xué)術(shù)風(fēng)氣,文檔復(fù)制檢測技術(shù)成為了自然語言處理領(lǐng)域的研究熱點,其研究意義十分重大。 本文對文檔復(fù)制檢測方面做了詳細(xì)研究,在前人研究的基礎(chǔ)上,對基于句子相似度計算的文檔復(fù)制檢測方法作了改進,很大程度上提高了文檔復(fù)制檢測效率與檢測準(zhǔn)確率。 首先,,本文針對文檔復(fù)制檢測的背景、意義、國內(nèi)外發(fā)展現(xiàn)狀及相關(guān)技術(shù)作了詳細(xì)介紹,并分析了目前常用文本復(fù)制檢測算法的優(yōu)缺點。 其次,基于傳統(tǒng)的BSP復(fù)制檢測算法,提出了基于有序最長公共關(guān)鍵詞序列的句子相似度算法及基于關(guān)鍵詞距離的句子局部復(fù)制檢測算法,同時設(shè)計了詞語-句子,句子-文檔的倒排索引結(jié)構(gòu),有效地提高了復(fù)制檢測準(zhǔn)確率與檢測效率。 再次,基于本文提出的文本復(fù)制檢測方法,設(shè)計實現(xiàn)了一款文本復(fù)制檢測系統(tǒng)。根據(jù)實際應(yīng)用需求,系統(tǒng)主要功能包括文檔注冊、文檔檢索、同義詞維護、本地復(fù)制檢測、分布式復(fù)制檢測,在線復(fù)制檢測、網(wǎng)絡(luò)設(shè)置、系統(tǒng)設(shè)置、文檔庫管理等。 最后,實驗表明:結(jié)果本文所研究的文檔復(fù)制檢測方法的實用性和有效性。
[Abstract]:At present, with the rapid development of the Internet, the network information resources are increasingly rich, and the way people exchange information becomes more and more convenient. However, due to the convenient reproduction basis of electronic resources such as text, pictures, video and so on, This leads to excessive redundancy of network resources, reduces the search efficiency of network search engines, and increases the difficulty of information extraction. In recent years, some colleges and universities have also frequently appeared homework plagiarism. In order to improve the efficiency of network information retrieval, protect intellectual property rights, and correct the academic atmosphere, document replication and detection technology has become the research hotspot in the field of natural language processing, and its research significance is very important. This paper makes a detailed study on document replication detection. On the basis of previous studies, the paper improves the document replication detection method based on sentence similarity calculation, which greatly improves the efficiency and accuracy of document replication detection. First of all, this paper introduces the background, significance, development status and related technologies of document replication detection in detail, and analyzes the advantages and disadvantages of common text copy detection algorithms. Secondly, based on the traditional BSP replication detection algorithm, a sentence similarity algorithm based on ordered longest common keyword sequence and a sentence local copy detection algorithm based on keyword distance are proposed. At the same time, word-sentence is designed. Sentence-document inverted index structure effectively improves the accuracy and efficiency of copy detection. Thirdly, based on the text copy detection method proposed in this paper, a text copy detection system is designed and implemented. According to the actual application requirements, the main functions of the system include document registration, document retrieval, synonym maintenance, local copy detection. Distributed replication detection, online replication detection, network settings, system settings, document library management, etc. Finally, the experimental results show the practicability and effectiveness of the document copy detection method studied in this paper.
【學(xué)位授予單位】:哈爾濱工業(yè)大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2012
【分類號】:TP391.1
【參考文獻】
相關(guān)期刊論文 前10條
1 樊勇;鄭家恒;;基于主題的網(wǎng)頁去重[J];電腦開發(fā)與應(yīng)用;2008年04期
2 閻亞杰;;網(wǎng)頁去重方法研究[J];電腦開發(fā)與應(yīng)用;2008年08期
3 彭宜佳;;畢業(yè)論文抄襲的識別與防范[J];湖北廣播電視大學(xué)學(xué)報;2006年06期
4 宋擒豹,沈鈞毅;數(shù)字商品非法復(fù)制和擴散的監(jiān)測機制[J];計算機研究與發(fā)展;2001年01期
5 張義忠,趙明生,朱精南;基于內(nèi)容的網(wǎng)頁特征提取[J];計算機工程與應(yīng)用;2001年10期
6 金博,史彥軍,滕弘飛;中文文檔復(fù)制檢測系統(tǒng)研究[J];計算機工程;2005年19期
7 李欣,舒風(fēng)笛;最長公共子序列問題的改進快速算法[J];計算機應(yīng)用研究;2000年02期
8 姚新波;馬治坤;;基于特征串的網(wǎng)頁去重算法[J];科技信息;2008年28期
9 林春實,方燕,全吉成;漢語文獻自動分詞與標(biāo)引技術(shù)發(fā)展淺析[J];情報學(xué)報;1997年S1期
10 付年鈞;彭昌水;王慰;;中文分詞技術(shù)及其實現(xiàn)[J];軟件導(dǎo)刊;2011年01期
本文編號:1511286
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1511286.html