基于相似度估計(jì)文檔復(fù)制檢測(cè)系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)
發(fā)布時(shí)間:2019-03-16 15:31
【摘要】:隨著計(jì)算機(jī)網(wǎng)絡(luò)應(yīng)用技術(shù)的發(fā)展,互聯(lián)網(wǎng)中相似信息的數(shù)量呈幾何級(jí)增長,越來越多的高相似度文檔一方面消耗了高額的網(wǎng)絡(luò)儲(chǔ)存空間,另一方面也對(duì)用戶體驗(yàn)造成了不良影響。信息平臺(tái)的開放性與數(shù)字化文本的易獲性造成了論文的抄襲甚至是非法剽竊等學(xué)術(shù)不端行為有越演越烈之勢(shì),造成的嚴(yán)重后果不言而喻。為提高信息檢索效率和保護(hù)知識(shí)產(chǎn)權(quán),利用相似度估計(jì)技術(shù)來設(shè)計(jì)和實(shí)現(xiàn)文檔復(fù)制檢測(cè)系統(tǒng)具有重要技術(shù)意義和應(yīng)用價(jià)值。為了在海量數(shù)據(jù)環(huán)境中快速地、準(zhǔn)確地檢測(cè)出相似性文檔,論文圍繞文檔相似度估計(jì)的相關(guān)理論與方法進(jìn)行了深入的研究,設(shè)計(jì)并實(shí)現(xiàn)了基于相似度估計(jì)的文檔復(fù)制檢測(cè)系統(tǒng)。論文的主要工作體現(xiàn)如下:論文基于minwise相似性估計(jì)子,使用設(shè)計(jì)并實(shí)現(xiàn)了一套文檔相似性檢測(cè)系統(tǒng),涵蓋了文檔信息預(yù)處理、相似性計(jì)算、相似性結(jié)果呈現(xiàn)及導(dǎo)出三個(gè)子功能系統(tǒng),重點(diǎn)解決了項(xiàng)目文檔聚類、相似度估值算法、相似性證據(jù)著色、相似性報(bào)告單生成和數(shù)據(jù)統(tǒng)計(jì)分析等問題。以軟件工程中的瀑布模型為設(shè)計(jì)主線,論文詳細(xì)介紹了基于相似度估計(jì)的文檔相似性檢測(cè)系統(tǒng)的業(yè)務(wù)需求、系統(tǒng)架構(gòu)設(shè)計(jì)、功能設(shè)計(jì)和主要業(yè)務(wù)流程設(shè)計(jì),并對(duì)主要功能,給出了系統(tǒng)的實(shí)現(xiàn)環(huán)境、界面設(shè)計(jì)以及關(guān)鍵功能模塊的實(shí)現(xiàn)過程。經(jīng)過本課題的研發(fā)測(cè)試,最終得到的系統(tǒng)擁有更為人性化的操作,各類格式的文本(pdf、word)的提取率和相似性比對(duì)的計(jì)算效率顯著提升。
[Abstract]:With the development of computer network application technology, the number of similar information in the Internet is increasing exponentially. On the one hand, more and more documents with high similarity consume high amount of network storage space. On the other hand, it also has a negative impact on the user experience. The openness of information platform and the availability of digital text result in academic misconduct such as plagiarism and even illegal plagiarism. The serious consequences are self-evident. In order to improve the efficiency of information retrieval and protect intellectual property, it is of great technical significance and application value to design and implement a document copy detection system by using similarity estimation technology. In order to detect similarity documents quickly and accurately in the environment of massive data, this paper researches deeply on the theory and method of document similarity estimation, and designs and implements a document copy detection system based on similarity estimation. The main work of this paper is as follows: based on the minwise similarity estimator, a set of document similarity detection system is designed and implemented, which covers the pre-processing of document information, similarity calculation, and so on. Three sub-functional systems are presented and derived from similarity results, which focus on solving the problems of project document clustering, similarity estimation algorithm, similarity evidence coloring, similarity report form generation and data statistical analysis. Based on the waterfall model in software engineering, the paper introduces the business requirements, system architecture design, function design and main business process design of document similarity detection system based on similarity estimation in detail. The implementation environment, interface design and key function modules of the system are given. Through the research and development of this project, the final system has a more user-friendly operation, and the extraction rate of various formats of text (pdf,word) and the computing efficiency of similarity comparison are significantly improved.
【學(xué)位授予單位】:電子科技大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2014
【分類號(hào)】:TP391.1
,
本文編號(hào):2441642
[Abstract]:With the development of computer network application technology, the number of similar information in the Internet is increasing exponentially. On the one hand, more and more documents with high similarity consume high amount of network storage space. On the other hand, it also has a negative impact on the user experience. The openness of information platform and the availability of digital text result in academic misconduct such as plagiarism and even illegal plagiarism. The serious consequences are self-evident. In order to improve the efficiency of information retrieval and protect intellectual property, it is of great technical significance and application value to design and implement a document copy detection system by using similarity estimation technology. In order to detect similarity documents quickly and accurately in the environment of massive data, this paper researches deeply on the theory and method of document similarity estimation, and designs and implements a document copy detection system based on similarity estimation. The main work of this paper is as follows: based on the minwise similarity estimator, a set of document similarity detection system is designed and implemented, which covers the pre-processing of document information, similarity calculation, and so on. Three sub-functional systems are presented and derived from similarity results, which focus on solving the problems of project document clustering, similarity estimation algorithm, similarity evidence coloring, similarity report form generation and data statistical analysis. Based on the waterfall model in software engineering, the paper introduces the business requirements, system architecture design, function design and main business process design of document similarity detection system based on similarity estimation in detail. The implementation environment, interface design and key function modules of the system are given. Through the research and development of this project, the final system has a more user-friendly operation, and the extraction rate of various formats of text (pdf,word) and the computing efficiency of similarity comparison are significantly improved.
【學(xué)位授予單位】:電子科技大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2014
【分類號(hào)】:TP391.1
,
本文編號(hào):2441642
本文鏈接:http://sikaile.net/falvlunwen/zhishichanquanfa/2441642.html
最近更新
教材專著