基于哈希學習的高效文本拷貝檢測研究

發(fā)布時間：2018-03-20 00:31

本文選題：哈希學習　切入點：拷貝檢測　出處：《復旦大學》2013年碩士論文　論文類型：學位論文

【摘要】：在信息化不斷深入、互聯(lián)網(wǎng)越來越普及的今天,各種形式的文本數(shù)據(jù)正以驚人的速度增長,由此帶來的數(shù)據(jù)拷貝問題也變得越來越嚴重。對于企業(yè)或組織來說,大量的重復數(shù)據(jù)導致存儲和檢索的效率降低；對于互聯(lián)網(wǎng)網(wǎng)站來說,大量的抄襲嚴重影響數(shù)據(jù)生產(chǎn)者的權(quán)益和積極性,不利于整個互聯(lián)網(wǎng)的良性發(fā)展；同時,數(shù)據(jù)拷貝對于搜索引擎的效果也造成了一定的負面影響。對于文本拷貝檢測研究來說,該領(lǐng)域的研究方向主要分為兩部分：1)文本表示,2)效率和可擴展性。前者主要研究如何從文本中抽取相關(guān)特征,用這些特征來代表文本,從而更好地進行拷貝檢測；后者在大規(guī)模數(shù)據(jù)的背景下主要研究如何高效地檢測拷貝文本。然而,在很多研究中,這兩個方向并不是孤立的,前者會為后者服務,一個高效的拷貝檢測方案可能需要特殊的文本表示方法。此外,對于不同的應用場景,進行拷貝檢測的顆粒度也有所不同；對于顆粒度更小的拷貝檢測,如句子級別的拷貝檢測,它對效率和可擴展性的要求將更高。本文主要的研究內(nèi)容同樣包括這兩方面,具體如下： 1)首先,本文提出了一個完整的拷貝檢測框架,包括主要的流程以及拷貝檢測算法； 2)其次,本文詳細討論了用哈希編碼來表示文本的可行性,并且發(fā)現(xiàn)已有的哈希編碼方案在準確率方面仍有著較大的提升空間：基于哈希編碼空間有限,需要充分利用編碼空間這一事實,本文提出了一個哈希編碼學習方案,并且通過實驗發(fā)現(xiàn),該方案所得到的哈希編碼確實是更優(yōu)的,能夠大幅提升檢測的準確性； 3)最后,通過在GPU上實現(xiàn)最為耗時的關(guān)鍵算法,取得了超過1500倍的加速比,同時兼具良好的可擴展性。
[Abstract]:With the deepening of information technology and the increasing popularity of the Internet, various forms of text data are increasing at an alarming rate, and the problem of data copying is becoming more and more serious. A large amount of duplicate data leads to lower efficiency of storage and retrieval; for Internet websites, a large number of plagiarism seriously affects the rights and interests of data producers and enthusiasm, and is not conducive to the healthy development of the entire Internet. Data copy to search engine effect also caused certain negative influence. For the study of text copy detection, the research direction in this field is mainly divided into two parts: 1) text representation 2) efficiency and extensibility. The former mainly studies how to extract relevant features from the text and use these features to represent the text. The latter, in the context of large-scale data, mainly studies how to detect copy text efficiently. However, in many studies, these two directions are not isolated, the former serving the latter. An efficient copy detection scheme may require a special text representation method. In addition, for different application scenarios, the particle size of copy detection is different; for copy detection with smaller particle size, Such as sentence-level copy detection, it will require higher efficiency and scalability. The main contents of this paper also include these two aspects, as follows:. 1) first of all, this paper proposes a complete copy detection framework, including the main process and copy detection algorithm; 2) secondly, the feasibility of using hash coding to represent text is discussed in detail, and it is found that the existing hash coding schemes still have a large improvement space in terms of accuracy: based on the limited hash coding space, It is necessary to make full use of the fact of coding space. In this paper, we propose a hashing coding learning scheme, and through experiments, we find that the hashing coding obtained by this scheme is indeed better and can greatly improve the accuracy of detection. 3) finally, by implementing the most time-consuming key algorithm on GPU, the speedup is more than 1500 times and the scalability is good.
【學位授予單位】：復旦大學
【學位級別】：碩士
【學位授予年份】：2013
【分類號】：TP391.3

【相似文獻】

相關(guān)會議論文前4條

1 徐強;張學軍;楊森;;拷貝數(shù)變異(CNVs)的研究進展[A];中華醫(yī)學會第14次全國皮膚性病學術(shù)年會論文匯編[C];2008年

2 王棟;張元女;王明月;夏繼光;程立新;李朋飛;李賓;王晨光;郭政;;基于拷貝數(shù)數(shù)據(jù)揭示基因在癌基因組中廣泛擴增[A];中國的遺傳學研究——遺傳學進步推動中國西部經(jīng)濟與社會發(fā)展——2011年中國遺傳學會大會論文摘要匯編[C];2011年

3 王火生;李美忠;徐六妹;王敏;;應用熒光定量PCR技術(shù)檢測HBV低拷貝樣品和臨床分析[A];第一次全國中西醫(yī)結(jié)合傳染病學術(shù)會議論文匯編[C];2006年

4 應莉莎;許沈華;蘇丹;牟瀚舟;葛海鵬;顧琳慧;朱赤紅;劉祥麟;;高轉(zhuǎn)移卵巢癌表達譜差異基因與染色體拷貝數(shù)變異相關(guān)性研究[A];第二屆中國醫(yī)學細胞生物學學術(shù)大會暨細胞生物學教學改革會議論文集[C];2008年

相關(guān)重要報紙文章前7條

1 記者胡德榮;首張中國人群拷貝數(shù)變異圖譜制成[N];健康報;2012年

2 本報記者胡嶸;《好奇害死貓》拷貝增至180多個[N];中國電影報;2006年

3 廣州紅葉;DVD影片的快速拷貝和播放[N];電腦報;2003年

4 ;典型的容災備份方案[N];網(wǎng)絡世界;2001年

5 江蘇黃建林;劣質(zhì)網(wǎng)卡造成拷貝數(shù)據(jù)錯誤[N];電腦報;2004年

6 農(nóng)業(yè)部飼料工業(yè)中心陸文清博士;使用飼用抗生素危害到底有多大[N];中國畜牧獸醫(yī)報;2005年

7 記者張兆軍邋通訊員王柏濤;葉綠體轉(zhuǎn)化技術(shù)廣泛應用[N];科技日報;2008年

相關(guān)博士學位論文前6條

1 袁田;成人急性淋巴細胞白血病基因組拷貝數(shù)變異分析[D];北京協(xié)和醫(yī)學院;2013年

2 許敏;微囊藻偽空胞基因叢的研究[D];中國科學院研究生院（水生生物研究所）;2006年

3 傅雯卿;人類基因組分析中的缺失偏倚效應研究和拷貝數(shù)變異的突變估計[D];復旦大學;2010年

4 張良志;中國地方黃牛基因組拷貝數(shù)變異檢測及遺傳效應研究[D];西北農(nóng)林科技大學;2014年

5 郭金超;轉(zhuǎn)基因植物及產(chǎn)品核酸檢測新技術(shù)研究[D];上海交通大學;2011年

6 王謙;木聚糖酶基因的體外定向進化及其高拷貝重組酵母的構(gòu)建[D];浙江大學;2012年

相關(guān)碩士學位論文前10條

1 王s，

本文編號：1636736

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1636736.html

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于哈希學習的高效文本拷貝檢測研究