基于Hadoop平臺(tái)的字符串相似性連接方法研究
本文選題:字符串相似連接 切入點(diǎn):Hadoop 出處:《東華大學(xué)》2017年碩士論文 論文類型:學(xué)位論文
【摘要】:隨著電子商務(wù)、社交網(wǎng)絡(luò)與云計(jì)算等互聯(lián)網(wǎng)技術(shù)的廣泛運(yùn)用和迅速發(fā)展,數(shù)據(jù)量急劇增長(zhǎng),對(duì)大規(guī)模數(shù)據(jù)進(jìn)行處理成為熱點(diǎn)問題之一。字符串相似連接是數(shù)據(jù)處理的基本操作,它在文本檢索、生物監(jiān)測(cè)、信息處理、模式識(shí)別、數(shù)據(jù)整合與清洗等領(lǐng)域有著廣泛的應(yīng)用;谧址嗨贫攘糠椒ㄓ卸喾N,包括編輯距離、杰卡德(Jaccard)相似度和Cosine相似度等,本文主要是對(duì)杰卡德相似度量的方法進(jìn)行研究。字符串相似連接的方法分為兩類:傳統(tǒng)的字符串相似連接方法與基于分布式框架的字符串相似連接方法。傳統(tǒng)的字符串相似連接方法有ALL-pairs、Ed-join和Trie-tree等,基于分布式框架的字符串相似連接方法有MRSimJoin、MR_DSJ和Fuzzy-Join等。本文對(duì)傳統(tǒng)的字符串相似連接方法進(jìn)行研究與分析,發(fā)現(xiàn)傳統(tǒng)方法受限于機(jī)器內(nèi)存空間、外存空間與CPU等資源,不適合對(duì)大規(guī)模數(shù)據(jù)進(jìn)行相似連接,而使用Hadoop分布式框架對(duì)大規(guī)模數(shù)據(jù)進(jìn)行處理是目前主要方式之一。因此本文研究如何在Hadoop分布式框架基礎(chǔ)上能高效并行地處理字符串相似連接。本文做出的主要貢獻(xiàn):(1)本文提出了一種字符串相似連接模型SSJ-Model,該模型運(yùn)用多種過濾策略且能增量式的對(duì)字符串進(jìn)行相似連接。(2)研究Hadoop分布式框架運(yùn)行原理,利用SSJ-Model提出了一種基于Hadoop的并行字符串相似性連接算法Hmrdp-join。(3)對(duì)Hmrdp-join算法進(jìn)行優(yōu)化,能保存MapReduce階段部分臨時(shí)結(jié)果,避免從磁盤拷貝數(shù)據(jù)產(chǎn)生的時(shí)間代價(jià)。更有效地對(duì)數(shù)據(jù)進(jìn)行劃分,平衡map階段與reduce階段的負(fù)載,避免產(chǎn)生數(shù)據(jù)傾斜。利用已存在的信息,避免相似連接過程中的部分重復(fù)計(jì)算。采用分組策略,減少對(duì)字符串的多重復(fù)制。(4)利用真實(shí)的數(shù)據(jù)集進(jìn)行實(shí)驗(yàn),分析得到優(yōu)化后的Hmrdp-join算法有更高的效率。
[Abstract]:With the extensive use of e-commerce, social networking and cloud computing technology and the rapid development of the Internet, the explosive growth of data on large-scale data processing has become a hot issue. The string similarity join is the basic operation of the data processing in text retrieval, biological monitoring, information processing, pattern recognition, data integration and cleaning etc. is widely used in the fields of similar characters. There are many methods to measure based on edit distance, including Jaccard, (Jaccard) Cosine similarity and similarity, this paper is mainly research methods of similarity measure. Jaccard string similarity join method is divided into two types: traditional string similarity join method and distributed framework based on string similarity connection method. Traditional string similarity join methods ALL-pairs, Ed-join and Trie-tree, a distributed framework based on similar connection string The method of MRSimJoin, MR_DSJ and Fuzzy-Join. In this paper, the traditional method of connection string similarity research and analysis, found that the traditional method is restricted by the machine memory space, disk space and CPU resources, not suitable for the large-scale data similar connection, and the use of Hadoop framework for distributed data processing is one of the main ways. This paper studies how to based on Hadoop distributed framework can be efficiently processed in parallel connection string similarity. The main contributions of this paper are to: (1) this paper proposes a string similarity join SSJ-Model model, this model employs several filtering strategies and incremental similarity connection string. (2) study on the operation principle of Hadoop distributed framework using SSJ-Model, we propose a parallel string similarity join algorithm based on Hadoop Hmrdp-join. (3) on Hmrdp-join The optimum method, can save MapReduce some temporary results, avoid copying data from disk. The time cost effectively divide the data load balance map stage and reduce stage, to avoid data skew. Use of existing information, avoid similar connection parts in the process of repeated calculation. By grouping strategy to reduce the multiple copies on the string. (4) set of experiments using real data analysis and optimized Hmrdp-join algorithm has higher efficiency.
【學(xué)位授予單位】:東華大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP301.6
【相似文獻(xiàn)】
相關(guān)期刊論文 前4條
1 郝建柏;陳賢富;黃雙福;楊俊;;一種基于模糊近鄰標(biāo)簽傳遞的半監(jiān)督分類算法[J];微電子學(xué)與計(jì)算機(jī);2010年02期
2 余海洋;林琛;陳珂;江弋;鄒權(quán);;Pass-Join-K:多分段匹配的相似性連接算法[J];計(jì)算機(jī)科學(xué)與探索;2013年10期
3 劉雪莉;王宏志;李建中;高宏;;實(shí)體數(shù)據(jù)庫(kù)中多相似連接順序選擇策略[J];計(jì)算機(jī)科學(xué)與探索;2012年10期
4 ;[J];;年期
相關(guān)碩士學(xué)位論文 前5條
1 雷斌;面向復(fù)雜距離度量的MapReduce相似性連接技術(shù)研究[D];東北大學(xué);2014年
2 夏龍雷;基于Hadoop平臺(tái)的字符串相似性連接方法研究[D];東華大學(xué);2017年
3 劉雪莉;基于實(shí)體的相似性連接操作的研究[D];哈爾濱工業(yè)大學(xué);2012年
4 周健雯;異質(zhì)網(wǎng)絡(luò)上的自相似性連接算法研究與實(shí)現(xiàn)[D];復(fù)旦大學(xué);2013年
5 徐媛媛;基于MapReduce的相似性連接研究[D];寧波大學(xué);2014年
,本文編號(hào):1578207
本文鏈接:http://sikaile.net/jingjilunwen/dianzishangwulunwen/1578207.html