面向海量商品數(shù)據(jù)的分布式層次聚類算法設計與實現(xiàn)

發(fā)布時間：2019-03-16 15:14

【摘要】：得益于計算機科學與信息技術的進步,企業(yè)可以方便的收集并儲存大量數(shù)據(jù)。但收集到的數(shù)據(jù)僅僅占用了大量的存儲空間,無法對企業(yè)的價值產(chǎn)生有效的幫助,因此企業(yè)開始著手于從數(shù)據(jù)中挖掘信息。以往的信息挖掘過程由專家分析并解釋數(shù)據(jù),這種方式隨著數(shù)據(jù)量以及屬性的急劇增加而變得越來越困難。所以,如何有效地從巨大數(shù)據(jù)庫中自動的發(fā)現(xiàn)知識,更進一步加工轉(zhuǎn)化成企業(yè)不可或缺的商業(yè)智慧,逐漸成為二十一世紀企業(yè)和機構所必須面對的重要課題。在生產(chǎn)實踐中,數(shù)據(jù)的增加速度與數(shù)據(jù)分析所消耗的大量時間已經(jīng)形成了越來越突出的矛盾。數(shù)據(jù)挖掘正是為了解決傳統(tǒng)分析方法的問題,針對大規(guī)模數(shù)據(jù)的分析處理而出現(xiàn)的技術。數(shù)據(jù)挖掘通過將自學習算法應用在大規(guī)模數(shù)據(jù)集上,得到隱藏在數(shù)據(jù)中難以獲取的知識與信息。海關作為國家商品進出口的主要監(jiān)管單位,是海量進出口數(shù)據(jù)的生產(chǎn)者和擁有者。隨著業(yè)務流程信息化建設的深入和完善,海關已經(jīng)基本實現(xiàn)了較為完整的數(shù)據(jù)化監(jiān)管和數(shù)字化運營能力。但同時,相對有限的數(shù)據(jù)分析手段與不斷增長的數(shù)據(jù)和業(yè)務復雜度之間的矛盾也日益突出。如何對海量的報關商品進行有效的歸類和管理成為海關監(jiān)管中亟待解決的問題。本論文以海關商品數(shù)據(jù)分析項目為主線,在MapReduce框架的基礎上實現(xiàn)了對商品數(shù)據(jù)的一系列處理模塊,形成了商品數(shù)據(jù)的分布式聚類系統(tǒng)。主要內(nèi)容包括商品數(shù)據(jù)的預處理、TF-IDF計算、倒排索引的構建、相似度矩陣的計算、單連接層次聚類計算等。最后利用層次聚類的結果對海關的商品數(shù)據(jù)進行了整理,為海關情報分析研判模塊提供精確的分組統(tǒng)計依據(jù),在實際應用中產(chǎn)生了效果。
[Abstract]:Thanks to advances in computer science and information technology, businesses can easily collect and store large amounts of data. However, the collected data only takes up a large amount of storage space and can not effectively help the value of the enterprise. Therefore, the enterprise begins to mine information from the data. In the past, the process of information mining was analyzed and interpreted by experts, which became more and more difficult with the rapid increase of data and attributes. Therefore, how to discover knowledge automatically from the huge database and further process into the indispensable business wisdom of enterprises has gradually become an important subject that enterprises and organizations have to face in the 21 century. In production practice, the increasing speed of data and the time consumed by data analysis have formed a more and more prominent contradiction. Data mining is just to solve the problem of traditional analysis methods, aiming at the analysis of large-scale data processing technology. By applying the self-learning algorithm to large-scale data sets, data mining can get the knowledge and information hidden in the data. As the main regulatory unit of national commodity import and export, customs is the producer and owner of mass import and export data. With the deepening and perfection of business process information construction, customs has basically realized relatively complete data-based supervision and digital operation capability. But at the same time, the contradiction between the relatively limited data analysis means and the increasing data and business complexity is becoming more and more prominent. How to effectively classify and manage the vast quantities of customs declaration goods becomes an urgent problem to be solved in customs supervision. Based on the main line of customs commodity data analysis project, a series of processing modules of commodity data are implemented on the basis of MapReduce framework, and a distributed clustering system of commodity data is formed in this paper. The main contents include commodity data preprocessing, TF-IDF calculation, inverted index construction, similarity matrix calculation, single join hierarchical clustering calculation and so on. Finally, the result of hierarchical clustering is used to sort out the commodity data of customs, which provides the accurate statistical basis for the module of customs information analysis and judgment, and produces an effect in practical application.
【學位授予單位】：浙江大學
【學位級別】：碩士
【學位授予年份】：2017
【分類號】：TP311.13

【相似文獻】

相關期刊論文前10條

1 李遠敏,林錦章;基于分治遞歸的層次聚類算法實現(xiàn)[J];湖北職業(yè)技術學院學報;2005年03期

2 陳旭玲;樓佩煌;;改進層次聚類算法在文獻分析中的應用[J];數(shù)值計算與計算機應用;2009年04期

3 楊棟;詹海亮;蘇錦旗;;基于區(qū)域最近鄰生長的層次聚類算法[J];化工自動化及儀表;2010年05期

4 王嫻;楊緒兵;周宇;周溜溜;;一種基于類中心矯正的層次聚類算法[J];微電子學與計算機;2011年10期

5 謝振平;王士同;王曉明;;一種基于軟邊界球分的分裂式層次聚類算法[J];模式識別與人工智能;2008年04期

6 姚玉欽;李金廣;;一種基于網(wǎng)格的層次聚類算法[J];河南師范大學學報(自然科學版);2009年04期

7 李俊輝;;基于不確定圖的層次聚類算法研究[J];中國管理信息化;2012年24期

8 李新良;;基于層次聚類算法的改進研究[J];軟件導刊;2007年19期

9 劉興波;;凝聚型層次聚類算法的研究[J];科技信息(科學教研);2008年11期

10 郭曉娟;劉曉霞;李曉玲;;層次聚類算法的改進及分析[J];計算機應用與軟件;2008年06期

相關會議論文前3條

1 馬曉艷;唐雁;;層次聚類算法研究[A];2008年計算機應用技術交流會論文集[C];2008年

2 饒金通;董槐林;姜青山;;基于孤立因子的層次聚類算法與應用[A];第二十一屆中國數(shù)據(jù)庫學術會議論文集（研究報告篇）[C];2004年

3 吳楠楠;史亮;饒金通;姜青山;董槐林;;一種改進的高效層次聚類算法[A];第二十二屆中國數(shù)據(jù)庫學術會議論文集（技術報告篇）[C];2005年

相關博士學位論文前1條

1 陳遠浩;非監(jiān)督的結構學習及其應用[D];中國科學技術大學;2008年

相關碩士學位論文前10條

1 郭芳芳;面向分類型集值數(shù)據(jù)的層次聚類算法研究[D];山西大學;2015年

2 李彩云;基于密度的改進型層次聚類算法研究[D];蘭州大學;2016年

3 喬端瑞;基于K-means算法及層次聚類算法的研究與應用[D];吉林大學;2016年

4 程東東;基于自然鄰的層次聚類算法研究[D];重慶大學;2016年

5 呂琳;基于蟻群優(yōu)化的層次聚類算法及其在網(wǎng)絡取證中的應用[D];山東師范大學;2017年

6 周俊林;面向海量商品數(shù)據(jù)的分布式層次聚類算法設計與實現(xiàn)[D];浙江大學;2017年

7 瞿俊;基于重疊度的層次聚類算法研究及其應用[D];廈門大學;2007年

8 楊海斌;一種新的層次聚類算法的研究及應用[D];西北師范大學;2011年

9 張冬梅;基于輪廓系數(shù)的層次聚類算法研究[D];燕山大學;2010年

10 李慧馳;基于三度信息的雙重層次聚類算法[D];武漢理工大學;2013年

，

本文編號：2441622

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2441622.html

上一篇：基于圖像的人臉特征提取與發(fā)型分類
下一篇：基于特征推理的圖標搜索特性實驗研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

面向海量商品數(shù)據(jù)的分布式層次聚類算法設計與實現(xiàn)