面向海量商品數(shù)據(jù)的分布式層次聚類算法設計與實現(xiàn)
[Abstract]:Thanks to advances in computer science and information technology, businesses can easily collect and store large amounts of data. However, the collected data only takes up a large amount of storage space and can not effectively help the value of the enterprise. Therefore, the enterprise begins to mine information from the data. In the past, the process of information mining was analyzed and interpreted by experts, which became more and more difficult with the rapid increase of data and attributes. Therefore, how to discover knowledge automatically from the huge database and further process into the indispensable business wisdom of enterprises has gradually become an important subject that enterprises and organizations have to face in the 21 century. In production practice, the increasing speed of data and the time consumed by data analysis have formed a more and more prominent contradiction. Data mining is just to solve the problem of traditional analysis methods, aiming at the analysis of large-scale data processing technology. By applying the self-learning algorithm to large-scale data sets, data mining can get the knowledge and information hidden in the data. As the main regulatory unit of national commodity import and export, customs is the producer and owner of mass import and export data. With the deepening and perfection of business process information construction, customs has basically realized relatively complete data-based supervision and digital operation capability. But at the same time, the contradiction between the relatively limited data analysis means and the increasing data and business complexity is becoming more and more prominent. How to effectively classify and manage the vast quantities of customs declaration goods becomes an urgent problem to be solved in customs supervision. Based on the main line of customs commodity data analysis project, a series of processing modules of commodity data are implemented on the basis of MapReduce framework, and a distributed clustering system of commodity data is formed in this paper. The main contents include commodity data preprocessing, TF-IDF calculation, inverted index construction, similarity matrix calculation, single join hierarchical clustering calculation and so on. Finally, the result of hierarchical clustering is used to sort out the commodity data of customs, which provides the accurate statistical basis for the module of customs information analysis and judgment, and produces an effect in practical application.
【學位授予單位】:浙江大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP311.13
【相似文獻】
相關期刊論文 前10條
1 李遠敏,林錦章;基于分治遞歸的層次聚類算法實現(xiàn)[J];湖北職業(yè)技術學院學報;2005年03期
2 陳旭玲;樓佩煌;;改進層次聚類算法在文獻分析中的應用[J];數(shù)值計算與計算機應用;2009年04期
3 楊棟;詹海亮;蘇錦旗;;基于區(qū)域最近鄰生長的層次聚類算法[J];化工自動化及儀表;2010年05期
4 王嫻;楊緒兵;周宇;周溜溜;;一種基于類中心矯正的層次聚類算法[J];微電子學與計算機;2011年10期
5 謝振平;王士同;王曉明;;一種基于軟邊界球分的分裂式層次聚類算法[J];模式識別與人工智能;2008年04期
6 姚玉欽;李金廣;;一種基于網(wǎng)格的層次聚類算法[J];河南師范大學學報(自然科學版);2009年04期
7 李俊輝;;基于不確定圖的層次聚類算法研究[J];中國管理信息化;2012年24期
8 李新良;;基于層次聚類算法的改進研究[J];軟件導刊;2007年19期
9 劉興波;;凝聚型層次聚類算法的研究[J];科技信息(科學教研);2008年11期
10 郭曉娟;劉曉霞;李曉玲;;層次聚類算法的改進及分析[J];計算機應用與軟件;2008年06期
相關會議論文 前3條
1 馬曉艷;唐雁;;層次聚類算法研究[A];2008年計算機應用技術交流會論文集[C];2008年
2 饒金通;董槐林;姜青山;;基于孤立因子的層次聚類算法與應用[A];第二十一屆中國數(shù)據(jù)庫學術會議論文集(研究報告篇)[C];2004年
3 吳楠楠;史亮;饒金通;姜青山;董槐林;;一種改進的高效層次聚類算法[A];第二十二屆中國數(shù)據(jù)庫學術會議論文集(技術報告篇)[C];2005年
相關博士學位論文 前1條
1 陳遠浩;非監(jiān)督的結構學習及其應用[D];中國科學技術大學;2008年
相關碩士學位論文 前10條
1 郭芳芳;面向分類型集值數(shù)據(jù)的層次聚類算法研究[D];山西大學;2015年
2 李彩云;基于密度的改進型層次聚類算法研究[D];蘭州大學;2016年
3 喬端瑞;基于K-means算法及層次聚類算法的研究與應用[D];吉林大學;2016年
4 程東東;基于自然鄰的層次聚類算法研究[D];重慶大學;2016年
5 呂琳;基于蟻群優(yōu)化的層次聚類算法及其在網(wǎng)絡取證中的應用[D];山東師范大學;2017年
6 周俊林;面向海量商品數(shù)據(jù)的分布式層次聚類算法設計與實現(xiàn)[D];浙江大學;2017年
7 瞿俊;基于重疊度的層次聚類算法研究及其應用[D];廈門大學;2007年
8 楊海斌;一種新的層次聚類算法的研究及應用[D];西北師范大學;2011年
9 張冬梅;基于輪廓系數(shù)的層次聚類算法研究[D];燕山大學;2010年
10 李慧馳;基于三度信息的雙重層次聚類算法[D];武漢理工大學;2013年
,本文編號:2441622
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2441622.html