分布式多數(shù)據(jù)源電商數(shù)據(jù)融合分析系統(tǒng)

發(fā)布時間：2018-07-15 09:56

【摘要】：隨著互聯(lián)網(wǎng)、移動智能終端的普及,物流行業(yè)的快速發(fā)展,電子商務(wù)越來越成為人們生活和國民經(jīng)濟的重要組成部分。電商平臺作為人們購物載體,承載著大量有價值數(shù)據(jù),從電商數(shù)據(jù)中不僅能夠還原用戶網(wǎng)絡(luò)購物時所處的環(huán)境,分析網(wǎng)絡(luò)購物環(huán)境對用戶行為的影響,又能分析商品市場的行為規(guī)律,為商家給出行為建議,還可分析國民經(jīng)濟情況,具有較高的研究價值。電商數(shù)據(jù)分析挖掘是對電商數(shù)據(jù)進行分析挖掘,以獲得有價值的信息的過程。電商數(shù)據(jù)分析挖掘?qū)儆跀?shù)據(jù)挖掘的一部分,同時又有自身的特殊性。在電商數(shù)據(jù)分析挖掘過程中,存在著以下幾個難題需要解決:數(shù)據(jù)采集、預(yù)處理難;多個數(shù)據(jù)源間缺少直接聯(lián)系,單一電商數(shù)據(jù)的可信度和完整度較低,缺少對多數(shù)據(jù)源數(shù)據(jù)的融合分析;單機數(shù)據(jù)挖掘系統(tǒng)無法應(yīng)對電商的海量數(shù)據(jù)的處理需求,需要應(yīng)用分布式數(shù)據(jù)挖掘系統(tǒng),同時一些常用的數(shù)據(jù)挖掘算法在分布式下的實現(xiàn)效率較低。本文的主要工作點分為以下3點:(1)針對電商數(shù)據(jù)的特點,對電商數(shù)據(jù)進行了有針對性和具體的數(shù)據(jù)分析挖掘工作。本文從電商數(shù)據(jù)定義、數(shù)據(jù)采集開始,分析了電商網(wǎng)站所包含的數(shù)據(jù)類型,根據(jù)分析需求采集了所需要的數(shù)據(jù),設(shè)計了數(shù)據(jù)存儲格式。針對電商數(shù)據(jù)的包含較多的半結(jié)構(gòu)化、無結(jié)構(gòu)化數(shù)據(jù),數(shù)據(jù)不規(guī)范,數(shù)據(jù)噪聲大等數(shù)據(jù)特點,從數(shù)據(jù)預(yù)處理切入,制定解決方法,以保證數(shù)據(jù)有較好的數(shù)據(jù)質(zhì)量。同時運用關(guān)聯(lián)分析、聚類、線性回歸、人工神經(jīng)網(wǎng)絡(luò)等多種數(shù)據(jù)挖掘方法對電商數(shù)據(jù)進行分析挖掘。(2)設(shè)計和實現(xiàn)了一種多數(shù)據(jù)源電商數(shù)據(jù)融合的方法,對不同電商網(wǎng)站數(shù)據(jù)進行數(shù)據(jù)融合,并將融合后的數(shù)據(jù)用于數(shù)據(jù)挖掘中。本文分析電商網(wǎng)站的商品信息的結(jié)構(gòu)特點,根據(jù)其特點設(shè)計一種多電商數(shù)據(jù)融合的方法,通過對電商數(shù)據(jù)的預(yù)處理和文本分析,提取出商品名、商品屬性名、商品屬性內(nèi)容的分級特征,設(shè)計了無監(jiān)督的學(xué)習(xí)算法,可在不同數(shù)據(jù)源的商品參數(shù)對應(yīng)關(guān)系未知的情況下,依據(jù)種子特征對數(shù)據(jù)進行學(xué)習(xí)、匹配,利用多種商品參數(shù),逐步找到匹配商品和商品參數(shù),減少了數(shù)據(jù)融合的計算量,同時相比于使用單一參數(shù)進行數(shù)據(jù)融合所得到的結(jié)果,提高了商品實體統(tǒng)一的準(zhǔn)確率,且能靈活設(shè)定相同商品的標(biāo)準(zhǔn),得到不同標(biāo)準(zhǔn)下的匹配結(jié)果。并將融合后的數(shù)據(jù)用于數(shù)據(jù)預(yù)測,相比于使用單一數(shù)據(jù)源數(shù)據(jù),預(yù)測結(jié)果的準(zhǔn)確率得到了提升。(3)設(shè)計了基于Hadoop的分布式電商數(shù)據(jù)挖掘系統(tǒng),改進和實現(xiàn)了層次聚類在Hadoop下的實現(xiàn)。分析了分布式計算架構(gòu)的特點,設(shè)計了采用基于Hadoop的分布式數(shù)據(jù)分析挖掘系統(tǒng)。針對Hadoop對迭代不友好,而層次聚類具有較高迭代次數(shù)所導(dǎo)致的傳統(tǒng)層次聚類在Hadoop下的實現(xiàn)效率較低的問題,依據(jù)層次聚類的算法原理和Hadoop的結(jié)構(gòu)特點設(shè)計了改進的層次聚類,在類間距離是單調(diào)遞增的情況下,其不改變聚類結(jié)果,能在一次聚類過程中聚合多個類,減少了迭代次數(shù),能大幅提高層次聚類在Hadoop下的計算效率。同時探討了在缺少商品多維特征信息的情況下,通過用戶對商品的使用日志間接計算商品之間的相似度,進而使用層次聚類得到商品聚類信息,并通過實驗驗證了方法的可行性。
[Abstract]:With the popularity of the Internet, the popularization of mobile intelligent terminals and the rapid development of the logistics industry, e-commerce has become an important part of the people's life and the national economy. As a shopping carrier, e-commerce platform carries a large number of valuable data. From the e-commerce data, it can not only restore the environment of the user's network shopping, but also the analysis network. The influence of the collaterals shopping environment on the behavior of the users can also analyze the behavior rules of the commodity market, give the behavior suggestions for the merchants and analyze the national economic situation, and have high research value. The data analysis and mining of e-commerce is the process of analyzing and mining the e-commerce data to obtain valuable information. The data analysis and mining of e-commerce is a number of data mining. In the process of data analysis and mining of e-commerce, there are several problems to be solved in the process of data analysis and mining of e-commerce: data acquisition, preprocessing, lack of direct connection between multiple data sources, low credibility and integrity of single e-commerce data, lack of fusion analysis of multi data source data, and single computer data. The mining system can not deal with the demand of mass data processing of e-commerce. It needs to apply distributed data mining system. At the same time, some common data mining algorithms have low efficiency in distributed implementation. The main work points of this paper are divided into 3 points: (1) aiming at the special point of e-commerce data, it is pertinent and specific to e-commerce data. Data analysis and mining work. In this paper, from the definition of e-commerce data and data acquisition, the data types included in the e-commerce site are analyzed. According to the analysis requirements, the required data are collected, and the data storage format is designed. The data include more semi-structured, unstructured data, unstandardized data and large data noise. According to the characteristics of the data preprocessing, the solution is made to ensure that the data has better data quality. At the same time, the data mining methods such as association analysis, clustering, linear regression, artificial neural network and other data mining methods are used to analyze and excavate the e-commerce data. (2) a method of data fusion for multi data source is designed and implemented, and different electricity is used for different electricity. The commercial website data is used for data fusion, and the fusion data are used in data mining. This paper analyzes the structural features of commercial information on e-commerce sites, designs a method of multi e-commerce data fusion according to its characteristics, and extracts commodity name, commodity attribute name and commodity attribute content by preprocessing and text analysis of e-commerce data. The unsupervised learning algorithm is designed, which can learn and match the data according to the characteristics of the seeds in the case of the unknown relation of the commodity parameters of the different data sources, and use a variety of commodity parameters to gradually find the matching goods and commodity parameters, and reduce the amount of calculation of data fusion, while comparing with the single parameter. The results obtained by data fusion can improve the accuracy of the unity of the commodity entities, and can flexibly set the standard of the same goods, get the matching results under different standards. And use the data after the fusion to predict the data. Compared with the use of single data source data, the accuracy of the prediction results has been improved. (3) the Hadoop based classification is designed. The implementation of hierarchical cluster data mining system is improved and realized under Hadoop. The characteristics of distributed computing architecture are analyzed. A distributed data analysis mining system based on Hadoop is designed. The traditional hierarchical clustering which is caused by Hadoop is not friendly to the iteration, and the hierarchical clustering has high overlapping times in Hadoop. According to the principle of hierarchical clustering algorithm and the structure characteristics of Hadoop, the improved hierarchical clustering is designed. Under the condition of monotonous increasing distance between classes, it can not change the clustering results, and can aggregate many classes in a cluster process, reduce the number of iterations, and can greatly improve the level of hierarchical clustering under the Hadoop. At the same time, the feasibility of the method is verified by using the hierarchical clustering to calculate the similarity between the goods and then use the hierarchical clustering to calculate the similarity between the goods under the condition of the lack of multi-dimensional feature information.
【學(xué)位授予單位】：北京郵電大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2016
【分類號】：TP311.13

【相似文獻】

相關(guān)期刊論文前10條

1 李玲玲;;關(guān)于凝聚型層次聚類時間復(fù)雜度的研究[J];宿州學(xué)院學(xué)報;2011年02期

2 潘大慶;;基于層次聚類的微博敏感話題檢測算法研究[J];廣西民族大學(xué)學(xué)報(自然科學(xué)版);2012年04期

3 鄭曉鳴;呂士穎;王曉東;;一種基于隨機抽取的有限深度層次聚類[J];鄭州大學(xué)學(xué)報(理學(xué)版);2007年03期

4 湯周文;葉東毅;;基于層次聚類的差異化屬性約簡算法[J];計算機應(yīng)用;2009年02期

5 文順;趙杰煜;朱紹軍;;基于貝葉斯和諧度的層次聚類[J];模式識別與人工智能;2013年12期

6 龔尚福;陳婉璐;賈澎濤;;層次聚類社區(qū)發(fā)現(xiàn)算法的研究[J];計算機應(yīng)用研究;2013年11期

7 香紅麗;王瀟涵;羅淑云;;基于層次聚類方法研究課程關(guān)系結(jié)構(gòu)[J];中國科教創(chuàng)新導(dǎo)刊;2011年26期

8 李曉飛;;基于動態(tài)層次聚類的離散化算法的研究[J];計算機應(yīng)用與軟件;2009年10期

9 張闊,徐鵬,李涓子,王克宏;基于優(yōu)化層次聚類的文檔邏輯結(jié)構(gòu)抽取[J];清華大學(xué)學(xué)報(自然科學(xué)版);2005年04期

10 王旅;彭宏;胡勁松;梁華芳;;層次聚類在種群親緣關(guān)系研究中的應(yīng)用[J];計算機時代;2006年07期

相關(guān)會議論文前6條

1 吾守爾·斯拉木;吳啟南;;基于層次聚類方法[A];第六屆全國計算機應(yīng)用聯(lián)合學(xué)術(shù)會議論文集[C];2002年

2 彭楠峗;王厚峰;凌晨添;;基于層次聚類的網(wǎng)絡(luò)新聞熱點發(fā)現(xiàn)[A];中國計算語言學(xué)研究前沿進展（2009-2011）[C];2011年

3 楊建武;;Web檢索結(jié)果的層次聚類研究[A];第二十一屆中國數(shù)據(jù)庫學(xué)術(shù)會議論文集（技術(shù)報告篇）[C];2004年

4 劉啟亮;鄧敏;李光強;王佳t，

本文編號：2123692

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2123692.html

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

分布式多數(shù)據(jù)源電商數(shù)據(jù)融合分析系統(tǒng)