分布式圖聚類及其在電子商務(wù)數(shù)據(jù)挖掘中的應(yīng)用
本文選題:分布式聚類 + 圖聚類 ; 參考:《東華大學(xué)》2013年碩士論文
【摘要】:圖作為一種常用的數(shù)據(jù)結(jié)構(gòu),由結(jié)點(diǎn)及其之間的連接邊組成,目前已成為各種復(fù)雜對象及其之間聯(lián)系的建模工具。在電子商務(wù)網(wǎng)站中,客戶登錄網(wǎng)站并進(jìn)行物品交易,都會在網(wǎng)站的后臺數(shù)據(jù)庫里生成相關(guān)的交易數(shù)據(jù)。利用這些交易數(shù)據(jù),可以構(gòu)建出各種各樣的客戶關(guān)系網(wǎng)絡(luò)圖。以購買同種物品的客戶關(guān)系為例,圖的結(jié)點(diǎn)表示不同的客戶,而圖中的一條邊則表示兩個客戶在該網(wǎng)站上購買了相同的物品。與其他類型數(shù)據(jù)類似,這種客戶關(guān)系網(wǎng)絡(luò)圖蘊(yùn)藏著豐富的信息與知識,在電子商務(wù)網(wǎng)站的客戶關(guān)系管理中具有實(shí)際的應(yīng)用價值。 圖聚類是利用聚類技術(shù)在圖中分析出那些內(nèi)部聯(lián)系緊密、外部聯(lián)系松散的聚簇。圖聚類已在社會網(wǎng)絡(luò)的社區(qū)發(fā)現(xiàn)、蛋白質(zhì)的復(fù)合物檢測等應(yīng)用得到實(shí)際的運(yùn)用。在上述電子商務(wù)網(wǎng)站的客戶關(guān)系網(wǎng)絡(luò)圖中,可以利用圖聚類的方法,挖掘出不同的客戶群體簇。所挖掘出來的客戶群體簇,可能代表了該群體簇里的客戶具有相似的興趣、偏好,也可能代表了這些客戶具有相似的家庭結(jié)構(gòu)、年齡段等。這類信息對于電子商務(wù)網(wǎng)站進(jìn)行個性化商品推薦,制定更有針對性的營銷策略,提升網(wǎng)站的運(yùn)營具有指導(dǎo)意義。 一些主流的電子商務(wù)網(wǎng)站,例如淘寶、一號店等,其擁有的客戶數(shù)量相當(dāng)龐大,由這些客戶所形成的關(guān)系圖也會非常巨大。面對龐大的數(shù)據(jù)量,單個工作站不管是在CPU計算能力還是在內(nèi)存消耗上均無法滿足需求,從而導(dǎo)致聚類分析無法正常執(zhí)行。在大規(guī)模的客戶關(guān)系圖中,如何有效地挖掘出客戶群體簇,已成為業(yè)界共同關(guān)注的問題。 MapReduce作為一種并行編程模型,可實(shí)現(xiàn)上百乃至上千臺計算機(jī)的互聯(lián),將巨大的系統(tǒng)資源池連接在一起,形成龐大的機(jī)器集群,特別適用于大規(guī)模數(shù)據(jù)的并行處理。本文考慮MapReduce在大數(shù)據(jù)處理上所具有的優(yōu)勢,試圖將MapReduce與傳統(tǒng)的圖聚類方法相結(jié)合,提出一種分布式的圖聚類方法,并將之運(yùn)用于客戶關(guān)系發(fā)現(xiàn)的實(shí)際應(yīng)用中。 本文以作者參與的“鋼貿(mào)網(wǎng)站交易數(shù)據(jù)分析”實(shí)際項目為應(yīng)用實(shí)例,利用某鋼貿(mào)公司2006年至2011年積累下來的5年交易數(shù)據(jù),通過圖聚類的方法,分析得到鋼貿(mào)客戶群體,為該公司制定有效的鋼材銷售策略提供了決策支持。具體而言,本文的研究內(nèi)容主要包括: 1)論文首先介紹相關(guān)技術(shù),包括數(shù)據(jù)挖掘、圖聚類、MapReduce并行框架及其開源實(shí)現(xiàn)Hadoop。 2)接著以鋼貿(mào)電子商務(wù)網(wǎng)站為具體實(shí)例,結(jié)合鋼貿(mào)交易數(shù)據(jù)的實(shí)際特點(diǎn),闡述了鋼貿(mào)交易數(shù)據(jù)倉庫構(gòu)建過程,并對鋼貿(mào)客戶關(guān)系圖建模進(jìn)行了詳細(xì)論述。 3)論文以MapReduce框架為基礎(chǔ),提出了一種基于MapReduce的分布式圖聚類算法,即MR-LSH算法,以解決在分布式環(huán)境下如何利用LSH實(shí)現(xiàn)大規(guī)模圖數(shù)據(jù)的可擴(kuò)展并行聚類問題。該算法將MapReduce并行框架與位置敏感哈希(Locality Sensitive Hashing,簡稱LSH)相結(jié)合,從而在MapReduce并行框架中實(shí)現(xiàn)一種基于位置敏感哈希的分布式圖聚類算法即MR-LSH算法。論文將詳細(xì)論述MR-LSH算法的具體思路及其實(shí)現(xiàn)框架,并詳細(xì)介紹了框架中的各個步驟的實(shí)現(xiàn)方法。 在此基礎(chǔ)上,論文運(yùn)用某鋼貿(mào)公司2006年至2011年的交易數(shù)據(jù)生成的客戶關(guān)系圖,通過實(shí)例證明本文所提到的分布式圖聚類在電子商務(wù)數(shù)據(jù)挖掘領(lǐng)域里的可行性與實(shí)用性。實(shí)驗結(jié)果表明,該系統(tǒng)安全可靠、易維護(hù)、具有良好的可擴(kuò)展性。
[Abstract]:As a common data structure, it is made up of nodes and the connections between them and has now become a modeling tool for various complex objects and their connections. A variety of customer relationship network diagrams can be built. As an example of the customer relationship for the purchase of the same item, the nodes of the graph represent different customers, while one side of the graph indicates that two customers have purchased the same items on the site. It has practical application value in customer relationship management of e-commerce website.
Graph clustering is a clustering technique that uses clustering technology to analyze the compact clusters with tight internal connections and loose external connections. Graph clustering has been found in the community network, and the application of protein complex detection has been applied. In the customer relationship network diagram of the e-business website, the method of graph clustering can be used to excavate Different customer clusters. The cluster of customer groups may represent the similar interests, preferences, and similar family structure, age, etc. of the customers in the cluster. This kind of information can make personalized recommendation for e-commerce websites and make more targeted marketing strategies. The promotion of the operation of the website is of guiding significance.
Some mainstream e-commerce sites, such as Taobao, No. 1 store and so on, have a large number of customers, and the relationships formed by these customers will be very huge. In the face of huge data, a single workstation is unable to meet the demand in both the CPU computing power and the memory consumption, which leads to the failure of clustering analysis. Frequent implementation. How to effectively mine customer clusters in large-scale customer relationship diagrams has become a common concern of the industry.
As a parallel programming model, MapReduce can interconnect the hundreds of thousands of computers, connect huge pool of system resources together, form a large cluster of machines, especially for parallel processing of large data. This paper considers the advantages of MapReduce in large data processing, and tries to make MapReduce and traditional Combining graph clustering method, a distributed graph clustering method is proposed and applied to the practical application of customer relationship discovery.
This paper, taking the actual project of "trading data analysis of steel trade website transaction data" as an application example, uses the 5 year transaction data accumulated by a steel trade company from 2006 to 2011, and analyzes the customer group of steel trade through the method of graph clustering, which provides the decision support for the company to formulate effective steel sales strategy. The main contents of the paper are as follows:
1) the paper first introduces related technologies, including data mining, graph clustering, MapReduce parallel framework and its open source implementation Hadoop.
2) then taking the steel trade e-commerce website as a concrete example and combining the actual characteristics of the trading data of steel trade, this paper expounds the construction process of the data warehouse of the trade in steel trade, and expounds the modeling of the customer relationship diagram of the steel trade.
3) based on the MapReduce framework, this paper proposes a distributed graph clustering algorithm based on MapReduce, that is, MR-LSH algorithm, to solve the scalable parallel clustering problem of how to use LSH to realize large scale graph data in a distributed environment. The algorithm combines MapReduce parallel framework and location sensitive Hashi (Locality Sensitive Hashing, L). SH) combined to implement a distributed graph clustering algorithm based on location sensitive hash in the MapReduce parallel framework, that is, MR-LSH algorithm. This paper will discuss the specific idea and implementation framework of MR-LSH algorithm in detail, and introduce the implementation of each step in the framework in detail.
On this basis, the paper uses a customer relationship diagram generated by the trading data of a steel trade company from 2006 to 2011, and proves the feasibility and practicability of the distributed graph clustering in the field of electronic commerce data mining through an example. The experimental results show that the system is safe, reliable, easy to maintain and has good scalability.
【學(xué)位授予單位】:東華大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP311.13;F713.36
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 付燕燕,蔣代梅,周小兵;支持增量式數(shù)據(jù)倉庫建設(shè)的多維數(shù)據(jù)模型[J];北京工業(yè)大學(xué)學(xué)報;2005年04期
2 費(fèi)賢舉,王文琴,莊燕濱;基于關(guān)聯(lián)規(guī)則的數(shù)據(jù)挖掘技術(shù)在CRM中的應(yīng)用研究[J];常州工學(xué)院學(xué)報;2005年04期
3 李建江;崔健;王聃;嚴(yán)林;黃義雙;;MapReduce并行編程模型研究綜述[J];電子學(xué)報;2011年11期
4 姜園,張朝陽,仇佩亮,周東方;用于數(shù)據(jù)挖掘的聚類算法[J];電子與信息學(xué)報;2005年04期
5 周水庚,周傲英,曹晶,胡運(yùn)發(fā);一種基于密度的快速聚類算法[J];計算機(jī)研究與發(fā)展;2000年11期
6 張旭峰;孫未未;汪衛(wèi);馮雅慧;施伯樂;;增量ETL過程自動化產(chǎn)生方法的研究[J];計算機(jī)研究與發(fā)展;2006年06期
7 尹丹;高宏;鄒兆年;;一種新的高效圖聚集算法[J];計算機(jī)研究與發(fā)展;2011年10期
8 張寧,賈自艷,史忠植;數(shù)據(jù)倉庫中ETL技術(shù)的研究[J];計算機(jī)工程與應(yīng)用;2002年24期
9 張e,
本文編號:1907332
本文鏈接:http://sikaile.net/jingjilunwen/dianzishangwulunwen/1907332.html