分布式圖聚類(lèi)及其在電子商務(wù)數(shù)據(jù)挖掘中的應(yīng)用
本文選題:分布式聚類(lèi) + 圖聚類(lèi) ; 參考:《東華大學(xué)》2013年碩士論文
【摘要】:圖作為一種常用的數(shù)據(jù)結(jié)構(gòu),由結(jié)點(diǎn)及其之間的連接邊組成,目前已成為各種復(fù)雜對(duì)象及其之間聯(lián)系的建模工具。在電子商務(wù)網(wǎng)站中,客戶(hù)登錄網(wǎng)站并進(jìn)行物品交易,都會(huì)在網(wǎng)站的后臺(tái)數(shù)據(jù)庫(kù)里生成相關(guān)的交易數(shù)據(jù)。利用這些交易數(shù)據(jù),可以構(gòu)建出各種各樣的客戶(hù)關(guān)系網(wǎng)絡(luò)圖。以購(gòu)買(mǎi)同種物品的客戶(hù)關(guān)系為例,圖的結(jié)點(diǎn)表示不同的客戶(hù),而圖中的一條邊則表示兩個(gè)客戶(hù)在該網(wǎng)站上購(gòu)買(mǎi)了相同的物品。與其他類(lèi)型數(shù)據(jù)類(lèi)似,這種客戶(hù)關(guān)系網(wǎng)絡(luò)圖蘊(yùn)藏著豐富的信息與知識(shí),在電子商務(wù)網(wǎng)站的客戶(hù)關(guān)系管理中具有實(shí)際的應(yīng)用價(jià)值。 圖聚類(lèi)是利用聚類(lèi)技術(shù)在圖中分析出那些內(nèi)部聯(lián)系緊密、外部聯(lián)系松散的聚簇。圖聚類(lèi)已在社會(huì)網(wǎng)絡(luò)的社區(qū)發(fā)現(xiàn)、蛋白質(zhì)的復(fù)合物檢測(cè)等應(yīng)用得到實(shí)際的運(yùn)用。在上述電子商務(wù)網(wǎng)站的客戶(hù)關(guān)系網(wǎng)絡(luò)圖中,可以利用圖聚類(lèi)的方法,挖掘出不同的客戶(hù)群體簇。所挖掘出來(lái)的客戶(hù)群體簇,可能代表了該群體簇里的客戶(hù)具有相似的興趣、偏好,也可能代表了這些客戶(hù)具有相似的家庭結(jié)構(gòu)、年齡段等。這類(lèi)信息對(duì)于電子商務(wù)網(wǎng)站進(jìn)行個(gè)性化商品推薦,制定更有針對(duì)性的營(yíng)銷(xiāo)策略,提升網(wǎng)站的運(yùn)營(yíng)具有指導(dǎo)意義。 一些主流的電子商務(wù)網(wǎng)站,例如淘寶、一號(hào)店等,其擁有的客戶(hù)數(shù)量相當(dāng)龐大,由這些客戶(hù)所形成的關(guān)系圖也會(huì)非常巨大。面對(duì)龐大的數(shù)據(jù)量,單個(gè)工作站不管是在CPU計(jì)算能力還是在內(nèi)存消耗上均無(wú)法滿(mǎn)足需求,從而導(dǎo)致聚類(lèi)分析無(wú)法正常執(zhí)行。在大規(guī)模的客戶(hù)關(guān)系圖中,如何有效地挖掘出客戶(hù)群體簇,已成為業(yè)界共同關(guān)注的問(wèn)題。 MapReduce作為一種并行編程模型,可實(shí)現(xiàn)上百乃至上千臺(tái)計(jì)算機(jī)的互聯(lián),將巨大的系統(tǒng)資源池連接在一起,形成龐大的機(jī)器集群,特別適用于大規(guī)模數(shù)據(jù)的并行處理。本文考慮MapReduce在大數(shù)據(jù)處理上所具有的優(yōu)勢(shì),試圖將MapReduce與傳統(tǒng)的圖聚類(lèi)方法相結(jié)合,提出一種分布式的圖聚類(lèi)方法,并將之運(yùn)用于客戶(hù)關(guān)系發(fā)現(xiàn)的實(shí)際應(yīng)用中。 本文以作者參與的“鋼貿(mào)網(wǎng)站交易數(shù)據(jù)分析”實(shí)際項(xiàng)目為應(yīng)用實(shí)例,利用某鋼貿(mào)公司2006年至2011年積累下來(lái)的5年交易數(shù)據(jù),通過(guò)圖聚類(lèi)的方法,分析得到鋼貿(mào)客戶(hù)群體,為該公司制定有效的鋼材銷(xiāo)售策略提供了決策支持。具體而言,本文的研究?jī)?nèi)容主要包括: 1)論文首先介紹相關(guān)技術(shù),包括數(shù)據(jù)挖掘、圖聚類(lèi)、MapReduce并行框架及其開(kāi)源實(shí)現(xiàn)Hadoop。 2)接著以鋼貿(mào)電子商務(wù)網(wǎng)站為具體實(shí)例,結(jié)合鋼貿(mào)交易數(shù)據(jù)的實(shí)際特點(diǎn),闡述了鋼貿(mào)交易數(shù)據(jù)倉(cāng)庫(kù)構(gòu)建過(guò)程,并對(duì)鋼貿(mào)客戶(hù)關(guān)系圖建模進(jìn)行了詳細(xì)論述。 3)論文以MapReduce框架為基礎(chǔ),提出了一種基于MapReduce的分布式圖聚類(lèi)算法,即MR-LSH算法,以解決在分布式環(huán)境下如何利用LSH實(shí)現(xiàn)大規(guī)模圖數(shù)據(jù)的可擴(kuò)展并行聚類(lèi)問(wèn)題。該算法將MapReduce并行框架與位置敏感哈希(Locality Sensitive Hashing,簡(jiǎn)稱(chēng)LSH)相結(jié)合,從而在MapReduce并行框架中實(shí)現(xiàn)一種基于位置敏感哈希的分布式圖聚類(lèi)算法即MR-LSH算法。論文將詳細(xì)論述MR-LSH算法的具體思路及其實(shí)現(xiàn)框架,并詳細(xì)介紹了框架中的各個(gè)步驟的實(shí)現(xiàn)方法。 在此基礎(chǔ)上,論文運(yùn)用某鋼貿(mào)公司2006年至2011年的交易數(shù)據(jù)生成的客戶(hù)關(guān)系圖,通過(guò)實(shí)例證明本文所提到的分布式圖聚類(lèi)在電子商務(wù)數(shù)據(jù)挖掘領(lǐng)域里的可行性與實(shí)用性。實(shí)驗(yàn)結(jié)果表明,該系統(tǒng)安全可靠、易維護(hù)、具有良好的可擴(kuò)展性。
[Abstract]:As a common data structure, it is made up of nodes and the connections between them and has now become a modeling tool for various complex objects and their connections. A variety of customer relationship network diagrams can be built. As an example of the customer relationship for the purchase of the same item, the nodes of the graph represent different customers, while one side of the graph indicates that two customers have purchased the same items on the site. It has practical application value in customer relationship management of e-commerce website.
Graph clustering is a clustering technique that uses clustering technology to analyze the compact clusters with tight internal connections and loose external connections. Graph clustering has been found in the community network, and the application of protein complex detection has been applied. In the customer relationship network diagram of the e-business website, the method of graph clustering can be used to excavate Different customer clusters. The cluster of customer groups may represent the similar interests, preferences, and similar family structure, age, etc. of the customers in the cluster. This kind of information can make personalized recommendation for e-commerce websites and make more targeted marketing strategies. The promotion of the operation of the website is of guiding significance.
Some mainstream e-commerce sites, such as Taobao, No. 1 store and so on, have a large number of customers, and the relationships formed by these customers will be very huge. In the face of huge data, a single workstation is unable to meet the demand in both the CPU computing power and the memory consumption, which leads to the failure of clustering analysis. Frequent implementation. How to effectively mine customer clusters in large-scale customer relationship diagrams has become a common concern of the industry.
As a parallel programming model, MapReduce can interconnect the hundreds of thousands of computers, connect huge pool of system resources together, form a large cluster of machines, especially for parallel processing of large data. This paper considers the advantages of MapReduce in large data processing, and tries to make MapReduce and traditional Combining graph clustering method, a distributed graph clustering method is proposed and applied to the practical application of customer relationship discovery.
This paper, taking the actual project of "trading data analysis of steel trade website transaction data" as an application example, uses the 5 year transaction data accumulated by a steel trade company from 2006 to 2011, and analyzes the customer group of steel trade through the method of graph clustering, which provides the decision support for the company to formulate effective steel sales strategy. The main contents of the paper are as follows:
1) the paper first introduces related technologies, including data mining, graph clustering, MapReduce parallel framework and its open source implementation Hadoop.
2) then taking the steel trade e-commerce website as a concrete example and combining the actual characteristics of the trading data of steel trade, this paper expounds the construction process of the data warehouse of the trade in steel trade, and expounds the modeling of the customer relationship diagram of the steel trade.
3) based on the MapReduce framework, this paper proposes a distributed graph clustering algorithm based on MapReduce, that is, MR-LSH algorithm, to solve the scalable parallel clustering problem of how to use LSH to realize large scale graph data in a distributed environment. The algorithm combines MapReduce parallel framework and location sensitive Hashi (Locality Sensitive Hashing, L). SH) combined to implement a distributed graph clustering algorithm based on location sensitive hash in the MapReduce parallel framework, that is, MR-LSH algorithm. This paper will discuss the specific idea and implementation framework of MR-LSH algorithm in detail, and introduce the implementation of each step in the framework in detail.
On this basis, the paper uses a customer relationship diagram generated by the trading data of a steel trade company from 2006 to 2011, and proves the feasibility and practicability of the distributed graph clustering in the field of electronic commerce data mining through an example. The experimental results show that the system is safe, reliable, easy to maintain and has good scalability.
【學(xué)位授予單位】:東華大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類(lèi)號(hào)】:TP311.13;F713.36
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 付燕燕,蔣代梅,周小兵;支持增量式數(shù)據(jù)倉(cāng)庫(kù)建設(shè)的多維數(shù)據(jù)模型[J];北京工業(yè)大學(xué)學(xué)報(bào);2005年04期
2 費(fèi)賢舉,王文琴,莊燕濱;基于關(guān)聯(lián)規(guī)則的數(shù)據(jù)挖掘技術(shù)在CRM中的應(yīng)用研究[J];常州工學(xué)院學(xué)報(bào);2005年04期
3 李建江;崔健;王聃;嚴(yán)林;黃義雙;;MapReduce并行編程模型研究綜述[J];電子學(xué)報(bào);2011年11期
4 姜園,張朝陽(yáng),仇佩亮,周東方;用于數(shù)據(jù)挖掘的聚類(lèi)算法[J];電子與信息學(xué)報(bào);2005年04期
5 周水庚,周傲英,曹晶,胡運(yùn)發(fā);一種基于密度的快速聚類(lèi)算法[J];計(jì)算機(jī)研究與發(fā)展;2000年11期
6 張旭峰;孫未未;汪衛(wèi);馮雅慧;施伯樂(lè);;增量ETL過(guò)程自動(dòng)化產(chǎn)生方法的研究[J];計(jì)算機(jī)研究與發(fā)展;2006年06期
7 尹丹;高宏;鄒兆年;;一種新的高效圖聚集算法[J];計(jì)算機(jī)研究與發(fā)展;2011年10期
8 張寧,賈自艷,史忠植;數(shù)據(jù)倉(cāng)庫(kù)中ETL技術(shù)的研究[J];計(jì)算機(jī)工程與應(yīng)用;2002年24期
9 張e,
本文編號(hào):1907332
本文鏈接:http://sikaile.net/jingjilunwen/dianzishangwulunwen/1907332.html