一種改進(jìn)的隨機森林并行分類方法在運營商大數(shù)據(jù)的應(yīng)用

發(fā)布時間：2019-03-25 07:04

【摘要】：電信運營商為電信消費者提供網(wǎng)絡(luò)服務(wù),能夠取得豐富的數(shù)據(jù)資源。為了發(fā)掘這些數(shù)據(jù)的價值,本文設(shè)計并實現(xiàn)了一個基于運營商大數(shù)據(jù)的二手房產(chǎn)中介客戶分類系統(tǒng),利用改進(jìn)的隨機森林分類方法、MapReduce并行計算框架、聚類分析等大數(shù)據(jù)處理技術(shù),并結(jié)合數(shù)理統(tǒng)計、復(fù)雜網(wǎng)絡(luò)方面的數(shù)據(jù)分析方法與網(wǎng)絡(luò)爬蟲技術(shù),從每天的運營商通話記錄中提取房產(chǎn)中介潛在客戶并對其按照租房者、出租者、購房者、售房者以及其他等類別進(jìn)行劃分,以供房產(chǎn)中介進(jìn)行精準(zhǔn)營銷。分類算法是整個系統(tǒng)的核心,本文提出一種改進(jìn)的隨機森林分類算法,包括三個改進(jìn):(1)通過數(shù)學(xué)和實驗證明,對平衡數(shù)據(jù),增加重復(fù)抽樣的樣本量可以有效提高準(zhǔn)確率;(2)通過采用簡單隨機抽樣等效替代原有的重復(fù)抽樣,減少該算法的運行時間,提高系統(tǒng)效率;(3)采用回歸分析得到不平衡度與重復(fù)抽樣的定量關(guān)系為..,最終根據(jù)運營商大數(shù)據(jù)的不平衡度得到適用于本系統(tǒng)的重復(fù)抽樣樣本量。系統(tǒng)分為數(shù)據(jù)采集子系統(tǒng)、數(shù)據(jù)預(yù)處理子系統(tǒng)、數(shù)據(jù)分析子系統(tǒng)和反饋調(diào)整子系統(tǒng)。數(shù)據(jù)采集子系統(tǒng)主要負(fù)責(zé)收集房產(chǎn)經(jīng)紀(jì)人數(shù)據(jù)。數(shù)據(jù)預(yù)處理子系統(tǒng)通過并行化處理技術(shù)過濾掉與房產(chǎn)經(jīng)紀(jì)人無關(guān)的通話記錄,并通過并行化處理技術(shù)從中提取出潛在的客戶,以及他們的所有通話行為信息。數(shù)據(jù)分析子系統(tǒng)利用改進(jìn)的隨機森林算法對潛在客戶進(jìn)行分類,特別當(dāng)系統(tǒng)處于冷啟動階段還沒有訓(xùn)練樣本時,系統(tǒng)利用數(shù)理統(tǒng)計的R語言構(gòu)建可視化維度圖,利用復(fù)雜網(wǎng)絡(luò)中的分析軟件Cytoscape構(gòu)建可視化交互作用網(wǎng)絡(luò),利用機器學(xué)習(xí)的聚類分析方法對初始樣本集進(jìn)行分析,幫助快速獲取訓(xùn)練樣本以及梳理特征維度組合。反饋調(diào)整子系統(tǒng)是將后續(xù)系統(tǒng)運行中獲得的符合條件的帶標(biāo)簽樣本加入到訓(xùn)練樣本庫中,不斷對分類系統(tǒng)進(jìn)行調(diào)整,細(xì)化分類邊界讓后續(xù)的分類更加準(zhǔn)確。通過將改進(jìn)的隨機森林分類算法應(yīng)用到基于運營商大數(shù)據(jù)的二手房產(chǎn)中介客戶分類系統(tǒng),采用最初的訓(xùn)練樣本作為測試樣本進(jìn)行測試,得到分類錯誤率為21.1379%左右,比未改進(jìn)的分類錯誤率(21.5274%)低0.3895%。應(yīng)用了改進(jìn)隨機森林算法的分類系統(tǒng)準(zhǔn)確率在79%左右,對房產(chǎn)中介銷售業(yè)績提升有促進(jìn)作用。
[Abstract]:Telecom operators provide network services for telecom consumers, and can obtain rich data resources. In order to explore the value of these data, this paper designs and implements a second-hand real estate intermediary customer classification system based on operator big data, using the improved stochastic forest classification method, MapReduce parallel computing framework, Cluster analysis and other big data processing techniques, combined with mathematical statistics, complex network data analysis methods and network crawler technology, extracted real estate intermediary potential customers from daily phone records of operators and used them according to tenants and rentals. Buyers, sellers and other categories are classified for precise marketing by real estate agents. Classification algorithm is the core of the whole system, this paper proposes an improved stochastic forest classification algorithm, including three improvements: (1) through mathematical and experimental results, it is proved that increasing the sample size of repeated sampling can effectively improve the accuracy of the balanced data; (2) by replacing the original repeated sampling with simple random sampling, the running time of the algorithm is reduced and the system efficiency is improved. (3) the quantitative relationship between the degree of unbalance and repeated sampling is obtained by regression analysis. Finally, according to the unbalance degree of operator big data, the sample size of repeated sampling suitable for this system is obtained. The system is divided into data acquisition subsystem, data preprocessing subsystem, data analysis subsystem and feedback adjustment subsystem. The data collection subsystem is mainly responsible for collecting real estate agent data. The data pre-processing subsystem filters out the calls independent of the real estate agent by parallel processing technology, and extracts potential customers and all of their call behavior information from the parallel processing technology. The data analysis subsystem uses the improved stochastic forest algorithm to classify potential customers, especially when the system is in the cold start stage without training samples, the system uses R language of mathematical statistics to construct visual dimension graph. The visual interaction network is constructed by the analysis software Cytoscape in the complex network. The cluster analysis method of machine learning is used to analyze the initial sample set, which helps to quickly obtain training samples and comb the combination of feature dimensions. The feedback adjustment subsystem adds the labeled samples obtained during the follow-up system operation to the training sample database, and constantly adjusts the classification system, and refines the classification boundary to make the subsequent classification more accurate. By applying the improved stochastic forest classification algorithm to the second-hand real estate intermediary customer classification system based on operator big data, using the initial training sample as the test sample, the classification error rate is about 21.1379%. The classification error rate is 0.3895% lower than the unimproved classification error rate (21.5274%). The accuracy of the classification system based on the improved stochastic forest algorithm is about 79%, which can promote the sales performance of real estate agents.
【學(xué)位授予單位】：電子科技大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2015
【分類號】：TP311.13

【相似文獻(xiàn)】

相關(guān)期刊論文前10條

1 劉足華;熊惠霖;;基于隨機森林的目標(biāo)檢測與定位[J];計算機工程;2012年13期

2 董師師;黃哲學(xué);;隨機森林理論淺析[J];集成技術(shù);2013年01期

3 王象剛;;基于K均值隨機森林快速算法及入侵檢測中的應(yīng)用[J];科技通報;2013年08期

4 陳姝;彭小寧;;基于粒子濾波和在線隨機森林分類的目標(biāo)跟蹤[J];江蘇大學(xué)學(xué)報(自然科學(xué)版);2014年02期

5 羅知林;陳挺;蔡皖東;;一個基于隨機森林的微博轉(zhuǎn)發(fā)預(yù)測算法[J];計算機科學(xué);2014年04期

6 王麗婷;丁曉青;方馳;;基于隨機森林的人臉關(guān)鍵點精確定位方法[J];清華大學(xué)學(xué)報(自然科學(xué)版);2009年04期

7 李建更;高志坤;;隨機森林針對小樣本數(shù)據(jù)類權(quán)重設(shè)置[J];計算機工程與應(yīng)用;2009年26期

8 張建;武東英;劉慧生;;基于隨機森林的流量分類方法[J];信息工程大學(xué)學(xué)報;2012年05期

9 吳華芹;;基于訓(xùn)練集劃分的隨機森林算法[J];科技通報;2013年10期

10 張華偉;王明文;甘麗新;;基于隨機森林的文本分類模型研究[J];山東大學(xué)學(xué)報(理學(xué)版);2006年03期

相關(guān)會議論文前7條

1 謝程利;王金橋;盧漢清;;核森林及其在目標(biāo)檢測中的應(yīng)用[A];第六屆和諧人機環(huán)境聯(lián)合學(xué)術(shù)會議（HHME2010)、第19屆全國多媒體學(xué)術(shù)會議（NCMT2010）、第6屆全國人機交互學(xué)術(shù)會議（CHCI2010）、第5屆全國普適計算學(xué)術(shù)會議（PCC2010）論文集[C];2010年

2 武曉巖;方慶偉;;基因表達(dá)數(shù)據(jù)分析的隨機森林方法及算法改進(jìn)[A];黑龍江省第十次統(tǒng)計科學(xué)討論會論文集[C];2008年

3 張?zhí)忑?梁龍;王康;李華;;隨機森林結(jié)合激光誘導(dǎo)擊穿光譜技術(shù)用于的鋼鐵分類[A];中國化學(xué)會第29屆學(xué)術(shù)年會摘要集——第19分會：化學(xué)信息學(xué)與化學(xué)計量學(xué)[C];2014年

4 相玉紅;張卓勇;;組蛋白去乙�；敢种苿┑臉�(gòu)效關(guān)系研究[A];第十一屆全國計算（機）化學(xué)學(xué)術(shù)會議論文摘要集[C];2011年

5 張濤;李貞子;武曉巖;李康;;隨機森林回歸分析方法及在代謝組學(xué)中的應(yīng)用[A];2011年中國衛(wèi)生統(tǒng)計學(xué)年會會議論文集[C];2011年

6 馮飛翔;馮輔周;江鵬程;劉菁;劉建敏;;隨機森林和k-近鄰法在某型坦克變速箱狀態(tài)識別中的應(yīng)用[A];第八屆全國轉(zhuǎn)子動力學(xué)學(xué)術(shù)討論會論文集[C];2008年

7 曹東升;許青松;梁逸曾;陳憲;李洪東;;組合樹的集合體和后向消除策略去分類P-糖蛋白化合物[A];第十屆全國計算(機)化學(xué)學(xué)術(shù)會議論文摘要集[C];2009年

相關(guān)博士學(xué)位論文前4條

1 曹正鳳;隨機森林算法優(yōu)化研究[D];首都經(jīng)濟貿(mào)易大學(xué);2014年

2 雷震;隨機森林及其在遙感影像處理中應(yīng)用研究[D];上海交通大學(xué);2012年

3 岳明;基于隨機森林和規(guī)則集成法的酒類市場預(yù)測與發(fā)展戰(zhàn)略[D];天津大學(xué);2008年

4 李書艷;單點氨基酸多態(tài)性與疾病相關(guān)關(guān)系的預(yù)測及其機制研究[D];蘭州大學(xué);2010年

相關(guān)碩士學(xué)位論文前10條

1 錢維;藥品不良反應(yīng)監(jiān)測中隨機森林方法的建立與實現(xiàn)[D];第二軍醫(yī)大學(xué);2012年

2 韓燕龍;基于隨機森林的指數(shù)化投資組合構(gòu)建研究[D];華南理工大學(xué);2015年

3 賀捷;隨機森林在文本分類中的應(yīng)用[D];華南理工大學(xué);2015年

4 張文婷;交通環(huán)境下基于改進(jìn)霍夫森林的目標(biāo)檢測與跟蹤[D];華南理工大學(xué);2015年

5 李強;基于多視角特征融合與隨機森林的蛋白質(zhì)結(jié)晶預(yù)測[D];南京理工大學(xué);2015年

6 朱玟謙;一種收斂性隨機森林在人臉檢測中的應(yīng)用研究[D];武漢理工大學(xué);2015年

7 肖宇;基于序列圖像的手勢檢測與識別算法研究[D];電子科技大學(xué);2014年

8 李慧;一種改進(jìn)的隨機森林并行分類方法在運營商大數(shù)據(jù)的應(yīng)用[D];電子科技大學(xué);2015年

9 袁芳娟;基于隨機森林的年齡估計[D];河北工業(yè)大學(xué);2012年

10 劉曉東;基于組合策略的隨機森林方法研究[D];大連理工大學(xué);2013年

，

本文編號：2446746

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/guanlilunwen/yingxiaoguanlilunwen/2446746.html

上一篇：顧客參與創(chuàng)新對口碑推薦意愿的影響研究:心理所有權(quán)的中介作用
下一篇：遼寧省營銷一體化管理信息系統(tǒng)的設(shè)計與實現(xiàn)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

一種改進(jìn)的隨機森林并行分類方法在運營商大數(shù)據(jù)的應(yīng)用