天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當前位置:主頁 > 教育論文 > 高等教育論文 >

基于MapReduce的分布式改進隨機森林學(xué)生就業(yè)數(shù)據(jù)分類模型研究

發(fā)布時間:2018-04-08 07:27

  本文選題:機器學(xué)習(xí) 切入點:數(shù)據(jù)分類模型 出處:《系統(tǒng)工程理論與實踐》2017年05期


【摘要】:教育數(shù)據(jù)挖掘(educational data mining)是當代教育信息化發(fā)展的前沿研究領(lǐng)域,正在吸引越來越多教育學(xué)家和數(shù)據(jù)科學(xué)家的關(guān)注."大數(shù)據(jù)"時代背景下,隨著數(shù)據(jù)處理規(guī)模的不斷激增,現(xiàn)有的數(shù)據(jù)挖掘模型在單一處理節(jié)點的計算能力遭遇瓶頸,各類面向大數(shù)據(jù)處理的分布式計算框架應(yīng)運而生.借助這些框架,面向解決高校就業(yè)數(shù)據(jù)挖掘問題的機器學(xué)習(xí)模型便可以滿足未來大規(guī)模數(shù)據(jù)處理的需求,在未來數(shù)據(jù)集體量龐大的信息集成系統(tǒng)中為數(shù)據(jù)挖掘和決策支持提供幫助.以此為背景,本研究對比現(xiàn)有數(shù)據(jù)模型對研究目標對象的分類性能,提出了以引入輸入特征加權(quán)系數(shù)來計算特征的信息增益作為特征最優(yōu)分裂評判指標的改進隨機森林模型來提升數(shù)據(jù)分類性能,通過仿真測試改進模型對于現(xiàn)有模型分類性能的提升情況,與此同時為解決大數(shù)據(jù)時代背景下面向海量數(shù)據(jù)分類任務(wù)的單節(jié)點性能瓶頸問題,提出了基于分布式改進隨機森林算法的大規(guī)模學(xué)生就業(yè)數(shù)據(jù)分類預(yù)測模型.通過使用MapReduce分布式計算框架實現(xiàn)已訓(xùn)練模型在本地磁盤與分布式文件系統(tǒng)之間的序列化寫入與反序列化加載過程,進而實現(xiàn)了基于改進隨機森林模型的大規(guī)模數(shù)據(jù)分類模型的分布式擴展.
[Abstract]:Educational data mining (EDM) is a frontier research field in the development of modern educational informatization, which is attracting more and more attention of educators and data scientists. "Under the background of big data, with the rapid increase of data processing scale, the computing power of existing data mining models in a single processing node has met a bottleneck, and various distributed computing frameworks for big data processing have emerged as the times require.With these frameworks, the machine learning model for solving the problem of employment data mining in colleges and universities can meet the needs of large-scale data processing in the future.It is helpful for data mining and decision support in the information integration system with large volume of data sets in the future.Against this background, this study compares the classification performance of the existing data models to the target objects.An improved stochastic forest model is proposed in which the information gain of the feature is calculated by introducing the weighted coefficient of the input feature as the index of feature optimal split evaluation to improve the performance of data classification.In order to solve the problem of single node performance bottleneck of mass data classification task in big data era, the improved model improves the classification performance of existing models through simulation test, and at the same time, in order to solve the bottleneck of single node performance in the context of big data era,Based on distributed improved stochastic forest algorithm, a large scale student employment data classification and prediction model is proposed.The serialization writing and deserialization loading process of the trained model between the local disk and the distributed file system is realized by using the MapReduce distributed computing framework.Then the distributed extension of large-scale data classification model based on improved stochastic forest model is realized.
【作者單位】: 同濟大學(xué)電子與信息工程學(xué)院CIMS中心;
【基金】:國家自然科學(xué)基金(71690234)~~
【分類號】:G647.38;TP311.13


本文編號:1720601

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/jiaoyulunwen/gaodengjiaoyulunwen/1720601.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶5a318***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com