基于Hadoop的特征核數(shù)據(jù)提取算法的研究

發(fā)布時間：2018-08-25 13:22

【摘要】：現(xiàn)在社會已經(jīng)處于并將長期處于大數(shù)據(jù)時代,海量數(shù)據(jù)具有4個V的特點,即數(shù)量大(Volume),多樣性(Variety),需要處理的速度快(Velocity)和真實性(Veracity)。雖然目前的數(shù)據(jù)量很大,但是往往攜帶者一些冗余信息,人們關(guān)注的其實是它們真實攜帶的有效數(shù)據(jù)特征。如果將數(shù)據(jù)看成大矩陣,則此矩陣在大部分情況下是稀疏的,可以將其映射到更低維的空間,這個低維的空間我們稱之為數(shù)據(jù)特征空間,將原始數(shù)據(jù)投影到該空間后可以得到特征核數(shù)據(jù),而且特征核數(shù)據(jù)往往攜帶著原始數(shù)據(jù)的主要信息。給出了信息損失率小于的-特征核數(shù)據(jù)和-特征空間的定義后,我們的目的是尋求最優(yōu)特征核數(shù)據(jù)和最優(yōu)特征空間。為此,本文根據(jù)高維度大數(shù)據(jù)的特點,利用Hadoop分布式計算框架提出了挖掘數(shù)據(jù)主成分的一些方法,同時針對Hadoop使用過程中出現(xiàn)的缺點提出了一些技術(shù),可以有效地降低內(nèi)存使用率,提高文件訪問效率。本文先交代預(yù)備知識和數(shù)學(xué)定義,為后面具體算法的實現(xiàn)提供了理論支持和衡量標(biāo)準(zhǔn)。然后針對分布式應(yīng)用環(huán)境提供了適應(yīng)Hadoop的新型向量數(shù)據(jù)結(jié)構(gòu),并在此基礎(chǔ)上定義了不同節(jié)點之間數(shù)據(jù)發(fā)送端和接收端的工作流程和數(shù)據(jù)格式。其次數(shù)據(jù)預(yù)處理模塊將輸入信息處理成系統(tǒng)能識別的形式,繼而獲取三對角陣并用QR算法將三對角陣特征分解以獲得特征信息。最后將特征向量稍加變換得到新的投影空間,將原始數(shù)據(jù)投影到新的投影空間即可得到核數(shù)據(jù)集。本文在實現(xiàn)的過程中經(jīng)常會對向量進(jìn)行處理,雖然向量的維度很大,但將矩陣按行分割后每一塊向量僅僅占用KB數(shù)量級的空間,Hadoop分布式文件系統(tǒng)為存儲在其中的每一份文件分配固定數(shù)據(jù)塊(block)的大小,這在實現(xiàn)的過程中會出現(xiàn)Name Node內(nèi)存占用過高和文件訪問效率過低的現(xiàn)象。針對Hadoop不善于處理海量小文件的問題,我們提出了一種優(yōu)化HDFS的技術(shù),基本思想是將小文件合并成適應(yīng)一個塊的大文件然后建立索引。更進(jìn)一步地,基于名字的索引可以有效提高文件訪問效率。實驗結(jié)果表明,本文提出的策略可以有效地挖掘原始數(shù)據(jù)的核數(shù)據(jù)集。
[Abstract]:The society has been and will be in big data's time now, the massive data has 4 V's characteristic, namely the quantity (Volume), diversity (Variety), needs to deal with the fast (Velocity) and the authenticity (Veracity). Although the amount of data is very large at present, it often carries some redundant information. If you think of the data as a large matrix, the matrix is sparse in most cases and can be mapped to a lower dimensional space called the data feature space. The feature kernel data can be obtained by projecting the original data into the space, and the feature core data often carry the main information of the original data. After giving the definitions of the-feature kernel data and the-feature space whose information loss rate is less than the loss rate, our aim is to find the best feature kernel data and the optimal feature space. Therefore, according to the characteristics of high dimensional big data, this paper puts forward some methods to mine the principal components of data by using the Hadoop distributed computing framework, and puts forward some techniques in view of the shortcomings in the process of using Hadoop. It can effectively reduce memory usage and improve file access efficiency. In this paper, the preparatory knowledge and mathematical definition are explained first, which provides theoretical support and measurement standard for the implementation of the following algorithm. Then, a new vector data structure adapted to Hadoop is provided for distributed application environment, and the workflow and data format of data sender and receiver between different nodes are defined. Secondly, the data preprocessing module processes the input information into a form that the system can recognize, and then obtains the tridiagonal matrix and decomposes the tridiagonal matrix feature to obtain the characteristic information by QR algorithm. At last, the new projection space is obtained by the transformation of the feature vector, and the kernel data set can be obtained by projecting the original data into the new projection space. In this paper, the vector is often processed in the process of implementation, although the dimension of the vector is very large, However, after dividing the matrix by line, each vector occupies only the KB order of magnitude space. The distributed file system allocates the size of a fixed data block (block) for each file stored in it. In the process of implementation, the Name Node memory is overoccupied and the file access efficiency is too low. Aiming at the problem that Hadoop is not good at dealing with large amount of small files, we propose a technique of optimizing HDFS. The basic idea is to merge small files into large files adapted to a block and build indexes. Furthermore, the name-based index can effectively improve the efficiency of file access. Experimental results show that the proposed strategy can effectively mine the core data set of raw data.
【學(xué)位授予單位】：哈爾濱工業(yè)大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2017
【分類號】：TP311.13

【參考文獻(xiàn)】

相關(guān)期刊論文前8條

1 易秀雙;劉勇;李婕;王興偉;;基于MapReduce的主成分分析算法研究[J];計算機(jī)科學(xué);2017年02期

2 高宏賓;侯杰;李瑞光;;基于核主成分分析的數(shù)據(jù)流降維研究[J];計算機(jī)工程與應(yīng)用;2013年11期

3 董煥;閆德勤;;基于NMF和LPP的降維方法[J];吉林師范大學(xué)學(xué)報(自然科學(xué)版);2011年04期

4 王儉臣;單甘霖;張岐龍;段修生;;基于改進(jìn)SVM-RFE的特征選擇方法研究[J];微計算機(jī)應(yīng)用;2011年02期

5 唐亮;段建國;許洪波;梁玲;;基于互信息最大化的特征選擇算法及應(yīng)用[J];計算機(jī)工程與應(yīng)用;2008年13期

6 羅澤舉;宋麗紅;朱思銘;;基于獨立成分分析的分解向前SVM降維算法[J];計算機(jī)應(yīng)用;2007年09期

7 李大鋒;羅林開;岑涌;;基于PCA與分類回歸樹的疾病診斷應(yīng)用研究[J];計算機(jī)與數(shù)字工程;2007年05期

8 林曉立;陳恩紅;任皖英;;高維數(shù)據(jù)特征提取算法的研究及比較[J];計算機(jī)科學(xué);2003年04期

相關(guān)博士學(xué)位論文前1條

1 毛勇;基于支持向量機(jī)的特征選擇方法的研究與應(yīng)用[D];浙江大學(xué);2006年

相關(guān)碩士學(xué)位論文前4條

1 黃勇;改進(jìn)的互信息與LDA結(jié)合的特征降維方法研究[D];華中師范大學(xué);2016年

2 李泰輝;IG-NMF特征降維方法在入侵檢測中的應(yīng)用研究[D];吉林大學(xué);2016年

3 陳佩;主成分分析法研究及其在特征提取中的應(yīng)用[D];陜西師范大學(xué);2014年

4 李微微;遙感圖像融合技術(shù)及應(yīng)用方法研究[D];燕山大學(xué);2012年

，

本文編號：2203004

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2203004.html

上一篇：悅?cè)A醫(yī)院護(hù)理信息管理系統(tǒng)設(shè)計與實現(xiàn)
下一篇：基于上下文的多特征圖模型中文實體鏈接技術(shù)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于Hadoop的特征核數(shù)據(jù)提取算法的研究