基于Hadoop的特征核數(shù)據(jù)提取算法的研究
[Abstract]:The society has been and will be in big data's time now, the massive data has 4 V's characteristic, namely the quantity (Volume), diversity (Variety), needs to deal with the fast (Velocity) and the authenticity (Veracity). Although the amount of data is very large at present, it often carries some redundant information. If you think of the data as a large matrix, the matrix is sparse in most cases and can be mapped to a lower dimensional space called the data feature space. The feature kernel data can be obtained by projecting the original data into the space, and the feature core data often carry the main information of the original data. After giving the definitions of the-feature kernel data and the-feature space whose information loss rate is less than the loss rate, our aim is to find the best feature kernel data and the optimal feature space. Therefore, according to the characteristics of high dimensional big data, this paper puts forward some methods to mine the principal components of data by using the Hadoop distributed computing framework, and puts forward some techniques in view of the shortcomings in the process of using Hadoop. It can effectively reduce memory usage and improve file access efficiency. In this paper, the preparatory knowledge and mathematical definition are explained first, which provides theoretical support and measurement standard for the implementation of the following algorithm. Then, a new vector data structure adapted to Hadoop is provided for distributed application environment, and the workflow and data format of data sender and receiver between different nodes are defined. Secondly, the data preprocessing module processes the input information into a form that the system can recognize, and then obtains the tridiagonal matrix and decomposes the tridiagonal matrix feature to obtain the characteristic information by QR algorithm. At last, the new projection space is obtained by the transformation of the feature vector, and the kernel data set can be obtained by projecting the original data into the new projection space. In this paper, the vector is often processed in the process of implementation, although the dimension of the vector is very large, However, after dividing the matrix by line, each vector occupies only the KB order of magnitude space. The distributed file system allocates the size of a fixed data block (block) for each file stored in it. In the process of implementation, the Name Node memory is overoccupied and the file access efficiency is too low. Aiming at the problem that Hadoop is not good at dealing with large amount of small files, we propose a technique of optimizing HDFS. The basic idea is to merge small files into large files adapted to a block and build indexes. Furthermore, the name-based index can effectively improve the efficiency of file access. Experimental results show that the proposed strategy can effectively mine the core data set of raw data.
【學(xué)位授予單位】:哈爾濱工業(yè)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP311.13
【參考文獻(xiàn)】
相關(guān)期刊論文 前8條
1 易秀雙;劉勇;李婕;王興偉;;基于MapReduce的主成分分析算法研究[J];計(jì)算機(jī)科學(xué);2017年02期
2 高宏賓;侯杰;李瑞光;;基于核主成分分析的數(shù)據(jù)流降維研究[J];計(jì)算機(jī)工程與應(yīng)用;2013年11期
3 董煥;閆德勤;;基于NMF和LPP的降維方法[J];吉林師范大學(xué)學(xué)報(bào)(自然科學(xué)版);2011年04期
4 王儉臣;單甘霖;張岐龍;段修生;;基于改進(jìn)SVM-RFE的特征選擇方法研究[J];微計(jì)算機(jī)應(yīng)用;2011年02期
5 唐亮;段建國(guó);許洪波;梁玲;;基于互信息最大化的特征選擇算法及應(yīng)用[J];計(jì)算機(jī)工程與應(yīng)用;2008年13期
6 羅澤舉;宋麗紅;朱思銘;;基于獨(dú)立成分分析的分解向前SVM降維算法[J];計(jì)算機(jī)應(yīng)用;2007年09期
7 李大鋒;羅林開;岑涌;;基于PCA與分類回歸樹的疾病診斷應(yīng)用研究[J];計(jì)算機(jī)與數(shù)字工程;2007年05期
8 林曉立;陳恩紅;任皖英;;高維數(shù)據(jù)特征提取算法的研究及比較[J];計(jì)算機(jī)科學(xué);2003年04期
相關(guān)博士學(xué)位論文 前1條
1 毛勇;基于支持向量機(jī)的特征選擇方法的研究與應(yīng)用[D];浙江大學(xué);2006年
相關(guān)碩士學(xué)位論文 前4條
1 黃勇;改進(jìn)的互信息與LDA結(jié)合的特征降維方法研究[D];華中師范大學(xué);2016年
2 李泰輝;IG-NMF特征降維方法在入侵檢測(cè)中的應(yīng)用研究[D];吉林大學(xué);2016年
3 陳佩;主成分分析法研究及其在特征提取中的應(yīng)用[D];陜西師范大學(xué);2014年
4 李微微;遙感圖像融合技術(shù)及應(yīng)用方法研究[D];燕山大學(xué);2012年
,本文編號(hào):2203004
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2203004.html