基于主動(dòng)學(xué)習(xí)的半結(jié)構(gòu)化數(shù)據(jù)清洗技術(shù)研究
[Abstract]:With the rapid development of the Internet, these data can be divided into: highly structured data, semi-structured data and original text. Structured data can be widely used because of its complete logical structure and description information. Semi-structured data is a kind of data form between the above two. It is an extremely wide range of data types on the Internet. It can be regarded as data with a certain structure, but the structure changes a lot. Because of the complex and changeable distinguishing marks between different data, they can not be described in a fixed form. Therefore, how to analyze semi-structured data attracts people's attention. In this paper, the cleaning problem of massive semi-structured data is studied, the valuable information is identified, and the semi-structured data is utilized. The massive semi-structured data is normalized, and the attributes of each field are analyzed. Finally, the two-dimensional structured data with attribute annotation is formed. Such structured data can greatly facilitate the subsequent use of analysis. For this reason, this paper proposes the following three methods to solve the problem of massive semi-structured data cleaning: (1) A multi-type file parallel parsing method based on double buffers is proposed, which uses double-buffer message queue and thread pool. It improves the speed of serial parsing and solves the problem of task stacking caused by inconsistent parsing speed of many formats in parallel parsing. (2) an attribute set recognition method based on regular expressions is proposed. The regular expression is used to recognize the attribute of the field in the data, and the complete set of the attribute is recognized according to the position of the attribute and the whole structure of the data. On this basis, a data normalization algorithm based on column statistics is proposed, and the number and position of the statistical attribute are proposed. The statistical results are compared with the complete set of attributes to determine the columns in which each field is located, so as to form structured data with attribute annotation. (3) A method based on active learning is proposed to improve the accuracy of attribute recognition. Using structured data with tagged attributes as training set, C4.5 algorithm is used to construct classification model, and active learning-based classifier optimization method is used to further improve the accuracy of attribute recognition of learning model. In this paper, an uncertain sampling algorithm based on voting mechanism is proposed, which can select the samples that can affect the accuracy of classifier most, and update the classification model to form a high efficiency and high accuracy. The high availability data cleaning method can increase the success rate of data cleaning to more than 95%.
【學(xué)位授予單位】:哈爾濱工業(yè)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP311.13
【參考文獻(xiàn)】
相關(guān)期刊論文 前9條
1 黃沈?yàn)I;王海潔;朱振華;;大數(shù)據(jù)云清洗系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[J];智能計(jì)算機(jī)與應(yīng)用;2015年03期
2 唐平秋;蔣曉飛;;論“信息孤島”對(duì)政府組織發(fā)展的制約與對(duì)策——基于學(xué)習(xí)型組織理論的視角[J];中國行政管理;2015年05期
3 劉康;錢旭;王自強(qiáng);;主動(dòng)學(xué)習(xí)算法綜述[J];計(jì)算機(jī)工程與應(yīng)用;2012年34期
4 張明寶;馬靜;;基于UIMA的企業(yè)非結(jié)構(gòu)信息資源管理系統(tǒng)研究[J];計(jì)算機(jī)系統(tǒng)應(yīng)用;2008年10期
5 武小平;左春;;基于工作流程的數(shù)據(jù)清洗系統(tǒng)[J];計(jì)算機(jī)工程與設(shè)計(jì);2008年08期
6 龍軍;殷建平;祝恩;蔡志平;;選取最大可能預(yù)測錯(cuò)誤樣例的主動(dòng)學(xué)習(xí)算法[J];計(jì)算機(jī)研究與發(fā)展;2008年03期
7 龍軍;殷建平;祝恩;趙文濤;;主動(dòng)學(xué)習(xí)研究綜述[J];計(jì)算機(jī)研究與發(fā)展;2008年S1期
8 徐宗本;張講社;;基于認(rèn)知的非結(jié)構(gòu)化信息處理:現(xiàn)狀與趨勢[J];中國基礎(chǔ)科學(xué);2007年06期
9 王靜;孟小峰;;半結(jié)構(gòu)化數(shù)據(jù)的模式研究綜述[J];計(jì)算機(jī)科學(xué);2001年02期
相關(guān)碩士學(xué)位論文 前1條
1 謝輝;基于用戶反饋數(shù)據(jù)清洗技術(shù)的研究[D];哈爾濱工業(yè)大學(xué);2013年
,本文編號(hào):2237303
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2237303.html