基于主動(dòng)學(xué)習(xí)的半結(jié)構(gòu)化數(shù)據(jù)清洗技術(shù)研究

發(fā)布時(shí)間：2018-09-11 17:04

【摘要】：隨著互聯(lián)網(wǎng)的快速發(fā)展產(chǎn)生了海量數(shù)據(jù),按照數(shù)據(jù)結(jié)構(gòu)可以將這些數(shù)據(jù)劃分為:高結(jié)構(gòu)化數(shù)據(jù)、半結(jié)構(gòu)化數(shù)據(jù)及以原始文本。其中結(jié)構(gòu)化數(shù)據(jù)由于其具有完整的邏輯結(jié)構(gòu)以及描述信息,能夠被人們廣泛利用;原始文本中包含的可用信息較少,并且需要經(jīng)過復(fù)雜的計(jì)算才能夠加以利用;半結(jié)構(gòu)化數(shù)據(jù)是介于以上兩者之間的一種數(shù)據(jù)形式,是互聯(lián)網(wǎng)上存在極其廣泛的數(shù)據(jù)類型,它可以看作是具有一定結(jié)構(gòu)的數(shù)據(jù),但是結(jié)構(gòu)變化很大,因?yàn)楦鱾€(gè)數(shù)據(jù)之間存在復(fù)雜多變的區(qū)分標(biāo)志,通常不能用固定的形式進(jìn)行描述。所以,如何能夠解析半結(jié)構(gòu)化數(shù)據(jù)吸引了人們的目光,本文針對(duì)海量半結(jié)構(gòu)化數(shù)據(jù)的清洗問題展開研究,識(shí)別其中有價(jià)值的信息,對(duì)半結(jié)構(gòu)化數(shù)據(jù)加以利用。并將海量半結(jié)構(gòu)化數(shù)據(jù)進(jìn)行規(guī)格化,解析各個(gè)字段的屬性,最終形成帶有屬性標(biāo)注的二維結(jié)構(gòu)化數(shù)據(jù)。這樣的結(jié)構(gòu)化數(shù)據(jù)能夠?yàn)楹罄m(xù)的分析使用帶來極大的便利。為此,本文提出了以下三種解決海量半結(jié)構(gòu)化數(shù)據(jù)清洗問題的方法:(1)提出了基于雙緩沖的多類型文件并行解析方法,使用雙緩沖消息隊(duì)列以及線程池,提升了串行解析的速度問題,還解決了并行解析中多種格式解析速度不一致造成的任務(wù)堆積問題;(2)提出基于正則表達(dá)式的屬性集識(shí)別方法,使用正則表達(dá)式識(shí)別數(shù)據(jù)中字段的屬性,根據(jù)屬性位置及數(shù)據(jù)整體結(jié)構(gòu)識(shí)別屬性全集,在此基礎(chǔ)上提出基于行列統(tǒng)計(jì)的數(shù)據(jù)規(guī)格化算法,統(tǒng)計(jì)屬性的數(shù)量及位置,將統(tǒng)計(jì)結(jié)果結(jié)果與屬性全集比較,確定每一個(gè)字段所在的列,從而形成帶有屬性標(biāo)注的結(jié)構(gòu)化數(shù)據(jù);(3)提出基于主動(dòng)學(xué)習(xí)的方法提升屬性識(shí)別準(zhǔn)確率。將已經(jīng)標(biāo)注屬性的結(jié)構(gòu)化數(shù)據(jù)作為訓(xùn)練集,使用C4.5算法構(gòu)建分類模型,使用基于主動(dòng)學(xué)習(xí)的分類器優(yōu)化方法進(jìn)一步提高學(xué)習(xí)模型屬性識(shí)別的準(zhǔn)確率。本文提出了基于投票機(jī)制的不確定性采樣算法,篩選出最能影響分類器準(zhǔn)確率的樣例交由轉(zhuǎn)件標(biāo)注,并更新分類模型,最終形成一個(gè)高效率、高準(zhǔn)確率、高可用性的數(shù)據(jù)清洗研究方法,能夠?qū)⒁阎獢?shù)據(jù)的清洗成功率提升至95%以上。
[Abstract]:With the rapid development of the Internet, these data can be divided into: highly structured data, semi-structured data and original text. Structured data can be widely used because of its complete logical structure and description information. Semi-structured data is a kind of data form between the above two. It is an extremely wide range of data types on the Internet. It can be regarded as data with a certain structure, but the structure changes a lot. Because of the complex and changeable distinguishing marks between different data, they can not be described in a fixed form. Therefore, how to analyze semi-structured data attracts people's attention. In this paper, the cleaning problem of massive semi-structured data is studied, the valuable information is identified, and the semi-structured data is utilized. The massive semi-structured data is normalized, and the attributes of each field are analyzed. Finally, the two-dimensional structured data with attribute annotation is formed. Such structured data can greatly facilitate the subsequent use of analysis. For this reason, this paper proposes the following three methods to solve the problem of massive semi-structured data cleaning: (1) A multi-type file parallel parsing method based on double buffers is proposed, which uses double-buffer message queue and thread pool. It improves the speed of serial parsing and solves the problem of task stacking caused by inconsistent parsing speed of many formats in parallel parsing. (2) an attribute set recognition method based on regular expressions is proposed. The regular expression is used to recognize the attribute of the field in the data, and the complete set of the attribute is recognized according to the position of the attribute and the whole structure of the data. On this basis, a data normalization algorithm based on column statistics is proposed, and the number and position of the statistical attribute are proposed. The statistical results are compared with the complete set of attributes to determine the columns in which each field is located, so as to form structured data with attribute annotation. (3) A method based on active learning is proposed to improve the accuracy of attribute recognition. Using structured data with tagged attributes as training set, C4.5 algorithm is used to construct classification model, and active learning-based classifier optimization method is used to further improve the accuracy of attribute recognition of learning model. In this paper, an uncertain sampling algorithm based on voting mechanism is proposed, which can select the samples that can affect the accuracy of classifier most, and update the classification model to form a high efficiency and high accuracy. The high availability data cleaning method can increase the success rate of data cleaning to more than 95%.
【學(xué)位授予單位】：哈爾濱工業(yè)大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2017
【分類號(hào)】：TP311.13

【參考文獻(xiàn)】

相關(guān)期刊論文前9條

1 黃沈?yàn)I;王海潔;朱振華;;大數(shù)據(jù)云清洗系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[J];智能計(jì)算機(jī)與應(yīng)用;2015年03期

2 唐平秋;蔣曉飛;;論“信息孤島”對(duì)政府組織發(fā)展的制約與對(duì)策——基于學(xué)習(xí)型組織理論的視角[J];中國行政管理;2015年05期

3 劉康;錢旭;王自強(qiáng);;主動(dòng)學(xué)習(xí)算法綜述[J];計(jì)算機(jī)工程與應(yīng)用;2012年34期

4 張明寶;馬靜;;基于UIMA的企業(yè)非結(jié)構(gòu)信息資源管理系統(tǒng)研究[J];計(jì)算機(jī)系統(tǒng)應(yīng)用;2008年10期

5 武小平;左春;;基于工作流程的數(shù)據(jù)清洗系統(tǒng)[J];計(jì)算機(jī)工程與設(shè)計(jì);2008年08期

6 龍軍;殷建平;祝恩;蔡志平;;選取最大可能預(yù)測(cè)錯(cuò)誤樣例的主動(dòng)學(xué)習(xí)算法[J];計(jì)算機(jī)研究與發(fā)展;2008年03期

7 龍軍;殷建平;祝恩;趙文濤;;主動(dòng)學(xué)習(xí)研究綜述[J];計(jì)算機(jī)研究與發(fā)展;2008年S1期

8 徐宗本;張講社;;基于認(rèn)知的非結(jié)構(gòu)化信息處理:現(xiàn)狀與趨勢(shì)[J];中國基礎(chǔ)科學(xué);2007年06期

9 王靜;孟小峰;;半結(jié)構(gòu)化數(shù)據(jù)的模式研究綜述[J];計(jì)算機(jī)科學(xué);2001年02期

相關(guān)碩士學(xué)位論文前1條

1 謝輝;基于用戶反饋數(shù)據(jù)清洗技術(shù)的研究[D];哈爾濱工業(yè)大學(xué);2013年

，

本文編號(hào)：2237303

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2237303.html

上一篇：HTTP協(xié)議優(yōu)化方法的研究與實(shí)現(xiàn)
下一篇：傳統(tǒng)企業(yè)互聯(lián)網(wǎng)化發(fā)展的基本思路與路徑

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級(jí)|國家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于主動(dòng)學(xué)習(xí)的半結(jié)構(gòu)化數(shù)據(jù)清洗技術(shù)研究