天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁 > 科技論文 > 軟件論文 >

基于主動(dòng)學(xué)習(xí)的半結(jié)構(gòu)化數(shù)據(jù)清洗技術(shù)研究

發(fā)布時(shí)間:2018-09-11 17:04
【摘要】:隨著互聯(lián)網(wǎng)的快速發(fā)展產(chǎn)生了海量數(shù)據(jù),按照數(shù)據(jù)結(jié)構(gòu)可以將這些數(shù)據(jù)劃分為:高結(jié)構(gòu)化數(shù)據(jù)、半結(jié)構(gòu)化數(shù)據(jù)及以原始文本。其中結(jié)構(gòu)化數(shù)據(jù)由于其具有完整的邏輯結(jié)構(gòu)以及描述信息,能夠被人們廣泛利用;原始文本中包含的可用信息較少,并且需要經(jīng)過復(fù)雜的計(jì)算才能夠加以利用;半結(jié)構(gòu)化數(shù)據(jù)是介于以上兩者之間的一種數(shù)據(jù)形式,是互聯(lián)網(wǎng)上存在極其廣泛的數(shù)據(jù)類型,它可以看作是具有一定結(jié)構(gòu)的數(shù)據(jù),但是結(jié)構(gòu)變化很大,因?yàn)楦鱾(gè)數(shù)據(jù)之間存在復(fù)雜多變的區(qū)分標(biāo)志,通常不能用固定的形式進(jìn)行描述。所以,如何能夠解析半結(jié)構(gòu)化數(shù)據(jù)吸引了人們的目光,本文針對(duì)海量半結(jié)構(gòu)化數(shù)據(jù)的清洗問題展開研究,識(shí)別其中有價(jià)值的信息,對(duì)半結(jié)構(gòu)化數(shù)據(jù)加以利用。并將海量半結(jié)構(gòu)化數(shù)據(jù)進(jìn)行規(guī)格化,解析各個(gè)字段的屬性,最終形成帶有屬性標(biāo)注的二維結(jié)構(gòu)化數(shù)據(jù)。這樣的結(jié)構(gòu)化數(shù)據(jù)能夠?yàn)楹罄m(xù)的分析使用帶來極大的便利。為此,本文提出了以下三種解決海量半結(jié)構(gòu)化數(shù)據(jù)清洗問題的方法:(1)提出了基于雙緩沖的多類型文件并行解析方法,使用雙緩沖消息隊(duì)列以及線程池,提升了串行解析的速度問題,還解決了并行解析中多種格式解析速度不一致造成的任務(wù)堆積問題;(2)提出基于正則表達(dá)式的屬性集識(shí)別方法,使用正則表達(dá)式識(shí)別數(shù)據(jù)中字段的屬性,根據(jù)屬性位置及數(shù)據(jù)整體結(jié)構(gòu)識(shí)別屬性全集,在此基礎(chǔ)上提出基于行列統(tǒng)計(jì)的數(shù)據(jù)規(guī)格化算法,統(tǒng)計(jì)屬性的數(shù)量及位置,將統(tǒng)計(jì)結(jié)果結(jié)果與屬性全集比較,確定每一個(gè)字段所在的列,從而形成帶有屬性標(biāo)注的結(jié)構(gòu)化數(shù)據(jù);(3)提出基于主動(dòng)學(xué)習(xí)的方法提升屬性識(shí)別準(zhǔn)確率。將已經(jīng)標(biāo)注屬性的結(jié)構(gòu)化數(shù)據(jù)作為訓(xùn)練集,使用C4.5算法構(gòu)建分類模型,使用基于主動(dòng)學(xué)習(xí)的分類器優(yōu)化方法進(jìn)一步提高學(xué)習(xí)模型屬性識(shí)別的準(zhǔn)確率。本文提出了基于投票機(jī)制的不確定性采樣算法,篩選出最能影響分類器準(zhǔn)確率的樣例交由轉(zhuǎn)件標(biāo)注,并更新分類模型,最終形成一個(gè)高效率、高準(zhǔn)確率、高可用性的數(shù)據(jù)清洗研究方法,能夠?qū)⒁阎獢?shù)據(jù)的清洗成功率提升至95%以上。
[Abstract]:With the rapid development of the Internet, these data can be divided into: highly structured data, semi-structured data and original text. Structured data can be widely used because of its complete logical structure and description information. Semi-structured data is a kind of data form between the above two. It is an extremely wide range of data types on the Internet. It can be regarded as data with a certain structure, but the structure changes a lot. Because of the complex and changeable distinguishing marks between different data, they can not be described in a fixed form. Therefore, how to analyze semi-structured data attracts people's attention. In this paper, the cleaning problem of massive semi-structured data is studied, the valuable information is identified, and the semi-structured data is utilized. The massive semi-structured data is normalized, and the attributes of each field are analyzed. Finally, the two-dimensional structured data with attribute annotation is formed. Such structured data can greatly facilitate the subsequent use of analysis. For this reason, this paper proposes the following three methods to solve the problem of massive semi-structured data cleaning: (1) A multi-type file parallel parsing method based on double buffers is proposed, which uses double-buffer message queue and thread pool. It improves the speed of serial parsing and solves the problem of task stacking caused by inconsistent parsing speed of many formats in parallel parsing. (2) an attribute set recognition method based on regular expressions is proposed. The regular expression is used to recognize the attribute of the field in the data, and the complete set of the attribute is recognized according to the position of the attribute and the whole structure of the data. On this basis, a data normalization algorithm based on column statistics is proposed, and the number and position of the statistical attribute are proposed. The statistical results are compared with the complete set of attributes to determine the columns in which each field is located, so as to form structured data with attribute annotation. (3) A method based on active learning is proposed to improve the accuracy of attribute recognition. Using structured data with tagged attributes as training set, C4.5 algorithm is used to construct classification model, and active learning-based classifier optimization method is used to further improve the accuracy of attribute recognition of learning model. In this paper, an uncertain sampling algorithm based on voting mechanism is proposed, which can select the samples that can affect the accuracy of classifier most, and update the classification model to form a high efficiency and high accuracy. The high availability data cleaning method can increase the success rate of data cleaning to more than 95%.
【學(xué)位授予單位】:哈爾濱工業(yè)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP311.13

【參考文獻(xiàn)】

相關(guān)期刊論文 前9條

1 黃沈?yàn)I;王海潔;朱振華;;大數(shù)據(jù)云清洗系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[J];智能計(jì)算機(jī)與應(yīng)用;2015年03期

2 唐平秋;蔣曉飛;;論“信息孤島”對(duì)政府組織發(fā)展的制約與對(duì)策——基于學(xué)習(xí)型組織理論的視角[J];中國行政管理;2015年05期

3 劉康;錢旭;王自強(qiáng);;主動(dòng)學(xué)習(xí)算法綜述[J];計(jì)算機(jī)工程與應(yīng)用;2012年34期

4 張明寶;馬靜;;基于UIMA的企業(yè)非結(jié)構(gòu)信息資源管理系統(tǒng)研究[J];計(jì)算機(jī)系統(tǒng)應(yīng)用;2008年10期

5 武小平;左春;;基于工作流程的數(shù)據(jù)清洗系統(tǒng)[J];計(jì)算機(jī)工程與設(shè)計(jì);2008年08期

6 龍軍;殷建平;祝恩;蔡志平;;選取最大可能預(yù)測錯(cuò)誤樣例的主動(dòng)學(xué)習(xí)算法[J];計(jì)算機(jī)研究與發(fā)展;2008年03期

7 龍軍;殷建平;祝恩;趙文濤;;主動(dòng)學(xué)習(xí)研究綜述[J];計(jì)算機(jī)研究與發(fā)展;2008年S1期

8 徐宗本;張講社;;基于認(rèn)知的非結(jié)構(gòu)化信息處理:現(xiàn)狀與趨勢[J];中國基礎(chǔ)科學(xué);2007年06期

9 王靜;孟小峰;;半結(jié)構(gòu)化數(shù)據(jù)的模式研究綜述[J];計(jì)算機(jī)科學(xué);2001年02期

相關(guān)碩士學(xué)位論文 前1條

1 謝輝;基于用戶反饋數(shù)據(jù)清洗技術(shù)的研究[D];哈爾濱工業(yè)大學(xué);2013年



本文編號(hào):2237303

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2237303.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶244c7***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請E-mail郵箱bigeng88@qq.com
大屁股肥臀熟女一区二区视频| 正在播放玩弄漂亮少妇高潮| 亚洲超碰成人天堂涩涩| 国产精品刮毛视频不卡| 亚洲伦片免费偷拍一区| 国产精品视频一区麻豆专区| 亚洲第一区二区三区女厕偷拍| 九九久久精品久久久精品| 中国美女草逼一级黄片视频| 免费特黄欧美亚洲黄片| 自拍偷拍一区二区三区| 五月婷婷欧美中文字幕| 欧美一区二区三区播放| 大香蕉久久精品一区二区字幕| 一区二区三区日本高清| 国内精品一区二区欧美| 亚洲中文字幕高清视频在线观看| 中文字幕一区二区久久综合| 久久精品伊人一区二区| 91插插插外国一区二区| 国产伦精品一一区二区三区高清版 | 欧美日韩最近中国黄片| 一二区中文字幕在线观看| 一二区中文字幕在线观看| 日韩人妻欧美一区二区久久| 天堂热东京热男人天堂| 日韩人妻免费视频一专区| 日本成人中文字幕一区| 亚洲日本中文字幕视频在线观看| 激情国产白嫩美女在线观看| 精品人妻一区二区三区免费| 麻豆看片麻豆免费视频| 久久国产精品热爱视频| 中文字幕人妻一区二区免费| 亚洲国产中文字幕在线观看| 又色又爽又黄的三级视频| 亚洲丁香婷婷久久一区| 国产成人精品视频一区二区三区| 亚洲中文字幕亲近伦片| 日韩一区二区三区四区乱码视频| 人妻久久一区二区三区精品99|