數(shù)據(jù)倉庫中基于學習的實體解析方法研究
本文選題:數(shù)據(jù)倉庫 + 數(shù)據(jù)質(zhì)量。 參考:《昆明理工大學》2017年碩士論文
【摘要】:實體解析是針對數(shù)據(jù)倉庫中數(shù)據(jù)質(zhì)量管理的冗余識別技術。隨著數(shù)據(jù)的海量增加,傳統(tǒng)的實體解析方法中識別效率低和識別精確度不足等問題也逐漸凸顯。本文分析了數(shù)據(jù)倉庫和數(shù)據(jù)質(zhì)量的相關理論和國內(nèi)外研究成果,以及實體解析的主要方法。重點針對海量數(shù)據(jù)實體解析算法原理、基本模型、模塊設計以及評價標準等展開了深入研究。以提高識別精度、減小計算時間為目標,針對某煙草集團數(shù)據(jù)中心的數(shù)據(jù)源,研究了基于學習的并行實體解析算法,并進行了仿真驗證。主要研究內(nèi)容如下:(1)以元組中關鍵屬性相似度確定Canopy集合閾值,利用Canopy聚類對海量實體進行初步分塊,使元組形成可疊加的子集,增加了算法的容錯性。(2)針對數(shù)據(jù)分塊后形成的相似實體對集合,引入位置編碼技術和TF-IDF算法相結合對元組進行詞特征的相似度計算方法。位置編碼技術可以很好的識別單詞的縮寫等問題,TF-IDF算法對字符位置順序不敏感,同時對屬性字符串中具有類別區(qū)分能力的單詞賦予相應權重信息。利用兩算法的優(yōu)勢結合提取元組對的特征向量。(3)針對元組相似度和屬性相似度之間的非線性映射關系,利用神經(jīng)網(wǎng)絡任意精度逼近非線性函數(shù)的特征,通過網(wǎng)絡學習屬性之間的內(nèi)在關系動態(tài)實現(xiàn)權值、閾值等參數(shù)的調(diào)整,來完成實體是否匹配的判斷。對于神經(jīng)網(wǎng)絡訓練過程收斂速度慢,易陷入局部最優(yōu)等問題,采用蟻群算法進行優(yōu)化。彌補了傳統(tǒng)實體匹配方法中根據(jù)屬性相似度的加權和是否大于人工閾值判斷元組對是否屬于同一實體的不足。(4)實現(xiàn)了 Hadoop基礎架構對海量實體解析的并行處理。利用數(shù)據(jù)中心供應商數(shù)據(jù)對方法和框架進行實驗仿真,通過與傳統(tǒng)的實體解析方法進行準確率、召回率和F1值等評價方式的對比分析,驗證了基于學習的實體解析算法可以獲得較高的識別精確度,并且隨著節(jié)點數(shù)目的增加,識別效率也有很大程度地提高。
[Abstract]:Entity parsing is a redundant identification technique for data quality management in data warehouse. With the massive increase of data, the problems of low recognition efficiency and low recognition accuracy in traditional entity analysis methods have been gradually highlighted. In this paper, the related theories of data warehouse and data quality, the research results at home and abroad, and the main methods of entity analysis are analyzed. This paper focuses on the principle, basic model, module design and evaluation criteria of mass data entity analysis algorithm. In order to improve the recognition accuracy and reduce the computing time, a parallel entity analysis algorithm based on learning was studied for the data source of a tobacco group data center, and the simulation was carried out. The main contents of this paper are as follows: (1) the threshold of Canopy set is determined by similarity of key attributes in tuples, and the initial block of massive entities is divided by Canopy clustering to form superimposed subsets of tuples. The fault tolerance of the algorithm is increased. (2) aiming at the similar entity pair set which is formed after the data is partitioned, this paper introduces the position coding technique and the TF-IDF algorithm to calculate the similarity of the character of the tuple. The position coding technique can recognize the abbreviation of words very well. The TF-IDF algorithm is not sensitive to the character position order, and gives the corresponding weight information to the words with the ability to distinguish the categories in the attribute string. The advantage of the two algorithms is used to extract the feature vector of tuple pairs. (3) aiming at the nonlinear mapping between tuple similarity and attribute similarity, the neural network is used to approximate the feature of nonlinear function with arbitrary precision. Through the dynamic adjustment of weights, thresholds and other parameters, the judgment of whether the entity matches or not is completed through the intrinsic relationship between the learning attributes of the network. Ant colony algorithm (ACA) is used to solve the problems of slow convergence and easy to fall into local optimization in the process of neural network training. It makes up for the deficiency of traditional entity matching method to judge whether the tuple pair belongs to the same entity or not according to the weighted sum of attribute similarity degree or not. It realizes the parallel processing of massive entity parsing in Hadoop infrastructure. The method and framework are simulated with data center supplier data, and compared with traditional entity analysis methods, such as accuracy rate, recall rate and F1 value, etc. It is verified that the Learning-based entity resolution algorithm can obtain high recognition accuracy and the recognition efficiency is improved greatly with the increase of the number of nodes.
【學位授予單位】:昆明理工大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP311.13
【參考文獻】
相關期刊論文 前10條
1 孫琛琛;申德榮;寇月;聶鐵錚;于戈;;面向?qū)嶓w識別的聚類算法[J];軟件學報;2016年09期
2 伯為翰;蘭雨晴;;一種基于RDF知識庫的二元組實體解析方法研究[J];電子測試;2016年05期
3 劉輝平;金澈清;周傲英;;一種基于模式的實體解析算法[J];計算機學報;2015年09期
4 高廣尚;張智雄;;關系數(shù)據(jù)庫中實體解析研究綜述[J];現(xiàn)代圖書情報技術;2015年Z1期
5 楊東華;李寧寧;王宏志;李建中;高宏;;基于任務合并的并行大數(shù)據(jù)清洗過程優(yōu)化[J];計算機學報;2016年01期
6 劉雪莉;王宏志;李建中;高宏;;基于實體的相似性連接算法[J];軟件學報;2015年06期
7 朱燦;曹健;;實體解析技術綜述與展望[J];計算機科學;2015年03期
8 張安珍;門雪瑩;王宏志;李建中;高宏;;大數(shù)據(jù)上基于Hadoop的不一致數(shù)據(jù)檢測與修復算法[J];計算機科學與探索;2015年09期
9 王宏志;;大數(shù)據(jù)質(zhì)量管理:問題與研究進展[J];科技導報;2014年34期
10 黎玲利;高宏;;基于距離度量的實體識別算法[J];智能計算機與應用;2014年06期
相關博士學位論文 前2條
1 黎玲利;實體識別關鍵技術的研究[D];哈爾濱工業(yè)大學;2015年
2 王欣;數(shù)據(jù)集成技術若干問題的研究[D];上海交通大學;2010年
相關碩士學位論文 前1條
1 黃敏;大數(shù)據(jù)下基于塊依賴的實體解析方法[D];北京交通大學;2015年
,本文編號:1959872
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1959872.html