生物醫(yī)學(xué)鏈接數(shù)據(jù)的清洗與集成技術(shù)研究

發(fā)布時間：2018-12-08 10:59

【摘要】：近年來語義網(wǎng)技術(shù)的高速發(fā)展方便了海量數(shù)據(jù)的集成與展示。生物醫(yī)學(xué)領(lǐng)域由于其數(shù)據(jù)量大及子領(lǐng)域多的特性,清洗并集成各機構(gòu)發(fā)布的RDF數(shù)據(jù)集的需求也日益凸顯。過往的許多工作致力于應(yīng)用語義網(wǎng)標準與技術(shù)為海量生物醫(yī)學(xué)數(shù)據(jù)建立鏈接數(shù)據(jù)網(wǎng)絡(luò)。例如采用語義網(wǎng)技術(shù)發(fā)布的生物醫(yī)學(xué)數(shù)據(jù)集通常提供了指向其他數(shù)據(jù)集的交叉引用,但是這些引用往往存在錯誤,或是不能完整表達數(shù)據(jù)集間鏈接關(guān)系。已集成的數(shù)據(jù)需要通過使用SPARQL語言查詢的方式來獲取,阻礙了非語義網(wǎng)領(lǐng)域用戶(例如生物醫(yī)學(xué)領(lǐng)域?qū)I(yè)技術(shù)人員)對數(shù)據(jù)的使用。各數(shù)據(jù)集使用不同的本體也使得跨數(shù)據(jù)集查詢的結(jié)果難以集成。本文對生物醫(yī)學(xué)數(shù)據(jù)集鏈接數(shù)據(jù)進行分析,并研究數(shù)據(jù)清洗及數(shù)據(jù)集成技術(shù)來解決上述問題。數(shù)據(jù)清洗技術(shù)對數(shù)據(jù)進行分析與校驗,對重復(fù)數(shù)據(jù),錯誤數(shù)據(jù)與缺失數(shù)據(jù)進行修正。語義網(wǎng)數(shù)據(jù)集成技術(shù)涉及本體匹配,實體鏈接等技術(shù),其中本體匹配用于統(tǒng)一不同數(shù)據(jù)集本體的類與屬性,實體鏈接連接不同數(shù)據(jù)集中指向同一實體的數(shù)據(jù)。本文的主要貢獻如下:1.依托于Bio2RDF項目,調(diào)查并分析了主流生物醫(yī)學(xué)鏈接數(shù)據(jù)。構(gòu)建了數(shù)據(jù)集鏈接,實體鏈接及術(shù)語鏈接三類數(shù)據(jù)鏈接圖,分析了鏈接圖間關(guān)聯(lián)性,發(fā)現(xiàn)了數(shù)據(jù)集鏈接具有小世界現(xiàn)象,實體鏈接度分布不嚴格符合冪次定律,不同數(shù)據(jù)集間術(shù)語有較多重合等現(xiàn)象。文章還通過研究實體鏈接屬性,構(gòu)建了一個標準測試集用于評估實體鏈接方法的優(yōu)劣。鏈接分析方法可以通用于生物醫(yī)學(xué)領(lǐng)域數(shù)據(jù)集分析;2.對選定數(shù)據(jù)集進行數(shù)據(jù)清洗,使用字符串檢測,機器學(xué)習(xí)等方法對因為自動轉(zhuǎn)換及人工輸入產(chǎn)生的錯誤,補全缺失數(shù)據(jù),修正錯誤數(shù)據(jù),消除重復(fù)數(shù)據(jù)。同時根據(jù)實體鏈接的對稱性和傳遞性分析補全數(shù)據(jù)集間缺失鏈接,修正錯誤鏈接,提升了數(shù)據(jù)質(zhì)量及鏈接質(zhì)量;3.在一個基于本體的數(shù)據(jù)集聯(lián)合搜索引擎BioSearch系統(tǒng)中集成清洗后的數(shù)據(jù)集,使用本體匹配方法支持跨數(shù)據(jù)集聯(lián)合查詢。系統(tǒng)為用戶提供簡單高效的數(shù)據(jù)查詢獲取界面。實驗結(jié)果表明使用聯(lián)合查詢及使用本文定義的語義查詢接口比現(xiàn)有的兩種鏈接數(shù)據(jù)搜索引擎更加高效,BioSearch所實現(xiàn)刻面過濾及實體瀏覽功能也被證實提升了用戶體驗。
[Abstract]:In recent years, the rapid development of semantic Web technology facilitates the integration and display of massive data. Due to the large amount of data and many sub-fields, the need of cleaning and integrating RDF data sets published by various organizations is increasingly prominent in the biomedical field. Many previous efforts have been devoted to the use of semantic Web standards and technologies to establish linked data networks for massive biomedical data. For example, biomedical data sets published using semantic Web technology usually provide cross-references to other data sets, but these references often contain errors or fail to fully express the link relationship between data sets. The integrated data needs to be obtained by using SPARQL language query, which hinders the use of data by non-semantic domain users (such as biomedical professionals). Different ontologies in different datasets also make it difficult to integrate the results of cross-dataset queries. This paper analyzes the linked data of biomedical data set, and studies data cleaning and data integration technology to solve the above problems. Data cleaning technology analyzes and verifies the data, and corrects the repeated data, error data and missing data. Semantic Web data integration technology involves ontology matching, entity linking and so on. Ontology matching is used to unify the classes and attributes of different datasets, and entity links connect different data sets to the same entity. The main contributions of this paper are as follows: 1. Based on the Bio2RDF project, the mainstream biomedical link data were investigated and analyzed. In this paper, three kinds of data link graphs, data set link, entity link and terminology link, are constructed, and the relationship between them is analyzed. It is found that the data set link has small world phenomenon, and the distribution of entity link degree is not strictly in accordance with power law. There is more overlap between different data sets. In addition, a standard test set is constructed to evaluate the merits and demerits of entity linking methods. Link analysis method can be used in biomedical domain data set analysis. 2. Data cleaning of selected data sets, string detection, machine learning and other methods to correct the missing data, correct the error data and eliminate the duplicate data caused by automatic conversion and manual input. At the same time, according to the symmetry and transitivity of the entity link, the missing link between the complete data sets is analyzed, and the error link is corrected to improve the data quality and link quality. 3. In an ontology-based data set federated search engine (BioSearch) system, the cleaned data set is integrated, and the ontology matching method is used to support cross-dataset joint query. The system provides users with a simple and efficient data query acquisition interface. The experimental results show that the joint query and semantic query interface defined in this paper are more efficient than the existing two linked data search engines. The facet filtering and entity browsing functions implemented by BioSearch have also been proved to improve the user experience.
【學(xué)位授予單位】：南京大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2017
【分類號】：TP311.13

【相似文獻】

相關(guān)期刊論文前10條

1 張爾強;創(chuàng)建SAS數(shù)據(jù)集的技巧[J];數(shù)理醫(yī)藥學(xué)雜志;2003年01期

2 ;數(shù)據(jù)集N鄽2[J];航空材料;1959年09期

3 江海洪 ,羅長坤;首套中國數(shù)字化可視人體數(shù)據(jù)集在第三軍醫(yī)大學(xué)研制成功[J];中華醫(yī)學(xué)雜志;2003年09期

4 陳相穎;數(shù)據(jù)集記錄快速定位與篩選方法之探討[J];計量與測試技術(shù);2005年06期

5 張曉斌;魏永祥;韓德民;夏寅;李希平;原林;唐雷;王興海;;數(shù)字化耳鼻咽喉數(shù)據(jù)集的采集[J];中華耳鼻咽喉頭頸外科雜志;2005年06期

6 王宏鼎;唐世渭;董國田;;數(shù)據(jù)集成中數(shù)據(jù)集特征的檢測方法[J];中國金融電腦;2006年03期

7 張華;郁書好;;時空數(shù)據(jù)集的連接處理和優(yōu)化方法研究[J];皖西學(xué)院學(xué)報;2006年02期

8 苗卿;單立新;裘昱;;信息熵在數(shù)據(jù)集分割中的應(yīng)用研究[J];電腦知識與技術(shù)(學(xué)術(shù)交流);2007年05期

9 陳德誠;丘平珠;唐炳莉;;廣西氣象數(shù)據(jù)集設(shè)計與制作[J];氣象研究與應(yīng)用;2007年04期

10 趙鳳英;王崇駿;陳世福;;用于不均衡數(shù)據(jù)集的挖掘方法[J];計算機科學(xué);2007年09期

相關(guān)會議論文前10條

1 田捷;;三維醫(yī)學(xué)影像數(shù)據(jù)集處理的集成化平臺[A];2003年全國醫(yī)學(xué)影像技術(shù)學(xué)術(shù)會議論文匯編[C];2003年

2 范明;魏芳;;挖掘基本顯露模式用于分類[A];第二十一屆中國數(shù)據(jù)庫學(xué)術(shù)會議論文集（技術(shù)報告篇）[C];2004年

3 冷傳良;;飛機化銑成樣板劃線數(shù)據(jù)集設(shè)計方法探索[A];第十屆沈陽科學(xué)學(xué)術(shù)年會論文集（信息科學(xué)與工程技術(shù)分冊）[C];2013年

4 孟燁;張鵬;宋大為;王雷;;信息檢索系統(tǒng)性能對數(shù)據(jù)集特性的依賴性分析[A];第十二屆全國人機語音通訊學(xué)術(shù)會議（NCMMSC'2013）論文集[C];2013年

5 段磊;唐常杰;左R，

本文編號：2368234

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2368234.html

上一篇：一種基于Agent的智能元搜索引擎框架
下一篇：元搜索引擎的調(diào)查分析

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

生物醫(yī)學(xué)鏈接數(shù)據(jù)的清洗與集成技術(shù)研究