生物醫(yī)學鏈接數(shù)據(jù)的清洗與集成技術(shù)研究
[Abstract]:In recent years, the rapid development of semantic Web technology facilitates the integration and display of massive data. Due to the large amount of data and many sub-fields, the need of cleaning and integrating RDF data sets published by various organizations is increasingly prominent in the biomedical field. Many previous efforts have been devoted to the use of semantic Web standards and technologies to establish linked data networks for massive biomedical data. For example, biomedical data sets published using semantic Web technology usually provide cross-references to other data sets, but these references often contain errors or fail to fully express the link relationship between data sets. The integrated data needs to be obtained by using SPARQL language query, which hinders the use of data by non-semantic domain users (such as biomedical professionals). Different ontologies in different datasets also make it difficult to integrate the results of cross-dataset queries. This paper analyzes the linked data of biomedical data set, and studies data cleaning and data integration technology to solve the above problems. Data cleaning technology analyzes and verifies the data, and corrects the repeated data, error data and missing data. Semantic Web data integration technology involves ontology matching, entity linking and so on. Ontology matching is used to unify the classes and attributes of different datasets, and entity links connect different data sets to the same entity. The main contributions of this paper are as follows: 1. Based on the Bio2RDF project, the mainstream biomedical link data were investigated and analyzed. In this paper, three kinds of data link graphs, data set link, entity link and terminology link, are constructed, and the relationship between them is analyzed. It is found that the data set link has small world phenomenon, and the distribution of entity link degree is not strictly in accordance with power law. There is more overlap between different data sets. In addition, a standard test set is constructed to evaluate the merits and demerits of entity linking methods. Link analysis method can be used in biomedical domain data set analysis. 2. Data cleaning of selected data sets, string detection, machine learning and other methods to correct the missing data, correct the error data and eliminate the duplicate data caused by automatic conversion and manual input. At the same time, according to the symmetry and transitivity of the entity link, the missing link between the complete data sets is analyzed, and the error link is corrected to improve the data quality and link quality. 3. In an ontology-based data set federated search engine (BioSearch) system, the cleaned data set is integrated, and the ontology matching method is used to support cross-dataset joint query. The system provides users with a simple and efficient data query acquisition interface. The experimental results show that the joint query and semantic query interface defined in this paper are more efficient than the existing two linked data search engines. The facet filtering and entity browsing functions implemented by BioSearch have also been proved to improve the user experience.
【學位授予單位】:南京大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP311.13
【相似文獻】
相關(guān)期刊論文 前10條
1 張爾強;創(chuàng)建SAS數(shù)據(jù)集的技巧[J];數(shù)理醫(yī)藥學雜志;2003年01期
2 ;數(shù)據(jù)集N鄽2[J];航空材料;1959年09期
3 江海洪 ,羅長坤;首套中國數(shù)字化可視人體數(shù)據(jù)集在第三軍醫(yī)大學研制成功[J];中華醫(yī)學雜志;2003年09期
4 陳相穎;數(shù)據(jù)集記錄快速定位與篩選方法之探討[J];計量與測試技術(shù);2005年06期
5 張曉斌;魏永祥;韓德民;夏寅;李希平;原林;唐雷;王興海;;數(shù)字化耳鼻咽喉數(shù)據(jù)集的采集[J];中華耳鼻咽喉頭頸外科雜志;2005年06期
6 王宏鼎;唐世渭;董國田;;數(shù)據(jù)集成中數(shù)據(jù)集特征的檢測方法[J];中國金融電腦;2006年03期
7 張華;郁書好;;時空數(shù)據(jù)集的連接處理和優(yōu)化方法研究[J];皖西學院學報;2006年02期
8 苗卿;單立新;裘昱;;信息熵在數(shù)據(jù)集分割中的應用研究[J];電腦知識與技術(shù)(學術(shù)交流);2007年05期
9 陳德誠;丘平珠;唐炳莉;;廣西氣象數(shù)據(jù)集設(shè)計與制作[J];氣象研究與應用;2007年04期
10 趙鳳英;王崇駿;陳世福;;用于不均衡數(shù)據(jù)集的挖掘方法[J];計算機科學;2007年09期
相關(guān)會議論文 前10條
1 田捷;;三維醫(yī)學影像數(shù)據(jù)集處理的集成化平臺[A];2003年全國醫(yī)學影像技術(shù)學術(shù)會議論文匯編[C];2003年
2 范明;魏芳;;挖掘基本顯露模式用于分類[A];第二十一屆中國數(shù)據(jù)庫學術(shù)會議論文集(技術(shù)報告篇)[C];2004年
3 冷傳良;;飛機化銑成樣板劃線數(shù)據(jù)集設(shè)計方法探索[A];第十屆沈陽科學學術(shù)年會論文集(信息科學與工程技術(shù)分冊)[C];2013年
4 孟燁;張鵬;宋大為;王雷;;信息檢索系統(tǒng)性能對數(shù)據(jù)集特性的依賴性分析[A];第十二屆全國人機語音通訊學術(shù)會議(NCMMSC'2013)論文集[C];2013年
5 段磊;唐常杰;左R,
本文編號:2368234
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2368234.html