中立RDF知識庫構(gòu)建問題研究與應(yīng)用
本文選題:知識庫 切入點:資源描述框架 出處:《西南交通大學(xué)》2016年碩士論文 論文類型:學(xué)位論文
【摘要】:互聯(lián)網(wǎng)上的大數(shù)據(jù)給人類生活帶來了豐富的信息,人們只需要通過關(guān)鍵字進行搜索,就能獲取到相關(guān)新聞、資料鏈接。然而,這種通過點擊鏈接的方式使得人類在面對持續(xù)增加的海量數(shù)據(jù)獲取知識與信息時變得十分低效。目前互聯(lián)網(wǎng)上的信息大多以網(wǎng)頁的形式進行存儲與發(fā)布,通過超鏈接的形式將文檔關(guān)聯(lián)起來,這種方式使得人類可以理解文檔中的信息,而計算機卻難以對文檔中的信息進行理解。為了更好地利用互聯(lián)網(wǎng)產(chǎn)生的大數(shù)據(jù)資源,國外已有研究機構(gòu)從英文維基百科中構(gòu)建了知識庫,如FreeBase, DBPedia等。國內(nèi)的知識庫有百度知心、搜狗知立方及清華XLore等。知識庫在知識圖譜、信息融合及人工智能問答等研究領(lǐng)域具有重要的應(yīng)用價值。國外的知識庫如FreeBase等提供了公開的資源描述框架數(shù)據(jù)源,但包含的中文實體數(shù)據(jù)量較少,如何構(gòu)建高質(zhì)量的中文RDF知識庫成為目前的研究熱點;谏鲜霰尘,本文對基于網(wǎng)絡(luò)百科構(gòu)建中文RDF知識庫的方法進行了研究,并在以下幾個方面開展了工作:1.深入研究了大規(guī)模網(wǎng)絡(luò)百科數(shù)據(jù)采集技術(shù),分析了數(shù)據(jù)采集中遇到的具體問題與挑戰(zhàn),結(jié)合Spring MVC框架與Scrapy框架構(gòu)建了一個網(wǎng)絡(luò)百科數(shù)據(jù)采集系統(tǒng),爬取性能穩(wěn)定且具有良好的人機交互界面。提出了一種代理IP信息自動抽取算法,該方法能夠有效抽取代理IP信息,并解決網(wǎng)站的反爬取問題。2.研究了針對網(wǎng)絡(luò)百科數(shù)據(jù)實體信息抽取技術(shù),提出了利用RDFS語義信息對抽取數(shù)據(jù)進行語義標(biāo)注及RDF數(shù)據(jù)規(guī)范化的方法。研究了RDF數(shù)據(jù)的圖數(shù)據(jù)庫存儲方法,開發(fā)了基于NEO4J的RDF數(shù)據(jù)圖存儲系統(tǒng),與傳統(tǒng)的關(guān)系型數(shù)據(jù)庫存儲方式進行了比較,結(jié)果表明本文實現(xiàn)的存儲系統(tǒng)能夠滿足大規(guī)模RDF數(shù)據(jù)的存儲與查詢需求。3.深入研究了基于百度百科與互動百科異構(gòu)數(shù)據(jù)源構(gòu)建知識庫過程中遇到的實體對齊問題,提出了一種基于實體屬性信息及上下文主題特征相結(jié)合進行實體對齊的方法,與傳統(tǒng)的實體對齊方法進行了比較,結(jié)果表明本論文提出的方法優(yōu)于現(xiàn)有實體對齊方法。4.將大規(guī)模網(wǎng)絡(luò)百科數(shù)據(jù)采集技術(shù)、實體信息RDF轉(zhuǎn)化、存儲與SPARQL查詢技術(shù)以及異構(gòu)數(shù)據(jù)源實體對齊方法相結(jié)合,設(shè)計并實現(xiàn)了一個中文網(wǎng)絡(luò)百科RDF知識庫自動構(gòu)建系統(tǒng),該系統(tǒng)能夠通過配置采集任務(wù),下載網(wǎng)絡(luò)百科數(shù)據(jù),進行實體數(shù)據(jù)抽取與RDF轉(zhuǎn)化與存儲,從而為外部應(yīng)用提供實體查詢與SPARQL查詢的功能。
[Abstract]:Big data on the Internet has brought a wealth of information to human life, people only need to search through the keyword to obtain relevant news, information links. However, This way of clicking on links makes it very inefficient for people to acquire knowledge and information in the face of the ever-increasing mass of data. At present, most of the information on the Internet is stored and published in the form of web pages. Linking documents in the form of hyperlinks makes it possible for humans to understand the information in documents, while computers find it difficult to understand them. In order to make better use of big data's resources generated by the Internet, Foreign research institutions have constructed knowledge bases from Wikipedia in English, such as FreeBase, DBPedia, etc. The knowledge bases in China are known by Baidu, Sogou, Tsinghua XLore, etc. The knowledge bases are in the knowledge atlas. The research field of information fusion and artificial intelligence question and answer has important application value. The knowledge base of foreign countries, such as FreeBase and so on, provides the open data source of resource description framework, but it contains less Chinese entity data. How to build a high-quality Chinese RDF knowledge base has become a hot research topic. Based on the above background, this paper studies the method of constructing Chinese RDF knowledge base based on network encyclopedia. And has carried out the work in the following several aspects: 1.deeply studied the large-scale network encyclopedia data collection technology, analyzed in the data collection concrete question and the challenge, A network encyclopedia data acquisition system based on Spring MVC framework and Scrapy framework is constructed. The crawling performance is stable and has good human-computer interface. A proxy IP information extraction algorithm is proposed, which can extract proxy IP information effectively. And solve the backcrawling problem of the website. 2. The technology of entity information extraction for the network encyclopedia data is studied. This paper puts forward a method of semantic annotation of extracted data and standardization of RDF data by using RDFS semantic information, studies the storage method of RDF data graph database, and develops a RDF data graph storage system based on NEO4J. Compared with the traditional relational database storage, The results show that the storage system realized in this paper can meet the storage and query requirements of large-scale RDF data. 3. The problem of entity alignment in the process of building a knowledge base based on Baidu encyclopedia and interactive encyclopedia heterogeneous data sources is studied in depth. A method of entity alignment based on entity attribute information and context subject feature is proposed, which is compared with traditional entity alignment method. The results show that the method proposed in this paper is superior to the existing entity alignment method. 4. Combining the large-scale network encyclopedia data acquisition technology, entity information RDF transformation, storage with SPARQL query technology and heterogeneous data source entity alignment method. This paper designs and implements an automatic construction system of RDF knowledge base of Chinese network encyclopedia. The system can extract entity data and transform and store RDF data by configuring collecting task, downloading network encyclopedia data. Thus provides the entity query and the SPARQL query function for the external application.
【學(xué)位授予單位】:西南交通大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2016
【分類號】:TP391.1
【相似文獻】
相關(guān)期刊論文 前10條
1 彭敏;;首個紡織知識庫系統(tǒng)建成[J];軟件世界;2006年14期
2 李天利;;企業(yè)建立知識庫的重要性[J];科技信息;2008年33期
3 劉成亮;韓海偉;;知識庫系統(tǒng)的原理及其在智能搜索引擎中的應(yīng)用[J];電腦知識與技術(shù);2008年08期
4 劉冰;胡風(fēng)華;郭丹峰;;知識庫系統(tǒng)原理之探討[J];光盤技術(shù);2009年06期
5 許文艷;劉三陽;;知識庫系統(tǒng)的邏輯基礎(chǔ)[J];計算機學(xué)報;2009年11期
6 孫培山;樊治平;陳曦;康峰;;評價知識庫構(gòu)建效果的框架與流程[J];東北大學(xué)學(xué)報(自然科學(xué)版);2010年09期
7 沈迪飛;;三次文獻數(shù)據(jù)庫—知識庫[J];情報科學(xué);1983年02期
8 周立柱;;以規(guī)則為基礎(chǔ)的知識庫系統(tǒng)簡介[J];計算機科學(xué);1986年03期
9 應(yīng)晶;吳朝暉;何志均;;知識庫的一致性問題和檢查方法[J];計算機科學(xué);1991年02期
10 蘇俊;王珊;;分布式知識庫系統(tǒng)研究[J];計算機科學(xué);1992年04期
相關(guān)會議論文 前10條
1 路燕;趙海;樂永年;施伯樂;;基于人工意識觀點的虛擬世界中的常識知識庫[A];第十八屆全國數(shù)據(jù)庫學(xué)術(shù)會議論文集(研究報告篇)[C];2001年
2 古麗松.那斯?fàn)柖?;基于知識庫系統(tǒng)的維漢、維外多文種雙向翻譯詞典研究[A];1998年中國智能自動化學(xué)術(shù)會議論文集(下冊)[C];1998年
3 談君;鄢曉;張躍;徐惠彬;;熱障涂層知識庫系統(tǒng)的設(shè)計和構(gòu)建[A];第五屆中國功能材料及其應(yīng)用學(xué)術(shù)會議論文集Ⅲ[C];2004年
4 王宇君;胡美琛;施伯樂;;一個分塊知識庫模型[A];第十二屆全國數(shù)據(jù)庫學(xué)術(shù)會議論文集[C];1994年
5 吉s,
本文編號:1583708
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1583708.html