基于Bootstrapping的領域知識自動抽取技術的研究

發(fā)布時間：2018-03-19 00:01

本文選題：領域知識抽取　切入點：半結構化網站　出處：《山東大學》2012年碩士論文　論文類型：學位論文

【摘要】：隨著互聯(lián)網的高速發(fā)展及其各種Web應用的快速增長,網絡上的信息規(guī)模急劇擴大。網絡已經成為人們生活中重要的知識庫,人們對高效地獲取信息的需求尤為迫切。在網絡的海量數(shù)據(jù)中,包含了大量的半結構化的領域知識,例如電影、書籍和酒店等等,這些領域知識與我們的生活秘密相關。目前,雖然可以通過搜索引擎從海量數(shù)據(jù)中進行信息檢索,但是搜索的結果并不是非�？煽�。而這些領域知識往往來自供應商的后臺數(shù)據(jù)庫,同時基于關鍵字匹配的搜索引擎由于自身的限制,不能索引這些嵌入在半結構化的HTML網頁中的領域知識。如何從大規(guī)模的Web網站中自動抽取并組織這些領域知識成為信息抽取研究的熱點。Web信息抽取技術(Web Information Extraction)可以從半結構化的網頁中抽取數(shù)據(jù),并以結構化的方式存儲在數(shù)據(jù)庫中。本文在分析當前Web信息抽取技術的基礎上,利用標簽路徑技術(Tag Path Technique)代替DOM樹來表示HTML文檔。該表示方法大大降低了標簽的數(shù)量,提高了算法的性能。針對半結構化的網站,提出了一種新的基于Bootstrapping的自動抽取領域知識的算法：Domain-specific Knowledge Extraction from Websites, DKEW。 DKEW利用本體(Ontology)來統(tǒng)一標注同一領域中抽取的半結構化數(shù)據(jù),便于存儲和查詢。DKEW首先利用基于標簽路徑技術的聚類算法對目標網頁進行聚類,過濾掉噪音網頁,DKEW只抽取包含詳細信息的半結構化網頁。根據(jù)標簽路徑技術,提出一種新的模式定義。對同一類別的網頁,借助于機器學習方法和領域種子自動地進行模式學習。然后利用學習到的模式自動抽取領域知識并匹配到事先定義的領域本體,將匹配好的領域知識存儲在結構化的、便于查詢的知識庫表格中。在知識抽取的同時,利用新抽取的具有高可信性的領域知識來擴充領域種子和Ontology,以便下次迭代應用。最后,通過Bootstrapping方法將相關的知識抽取過程結合起來,使之成為一套無需人工監(jiān)督的自動抽取工具。DKEW只需要少量的人力進行領域種子的初始化。為了驗證DKEW,本文利用自定義的網絡爬蟲爬取多個領域的網頁數(shù)據(jù)。實驗表明DKEW不僅在性能上優(yōu)于現(xiàn)有的Web信息抽取方法RoadRunner,而且在效率上也遠遠高于RoadRunner。相比于RoadRunner需要手動匹配抽取的數(shù)據(jù),DKEW利用自動的方式進行本體匹配,節(jié)省了大量的人力和時間。在多個領域上的實驗表明,DKEW可以應用在大規(guī)模的Web信息抽取中。
[Abstract]:With the rapid development of the Internet and the rapid growth of various Web applications, the scale of information on the network has expanded dramatically. The network has become an important knowledge base in people's lives. The need for efficient access to information is particularly urgent. There is a large amount of semi-structured domain knowledge, such as movies, books and hotels, in the vast amount of data on the Internet that is relevant to the secrets of our lives. Although it is possible to retrieve information from vast amounts of data through a search engine, the results of the search are not very reliable. At the same time, the search engine based on keyword matching has its own limitations, Cannot index the domain knowledge embedded in semi-structured HTML web pages. How to automatically extract and organize these domain knowledge from large-scale Web websites becomes a hot topic of information extraction. To extract data from semi-structured Web pages, And stored in a structured way in the database. Based on the analysis of current Web information extraction technology, tag Path technique is used to represent HTML documents instead of DOM tree. This method greatly reduces the number of tags and improves the performance of the algorithm. A new domain knowledge extraction algorithm based on Bootstrapping:: Domain-specific Knowledge Extraction from Web sites (DKEW. DKEW) is proposed to annotate the semi-structured data extracted from the same domain. DKEW is convenient to store and query .DKEW firstly uses the clustering algorithm based on label path technology to cluster the target web pages, and filter out the noisy web pages to extract only semi-structured web pages with detailed information. According to the label path technology, DKEW can only extract the semi-structured web pages with detailed information. In this paper, a new schema definition is proposed. For a web page of the same class, pattern learning is carried out automatically by means of machine learning method and domain seed, and then domain knowledge is automatically extracted and matched to the predefined domain ontology by using the learned pattern. The matched domain knowledge is stored in a structured, query-friendly knowledge base table. At the same time, the newly extracted domain knowledge with high credibility is used to expand the domain seed and ontology for the next iteration. Finally, The related knowledge extraction process is combined by Bootstrapping method. DKEW is an automatic extraction tool without manual supervision. In order to verify DKEW, this paper uses self-defined web crawler to crawl web data from multiple domains. Experiments show that DKEW requires only a small amount of manpower to initialize the seed of the field. The results show that DKEW is not only better than RoadRunner in performance, but also more efficient than RoadRunner.Compared with data extracted by manual matching in RoadRunner, DKEW uses automatic way to match ontology. Experiments in many fields show that DKEW can be used in large-scale Web information extraction.
【學位授予單位】：山東大學
【學位級別】：碩士
【學位授予年份】：2012
【分類號】：TP391.1

【參考文獻】

相關期刊論文前2條

1 徐中華;;Web信息抽取方法概述[J];經營管理者;2008年09期

2 康琪;馬軍;;有向標記根樹之間的語義編輯距離[J];模式識別與人工智能;2011年06期

相關碩士學位論文前1條

1 馬騰;基于ontology的信息抽取系統(tǒng)的研究與實現(xiàn)[D];電子科技大學;2006年

，

本文編號：1631894

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1631894.html

上一篇：一種基于語義網中本體的排序算法
下一篇：基于表單特性的深層網絡數(shù)據(jù)源分類方法研究

論文發(fā)表

·知網|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于Bootstrapping的領域知識自動抽取技術的研究