基于Bootstrapping的領(lǐng)域知識(shí)自動(dòng)抽取技術(shù)的研究
發(fā)布時(shí)間:2018-03-19 00:01
本文選題:領(lǐng)域知識(shí)抽取 切入點(diǎn):半結(jié)構(gòu)化網(wǎng)站 出處:《山東大學(xué)》2012年碩士論文 論文類(lèi)型:學(xué)位論文
【摘要】:隨著互聯(lián)網(wǎng)的高速發(fā)展及其各種Web應(yīng)用的快速增長(zhǎng),網(wǎng)絡(luò)上的信息規(guī)模急劇擴(kuò)大。網(wǎng)絡(luò)已經(jīng)成為人們生活中重要的知識(shí)庫(kù),人們對(duì)高效地獲取信息的需求尤為迫切。在網(wǎng)絡(luò)的海量數(shù)據(jù)中,包含了大量的半結(jié)構(gòu)化的領(lǐng)域知識(shí),例如電影、書(shū)籍和酒店等等,這些領(lǐng)域知識(shí)與我們的生活秘密相關(guān)。目前,雖然可以通過(guò)搜索引擎從海量數(shù)據(jù)中進(jìn)行信息檢索,但是搜索的結(jié)果并不是非?煽。而這些領(lǐng)域知識(shí)往往來(lái)自供應(yīng)商的后臺(tái)數(shù)據(jù)庫(kù),同時(shí)基于關(guān)鍵字匹配的搜索引擎由于自身的限制,不能索引這些嵌入在半結(jié)構(gòu)化的HTML網(wǎng)頁(yè)中的領(lǐng)域知識(shí)。如何從大規(guī)模的Web網(wǎng)站中自動(dòng)抽取并組織這些領(lǐng)域知識(shí)成為信息抽取研究的熱點(diǎn)。Web信息抽取技術(shù)(Web Information Extraction)可以從半結(jié)構(gòu)化的網(wǎng)頁(yè)中抽取數(shù)據(jù),并以結(jié)構(gòu)化的方式存儲(chǔ)在數(shù)據(jù)庫(kù)中。 本文在分析當(dāng)前Web信息抽取技術(shù)的基礎(chǔ)上,利用標(biāo)簽路徑技術(shù)(Tag Path Technique)代替DOM樹(shù)來(lái)表示HTML文檔。該表示方法大大降低了標(biāo)簽的數(shù)量,提高了算法的性能。針對(duì)半結(jié)構(gòu)化的網(wǎng)站,提出了一種新的基于Bootstrapping的自動(dòng)抽取領(lǐng)域知識(shí)的算法:Domain-specific Knowledge Extraction from Websites, DKEW。 DKEW利用本體(Ontology)來(lái)統(tǒng)一標(biāo)注同一領(lǐng)域中抽取的半結(jié)構(gòu)化數(shù)據(jù),便于存儲(chǔ)和查詢(xún)。DKEW首先利用基于標(biāo)簽路徑技術(shù)的聚類(lèi)算法對(duì)目標(biāo)網(wǎng)頁(yè)進(jìn)行聚類(lèi),過(guò)濾掉噪音網(wǎng)頁(yè),DKEW只抽取包含詳細(xì)信息的半結(jié)構(gòu)化網(wǎng)頁(yè)。根據(jù)標(biāo)簽路徑技術(shù),提出一種新的模式定義。對(duì)同一類(lèi)別的網(wǎng)頁(yè),借助于機(jī)器學(xué)習(xí)方法和領(lǐng)域種子自動(dòng)地進(jìn)行模式學(xué)習(xí)。然后利用學(xué)習(xí)到的模式自動(dòng)抽取領(lǐng)域知識(shí)并匹配到事先定義的領(lǐng)域本體,將匹配好的領(lǐng)域知識(shí)存儲(chǔ)在結(jié)構(gòu)化的、便于查詢(xún)的知識(shí)庫(kù)表格中。在知識(shí)抽取的同時(shí),利用新抽取的具有高可信性的領(lǐng)域知識(shí)來(lái)擴(kuò)充領(lǐng)域種子和Ontology,以便下次迭代應(yīng)用。最后,通過(guò)Bootstrapping方法將相關(guān)的知識(shí)抽取過(guò)程結(jié)合起來(lái),使之成為一套無(wú)需人工監(jiān)督的自動(dòng)抽取工具。DKEW只需要少量的人力進(jìn)行領(lǐng)域種子的初始化。為了驗(yàn)證DKEW,本文利用自定義的網(wǎng)絡(luò)爬蟲(chóng)爬取多個(gè)領(lǐng)域的網(wǎng)頁(yè)數(shù)據(jù)。實(shí)驗(yàn)表明DKEW不僅在性能上優(yōu)于現(xiàn)有的Web信息抽取方法RoadRunner,而且在效率上也遠(yuǎn)遠(yuǎn)高于RoadRunner。相比于RoadRunner需要手動(dòng)匹配抽取的數(shù)據(jù),DKEW利用自動(dòng)的方式進(jìn)行本體匹配,節(jié)省了大量的人力和時(shí)間。在多個(gè)領(lǐng)域上的實(shí)驗(yàn)表明,DKEW可以應(yīng)用在大規(guī)模的Web信息抽取中。
[Abstract]:With the rapid development of the Internet and the rapid growth of various Web applications, the scale of information on the network has expanded dramatically. The network has become an important knowledge base in people's lives. The need for efficient access to information is particularly urgent. There is a large amount of semi-structured domain knowledge, such as movies, books and hotels, in the vast amount of data on the Internet that is relevant to the secrets of our lives. Although it is possible to retrieve information from vast amounts of data through a search engine, the results of the search are not very reliable. At the same time, the search engine based on keyword matching has its own limitations, Cannot index the domain knowledge embedded in semi-structured HTML web pages. How to automatically extract and organize these domain knowledge from large-scale Web websites becomes a hot topic of information extraction. To extract data from semi-structured Web pages, And stored in a structured way in the database. Based on the analysis of current Web information extraction technology, tag Path technique is used to represent HTML documents instead of DOM tree. This method greatly reduces the number of tags and improves the performance of the algorithm. A new domain knowledge extraction algorithm based on Bootstrapping:: Domain-specific Knowledge Extraction from Web sites (DKEW. DKEW) is proposed to annotate the semi-structured data extracted from the same domain. DKEW is convenient to store and query .DKEW firstly uses the clustering algorithm based on label path technology to cluster the target web pages, and filter out the noisy web pages to extract only semi-structured web pages with detailed information. According to the label path technology, DKEW can only extract the semi-structured web pages with detailed information. In this paper, a new schema definition is proposed. For a web page of the same class, pattern learning is carried out automatically by means of machine learning method and domain seed, and then domain knowledge is automatically extracted and matched to the predefined domain ontology by using the learned pattern. The matched domain knowledge is stored in a structured, query-friendly knowledge base table. At the same time, the newly extracted domain knowledge with high credibility is used to expand the domain seed and ontology for the next iteration. Finally, The related knowledge extraction process is combined by Bootstrapping method. DKEW is an automatic extraction tool without manual supervision. In order to verify DKEW, this paper uses self-defined web crawler to crawl web data from multiple domains. Experiments show that DKEW requires only a small amount of manpower to initialize the seed of the field. The results show that DKEW is not only better than RoadRunner in performance, but also more efficient than RoadRunner.Compared with data extracted by manual matching in RoadRunner, DKEW uses automatic way to match ontology. Experiments in many fields show that DKEW can be used in large-scale Web information extraction.
【學(xué)位授予單位】:山東大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2012
【分類(lèi)號(hào)】:TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前2條
1 徐中華;;Web信息抽取方法概述[J];經(jīng)營(yíng)管理者;2008年09期
2 康琪;馬軍;;有向標(biāo)記根樹(shù)之間的語(yǔ)義編輯距離[J];模式識(shí)別與人工智能;2011年06期
相關(guān)碩士學(xué)位論文 前1條
1 馬騰;基于ontology的信息抽取系統(tǒng)的研究與實(shí)現(xiàn)[D];電子科技大學(xué);2006年
,本文編號(hào):1631894
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1631894.html
最近更新
教材專(zhuān)著