天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁(yè) > 科技論文 > 搜索引擎論文 >

基于Bootstrapping的領(lǐng)域知識(shí)自動(dòng)抽取技術(shù)的研究

發(fā)布時(shí)間:2018-03-19 00:01

  本文選題:領(lǐng)域知識(shí)抽取 切入點(diǎn):半結(jié)構(gòu)化網(wǎng)站 出處:《山東大學(xué)》2012年碩士論文 論文類(lèi)型:學(xué)位論文


【摘要】:隨著互聯(lián)網(wǎng)的高速發(fā)展及其各種Web應(yīng)用的快速增長(zhǎng),網(wǎng)絡(luò)上的信息規(guī)模急劇擴(kuò)大。網(wǎng)絡(luò)已經(jīng)成為人們生活中重要的知識(shí)庫(kù),人們對(duì)高效地獲取信息的需求尤為迫切。在網(wǎng)絡(luò)的海量數(shù)據(jù)中,包含了大量的半結(jié)構(gòu)化的領(lǐng)域知識(shí),例如電影、書(shū)籍和酒店等等,這些領(lǐng)域知識(shí)與我們的生活秘密相關(guān)。目前,雖然可以通過(guò)搜索引擎從海量數(shù)據(jù)中進(jìn)行信息檢索,但是搜索的結(jié)果并不是非?煽。而這些領(lǐng)域知識(shí)往往來(lái)自供應(yīng)商的后臺(tái)數(shù)據(jù)庫(kù),同時(shí)基于關(guān)鍵字匹配的搜索引擎由于自身的限制,不能索引這些嵌入在半結(jié)構(gòu)化的HTML網(wǎng)頁(yè)中的領(lǐng)域知識(shí)。如何從大規(guī)模的Web網(wǎng)站中自動(dòng)抽取并組織這些領(lǐng)域知識(shí)成為信息抽取研究的熱點(diǎn)。Web信息抽取技術(shù)(Web Information Extraction)可以從半結(jié)構(gòu)化的網(wǎng)頁(yè)中抽取數(shù)據(jù),并以結(jié)構(gòu)化的方式存儲(chǔ)在數(shù)據(jù)庫(kù)中。 本文在分析當(dāng)前Web信息抽取技術(shù)的基礎(chǔ)上,利用標(biāo)簽路徑技術(shù)(Tag Path Technique)代替DOM樹(shù)來(lái)表示HTML文檔。該表示方法大大降低了標(biāo)簽的數(shù)量,提高了算法的性能。針對(duì)半結(jié)構(gòu)化的網(wǎng)站,提出了一種新的基于Bootstrapping的自動(dòng)抽取領(lǐng)域知識(shí)的算法:Domain-specific Knowledge Extraction from Websites, DKEW。 DKEW利用本體(Ontology)來(lái)統(tǒng)一標(biāo)注同一領(lǐng)域中抽取的半結(jié)構(gòu)化數(shù)據(jù),便于存儲(chǔ)和查詢(xún)。DKEW首先利用基于標(biāo)簽路徑技術(shù)的聚類(lèi)算法對(duì)目標(biāo)網(wǎng)頁(yè)進(jìn)行聚類(lèi),過(guò)濾掉噪音網(wǎng)頁(yè),DKEW只抽取包含詳細(xì)信息的半結(jié)構(gòu)化網(wǎng)頁(yè)。根據(jù)標(biāo)簽路徑技術(shù),提出一種新的模式定義。對(duì)同一類(lèi)別的網(wǎng)頁(yè),借助于機(jī)器學(xué)習(xí)方法和領(lǐng)域種子自動(dòng)地進(jìn)行模式學(xué)習(xí)。然后利用學(xué)習(xí)到的模式自動(dòng)抽取領(lǐng)域知識(shí)并匹配到事先定義的領(lǐng)域本體,將匹配好的領(lǐng)域知識(shí)存儲(chǔ)在結(jié)構(gòu)化的、便于查詢(xún)的知識(shí)庫(kù)表格中。在知識(shí)抽取的同時(shí),利用新抽取的具有高可信性的領(lǐng)域知識(shí)來(lái)擴(kuò)充領(lǐng)域種子和Ontology,以便下次迭代應(yīng)用。最后,通過(guò)Bootstrapping方法將相關(guān)的知識(shí)抽取過(guò)程結(jié)合起來(lái),使之成為一套無(wú)需人工監(jiān)督的自動(dòng)抽取工具。DKEW只需要少量的人力進(jìn)行領(lǐng)域種子的初始化。為了驗(yàn)證DKEW,本文利用自定義的網(wǎng)絡(luò)爬蟲(chóng)爬取多個(gè)領(lǐng)域的網(wǎng)頁(yè)數(shù)據(jù)。實(shí)驗(yàn)表明DKEW不僅在性能上優(yōu)于現(xiàn)有的Web信息抽取方法RoadRunner,而且在效率上也遠(yuǎn)遠(yuǎn)高于RoadRunner。相比于RoadRunner需要手動(dòng)匹配抽取的數(shù)據(jù),DKEW利用自動(dòng)的方式進(jìn)行本體匹配,節(jié)省了大量的人力和時(shí)間。在多個(gè)領(lǐng)域上的實(shí)驗(yàn)表明,DKEW可以應(yīng)用在大規(guī)模的Web信息抽取中。
[Abstract]:With the rapid development of the Internet and the rapid growth of various Web applications, the scale of information on the network has expanded dramatically. The network has become an important knowledge base in people's lives. The need for efficient access to information is particularly urgent. There is a large amount of semi-structured domain knowledge, such as movies, books and hotels, in the vast amount of data on the Internet that is relevant to the secrets of our lives. Although it is possible to retrieve information from vast amounts of data through a search engine, the results of the search are not very reliable. At the same time, the search engine based on keyword matching has its own limitations, Cannot index the domain knowledge embedded in semi-structured HTML web pages. How to automatically extract and organize these domain knowledge from large-scale Web websites becomes a hot topic of information extraction. To extract data from semi-structured Web pages, And stored in a structured way in the database. Based on the analysis of current Web information extraction technology, tag Path technique is used to represent HTML documents instead of DOM tree. This method greatly reduces the number of tags and improves the performance of the algorithm. A new domain knowledge extraction algorithm based on Bootstrapping:: Domain-specific Knowledge Extraction from Web sites (DKEW. DKEW) is proposed to annotate the semi-structured data extracted from the same domain. DKEW is convenient to store and query .DKEW firstly uses the clustering algorithm based on label path technology to cluster the target web pages, and filter out the noisy web pages to extract only semi-structured web pages with detailed information. According to the label path technology, DKEW can only extract the semi-structured web pages with detailed information. In this paper, a new schema definition is proposed. For a web page of the same class, pattern learning is carried out automatically by means of machine learning method and domain seed, and then domain knowledge is automatically extracted and matched to the predefined domain ontology by using the learned pattern. The matched domain knowledge is stored in a structured, query-friendly knowledge base table. At the same time, the newly extracted domain knowledge with high credibility is used to expand the domain seed and ontology for the next iteration. Finally, The related knowledge extraction process is combined by Bootstrapping method. DKEW is an automatic extraction tool without manual supervision. In order to verify DKEW, this paper uses self-defined web crawler to crawl web data from multiple domains. Experiments show that DKEW requires only a small amount of manpower to initialize the seed of the field. The results show that DKEW is not only better than RoadRunner in performance, but also more efficient than RoadRunner.Compared with data extracted by manual matching in RoadRunner, DKEW uses automatic way to match ontology. Experiments in many fields show that DKEW can be used in large-scale Web information extraction.
【學(xué)位授予單位】:山東大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2012
【分類(lèi)號(hào)】:TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文 前2條

1 徐中華;;Web信息抽取方法概述[J];經(jīng)營(yíng)管理者;2008年09期

2 康琪;馬軍;;有向標(biāo)記根樹(shù)之間的語(yǔ)義編輯距離[J];模式識(shí)別與人工智能;2011年06期

相關(guān)碩士學(xué)位論文 前1條

1 馬騰;基于ontology的信息抽取系統(tǒng)的研究與實(shí)現(xiàn)[D];電子科技大學(xué);2006年

,

本文編號(hào):1631894

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1631894.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶(hù)1f8da***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com
婷婷激情五月天丁香社区| 美女激情免费在线观看| 精产国品一二三区麻豆| 日韩欧美黄色一级视频| 午夜福利大片亚洲一区| 亚洲日本久久国产精品久久| 高清在线精品一区二区| 中文字幕禁断介一区二区| 午夜福利精品视频视频| 国产香蕉国产精品偷在线观看| 在线观看免费午夜福利| 欧美一区二区三区播放| 国产亚洲欧美日韩国亚语| 欧美一本在线免费观看| 精品国产av一区二区三区不卡蜜| 久久99精品国产麻豆婷婷洗澡| 色婷婷国产精品视频一区二区保健 | 国产精品欧美激情在线观看| 免费大片黄在线观看国语| 在线播放欧美精品一区| 精品国产91亚洲一区二区三区| 国产免费人成视频尤物| 欧美国产在线观看精品| 久久精品中文字幕人妻中文| 亚洲中文字幕在线综合视频| 欧美日韩国产综合特黄| 一区二区三区四区亚洲专区 | 午夜传媒视频免费在线观看| 色婷婷人妻av毛片一区二区三区| 人人爽夜夜爽夜夜爽精品视频| 日韩国产欧美中文字幕| 中文字幕乱码一区二区三区四区| 中国一区二区三区人妻| 粉嫩国产美女国产av| 成人精品一区二区三区综合| 91人人妻人人爽人人狠狠| 一区二区三区日韩中文| 麻豆tv传媒在线观看| 伊人网免费在线观看高清版| 日本丁香婷婷欧美激情| 爱在午夜降临前在线观看|