深層網(wǎng)絡(luò)數(shù)據(jù)源發(fā)現(xiàn)與查詢結(jié)果抽取研究
本文選題:深層網(wǎng)絡(luò) + 數(shù)據(jù)源發(fā)現(xiàn)。 參考:《西南交通大學(xué)》2013年碩士論文
【摘要】:隨著互聯(lián)網(wǎng)技術(shù)的飛速發(fā)展,網(wǎng)絡(luò)中蘊(yùn)藏的有價(jià)值信息愈來(lái)愈多。但各站點(diǎn)提供的信息在數(shù)量及質(zhì)量上都存在巨大的差異。這給人們選取高質(zhì)量信息帶來(lái)了困難。搜索引擎技術(shù)可以對(duì)網(wǎng)絡(luò)資源進(jìn)行分類整理和檢索,極大地提高了人們獲取有價(jià)值資源的效率。然而有的數(shù)據(jù)資源位于后臺(tái)數(shù)據(jù)庫(kù)中,不能被傳統(tǒng)搜索引擎檢索,這部分網(wǎng)絡(luò)資源稱為深層網(wǎng)絡(luò)。深層網(wǎng)絡(luò)所包含的數(shù)據(jù)具有結(jié)構(gòu)化程度高、數(shù)據(jù)量大、質(zhì)量?jī)?yōu)質(zhì)等特點(diǎn)。因此,研究這些數(shù)據(jù)具有重要的意義。 本文針對(duì)如何發(fā)現(xiàn)并抽取深層網(wǎng)絡(luò)數(shù)據(jù)展開(kāi)了相關(guān)研究。要利用深層網(wǎng)絡(luò)中的信息,首要問(wèn)題就是發(fā)現(xiàn)深層網(wǎng)絡(luò)的數(shù)據(jù)源。其次,對(duì)于向深層網(wǎng)絡(luò)提交查詢后所返回的結(jié)果數(shù)據(jù)區(qū)域,如何自動(dòng)發(fā)現(xiàn)這些區(qū)域是對(duì)其信息抽取的前提。針對(duì)這些問(wèn)題,本文主要完成三個(gè)方面的工作:研究并改進(jìn)了一種數(shù)據(jù)源的發(fā)現(xiàn)方法;采用了一種新的網(wǎng)頁(yè)結(jié)構(gòu)相似度比較算法,在算法的基礎(chǔ)上實(shí)現(xiàn)了網(wǎng)頁(yè)數(shù)據(jù)區(qū)域的識(shí)別:設(shè)計(jì)了深層網(wǎng)絡(luò)信息集成系統(tǒng)框架,并實(shí)現(xiàn)了數(shù)據(jù)源發(fā)現(xiàn)與結(jié)果網(wǎng)頁(yè)信息抽取功能模塊。 首先是深層網(wǎng)絡(luò)數(shù)據(jù)源的發(fā)現(xiàn)及方法改進(jìn)。論文設(shè)計(jì)了一種數(shù)據(jù)源發(fā)現(xiàn)框架。針對(duì)查詢接口的判定問(wèn)題,本文分析了查詢接口與其他表單的區(qū)別,采用了一系列規(guī)則進(jìn)行判斷。數(shù)據(jù)源一般只限某一類領(lǐng)域,為準(zhǔn)確查找數(shù)據(jù)源,必須判定其是否與主題類別相關(guān)。論文分析了傳統(tǒng)數(shù)據(jù)源分類方法在特征選擇方面的不足之處,并對(duì)特征選擇策略進(jìn)行了改進(jìn)。實(shí)驗(yàn)表明,改進(jìn)的方法能有效發(fā)現(xiàn)主題相關(guān)的數(shù)據(jù)源站點(diǎn)。 然后是網(wǎng)頁(yè)信息抽取及新算法的應(yīng)用。本文通過(guò)分析在線數(shù)據(jù)庫(kù)返回結(jié)果頁(yè)面的特點(diǎn),發(fā)現(xiàn)每個(gè)數(shù)據(jù)區(qū)域?qū)?yīng)的標(biāo)簽樹(shù)在結(jié)構(gòu)上十分相似。論文采用了一種新的網(wǎng)頁(yè)結(jié)構(gòu)相似度比較算法,識(shí)別數(shù)據(jù)區(qū)域所在位置。新算法將網(wǎng)頁(yè)的標(biāo)簽表示成樹(shù)的形式,并定義一種特殊的子樹(shù),將整個(gè)樹(shù)的比較劃為對(duì)這些特殊子樹(shù)的比較,實(shí)驗(yàn)證明了此算法能有效反映網(wǎng)頁(yè)結(jié)構(gòu)的相似程度。使用該算法找出數(shù)據(jù)區(qū)域所在位置之后,本文利用網(wǎng)頁(yè)結(jié)構(gòu)特點(diǎn)及關(guān)鍵詞提取相關(guān)記錄,并將這些信息抽取出來(lái)。 最后是深層網(wǎng)絡(luò)數(shù)據(jù)集成框架設(shè)計(jì)與主要模塊實(shí)現(xiàn)。論文設(shè)計(jì)了深層網(wǎng)絡(luò)信息集成框架。并且在第三章數(shù)據(jù)源發(fā)現(xiàn)方法及第四章深層網(wǎng)絡(luò)結(jié)果頁(yè)面信息抽取方法的基礎(chǔ)上,實(shí)現(xiàn)了該集成框架的主要模塊。
[Abstract]:With the rapid development of Internet technology, there are more and more valuable information in the network. However, the information provided by each site in quantity and quality are huge differences. This makes it difficult for people to select high quality information. Search engine technology can sort and retrieve network resources, which greatly improves the efficiency of obtaining valuable resources. However, some data resources are located in the backstage database and cannot be retrieved by the traditional search engine. This part of the network resources is called the deep network. The data contained in the deep network has the characteristics of high degree of structure, large amount of data, high quality and so on. Therefore, the study of these data is of great significance. This paper focuses on how to find and extract deep network data. To utilize the information in the deep network, the first problem is to find the data source of the deep network. Secondly, how to find these regions automatically is the premise of information extraction for the result data regions returned after the query is submitted to the deep network. In order to solve these problems, this paper mainly completes three aspects: researching and improving a data source discovery method, adopting a new similarity comparison algorithm of web page structure, On the basis of the algorithm, the recognition of the web page data area is realized: the deep network information integration system framework is designed, and the function module of data source discovery and result page information extraction is implemented. The first is the discovery and improvement of deep network data sources. This paper designs a data source discovery framework. Aiming at the judgment of query interface, this paper analyzes the difference between query interface and other forms, and adopts a series of rules to judge. In order to find the data source accurately, it is necessary to determine whether it is related to the subject category. This paper analyzes the shortcomings of the traditional data source classification methods in feature selection, and improves the feature selection strategy. Experiments show that the improved method can effectively find the data source sites related to the topic. Then there is the application of web information extraction and new algorithm. By analyzing the characteristics of the result page of the online database, it is found that the label tree corresponding to each data region is very similar in structure. In this paper, a new similarity comparison algorithm is used to identify the location of the data region. The new algorithm represents the label of the web page as a tree and defines a special subtree. The comparison of the whole tree is divided into the comparison of these special subtrees. The experiments show that the algorithm can effectively reflect the similarity degree of the web page structure. After using the algorithm to find out the location of the data region, this paper extracts the relevant records by using the features of the web page structure and key words, and extracts the information. Finally, the deep network data integration framework design and main module implementation. The paper designs a deep network information integration framework. On the basis of the method of data source discovery in chapter 3 and the method of extracting information from the result page of deep network in chapter 4, the main module of the integration framework is implemented.
【學(xué)位授予單位】:西南交通大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 祝官文;王念濱;王紅濱;;基于主題和表單屬性的深層網(wǎng)絡(luò)數(shù)據(jù)源分類方法[J];電子學(xué)報(bào);2013年02期
2 楊麗華;袁方;姚增利;王煜;;基于啟發(fā)式規(guī)則的Deep Web接口發(fā)現(xiàn)[J];河北大學(xué)學(xué)報(bào)(自然科學(xué)版);2010年01期
3 祁鈺;關(guān)毅;呂新波;岳淑珍;;網(wǎng)頁(yè)結(jié)構(gòu)樹(shù)相似度計(jì)算[J];黑龍江大學(xué)自然科學(xué)學(xué)報(bào);2009年05期
4 石倩;陳榮;魯明羽;;基于規(guī)則歸納的信息抽取系統(tǒng)實(shí)現(xiàn)[J];計(jì)算機(jī)工程與應(yīng)用;2008年21期
5 林超;趙朋朋;崔志明;;Deep Web數(shù)據(jù)源聚焦爬蟲(chóng)[J];計(jì)算機(jī)工程;2008年07期
6 楊巨峰;史廣順;趙玉娟;王慶人;;基于規(guī)則集的Deep Web信息檢索[J];計(jì)算機(jī)工程;2008年13期
7 王權(quán);施韶亭;;基于子樹(shù)廣度的Web信息抽取[J];計(jì)算機(jī)工程;2009年03期
8 華慧;伏玉琛;周小科;;基于查詢接口文本的Deep Web數(shù)據(jù)源分類[J];計(jì)算機(jī)工程;2010年12期
9 王海龍;胡景芝;趙朋朋;崔志明;;基于搜索引擎的Deep Web數(shù)據(jù)源發(fā)現(xiàn)[J];計(jì)算機(jī)工程;2011年05期
10 劉偉;孟小峰;孟衛(wèi)一;;Deep Web數(shù)據(jù)集成研究綜述[J];計(jì)算機(jī)學(xué)報(bào);2007年09期
相關(guān)碩士學(xué)位論文 前1條
1 陳洪平;面向Deep Web的數(shù)據(jù)抽取與語(yǔ)義標(biāo)注技術(shù)研究[D];蘇州大學(xué);2010年
,本文編號(hào):1955450
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1955450.html