WEB信息整合平臺(tái)設(shè)計(jì)與實(shí)現(xiàn)

發(fā)布時(shí)間：2018-09-17 08:11

【摘要】：隨著Internet技術(shù)的高速發(fā)展,網(wǎng)絡(luò)信息資源的快速增長(zhǎng),網(wǎng)絡(luò)已成為人們獲取數(shù)據(jù)的重要來(lái)源。面對(duì)龐大的網(wǎng)絡(luò)資源,搜索引擎為人們檢索提供重要的技術(shù)手段。然而,傳統(tǒng)的搜索引擎是基于單詞的檢索,存在一定的局限性,如搜索結(jié)果存在大量無(wú)關(guān)的網(wǎng)頁(yè)、由于轉(zhuǎn)載而造成信息內(nèi)容雷同等。因此,極有必要對(duì)網(wǎng)絡(luò)信息資源進(jìn)行整合,以幫助人們從海量網(wǎng)絡(luò)資源中,提煉出人們所關(guān)心的特定信息,并對(duì)數(shù)據(jù)重新整合與統(tǒng)一的展現(xiàn)。本文的主要研究工作就是整合WEB資源信息,使互聯(lián)網(wǎng)用戶(hù)能夠快速準(zhǔn)確地搜尋到自己需要的信息。首先,本文對(duì)WEB信息整合中的相關(guān)理論和技術(shù)研究,包括信息整合兩種方法、三大組成模塊以及四種關(guān)鍵技術(shù)等。并在設(shè)計(jì)過(guò)程中對(duì)各模塊涉及知識(shí)做全面綜述,包括本體概念、網(wǎng)絡(luò)爬蟲(chóng)、信息抽取、資源描述框架等。其次,本文設(shè)計(jì)并實(shí)現(xiàn)了一種WEB信息整合平臺(tái)原型系統(tǒng),該系統(tǒng)以本體為指導(dǎo)。設(shè)計(jì)了系統(tǒng)總體結(jié)構(gòu)框架模型,系統(tǒng)由4大模塊組成：數(shù)據(jù)采集、信息抽取、存儲(chǔ)模型、前臺(tái)呈現(xiàn)。提出了基于本體和搜索引擎聚焦網(wǎng)絡(luò)爬蟲(chóng),基于本體的頁(yè)面分析過(guò)濾算法,基于本體和DOM樹(shù)路徑的信息抽取規(guī)則,以及基于RDF的數(shù)據(jù)存儲(chǔ)模型和基于B/S前臺(tái)結(jié)果呈現(xiàn)等一系列設(shè)計(jì)方案。通過(guò)該信息整合平臺(tái),用戶(hù)可以設(shè)置需要整合的領(lǐng)域信息,系統(tǒng)能夠檢索并整合出互聯(lián)網(wǎng)中相關(guān)領(lǐng)域資源,并將結(jié)果以統(tǒng)一的、結(jié)構(gòu)的、形象的展示給用戶(hù)。該系統(tǒng)不需要對(duì)不同數(shù)據(jù)源分別建立包裝器,而是作用域整個(gè)互聯(lián)網(wǎng)之上,能夠融合互聯(lián)網(wǎng)中多種異構(gòu)資源。最后,本文還對(duì)WEB信息整合平臺(tái)做了綜合測(cè)試,包括爬蟲(chóng)的效率與抓取量測(cè)試、數(shù)據(jù)抽取率測(cè)試等。測(cè)試證明系統(tǒng)能整合互聯(lián)網(wǎng)中部分異質(zhì)的數(shù)據(jù)源,但也存在一些不足。
[Abstract]:With the rapid development of Internet technology and the rapid growth of network information resources, the network has become an important source for people to obtain data. Facing the huge network resources, the search engine provides the important technical means for people's retrieval. However, the traditional search engine is based on word retrieval, there are some limitations, such as search results have a large number of unrelated web pages, because of reprinting and resulting in the same information content and so on. Therefore, it is very necessary to integrate the network information resources to help people extract the specific information that people care about from the massive network resources, and to reintegrate and unify the data. The main research work of this paper is to integrate WEB resource information so that Internet users can quickly and accurately search for the information they need. Firstly, this paper studies the theory and technology of WEB information integration, including two methods of information integration, three modules and four key technologies. In the process of design, the knowledge involved in each module is summarized, including ontology concept, web crawler, information extraction, resource description framework and so on. Secondly, this paper designs and implements a prototype system of WEB information integration platform, which is guided by ontology. The system is composed of four modules: data acquisition, information extraction, storage model and foreground presentation. This paper proposes a web crawler based on ontology and search engine, a page analysis filtering algorithm based on ontology, and information extraction rules based on ontology and DOM tree path. And a series of design schemes, such as data storage model based on RDF and foreground result presentation based on B / S, etc. Through the information integration platform, the user can set up the domain information that needs to be integrated. The system can retrieve and integrate the related domain resources in the Internet, and display the results to the user in a unified, structured and vivid way. The system does not need to set up wrappers for different data sources separately, but the scope of the entire Internet, and can integrate a variety of heterogeneous resources in the Internet. Finally, the paper also makes a comprehensive test on WEB information integration platform, including crawler efficiency and crawl test, data extraction rate test and so on. The test shows that the system can integrate some heterogeneous data sources in the Internet, but there are some shortcomings.
【學(xué)位授予單位】：電子科技大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2012
【分類(lèi)號(hào)】：TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 程文濤;師雪霖;;以本體為指導(dǎo)的Web網(wǎng)頁(yè)信息抽取方法[J];北京化工大學(xué)學(xué)報(bào)(自然科學(xué)版);2011年04期

2 李保利,陳玉忠,俞士汶;信息抽取研究綜述[J];計(jì)算機(jī)工程與應(yīng)用;2003年10期

3 蔡俊杰;孫建伶;董金祥;;建立Web信息集成系統(tǒng)[J];計(jì)算機(jī)科學(xué);2001年12期

4 楊先娣;彭智勇;劉君強(qiáng);李旭輝;;信息集成研究綜述[J];計(jì)算機(jī)科學(xué);2006年07期

5 周德懋;李舟軍;;高性能網(wǎng)絡(luò)爬蟲(chóng):研究綜述[J];計(jì)算機(jī)科學(xué);2009年08期

6 鄒嘉麟,陳家訓(xùn);Web信息資源整合系統(tǒng)模型和方法[J];計(jì)算機(jī)工程;2004年12期

7 李勇;韓亮;;主題搜索引擎中網(wǎng)絡(luò)爬蟲(chóng)的搜索策略研究[J];計(jì)算機(jī)工程與科學(xué);2008年03期

8 李效東,顧毓清;基于DOM的Web信息提取[J];計(jì)算機(jī)學(xué)報(bào);2002年05期

9 周立柱,林玲;聚焦爬蟲(chóng)技術(shù)研究綜述[J];計(jì)算機(jī)應(yīng)用;2005年09期

10 劉金紅;陸余良;;主題網(wǎng)絡(luò)爬蟲(chóng)研究綜述[J];計(jì)算機(jī)應(yīng)用研究;2007年10期

相關(guān)碩士學(xué)位論文前5條

1 方少卿;Web就業(yè)信息抽取系統(tǒng)的實(shí)現(xiàn)研究[D];合肥工業(yè)大學(xué);2010年

2 薛惠忠;WEB信息的抽取與集成[D];東南大學(xué);2004年

3 史軍強(qiáng);WEB信息集成技術(shù)研究[D];電子科技大學(xué);2005年

4 賀智平;Web信息自動(dòng)抽取技術(shù)研究[D];西安電子科技大學(xué);2006年

5 江佳;信息集成中Web信息抽取技術(shù)的研究[D];西安電子科技大學(xué);2007年

，

本文編號(hào)：2245262

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2245262.html

上一篇：重點(diǎn)新聞網(wǎng)站的媒體管理改進(jìn)策略論析
下一篇：從Google搜索引擎的使用安全談個(gè)人信息的保護(hù)

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

WEB信息整合平臺(tái)設(shè)計(jì)與實(shí)現(xiàn)