基于策略的網(wǎng)絡(luò)信息提取技術(shù)的研究
發(fā)布時(shí)間:2019-05-18 10:43
【摘要】:進(jìn)入信息時(shí)代之前,信息收集的研究就已有所發(fā)展。進(jìn)入信息時(shí)代之后,信息資源得到了前所未有的重視。在某些應(yīng)用領(lǐng)域中,信息資源的收集更是尤為重要。隨著Internet互聯(lián)網(wǎng)的快速發(fā)展,網(wǎng)絡(luò)上信息資源的飛速增長(zhǎng),為信息的利用提供了便利條件。但是,隨著網(wǎng)絡(luò)信息資源越來(lái)越豐富,信息資源收集工作的工作量也是與日俱增。同時(shí),網(wǎng)絡(luò)上信息資源的無(wú)序性、分散性給收集工作帶來(lái)了障礙。但通過(guò)信息提取就能將這些信息收集起來(lái),格式化并存儲(chǔ),方便查詢使用。 本論文針對(duì)網(wǎng)絡(luò)信息提取這一問(wèn)題,以網(wǎng)絡(luò)信息獲取、文本信息提取相關(guān)技術(shù)為主要的研究對(duì)象,在深入分析網(wǎng)絡(luò)搜索原理和信息提取技術(shù)的基礎(chǔ)上,詳細(xì)討論和設(shè)計(jì)實(shí)現(xiàn)了一種網(wǎng)絡(luò)信息提取軟件。主要內(nèi)容為: 1.研究網(wǎng)絡(luò)搜索原理和信息提取技術(shù),提出了一種針對(duì)網(wǎng)頁(yè)頁(yè)面信息的網(wǎng)絡(luò)信息提取的方法。該方法首先通過(guò)網(wǎng)絡(luò)搜索中的網(wǎng)頁(yè)爬蟲(chóng)技術(shù)從互聯(lián)網(wǎng)獲取網(wǎng)頁(yè)頁(yè)面信息,再對(duì)網(wǎng)頁(yè)頁(yè)面信息進(jìn)行分析,根據(jù)用戶設(shè)置的基于信息格式的提取策略,獲取符合用戶所期望的信息。 2.研究網(wǎng)絡(luò)爬蟲(chóng)技術(shù),討論分析了URL消重技術(shù)要點(diǎn)的工作原理;研究網(wǎng)頁(yè)的表現(xiàn)方式、網(wǎng)頁(yè)的傳輸協(xié)議(超文本傳輸協(xié)議)及網(wǎng)頁(yè)的編寫(xiě)方式(超文本標(biāo)記語(yǔ)言),結(jié)合成熟的正則表達(dá)式文本處理技術(shù),實(shí)現(xiàn)對(duì)使用超文本標(biāo)記的信息進(jìn)行分析、提取;討論分析商用搜索引擎的工作運(yùn)行方式,提出了搜索引擎調(diào)用的方法。 3.設(shè)計(jì)實(shí)現(xiàn)了一款基于策略的網(wǎng)絡(luò)信息提取軟件。軟件以正則表達(dá)式為基礎(chǔ)構(gòu)建信息提取策略,對(duì)網(wǎng)頁(yè)頁(yè)面信息中符合提取策略的信息進(jìn)行抽;軟件具備策略設(shè)置界面,策略可根據(jù)需要進(jìn)行設(shè)置;軟件實(shí)現(xiàn)網(wǎng)絡(luò)爬蟲(chóng)的功能,,可根據(jù)用戶輸入的起始URL地址開(kāi)始網(wǎng)頁(yè)抓取;軟件還具備調(diào)用搜索引擎的能力,可根據(jù)用戶輸入的關(guān)鍵詞訪問(wèn)搜索引擎,自動(dòng)獲取、分析搜索結(jié)果,通過(guò)這些搜索結(jié)果再開(kāi)始網(wǎng)頁(yè)抓取和信息提取。最后,對(duì)軟件進(jìn)行了功能、效能實(shí)驗(yàn),驗(yàn)證軟件是否達(dá)到預(yù)期要求,并就發(fā)現(xiàn)的問(wèn)題進(jìn)行了討論并給出了改進(jìn)措施。
[Abstract]:The research of information collection has been developed before entering the information age. After entering the information age, the information resources have been paid more and more attention. In some applications, the collection of information resources is particularly important. With the rapid development of the Internet, the rapid growth of information resources on the network provides a convenient condition for the utilization of information. However, as the network information resources become more and more abundant, the workload of information resource collection is also increasing. At the same time, the disordering of information resources on the network has brought an obstacle to the collection of information. But the information can be collected, formatted and stored by information extraction so as to be convenient for query and use. This paper, aiming at the problem of network information extraction, takes the network information acquisition and the text information extraction related technology as the main research object. Based on the deep analysis of the network search principle and the information extraction technology, the paper discusses and designs a kind of network information extraction soft. Item. Main content in order to:1. To study the network search principle and information extraction technology, and put forward a network information extraction for web page information The method comprises the following steps of: firstly, acquiring webpage page information from the Internet through a webpage crawler technology in a network search, analyzing the webpage information of the webpage, and acquiring the webpage information according to the information format set by the user, 2. Research the technology of web crawler, and discuss the work principle of the key points of the URL elimination technique; study the way of the web page, the transmission protocol of the web page (the hypertext transfer protocol) and the preparation of the web page (Hypertext Markup Language), and combine the mature regular expression According to the processing technology, the information of the hypertext markup is analyzed and extracted; the working mode of the business search engine is analyzed and analyzed, and a search engine is put forward the method of calling.3. The design implements a policy-based network The software implements the information extraction strategy based on the regular expression, extracts the information corresponding to the extraction strategy in the page information of the web page, the software has the policy setting interface, the policy can be set according to the requirement, and the software implementation The function of the network crawler can start the webpage grab according to the starting URL address input by the user; the software also has the capability of calling the search engine, can access the search engine according to the keywords input by the user, automatically acquire, analyze and search results, and then start the webpage through the search results In the end, the functions and performance experiments of the software are carried out, and whether the software meets the expected requirements is verified, and the problems found are discussed.
【學(xué)位授予單位】:電子科技大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP391.1
[Abstract]:The research of information collection has been developed before entering the information age. After entering the information age, the information resources have been paid more and more attention. In some applications, the collection of information resources is particularly important. With the rapid development of the Internet, the rapid growth of information resources on the network provides a convenient condition for the utilization of information. However, as the network information resources become more and more abundant, the workload of information resource collection is also increasing. At the same time, the disordering of information resources on the network has brought an obstacle to the collection of information. But the information can be collected, formatted and stored by information extraction so as to be convenient for query and use. This paper, aiming at the problem of network information extraction, takes the network information acquisition and the text information extraction related technology as the main research object. Based on the deep analysis of the network search principle and the information extraction technology, the paper discusses and designs a kind of network information extraction soft. Item. Main content in order to:1. To study the network search principle and information extraction technology, and put forward a network information extraction for web page information The method comprises the following steps of: firstly, acquiring webpage page information from the Internet through a webpage crawler technology in a network search, analyzing the webpage information of the webpage, and acquiring the webpage information according to the information format set by the user, 2. Research the technology of web crawler, and discuss the work principle of the key points of the URL elimination technique; study the way of the web page, the transmission protocol of the web page (the hypertext transfer protocol) and the preparation of the web page (Hypertext Markup Language), and combine the mature regular expression According to the processing technology, the information of the hypertext markup is analyzed and extracted; the working mode of the business search engine is analyzed and analyzed, and a search engine is put forward the method of calling.3. The design implements a policy-based network The software implements the information extraction strategy based on the regular expression, extracts the information corresponding to the extraction strategy in the page information of the web page, the software has the policy setting interface, the policy can be set according to the requirement, and the software implementation The function of the network crawler can start the webpage grab according to the starting URL address input by the user; the software also has the capability of calling the search engine, can access the search engine according to the keywords input by the user, automatically acquire, analyze and search results, and then start the webpage through the search results In the end, the functions and performance experiments of the software are carried out, and whether the software meets the expected requirements is verified, and the problems found are discussed.
【學(xué)位授予單位】:電子科技大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前5條
1 劉金松;;搜索引擎的原理及spider程序淺析[J];電腦知識(shí)與技術(shù);2011年25期
2 李保利,陳玉忠,俞士汶;信息抽取研究綜述[J];計(jì)算機(jī)工程與應(yīng)用;2003年10期
3 劉偉;;搜索引擎中網(wǎng)絡(luò)爬蟲(chóng)的設(shè)計(jì)與實(shí)現(xiàn)[J];科技傳播;2011年20期
4 龍麗;龐弘q
本文編號(hào):2479928
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2479928.html
最近更新
教材專著