天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁(yè) > 科技論文 > 搜索引擎論文 >

基于策略的網(wǎng)絡(luò)信息提取技術(shù)的研究

發(fā)布時(shí)間:2019-05-18 10:43
【摘要】:進(jìn)入信息時(shí)代之前,信息收集的研究就已有所發(fā)展。進(jìn)入信息時(shí)代之后,信息資源得到了前所未有的重視。在某些應(yīng)用領(lǐng)域中,信息資源的收集更是尤為重要。隨著Internet互聯(lián)網(wǎng)的快速發(fā)展,網(wǎng)絡(luò)上信息資源的飛速增長(zhǎng),為信息的利用提供了便利條件。但是,隨著網(wǎng)絡(luò)信息資源越來(lái)越豐富,信息資源收集工作的工作量也是與日俱增。同時(shí),網(wǎng)絡(luò)上信息資源的無(wú)序性、分散性給收集工作帶來(lái)了障礙。但通過(guò)信息提取就能將這些信息收集起來(lái),格式化并存儲(chǔ),方便查詢使用。 本論文針對(duì)網(wǎng)絡(luò)信息提取這一問(wèn)題,以網(wǎng)絡(luò)信息獲取、文本信息提取相關(guān)技術(shù)為主要的研究對(duì)象,在深入分析網(wǎng)絡(luò)搜索原理和信息提取技術(shù)的基礎(chǔ)上,詳細(xì)討論和設(shè)計(jì)實(shí)現(xiàn)了一種網(wǎng)絡(luò)信息提取軟件。主要內(nèi)容為: 1.研究網(wǎng)絡(luò)搜索原理和信息提取技術(shù),提出了一種針對(duì)網(wǎng)頁(yè)頁(yè)面信息的網(wǎng)絡(luò)信息提取的方法。該方法首先通過(guò)網(wǎng)絡(luò)搜索中的網(wǎng)頁(yè)爬蟲(chóng)技術(shù)從互聯(lián)網(wǎng)獲取網(wǎng)頁(yè)頁(yè)面信息,再對(duì)網(wǎng)頁(yè)頁(yè)面信息進(jìn)行分析,根據(jù)用戶設(shè)置的基于信息格式的提取策略,獲取符合用戶所期望的信息。 2.研究網(wǎng)絡(luò)爬蟲(chóng)技術(shù),討論分析了URL消重技術(shù)要點(diǎn)的工作原理;研究網(wǎng)頁(yè)的表現(xiàn)方式、網(wǎng)頁(yè)的傳輸協(xié)議(超文本傳輸協(xié)議)及網(wǎng)頁(yè)的編寫(xiě)方式(超文本標(biāo)記語(yǔ)言),結(jié)合成熟的正則表達(dá)式文本處理技術(shù),實(shí)現(xiàn)對(duì)使用超文本標(biāo)記的信息進(jìn)行分析、提取;討論分析商用搜索引擎的工作運(yùn)行方式,提出了搜索引擎調(diào)用的方法。 3.設(shè)計(jì)實(shí)現(xiàn)了一款基于策略的網(wǎng)絡(luò)信息提取軟件。軟件以正則表達(dá)式為基礎(chǔ)構(gòu)建信息提取策略,對(duì)網(wǎng)頁(yè)頁(yè)面信息中符合提取策略的信息進(jìn)行抽;軟件具備策略設(shè)置界面,策略可根據(jù)需要進(jìn)行設(shè)置;軟件實(shí)現(xiàn)網(wǎng)絡(luò)爬蟲(chóng)的功能,,可根據(jù)用戶輸入的起始URL地址開(kāi)始網(wǎng)頁(yè)抓取;軟件還具備調(diào)用搜索引擎的能力,可根據(jù)用戶輸入的關(guān)鍵詞訪問(wèn)搜索引擎,自動(dòng)獲取、分析搜索結(jié)果,通過(guò)這些搜索結(jié)果再開(kāi)始網(wǎng)頁(yè)抓取和信息提取。最后,對(duì)軟件進(jìn)行了功能、效能實(shí)驗(yàn),驗(yàn)證軟件是否達(dá)到預(yù)期要求,并就發(fā)現(xiàn)的問(wèn)題進(jìn)行了討論并給出了改進(jìn)措施。
[Abstract]:The research of information collection has been developed before entering the information age. After entering the information age, the information resources have been paid more and more attention. In some applications, the collection of information resources is particularly important. With the rapid development of the Internet, the rapid growth of information resources on the network provides a convenient condition for the utilization of information. However, as the network information resources become more and more abundant, the workload of information resource collection is also increasing. At the same time, the disordering of information resources on the network has brought an obstacle to the collection of information. But the information can be collected, formatted and stored by information extraction so as to be convenient for query and use. This paper, aiming at the problem of network information extraction, takes the network information acquisition and the text information extraction related technology as the main research object. Based on the deep analysis of the network search principle and the information extraction technology, the paper discusses and designs a kind of network information extraction soft. Item. Main content in order to:1. To study the network search principle and information extraction technology, and put forward a network information extraction for web page information The method comprises the following steps of: firstly, acquiring webpage page information from the Internet through a webpage crawler technology in a network search, analyzing the webpage information of the webpage, and acquiring the webpage information according to the information format set by the user, 2. Research the technology of web crawler, and discuss the work principle of the key points of the URL elimination technique; study the way of the web page, the transmission protocol of the web page (the hypertext transfer protocol) and the preparation of the web page (Hypertext Markup Language), and combine the mature regular expression According to the processing technology, the information of the hypertext markup is analyzed and extracted; the working mode of the business search engine is analyzed and analyzed, and a search engine is put forward the method of calling.3. The design implements a policy-based network The software implements the information extraction strategy based on the regular expression, extracts the information corresponding to the extraction strategy in the page information of the web page, the software has the policy setting interface, the policy can be set according to the requirement, and the software implementation The function of the network crawler can start the webpage grab according to the starting URL address input by the user; the software also has the capability of calling the search engine, can access the search engine according to the keywords input by the user, automatically acquire, analyze and search results, and then start the webpage through the search results In the end, the functions and performance experiments of the software are carried out, and whether the software meets the expected requirements is verified, and the problems found are discussed.
【學(xué)位授予單位】:電子科技大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文 前5條

1 劉金松;;搜索引擎的原理及spider程序淺析[J];電腦知識(shí)與技術(shù);2011年25期

2 李保利,陳玉忠,俞士汶;信息抽取研究綜述[J];計(jì)算機(jī)工程與應(yīng)用;2003年10期

3 劉偉;;搜索引擎中網(wǎng)絡(luò)爬蟲(chóng)的設(shè)計(jì)與實(shí)現(xiàn)[J];科技傳播;2011年20期

4 龍麗;龐弘q

本文編號(hào):2479928


資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2479928.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶d3b32***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com
欧美成人一区二区三区在线 | 国产精品欧美日韩中文字幕| 色哟哟哟在线观看视频| 成人精品一级特黄大片| 国产欧美一区二区三区精品视| 91人妻人人揉人人澡人| 国产日本欧美特黄在线观看| 亚洲高清中文字幕一区二三区| 少妇高潮呻吟浪语91| 日本办公室三级在线观看| 狠狠亚洲丁香综合久久| 中文字幕亚洲精品乱码加勒比| 日韩欧美在线看一卡一卡| 日韩国产欧美中文字幕| 日本人妻精品有码字幕| 爱草草在线观看免费视频| 91欧美一区二区三区| 欧美精品亚洲精品日韩精品| 亚洲av秘片一区二区三区| 一区二区在线激情视频| 国产高清一区二区白浆| 日韩精品一区二区毛片| 特黄大片性高水多欧美一级| 亚洲一区二区精品国产av| 亚洲一区二区久久观看| 国产在线一区二区三区不卡 | 欧美偷拍一区二区三区四区| 精品女同在线一区二区| 东京热一二三区在线免| 五月激情综合在线视频| 成人午夜激情在线免费观看| 91精品国产综合久久精品| 色婷婷在线精品国自产拍| 中文精品人妻一区二区| 亚洲国产四季欧美一区| 国产乱淫av一区二区三区| 国产亚洲精品俞拍视频福利区| 91国自产精品中文字幕亚洲| 亚洲伊人久久精品国产| 亚洲中文在线男人的天堂| 中字幕一区二区三区久久蜜桃|