當(dāng)前位置：主頁 > 管理論文 > 移動網(wǎng)絡(luò)論文 >

基于網(wǎng)絡(luò)爬蟲的網(wǎng)站信息采集技術(shù)研究

發(fā)布時間：2018-05-19 07:55

本文選題：信息采集 + 信息抽取��；參考：《大連海事大學(xué)》2014年碩士論文

【摘要】：隨著互聯(lián)網(wǎng)的迅速普及發(fā)展,它已經(jīng)逐漸融入人們?nèi)粘Ｉ畹姆椒矫婷妗Ｆ渲蠾eb是人們在互聯(lián)網(wǎng)上互相溝通、獲取外界信息的重要途徑。作為一個很有價值的信息來源,Web憑借其直觀便利的使用方式以及豐富的內(nèi)容表達(dá)能力,可以為用戶提供多種形式的信息,例如文本、音頻、視頻等。隨著時間的推移,互聯(lián)網(wǎng)的信息規(guī)模及其用戶群體規(guī)模也在快速增長�；ヂ�(lián)網(wǎng)用戶的需求正在變得越發(fā)多樣化,如何為用戶快速地提供其所感興趣的信息是目前的一大難題。如今自媒體已經(jīng)在互聯(lián)上逐漸開始興起,并且其規(guī)模越來也龐大,其中不乏各行各業(yè)優(yōu)秀代表人物,因而開始受到越來越多的關(guān)注。因此本文提出運(yùn)用一定的技術(shù)手段實現(xiàn)對百度百家這一自媒體平臺完成采集其站點內(nèi)的文章內(nèi)容。然后對所采集的文章內(nèi)容進(jìn)行重新組織,以利于對這些內(nèi)容的二次利用。圍繞這一目標(biāo),本文提出了基于網(wǎng)絡(luò)爬蟲的網(wǎng)站信息采集技術(shù)的整合方案的設(shè)計與實現(xiàn)。本文提出的基于網(wǎng)絡(luò)爬蟲的網(wǎng)站信息采集技術(shù)的整合方案包括信息采集、信息抽取、信息檢索這三部分。其中信息采集是基于Heritrix爬蟲的擴(kuò)展(結(jié)合HtmlUnit)所實現(xiàn),負(fù)責(zé)完成對目標(biāo)站點的網(wǎng)頁采集；信息抽取是基于Jsoup和DOM技術(shù)所實現(xiàn),負(fù)責(zé)完成從網(wǎng)頁中抽取文章信息保存至數(shù)據(jù)庫中,將非結(jié)構(gòu)化信息轉(zhuǎn)化成結(jié)構(gòu)化信息；信息檢索是基于Lucene索引工具以及SSH2架構(gòu)所實現(xiàn),負(fù)責(zé)向呈現(xiàn)所采集的文章信息,便于用戶瀏覽。
[Abstract]:With the rapid development of the Internet, it has gradually integrated into all aspects of people's daily life. Among them, Web is an important way for people to communicate with each other and obtain external information on the Internet. As a valuable source of information, Web can provide users with various forms of information, such as text, audio, video and so on. With the passage of time, the information scale of the Internet and the size of its user groups are also growing rapidly. The needs of Internet users are becoming more and more diverse. How to quickly provide information of interest to users is a big problem. Now the media has started to rise gradually in the interconnection, and its scale has become larger and larger, among which there are many outstanding representatives of various industries, so it began to get more and more attention. Therefore, this paper proposes to use certain technical means to complete the collection of articles on Baidu 100 self-media platform. Then the collected content of the article is reorganized to facilitate the secondary use of these contents. Around this goal, this paper puts forward the design and implementation of the integration scheme of Web crawler based website information collection technology. The integration scheme of Web site information collection technology based on web crawler in this paper includes three parts: information collection, information extraction and information retrieval. The information collection is based on the extension of Heritrix crawler (combined with HtmlUnit), which is responsible for accomplishing the web page collection of the target site, and the information extraction is based on the technology of Jsoup and DOM, which is responsible for extracting the article information from the web page and storing it into the database. The information retrieval is based on the Lucene indexing tool and the SSH2 framework, which is responsible for presenting the collected article information and making it easy for users to browse.
【學(xué)位授予單位】：大連海事大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2014
【分類號】：TP393.092

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 周平;;Lucene全文檢索引擎技術(shù)及應(yīng)用[J];重慶工學(xué)院學(xué)報(自然科學(xué)版);2007年04期

2 王學(xué)輝;金丹;;Lucene與關(guān)系型數(shù)據(jù)庫對比[J];電腦知識與技術(shù)(學(xué)術(shù)交流);2007年03期

3 蘇華軍;;基于Hibernate的JAVA對象持久化技術(shù)[J];電腦知識與技術(shù);2008年29期

4 孫立偉;何國輝;吳禮發(fā);;網(wǎng)絡(luò)爬蟲技術(shù)的研究[J];電腦知識與技術(shù);2010年15期

5 藺跟榮;;基于用戶興趣的個性化Web信息檢索方法[J];電子設(shè)計工程;2010年07期

6 金岳富;范劍英;馮揚(yáng);;分布式Web信息采集系統(tǒng)的設(shè)計與實現(xiàn)[J];哈爾濱理工大學(xué)學(xué)報;2010年01期

7 胡啟敏;薛錦云;鐘林輝;;基于Spring框架的輕量級J2EE架構(gòu)與應(yīng)用[J];計算機(jī)工程與應(yīng)用;2008年05期

8 顧韻華;田偉;;基于DOM模型擴(kuò)展的Web信息提取[J];計算機(jī)科學(xué);2009年11期

9 陳瓊,蘇文健;基于網(wǎng)頁結(jié)構(gòu)樹的Web信息抽取方法[J];計算機(jī)工程;2005年20期

10 丁寶瓊;謝遠(yuǎn)平;吳瓊;;基于改進(jìn)DOM樹的網(wǎng)頁去噪聲方法[J];計算機(jī)應(yīng)用;2009年S1期

相關(guān)博士學(xué)位論文前1條

1 車海燕;面向中文自然語言Web文檔的自動知識抽取和知識融合[D];吉林大學(xué);2008年

，

本文編號：1909313

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/guanlilunwen/ydhl/1909313.html

上一篇：Android系統(tǒng)中P2P應(yīng)用數(shù)據(jù)包捕獲及流量控制研究
下一篇：計算機(jī)網(wǎng)絡(luò)技術(shù)在茶葉營銷中的運(yùn)用

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于網(wǎng)絡(luò)爬蟲的網(wǎng)站信息采集技術(shù)研究