基于網(wǎng)絡(luò)爬蟲的網(wǎng)站信息采集技術(shù)研究
本文選題:信息采集 + 信息抽取。 參考:《大連海事大學(xué)》2014年碩士論文
【摘要】:隨著互聯(lián)網(wǎng)的迅速普及發(fā)展,它已經(jīng)逐漸融入人們?nèi)粘I畹姆椒矫婷妗F渲蠾eb是人們?cè)诨ヂ?lián)網(wǎng)上互相溝通、獲取外界信息的重要途徑。作為一個(gè)很有價(jià)值的信息來(lái)源,Web憑借其直觀便利的使用方式以及豐富的內(nèi)容表達(dá)能力,可以為用戶提供多種形式的信息,例如文本、音頻、視頻等。隨著時(shí)間的推移,互聯(lián)網(wǎng)的信息規(guī)模及其用戶群體規(guī)模也在快速增長(zhǎng)。互聯(lián)網(wǎng)用戶的需求正在變得越發(fā)多樣化,如何為用戶快速地提供其所感興趣的信息是目前的一大難題。 如今自媒體已經(jīng)在互聯(lián)上逐漸開始興起,并且其規(guī)模越來(lái)也龐大,其中不乏各行各業(yè)優(yōu)秀代表人物,因而開始受到越來(lái)越多的關(guān)注。因此本文提出運(yùn)用一定的技術(shù)手段實(shí)現(xiàn)對(duì)百度百家這一自媒體平臺(tái)完成采集其站點(diǎn)內(nèi)的文章內(nèi)容。然后對(duì)所采集的文章內(nèi)容進(jìn)行重新組織,以利于對(duì)這些內(nèi)容的二次利用。圍繞這一目標(biāo),本文提出了基于網(wǎng)絡(luò)爬蟲的網(wǎng)站信息采集技術(shù)的整合方案的設(shè)計(jì)與實(shí)現(xiàn)。 本文提出的基于網(wǎng)絡(luò)爬蟲的網(wǎng)站信息采集技術(shù)的整合方案包括信息采集、信息抽取、信息檢索這三部分。其中信息采集是基于Heritrix爬蟲的擴(kuò)展(結(jié)合HtmlUnit)所實(shí)現(xiàn),負(fù)責(zé)完成對(duì)目標(biāo)站點(diǎn)的網(wǎng)頁(yè)采集;信息抽取是基于Jsoup和DOM技術(shù)所實(shí)現(xiàn),負(fù)責(zé)完成從網(wǎng)頁(yè)中抽取文章信息保存至數(shù)據(jù)庫(kù)中,將非結(jié)構(gòu)化信息轉(zhuǎn)化成結(jié)構(gòu)化信息;信息檢索是基于Lucene索引工具以及SSH2架構(gòu)所實(shí)現(xiàn),負(fù)責(zé)向呈現(xiàn)所采集的文章信息,便于用戶瀏覽。
[Abstract]:With the rapid development of the Internet, it has gradually integrated into all aspects of people's daily life. Among them, Web is an important way for people to communicate with each other and obtain external information on the Internet. As a valuable source of information, Web can provide users with various forms of information, such as text, audio, video and so on. With the passage of time, the information scale of the Internet and the size of its user groups are also growing rapidly. The needs of Internet users are becoming more and more diverse. How to quickly provide information of interest to users is a big problem. Now the media has started to rise gradually in the interconnection, and its scale has become larger and larger, among which there are many outstanding representatives of various industries, so it began to get more and more attention. Therefore, this paper proposes to use certain technical means to complete the collection of articles on Baidu 100 self-media platform. Then the collected content of the article is reorganized to facilitate the secondary use of these contents. Around this goal, this paper puts forward the design and implementation of the integration scheme of Web crawler based website information collection technology. The integration scheme of Web site information collection technology based on web crawler in this paper includes three parts: information collection, information extraction and information retrieval. The information collection is based on the extension of Heritrix crawler (combined with HtmlUnit), which is responsible for accomplishing the web page collection of the target site, and the information extraction is based on the technology of Jsoup and DOM, which is responsible for extracting the article information from the web page and storing it into the database. The information retrieval is based on the Lucene indexing tool and the SSH2 framework, which is responsible for presenting the collected article information and making it easy for users to browse.
【學(xué)位授予單位】:大連海事大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2014
【分類號(hào)】:TP393.092
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 周平;;Lucene全文檢索引擎技術(shù)及應(yīng)用[J];重慶工學(xué)院學(xué)報(bào)(自然科學(xué)版);2007年04期
2 王學(xué)輝;金丹;;Lucene與關(guān)系型數(shù)據(jù)庫(kù)對(duì)比[J];電腦知識(shí)與技術(shù)(學(xué)術(shù)交流);2007年03期
3 蘇華軍;;基于Hibernate的JAVA對(duì)象持久化技術(shù)[J];電腦知識(shí)與技術(shù);2008年29期
4 孫立偉;何國(guó)輝;吳禮發(fā);;網(wǎng)絡(luò)爬蟲技術(shù)的研究[J];電腦知識(shí)與技術(shù);2010年15期
5 藺跟榮;;基于用戶興趣的個(gè)性化Web信息檢索方法[J];電子設(shè)計(jì)工程;2010年07期
6 金岳富;范劍英;馮揚(yáng);;分布式Web信息采集系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[J];哈爾濱理工大學(xué)學(xué)報(bào);2010年01期
7 胡啟敏;薛錦云;鐘林輝;;基于Spring框架的輕量級(jí)J2EE架構(gòu)與應(yīng)用[J];計(jì)算機(jī)工程與應(yīng)用;2008年05期
8 顧韻華;田偉;;基于DOM模型擴(kuò)展的Web信息提取[J];計(jì)算機(jī)科學(xué);2009年11期
9 陳瓊,蘇文健;基于網(wǎng)頁(yè)結(jié)構(gòu)樹的Web信息抽取方法[J];計(jì)算機(jī)工程;2005年20期
10 丁寶瓊;謝遠(yuǎn)平;吳瓊;;基于改進(jìn)DOM樹的網(wǎng)頁(yè)去噪聲方法[J];計(jì)算機(jī)應(yīng)用;2009年S1期
相關(guān)博士學(xué)位論文 前1條
1 車海燕;面向中文自然語(yǔ)言Web文檔的自動(dòng)知識(shí)抽取和知識(shí)融合[D];吉林大學(xué);2008年
,本文編號(hào):1909313
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/1909313.html