基于RSS的聚焦網(wǎng)絡(luò)爬蟲(chóng)在高校網(wǎng)站群中的研究

發(fā)布時(shí)間：2018-11-13 12:32

【摘要】：網(wǎng)絡(luò)發(fā)展迅速,網(wǎng)頁(yè)數(shù)量越來(lái)越龐大,人們?yōu)榱双@取需要的信息,往往需要翻閱大量的網(wǎng)頁(yè),浪費(fèi)時(shí)間和精力,并且還不一定能夠獲取最新最全的信息,而網(wǎng)絡(luò)信息的發(fā)布者也希望有更多的用戶(hù)能夠?qū)崟r(shí)的閱讀自己的信息,為此有很多針對(duì)該需求的研究孕育而生,例如由網(wǎng)絡(luò)爬蟲(chóng)支持的搜索引擎、RSS信息推送等。但是它們都各有各的局限性,例如我們需要按照分類(lèi)得到某高校的所有網(wǎng)站中的最新通知,比如該高校所有科研類(lèi)別的最新通知。使用搜索引擎進(jìn)行搜索,結(jié)果差強(qiáng)人意。而RSS雖然可以實(shí)現(xiàn)分類(lèi)的推送最新信息,但是它推送的信息僅限于那些提供RSS feed的網(wǎng)站。對(duì)于一些類(lèi)似于高校網(wǎng)站群這種早期建立的時(shí)候就沒(méi)有實(shí)現(xiàn)RSS推送功能的對(duì)象來(lái)說(shuō),它就愛(ài)莫能助了。因此,本文主要研究基于RSS的聚焦網(wǎng)絡(luò)爬蟲(chóng)來(lái)解決上述問(wèn)題,并將其應(yīng)用在高校網(wǎng)站群中,取得了較好的效果。它的原理是用聚焦網(wǎng)絡(luò)爬蟲(chóng)對(duì)目標(biāo)網(wǎng)站群的數(shù)據(jù)進(jìn)行抓取、分析和處理,然后提供RSS推送。通過(guò)這種方式,對(duì)于即使沒(méi)有提供RSS feed的網(wǎng)站,用戶(hù)也可以通過(guò)RSS閱讀器分類(lèi)訂閱其最新的信息。免去了大量翻閱網(wǎng)頁(yè)查找信息的麻煩,以及查找疏忽對(duì)信息的遺漏。本文的主要研究?jī)?nèi)容包括： (1)提出一種新的基于RSS的聚焦網(wǎng)絡(luò)爬蟲(chóng)的研究,使得用戶(hù)可以使用RSS閱讀器,訂閱并閱讀到?jīng)]有提供RSS feed的網(wǎng)站的最新的信息。過(guò)濾無(wú)用的廣告等垃圾信息,免去查找信息的麻煩。 (2)基于TF-IDF算法對(duì)抓取的網(wǎng)頁(yè)文本進(jìn)行分類(lèi),并且在用TF-IDF提取不同類(lèi)別的特征向量部分,針對(duì)網(wǎng)頁(yè)的特征對(duì)其進(jìn)行了改進(jìn)。使得提取出的特征向量更能好的代表類(lèi)別,分類(lèi)結(jié)果更準(zhǔn)確。 (3)對(duì)網(wǎng)絡(luò)爬蟲(chóng)的增量式爬取進(jìn)行改進(jìn),基于傳統(tǒng)的增量式爬取算法提出了一種新的計(jì)算預(yù)測(cè)更新時(shí)間的算法,使得預(yù)測(cè)時(shí)間更貼近實(shí)際更新時(shí)間的值,減少系統(tǒng)的開(kāi)銷(xiāo),提高效率。 (4)將基于RSS的聚焦網(wǎng)絡(luò)爬蟲(chóng)的研究應(yīng)用到高校網(wǎng)站群中,針對(duì)高校網(wǎng)站群的特征對(duì)PageRank算法進(jìn)行改進(jìn),提高網(wǎng)絡(luò)爬蟲(chóng)的查全率。
[Abstract]:With the rapid development of the network and the increasing number of web pages, people often need to read a large number of pages in order to obtain the information they need, wasting time and energy, and not necessarily getting the latest and most complete information. The publishers of network information also hope that more users can read their own information in real time. For this reason, there are a lot of research on this need, such as search engine supported by web crawler, RSS information push and so on. However, each of them has its own limitations. For example, we need to get the latest notifications from all websites of a university according to the classification, such as the latest notifications of all scientific research categories of that university. Search engine is used to search, the results are unsatisfactory. RSS can push the latest information into categories, but only those sites that offer RSS feed. It's not going to be helpful for objects like college web groups that didn't implement RSS push when they were built early. Therefore, this paper mainly studies the focused web crawler based on RSS to solve the above problems, and applies it to the university website group, and obtains good results. Its principle is to use focused web crawlers to capture, analyze and process the data of the target site group, and then provide RSS push. In this way, users can subscribe to their latest information through RSS readers, even if they don't have a RSS feed site. Avoid the trouble of looking through a large number of web pages to find information, as well as the omission of information. The main contents of this paper are as follows: (1) A new focused web crawler based on RSS is proposed, which enables users to use RSS readers to subscribe and read the latest information of Web sites that do not provide RSS feed. Filter useless advertising and other spam information, to avoid the trouble of finding information. (2) based on the TF-IDF algorithm, the text is classified, and the feature vectors of different categories are extracted by TF-IDF, which is improved according to the features of the web pages. The extracted feature vectors can better represent the categories and the classification results are more accurate. (3) the incremental crawling of network crawler is improved. Based on the traditional incremental crawling algorithm, a new algorithm is proposed to calculate the predictive update time, which makes the prediction time closer to the actual update time and reduces the overhead of the system. Improve efficiency. (4) the research of focused web crawler based on RSS is applied to the university website group, and the PageRank algorithm is improved to improve the recall rate of the network crawler according to the characteristics of the university website group.
【學(xué)位授予單位】：南昌大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2012
【分類(lèi)號(hào)】：TP393.092

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 孫立偉;何國(guó)輝;吳禮發(fā);;網(wǎng)絡(luò)爬蟲(chóng)技術(shù)的研究[J];電腦知識(shí)與技術(shù);2010年15期

2 謝劍猛;高校網(wǎng)站的規(guī)劃與設(shè)計(jì)[J];華東交通大學(xué)學(xué)報(bào);2004年05期

3 胡海燕;;RSS技術(shù)在高校網(wǎng)站中的設(shè)計(jì)與實(shí)現(xiàn)[J];吉林工商學(xué)院學(xué)報(bào);2009年03期

4 駱斌,費(fèi)翔林;多線(xiàn)程技術(shù)的研究與應(yīng)用[J];計(jì)算機(jī)研究與發(fā)展;2000年04期

5 王津濤,蘭皓;面向主題元搜索引擎的設(shè)計(jì)與實(shí)現(xiàn)[J];計(jì)算機(jī)工程;2005年07期

6 秦玉平;王秀坤;艾青;劉衛(wèi)江;;多主題文本分類(lèi)的實(shí)現(xiàn)算法[J];計(jì)算機(jī)工程;2008年02期

7 李勇;韓亮;;主題搜索引擎中網(wǎng)絡(luò)爬蟲(chóng)的搜索策略研究[J];計(jì)算機(jī)工程與科學(xué);2008年03期

8 周立柱,林玲;聚焦爬蟲(chóng)技術(shù)研究綜述[J];計(jì)算機(jī)應(yīng)用;2005年09期

9 施聰鶯;徐朝軍;楊曉江;;TFIDF算法研究綜述[J];計(jì)算機(jī)應(yīng)用;2009年S1期

10 劉金紅;陸余良;;主題網(wǎng)絡(luò)爬蟲(chóng)研究綜述[J];計(jì)算機(jī)應(yīng)用研究;2007年10期

相關(guān)碩士學(xué)位論文前10條

1 林捷;主題網(wǎng)絡(luò)爬蟲(chóng)的研究和實(shí)現(xiàn)[D];武漢理工大學(xué);2011年

2 于魁飛;基于RSS的信息發(fā)布與訂閱技術(shù)研究[D];北京郵電大學(xué);2007年

3 劉喜亮;面向主題的網(wǎng)絡(luò)爬蟲(chóng)設(shè)計(jì)與實(shí)現(xiàn)[D];湖南大學(xué);2009年

4 韓冰;基于BP網(wǎng)絡(luò)的高校主題爬蟲(chóng)的設(shè)計(jì)與實(shí)現(xiàn)[D];東北師范大學(xué);2009年

5 楊溥;搜索引擎中爬蟲(chóng)的若干問(wèn)題研究[D];北京郵電大學(xué);2009年

6 袁浩;主題爬蟲(chóng)搜索Web頁(yè)面策略的研究[D];中南大學(xué);2009年

7 陳叢叢;主題爬蟲(chóng)搜索策略研究[D];山東大學(xué);2009年

8 賀晟;搜索引擎中主題網(wǎng)絡(luò)爬蟲(chóng)的研究與設(shè)計(jì)[D];安徽大學(xué);2010年

9 張紅云;基于頁(yè)面分析的主題網(wǎng)絡(luò)爬蟲(chóng)的研究[D];武漢理工大學(xué);2010年

10 張航;主題爬蟲(chóng)的實(shí)現(xiàn)及其關(guān)鍵技術(shù)研究[D];武漢理工大學(xué);2010年

，

本文編號(hào)：2329116

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2329116.html

上一篇：基于J2EE的古生物標(biāo)本共享庫(kù)系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)
下一篇：國(guó)外開(kāi)放空間研究演進(jìn)與前沿?zé)狳c(diǎn)的可視化分析

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于RSS的聚焦網(wǎng)絡(luò)爬蟲(chóng)在高校網(wǎng)站群中的研究