基于RSS的聚焦網(wǎng)絡(luò)爬蟲(chóng)在高校網(wǎng)站群中的研究
[Abstract]:With the rapid development of the network and the increasing number of web pages, people often need to read a large number of pages in order to obtain the information they need, wasting time and energy, and not necessarily getting the latest and most complete information. The publishers of network information also hope that more users can read their own information in real time. For this reason, there are a lot of research on this need, such as search engine supported by web crawler, RSS information push and so on. However, each of them has its own limitations. For example, we need to get the latest notifications from all websites of a university according to the classification, such as the latest notifications of all scientific research categories of that university. Search engine is used to search, the results are unsatisfactory. RSS can push the latest information into categories, but only those sites that offer RSS feed. It's not going to be helpful for objects like college web groups that didn't implement RSS push when they were built early. Therefore, this paper mainly studies the focused web crawler based on RSS to solve the above problems, and applies it to the university website group, and obtains good results. Its principle is to use focused web crawlers to capture, analyze and process the data of the target site group, and then provide RSS push. In this way, users can subscribe to their latest information through RSS readers, even if they don't have a RSS feed site. Avoid the trouble of looking through a large number of web pages to find information, as well as the omission of information. The main contents of this paper are as follows: (1) A new focused web crawler based on RSS is proposed, which enables users to use RSS readers to subscribe and read the latest information of Web sites that do not provide RSS feed. Filter useless advertising and other spam information, to avoid the trouble of finding information. (2) based on the TF-IDF algorithm, the text is classified, and the feature vectors of different categories are extracted by TF-IDF, which is improved according to the features of the web pages. The extracted feature vectors can better represent the categories and the classification results are more accurate. (3) the incremental crawling of network crawler is improved. Based on the traditional incremental crawling algorithm, a new algorithm is proposed to calculate the predictive update time, which makes the prediction time closer to the actual update time and reduces the overhead of the system. Improve efficiency. (4) the research of focused web crawler based on RSS is applied to the university website group, and the PageRank algorithm is improved to improve the recall rate of the network crawler according to the characteristics of the university website group.
【學(xué)位授予單位】:南昌大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2012
【分類(lèi)號(hào)】:TP393.092
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 孫立偉;何國(guó)輝;吳禮發(fā);;網(wǎng)絡(luò)爬蟲(chóng)技術(shù)的研究[J];電腦知識(shí)與技術(shù);2010年15期
2 謝劍猛;高校網(wǎng)站的規(guī)劃與設(shè)計(jì)[J];華東交通大學(xué)學(xué)報(bào);2004年05期
3 胡海燕;;RSS技術(shù)在高校網(wǎng)站中的設(shè)計(jì)與實(shí)現(xiàn)[J];吉林工商學(xué)院學(xué)報(bào);2009年03期
4 駱斌,費(fèi)翔林;多線(xiàn)程技術(shù)的研究與應(yīng)用[J];計(jì)算機(jī)研究與發(fā)展;2000年04期
5 王津濤,蘭皓;面向主題元搜索引擎的設(shè)計(jì)與實(shí)現(xiàn)[J];計(jì)算機(jī)工程;2005年07期
6 秦玉平;王秀坤;艾青;劉衛(wèi)江;;多主題文本分類(lèi)的實(shí)現(xiàn)算法[J];計(jì)算機(jī)工程;2008年02期
7 李勇;韓亮;;主題搜索引擎中網(wǎng)絡(luò)爬蟲(chóng)的搜索策略研究[J];計(jì)算機(jī)工程與科學(xué);2008年03期
8 周立柱,林玲;聚焦爬蟲(chóng)技術(shù)研究綜述[J];計(jì)算機(jī)應(yīng)用;2005年09期
9 施聰鶯;徐朝軍;楊曉江;;TFIDF算法研究綜述[J];計(jì)算機(jī)應(yīng)用;2009年S1期
10 劉金紅;陸余良;;主題網(wǎng)絡(luò)爬蟲(chóng)研究綜述[J];計(jì)算機(jī)應(yīng)用研究;2007年10期
相關(guān)碩士學(xué)位論文 前10條
1 林捷;主題網(wǎng)絡(luò)爬蟲(chóng)的研究和實(shí)現(xiàn)[D];武漢理工大學(xué);2011年
2 于魁飛;基于RSS的信息發(fā)布與訂閱技術(shù)研究[D];北京郵電大學(xué);2007年
3 劉喜亮;面向主題的網(wǎng)絡(luò)爬蟲(chóng)設(shè)計(jì)與實(shí)現(xiàn)[D];湖南大學(xué);2009年
4 韓冰;基于BP網(wǎng)絡(luò)的高校主題爬蟲(chóng)的設(shè)計(jì)與實(shí)現(xiàn)[D];東北師范大學(xué);2009年
5 楊溥;搜索引擎中爬蟲(chóng)的若干問(wèn)題研究[D];北京郵電大學(xué);2009年
6 袁浩;主題爬蟲(chóng)搜索Web頁(yè)面策略的研究[D];中南大學(xué);2009年
7 陳叢叢;主題爬蟲(chóng)搜索策略研究[D];山東大學(xué);2009年
8 賀晟;搜索引擎中主題網(wǎng)絡(luò)爬蟲(chóng)的研究與設(shè)計(jì)[D];安徽大學(xué);2010年
9 張紅云;基于頁(yè)面分析的主題網(wǎng)絡(luò)爬蟲(chóng)的研究[D];武漢理工大學(xué);2010年
10 張航;主題爬蟲(chóng)的實(shí)現(xiàn)及其關(guān)鍵技術(shù)研究[D];武漢理工大學(xué);2010年
,本文編號(hào):2329116
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2329116.html