基于主題的微博網(wǎng)頁爬蟲研究
發(fā)布時間:2018-04-29 01:33
本文選題:網(wǎng)頁頁面分析 + 微博爬蟲 ; 參考:《武漢理工大學》2014年碩士論文
【摘要】:隨著美國twitter的火熱,國內(nèi)各大微博網(wǎng)站興起,微博在網(wǎng)民中日益火熱。在微博中誕生的各種網(wǎng)絡熱詞也迅速走紅網(wǎng)絡,微博效應正在逐漸形成,微博成為中國網(wǎng)民上網(wǎng)的主要活動之一。正是由于微博效應的形成,微博話題在網(wǎng)民之間迅速傳遞。對于微博信息的獲取以及分析,成為重要的研究對象。為方便微博數(shù)據(jù)的獲取,各大網(wǎng)站微博也相繼提供了抓取微博的API,但這些API都有訪問次數(shù)的限制,,無法滿足獲取大量微博數(shù)據(jù)的要求,同時抓取的數(shù)據(jù)往往很雜亂。針對上述問題,本文引入網(wǎng)頁頁面分析技術和主題相關性分析技術,展開基于主題的微博網(wǎng)頁爬蟲的研究與設計。 本文的主要工作有研究分析網(wǎng)頁頁面分析技術,根據(jù)微博頁面特點選擇微博頁面信息獲取方法;重點描述基于“剪枝”的廣度優(yōu)先搜索策略的思考以及設計的詳細過程,著重解決URL的去重、URL地址集合動態(tài)變化等問題;研究分析短文本主題抽取技術以及多關鍵匹配技術,確定微博主題相關性分析的設計方案;最后設計實現(xiàn)基于主題的微博網(wǎng)頁爬蟲的原型系統(tǒng),實時抓取和存儲微博數(shù)據(jù)。本文研究的核心問題是,根據(jù)微博數(shù)據(jù)的特點設計一種基于“剪枝”的廣度優(yōu)先搜索策略,并將其應用到微博爬蟲中;同時使用微博頁面分析技術使得爬蟲不受微博平臺API限制,從而讓用戶盡可能準確地抓取主題相關的微博數(shù)據(jù)。 通過多次反復實驗獲取原型系統(tǒng)實驗結(jié)果,將實驗結(jié)果同基于API微博爬蟲和基于網(wǎng)頁微博爬蟲的抓取效果進行對比分析得出結(jié)論:本文提出的爬行策略能夠抓取主題相關的微博數(shù)據(jù),雖然在效率上有所降低,但在抓取的微博數(shù)據(jù)具有較好的主題相關性。這實驗結(jié)果證明本論文研究的實現(xiàn)方案是可行的。
[Abstract]:With the popularity of twitter in the United States and the rise of Weibo websites in China, Weibo is becoming more and more popular among Internet users. All kinds of network hot words born in Weibo are also becoming popular in the Internet, and Weibo effect is gradually forming. Weibo has become one of the main activities of Internet users in China. Precisely because of the formation of Weibo effect, Weibo topic passes quickly among the netizen. For Weibo information acquisition and analysis, become an important research object. In order to facilitate the acquisition of Weibo data, Weibo has also provided the API of Weibo, but these API can not meet the requirements of obtaining a large number of Weibo data because of the limitation of access times. At the same time, the fetched data is often very messy. Aiming at the above problems, this paper introduces the technology of web page analysis and theme correlation analysis, and develops the research and design of Weibo web crawler based on topic. The main work of this paper is to study and analyze the technology of page analysis, to select the method of obtaining the information of Weibo page according to Weibo's page characteristics, and to describe the thinking and design process of the breadth-first search strategy based on "pruning". In order to solve the problem of dynamic change of URL's reshuffling URL address set, this paper studies and analyzes the technology of extracting short text and multi-key matching technology, and determines the design scheme of Weibo's theme correlation analysis. Finally, a prototype system of Weibo web crawler based on theme is designed and implemented, which can capture and store Weibo data in real time. The core problem of this paper is to design a breadth-first search strategy based on pruning according to the characteristics of Weibo data, and apply it to Weibo crawler. At the same time, using Weibo page analysis technology, the crawler is not restricted by the API platform, so that users can capture the data of the topic as accurately as possible. The experimental results of the prototype system are obtained by repeated experiments. The experimental results are compared with those based on API Weibo crawler and web page Weibo crawler. It is concluded that the crawling strategy proposed in this paper can capture data related to the subject, although the efficiency is somewhat lower. But Weibo data in the capture has a better thematic correlation. The experimental results show that the scheme is feasible.
【學位授予單位】:武漢理工大學
【學位級別】:碩士
【學位授予年份】:2014
【分類號】:TP393.092
【參考文獻】
相關期刊論文 前10條
1 段愛華;;基于網(wǎng)站結(jié)構(gòu)分析頁面信息提取的方法研究[J];電腦知識與技術;2008年23期
2 周民;邱雅;王華彬;;網(wǎng)絡輿情分析中智能爬蟲的設計[J];電腦知識與技術;2011年33期
3 趙前東;葉猛;;微博熱點話題檢測系統(tǒng)的設計與實現(xiàn)[J];電視技術;2013年03期
4 殷賢亮;李猛;;基于分塊的網(wǎng)頁主題信息自動提取算法[J];華中科技大學學報(自然科學版);2007年10期
5 王琦,唐世渭,楊冬青,王騰蛟;基于DOM的網(wǎng)頁主題信息自動提取[J];計算機研究與發(fā)展;2004年10期
6 李聰;梁昌勇;馬麗;;基于領域最近鄰的協(xié)同過濾推薦算法[J];計算機研究與發(fā)展;2008年09期
7 李學勇,歐陽柳波,李國徽,鐘敏娟;網(wǎng)絡蜘蛛搜索策略比較研究[J];計算機工程與應用;2004年04期
8 常育紅,姜哲,朱小燕;基于標記樹表示方法的頁面結(jié)構(gòu)分析[J];計算機工程與應用;2004年16期
9 林海霞;原福永;陳金森;劉俊峰;;一種改進的主題網(wǎng)絡蜘蛛搜索算法[J];計算機工程與應用;2007年10期
10 周德懋;李舟軍;;高性能網(wǎng)絡爬蟲:研究綜述[J];計算機科學;2009年08期
本文編號:1817814
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/1817814.html
最近更新
教材專著