支持云計(jì)算的微博在線采集方法研究與應(yīng)用

發(fā)布時(shí)間：2018-01-01 06:18

本文關(guān)鍵詞：支持云計(jì)算的微博在線采集方法研究與應(yīng)用　出處：《燕山大學(xué)》2014年碩士論文　論文類型：學(xué)位論文

【摘要】：Web2.0時(shí)代的到來，不僅改變了我們對(duì)傳統(tǒng)互聯(lián)網(wǎng)的使用習(xí)慣，更是掀起了Web時(shí)代的新變革。作為社交網(wǎng)絡(luò)和移動(dòng)互聯(lián)網(wǎng)的典型代表——新浪微博擁有5億多注冊(cè)用戶，龐大的用戶群體和每天產(chǎn)生的海量數(shù)據(jù)集使得一個(gè)真正的雙向傳播和新媒體時(shí)代初具規(guī)模。本文針對(duì)微博數(shù)據(jù)的在線采集，分析了傳統(tǒng)網(wǎng)絡(luò)爬蟲采集的局限性及國(guó)內(nèi)外現(xiàn)有研究及設(shè)計(jì)方案的優(yōu)劣后，提出了支持云計(jì)算擴(kuò)展的微博網(wǎng)絡(luò)爬蟲設(shè)計(jì)方案，研究設(shè)計(jì)基于HTTP協(xié)議通信數(shù)據(jù)包分析，分布式計(jì)算及Hadoop分布式文件系統(tǒng)HDFS的技術(shù)原理。具體研究的問題有以下幾個(gè)方面：首先，分析了Web2.0網(wǎng)絡(luò)應(yīng)用在線數(shù)據(jù)采集的研究現(xiàn)狀和局限性，提出以模擬瀏覽器方式登錄微博，解決由登錄問題導(dǎo)致信息無法采集的問題，分析現(xiàn)有oAuth授權(quán)調(diào)用微博API方式獲取信息方案的不足，提出以模擬瀏覽器方式訪問的網(wǎng)絡(luò)爬蟲方法進(jìn)行微博數(shù)據(jù)的在線采集。然后，對(duì)于微博產(chǎn)生龐大的數(shù)據(jù)量，，在評(píng)估了重構(gòu)Nutch搜索引擎框架中傳統(tǒng)網(wǎng)絡(luò)爬蟲采集、解析功能的風(fēng)險(xiǎn)后，依據(jù)分布式計(jì)算原理，提出了分布式微博爬蟲的架構(gòu)，并根據(jù)此架構(gòu)詳細(xì)介紹了各模塊間的核心業(yè)務(wù)邏輯。再次，進(jìn)一步擴(kuò)展了分布式微博爬蟲的功能，提出了兩種工作模式：普通模式和云計(jì)算擴(kuò)展模式。其中普通模式Web信息抽取工作依據(jù)正則表達(dá)式和BeautifulSoup框架提供的XML檢索接口完成；云計(jì)算擴(kuò)展模式則提出了支持Hadoop分布式文件系統(tǒng)HDFS。擴(kuò)展模式產(chǎn)生鍵值對(duì)形式的采集數(shù)據(jù)，并將資源副本輸出到HDFS上，實(shí)質(zhì)為MapReduce計(jì)算框架提供了文件輸入端。最后，實(shí)現(xiàn)了上述的功能模塊，并進(jìn)行了驗(yàn)證。
[Abstract]:The arrival of Web2.0 not only changes our habit of using traditional Internet. It is also a new revolution in the Web era. As a typical representative of social networks and mobile Internet, Sina Weibo has more than 500 million registered users. The huge user group and the massive data set produced every day make a real bidirectional communication and new media era take shape. This paper focuses on the online acquisition of Weibo data. After analyzing the limitations of traditional crawler collection and the advantages and disadvantages of the existing research and design schemes at home and abroad, a Weibo crawler design scheme to support cloud computing expansion is proposed. The technical principle of communication packet analysis, distributed computing and Hadoop distributed file system (HDFS) based on HTTP protocol is studied and designed. First of all, this paper analyzes the research status and limitation of online data acquisition in Web2.0 network application, and puts forward the method of simulating browser to log on to Weibo to solve the problem that information can not be collected caused by login problem. This paper analyzes the shortcomings of the existing oAuth authorization to call Weibo API to obtain information, and proposes a web crawler method which simulates browser access to carry out online Weibo data acquisition. Then, for Weibo to produce a large amount of data, after evaluating the risk of traditional crawler collection and parsing function in the framework of reconfigurable Nutch search engine, according to the principle of distributed computing. The architecture of distributed Weibo crawler is proposed and the core business logic among modules is introduced in detail according to this architecture. Thirdly, the function of distributed Weibo crawler is further expanded. Two modes of work are proposed:. General schema and cloud computing extended pattern, in which the Web information extraction of common schema is based on the XML retrieval interface provided by regular expression and BeautifulSoup framework. Cloud computing extension mode proposed to support the Hadoop distributed file system HDFS.Extensible schema generates data in the form of key-value pairs and outputs copies of resources to HDFS. In essence, it provides the file input for the MapReduce computing framework. Finally, the above functional modules are implemented and verified.
【學(xué)位授予單位】：燕山大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2014
【分類號(hào)】：TP393.092

【參考文獻(xiàn)】

相關(guān)期刊論文前4條

1 周德懋;李舟軍;;高性能網(wǎng)絡(luò)爬蟲:研究綜述[J];計(jì)算機(jī)科學(xué);2009年08期

2 鮑彤;陳維鋒;盧磊;;基于Hadoop的分布式集群平臺(tái)構(gòu)建方法研究[J];信息通信;2013年08期

3 廉捷;周欣;曹偉;劉云;;新浪微博數(shù)據(jù)挖掘方案[J];清華大學(xué)學(xué)報(bào)(自然科學(xué)版);2011年10期

4 時(shí)子慶;劉金蘭;譚曉華;;基于OAuth2.0的認(rèn)證授權(quán)技術(shù)[J];計(jì)算機(jī)系統(tǒng)應(yīng)用;2012年03期

本文編號(hào)：1363294

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1363294.html

上一篇：基于web的餐飲服務(wù)平臺(tái)的設(shè)計(jì)與實(shí)現(xiàn)
下一篇：基于分布式搜索引擎的消息中間件設(shè)計(jì)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

支持云計(jì)算的微博在線采集方法研究與應(yīng)用