天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

微博輿情系統(tǒng)中數(shù)據(jù)采集技術研究

發(fā)布時間:2018-04-18 22:10

  本文選題:微博數(shù)據(jù) + 模擬登錄; 參考:《湘潭大學》2014年碩士論文


【摘要】:隨著互聯(lián)網(wǎng)的成熟和移動互聯(lián)網(wǎng)的快速發(fā)展,越來越多的信息都被發(fā)布在網(wǎng)絡上,而且這種方式也逐漸的被大眾接受。網(wǎng)絡上的信息在一定程度上能反映民眾意向,但同時一些蠱惑性的話也能煽動網(wǎng)民,因此網(wǎng)絡輿論在當下社會中越來越受關注。為發(fā)展健康的網(wǎng)絡環(huán)境,有關政府部門需要對網(wǎng)絡輿情進行有效預測、發(fā)現(xiàn)和疏通引導。而在網(wǎng)絡輿情領域中,,微博輿情備受關注,因為越來越多的輿情事件都是首先在微博上曝光,然后在微博上傳播、討論從而形成輿情事件。從各級政府、企事業(yè)單位開通微博的動作就能看出微博在網(wǎng)絡中的地位。 本文針對微博輿情系統(tǒng)中數(shù)據(jù)采集存在的若干問題進行分析與研究,提出了通過模擬登錄采集網(wǎng)頁,然后輔以優(yōu)先隊列采來集微博上更有影響力的微博。本文主要完成以下工作: (1)就目前常用三種方法進行分析:微博推送、基于微博API和網(wǎng)絡爬蟲。前兩種采集方法很難滿足輿情系統(tǒng)對微博數(shù)據(jù)在規(guī)模和實時性等方面的需求,最后一種則不容易采集到有用信息。為此,本文提出模擬瀏覽器登錄微博抓取網(wǎng)頁數(shù)據(jù)的方法,以方便地獲取任意微博用戶網(wǎng)頁上的數(shù)據(jù),并且能避開前兩種方法在數(shù)據(jù)采集速度上的限制。 (2)考慮到微博上用戶數(shù)目龐大,采集數(shù)據(jù)時會漏掉很多用戶。本文提出構建微博用戶網(wǎng)絡的方法來解決該問題。首先,將每個微博用戶抽象為一個點,用戶和用戶之間的粉絲、關注、轉(zhuǎn)發(fā)、評論等關系抽象為邊,將每種關系的量化值作為該邊上對應關系權值。通過點和邊加入,就能構建出一個巨大的微博用戶網(wǎng)絡,這樣就能通過這個網(wǎng)絡不斷的發(fā)現(xiàn)新微博用戶,進而能保證數(shù)據(jù)的完整性。 (3)為實現(xiàn)高效的微博數(shù)據(jù)采集,本文采用優(yōu)先隊列算法。高效采集數(shù)據(jù)是指在面對大量的數(shù)據(jù)時,我們分層次的采集這些數(shù)據(jù),即先采集影響力大的用戶所發(fā)的微博,然后才是影響力較小的。為實現(xiàn)該功能,本文設計了優(yōu)先級的計算模型。綜合新浪微博對影響力用戶的定義和各種實際情況,篩選出粉絲數(shù)、關注數(shù)、活躍度、傳播力和時間戳這五個因子。以影響力為主要因子構建優(yōu)先隊列,使得影響力越大的用戶數(shù)據(jù)采集頻率越高,同時還通過計算時間間隔兼顧非活躍用戶的數(shù)據(jù)獲取。并且,在獲得網(wǎng)頁后,由于微博的網(wǎng)頁結構單一,本文設計了相應的去噪、解析方法,即通過固定特征值直接定位有效信息,實現(xiàn)高效解析。對得到的數(shù)據(jù),對其進行簡單的數(shù)據(jù)分析,得到一些簡單有意思的信息。 實驗結果表明該方法具有通用性強、完全無需人工干預、獲取信息的質(zhì)量高、速度快等優(yōu)點。
[Abstract]:With the maturity of the Internet and the rapid development of mobile Internet, more and more information are published on the network, and this way is gradually accepted by the public.The information on the network can reflect the public intention to some extent, but at the same time some demagoguery words can also incite the netizen, so the network public opinion is paid more and more attention in the present society.In order to develop a healthy network environment, relevant government departments need to make effective prediction, discovery and guidance of network public opinion.In the field of network public opinion, Weibo's public opinion is concerned, because more and more public opinion events are first exposed on Weibo, and then spread on Weibo to discuss the formation of public opinion events.From all levels of government, enterprises and institutions to open Weibo's actions can see the status of Weibo in the network.This paper analyzes and studies some problems existing in data acquisition in Weibo's public opinion system, and puts forward the idea of collecting web pages by simulating login, and then using priority queue to collect the more influential Weibo on Weibo.The main work of this paper is as follows:This paper analyzes three methods used at present: Weibo push, Weibo API and web crawler.The first two methods are difficult to meet the demand of the public opinion system for Weibo data in scale and real-time. The last one is not easy to collect useful information.For this reason, this paper proposes a method of imitating browser login Weibo to grab web page data, so as to obtain data on any user's page easily, and to avoid the limitation of data acquisition speed of the former two methods.Considering Weibo's large number of users, many users will be left out when collecting data.This paper puts forward the method of constructing Weibo user network to solve this problem.First of all, each Weibo user is abstracted as a point, the relationship between user and user, attention, forwarding, comment and so on are abstracted as edges, and the quantization value of each relationship is regarded as the corresponding relation weight value of each kind of relationship.By adding dots and edges, we can construct a huge Weibo user network, which can continuously discover new Weibo users and ensure the integrity of the data.In order to achieve efficient Weibo data acquisition, priority queue algorithm is adopted in this paper.Efficient data acquisition means that in the face of a large number of data, we collect these data at different levels, that is to say, we first collect Weibo, who has great influence, and then we have less influence.In order to realize this function, the priority calculation model is designed in this paper.Synthesizing Sina Weibo's definition of influential user and all kinds of actual situation, the five factors of fan number, attention number, activity degree, propagation power and time stamp are screened out.With the influence as the main factor, the priority queue is constructed, which makes the more influential user data acquisition frequency higher, but also through calculating the time interval to take account of inactive users data acquisition.After obtaining the web page, due to the single structure of Weibo's web page, the corresponding denoising and parsing method is designed in this paper, that is, the effective information can be directly located by fixed eigenvalues to achieve efficient parsing.For the obtained data, the simple data analysis, get some simple and interesting information.The experimental results show that this method has many advantages, such as high quality and high speed.
【學位授予單位】:湘潭大學
【學位級別】:碩士
【學位授予年份】:2014
【分類號】:TP393.092

【相似文獻】

相關期刊論文 前10條

1 唐開山;基于K叉樹的優(yōu)先隊列[J];系統(tǒng)工程理論與實踐;1999年07期

2 劉晨亮 ,許家棟,楊少軍;常量時間的優(yōu)先隊列算法[J];微型機與應用;2004年05期

3 王兆紅;利用堆實現(xiàn)優(yōu)先隊列[J];電腦學習;2005年06期

4 范中,鄭應平;優(yōu)先隊列控制模型參數(shù)優(yōu)化[J];電子學報;1998年08期

5 林家驥,閔應驊;一種基于類的優(yōu)先隊列的動態(tài)資源配置方案[J];科學技術與工程;2005年14期

6 王知人,王平;優(yōu)先隊列控制算法的性能研究[J];自動化技術與應用;2000年05期

7 崔慎智;陳志泊;;基于多代理和多優(yōu)先隊列的短信實時并發(fā)算法[J];計算機工程;2011年03期

8 武繼剛,陳國良;優(yōu)先隊列與并行分枝界限算法[J];煙臺大學學報(自然科學與工程版);2000年01期

9 劉晨亮,許家棟,楊少軍;大排隊長度優(yōu)化的優(yōu)先隊列算法[J];計算機應用;2004年S1期

10 劉晨亮,許家棟,李前進;基于基數(shù)排序的集成服務優(yōu)先隊列算法[J];計算機工程與應用;2004年27期

相關會議論文 前1條

1 范中;鄭應平;;優(yōu)先隊列控制模型優(yōu)化參數(shù)設計[A];1997年中國控制會議論文集[C];1997年



本文編號:1770288

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/guanlilunwen/ydhl/1770288.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權申明:資料由用戶7b424***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com