基于大數(shù)據(jù)的熱點(diǎn)輿情發(fā)現(xiàn)與分析系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)
本文選題:大數(shù)據(jù) + Hadoop; 參考:《哈爾濱工業(yè)大學(xué)》2017年碩士論文
【摘要】:現(xiàn)實(shí)新聞內(nèi)容生產(chǎn)環(huán)境中存在很多瓶頸,制約著新聞內(nèi)容的生產(chǎn)。比如:短期熱點(diǎn)無法捕捉、編輯人力有限、相關(guān)素材難以搜集、對(duì)已發(fā)表的報(bào)道缺乏合理的反饋機(jī)制。媒體需要一個(gè)能為他們及時(shí)發(fā)現(xiàn)熱點(diǎn),提供素材支持,追蹤熱點(diǎn)的工具;ヂ(lián)網(wǎng)已成為思想文化信息的集散地和社會(huì)輿論的放大器,這就使得輿情監(jiān)控對(duì)企業(yè)、組織、機(jī)構(gòu)而言非常重要。熱點(diǎn)輿情發(fā)現(xiàn)與分析系統(tǒng)使用Hadoop計(jì)算平臺(tái)分析大數(shù)據(jù)。Hadoop計(jì)算平臺(tái)主要進(jìn)行熱點(diǎn)挖據(jù)和輿情分析。熱點(diǎn)挖掘通過對(duì)一段時(shí)間內(nèi)新聞數(shù)據(jù)進(jìn)行挖掘,發(fā)現(xiàn)熱點(diǎn)話題。輿情分析,對(duì)已挖掘的熱點(diǎn)話題,將評(píng)論數(shù)據(jù)和社交數(shù)據(jù)與熱點(diǎn)進(jìn)行關(guān)聯(lián),通過情感分析、觀點(diǎn)計(jì)算和用戶畫像進(jìn)行輿情分析。所有數(shù)據(jù)使用Hadoop存儲(chǔ)平臺(tái)進(jìn)行存儲(chǔ),對(duì)新聞數(shù)據(jù)建立索引,使用檢索系統(tǒng)提供素材檢索服務(wù)。最終,整個(gè)以網(wǎng)頁形式程序呈現(xiàn),為媒體寫作提供熱點(diǎn)發(fā)現(xiàn)與線索管理功能,為企業(yè)、組織、機(jī)構(gòu)提供輿情分析和報(bào)警功能。系統(tǒng)通過下載平臺(tái)從外網(wǎng)進(jìn)行新聞與評(píng)論數(shù)據(jù)和新浪微博數(shù)據(jù)采集,通過內(nèi)網(wǎng)內(nèi)部推送流程進(jìn)行社交和搜索數(shù)據(jù)采集。然后,系統(tǒng)對(duì)新聞數(shù)據(jù)和評(píng)論數(shù)據(jù)進(jìn)行預(yù)處理,預(yù)處理主要包含地域分類、領(lǐng)域分類、低質(zhì)量過濾、情感分析、站點(diǎn)識(shí)別和權(quán)威媒體認(rèn)證。之后,一份數(shù)據(jù)將存儲(chǔ)到Hadoop集群中被熱點(diǎn)挖據(jù)流程使用,另一份將建立索引存儲(chǔ),索引數(shù)據(jù)可以被用作輿情分析也可以進(jìn)行素材檢索。之后,使用算法組件進(jìn)行熱點(diǎn)挖據(jù)和輿情分析,得到熱點(diǎn)話題和輿情相關(guān)數(shù)據(jù)。算法組件主要包含熱點(diǎn)挖據(jù)、熱詞發(fā)現(xiàn)、情感分析、觀點(diǎn)計(jì)算和用戶畫像。前后臺(tái)使用Hadoop文件和MySQL數(shù)據(jù)庫,進(jìn)行數(shù)據(jù)交互。最終,利用網(wǎng)頁形式,根據(jù)不同的業(yè)務(wù)需求呈現(xiàn)數(shù)據(jù)。熱點(diǎn)輿情發(fā)現(xiàn)與分析系統(tǒng)1.0版本已經(jīng)完成人民日?qǐng)?bào)的驗(yàn)收并獲得肯定。當(dāng)然,系統(tǒng)還有需要完善的地方。
[Abstract]:There are many bottlenecks in the production environment of news content, which restricts the production of news content. For example, short-term hot spots can not be captured, editors have limited manpower, relevant materials are difficult to collect, and there is no reasonable feedback mechanism for published reports. The media needs a tool to spot hot spots, provide material support, and track hot spots in time. The Internet has become the center of ideological and cultural information and the amplifier of public opinion, which makes monitoring of public opinion very important for enterprises, organizations and institutions. The hot spot public opinion discovery and analysis system uses the Hadoop computing platform to analyze the big data. Hadoop computing platform mainly carries on the hot spot digging and the public opinion analysis. Hot spot mining finds hot topics by mining news data for a period of time. Based on the analysis of public opinion, the comment data and social data are associated with the hot spots, and the public opinion is analyzed through emotional analysis, viewpoint calculation and user portrait. All the data are stored on the Hadoop storage platform, the news data is indexed, and the material retrieval service is provided by the retrieval system. Finally, the whole program is presented in the form of web pages, which provides hot spot discovery and clue management function for media writing, and provides public opinion analysis and alarm function for enterprises, organizations and institutions. The system collects news and comment data and Sina Weibo data from outside network through downloading platform, and social and search data collection through internal push flow of intranet. Then, the system preprocesses the news data and comment data. The preprocessing mainly includes regional classification, domain classification, low-quality filtering, emotional analysis, site identification and authoritative media authentication. After that, one piece of data will be stored in the Hadoop cluster and used by the hot spot collection process, and the other will be indexed. The index data can be used for public opinion analysis or for material retrieval. After that, the algorithm component is used to analyze hot spot and public opinion, and the data of hot topic and public opinion are obtained. The algorithm component mainly includes hot spot data, hot word discovery, emotion analysis, viewpoint calculation and user portrait. The front and back uses the Hadoop file and the MySQL database, carries on the data interaction. Finally, using the form of web pages, according to different business requirements to present the data. Hot public opinion discovery and analysis system version 1. 0 has completed the acceptance of People's Daily and has been confirmed. Of course, the system needs to be improved.
【學(xué)位授予單位】:哈爾濱工業(yè)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:G252.7;TP311.13
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 李金海;何有世;熊強(qiáng);;基于大數(shù)據(jù)技術(shù)的網(wǎng)絡(luò)輿情文本挖掘研究[J];情報(bào)雜志;2014年10期
2 蘭月新;董希琳;蘇國強(qiáng);;公共危機(jī)事件網(wǎng)絡(luò)輿情預(yù)測(cè)問題研究[J];情報(bào)科學(xué);2014年04期
3 宮夏屹;李伯虎;柴旭東;谷牧;;大數(shù)據(jù)平臺(tái)技術(shù)綜述[J];系統(tǒng)仿真學(xué)報(bào);2014年03期
4 唐濤;;基于情報(bào)學(xué)方法的網(wǎng)絡(luò)輿情監(jiān)測(cè)研究[J];情報(bào)科學(xué);2014年01期
5 王元卓;靳小龍;程學(xué)旗;;網(wǎng)絡(luò)大數(shù)據(jù):現(xiàn)狀與展望[J];計(jì)算機(jī)學(xué)報(bào);2013年06期
6 劉建;;大數(shù)據(jù)時(shí)代的輿情版圖——訪武漢大學(xué)信息管理學(xué)院教授、輿情研究學(xué)者 沈陽[J];人民論壇;2013年15期
7 溫優(yōu)華;;媒介融合背景下學(xué)術(shù)期刊信息傳播策略探討[J];編輯之友;2013年05期
8 周白瑜;段春波;于普林;;科技期刊在媒體融合時(shí)代面臨的機(jī)遇與挑戰(zhàn)[J];編輯之友;2013年04期
9 馮芷艷;郭迅華;曾大軍;陳煜波;陳國青;;大數(shù)據(jù)背景下商務(wù)管理研究若干前沿課題[J];管理科學(xué)學(xué)報(bào);2013年01期
10 孟小峰;慈祥;;大數(shù)據(jù)管理:概念、技術(shù)與挑戰(zhàn)[J];計(jì)算機(jī)研究與發(fā)展;2013年01期
相關(guān)博士學(xué)位論文 前1條
1 方付建;突發(fā)事件網(wǎng)絡(luò)輿情演變研究[D];華中科技大學(xué);2011年
相關(guān)碩士學(xué)位論文 前5條
1 王樹辰;基于海量輿情信息的話題檢測(cè)系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[D];中山大學(xué);2013年
2 宋文婷;中國期刊在三網(wǎng)融合背景下的發(fā)展研究[D];南昌大學(xué);2012年
3 夏虹;“三網(wǎng)融合”背景下的媒介融合研究[D];南昌大學(xué);2012年
4 楊冠超;微博客熱點(diǎn)話題發(fā)現(xiàn)策略研究[D];浙江大學(xué);2011年
5 桑翔;中國媒體融合的現(xiàn)狀、模式和趨勢(shì)研究[D];華東師范大學(xué);2009年
,本文編號(hào):1972549
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1972549.html