基于Spark平臺的網(wǎng)絡數(shù)據(jù)分析系統(tǒng)的設(shè)計與實現(xiàn)

發(fā)布時間：2018-11-17 06:59

【摘要】：隨著互聯(lián)網(wǎng)技術(shù)的快速發(fā)展,內(nèi)容分發(fā)網(wǎng)絡(CDN)在互聯(lián)網(wǎng)架構(gòu)中起到重要作用,用戶的上網(wǎng)記錄也被記錄在CDN服務提供商的網(wǎng)絡日志中。各大CDN廠商都有一些通用的分析海量網(wǎng)絡數(shù)據(jù)的需求,他們的PM管理人員,運營人員等非技術(shù)人員都需要對這些網(wǎng)絡數(shù)據(jù)做一些通用的數(shù)據(jù)分析工作。針對CDN服務提供商,目前市場上缺少一個通用的網(wǎng)絡數(shù)據(jù)分析服務平臺。因此,為CDN廠商提供一個通用的,沒有大數(shù)據(jù)平臺使用門檻的網(wǎng)絡數(shù)據(jù)分析服務平臺有著迫切的需求。為了設(shè)計出一個通用的、操作簡單、易擴展的分析海量網(wǎng)絡數(shù)據(jù)的服務平臺,本文利用現(xiàn)有的分布式框架設(shè)計并實現(xiàn)了基于Spark平臺的網(wǎng)絡數(shù)據(jù)分析服務平臺。本文的主要工作有:(1)基于Spark大數(shù)據(jù)技術(shù)實現(xiàn)對海量網(wǎng)絡數(shù)據(jù)的預處理以及處理分析。本文根據(jù)網(wǎng)絡數(shù)據(jù)的特點,設(shè)計實現(xiàn)了網(wǎng)絡數(shù)據(jù)分析服務工具;(2)對大數(shù)據(jù)平臺Web化技術(shù)的研究。本文主要研究了如何在Web平臺上瀏覽分布式存儲引擎上的網(wǎng)絡數(shù)據(jù)以及如何通過Web平臺執(zhí)行海量網(wǎng)絡數(shù)據(jù)分析任務;(3)基于Yarn對整個大數(shù)據(jù)平臺的管理機制,分析了資源管理器Yam和計算引擎Spark之間的關(guān)系,研究了如何通過監(jiān)控Yarn來實現(xiàn)監(jiān)控大數(shù)據(jù)平臺中的Spark任務,從而保證整個系統(tǒng)平臺的可用性;(4)研究了關(guān)于大數(shù)據(jù)分析結(jié)果的可視化。通過對第三方可視化插件的研究,提出引入Echarts將大數(shù)據(jù)分析結(jié)果呈現(xiàn)到頁面中。根據(jù)對相關(guān)技術(shù)研究所取得的解決方案,本文實現(xiàn)了基于Spark平臺的數(shù)據(jù)分析功能和大數(shù)據(jù)、平臺的Web化,并通過實驗驗證了這些功能和平臺的有效性�；谝陨详P(guān)鍵技術(shù)方案的實現(xiàn),本文完成了網(wǎng)絡數(shù)據(jù)分析服務平臺的開發(fā),為用戶提供了相關(guān)的網(wǎng)絡數(shù)據(jù)分析功能,網(wǎng)絡數(shù)據(jù)預覽功能,結(jié)果數(shù)據(jù)可視化,系統(tǒng)監(jiān)控功能等功能,為掌握用戶的上網(wǎng)行為特征提供一個平臺,同時也為各大網(wǎng)站提供方和CDN廠商優(yōu)化自身服務創(chuàng)造了條件。
[Abstract]:With the rapid development of Internet technology, the content distribution network (CDN) plays an important role in the Internet architecture, and users' online records are recorded in the CDN service provider's log. The major CDN manufacturers have some common requirements for analyzing massive network data, and their PM managers, operators and other non-technical personnel all need to do some general data analysis work on these network data. For CDN service providers, there is a lack of a common network data analysis service platform. Therefore, to provide CDN manufacturers with a general, no big data platform to use the threshold of network data analysis service platform has an urgent need. In order to design a general, simple and extensible service platform for analyzing massive network data, this paper designs and implements a network data analysis service platform based on Spark platform by using the existing distributed framework. The main work of this paper is as follows: (1) based on Spark big data technology, the preprocessing and processing of massive network data are realized. According to the characteristics of network data, this paper designs and implements a network data analysis service tool. (2) the research of big data platform Web technology. This paper mainly studies how to browse the network data on the distributed storage engine on the Web platform and how to carry out the massive network data analysis task through the Web platform. (3) based on the management mechanism of big data platform based on Yarn, this paper analyzes the relationship between resource manager Yam and computing engine Spark, and studies how to realize the task of monitoring the Spark in big data platform by monitoring Yarn. In order to ensure the usability of the whole system platform; (4) the visualization of big data analysis results is studied. Through the research of the third party visualization plug-in, this paper proposes to introduce Echarts to present big data analysis results to the page. According to the solutions obtained by the related technical research, this paper realizes the data analysis function based on the Spark platform and the Web of big data and the platform, and verifies the effectiveness of these functions and platforms through experiments. Based on the implementation of the above key technology, this paper has completed the development of network data analysis service platform, which provides users with related network data analysis function, network data preview function, result data visualization. The functions of system monitoring provide a platform for the users to master the characteristics of their Internet behavior, and also create conditions for the providers and CDN vendors to optimize their own services.
【學位授予單位】：北京郵電大學
【學位級別】：碩士
【學位授予年份】：2017
【分類號】：TP393.09;TP311.13

【參考文獻】

相關(guān)期刊論文前9條

1 顧小苑;;Chubby和ZooKeeper系統(tǒng)的對比研究[J];數(shù)字技術(shù)與應用;2016年08期

2 李媛禎;楊群;賴尚琦;李博涵;;一種Hadoop Yarn的資源調(diào)度方法研究[J];電子學報;2016年05期

3 陳僑安;李峰;曹越;龍明盛;;基于運行數(shù)據(jù)分析的Spark任務參數(shù)優(yōu)化[J];計算機工程與科學;2016年01期

4 薛志云;何軍;張丹陽;曹維焯;;Hadoop和Spark在實驗室中部署與性能評估[J];實驗室研究與探索;2015年11期

5 ;運用Spark加速實時數(shù)據(jù)分析[J];電腦編程技巧與維護;2015年21期

6 陳虹君;;Spark框架的Graphx算法研究[J];電腦知識與技術(shù);2015年01期

7 丁圣勇;閔世武;樊勇兵;;基于Spark平臺的NetFlow流量分析系統(tǒng)[J];電信科學;2014年10期

8 申德榮;于戈;王習特;聶鐵錚;寇月;;支持大數(shù)據(jù)管理的NoSQL系統(tǒng)研究綜述[J];軟件學報;2013年08期

9 張延松;焦敏;王占偉;王珊;周p，

本文編號：2336875

資料下載