基于Storm與Hadoop的日志數(shù)據(jù)實時處理研究
本文選題:日志數(shù)據(jù)實時處理 + Hadoop ; 參考:《西南大學》2017年碩士論文
【摘要】:日志數(shù)據(jù)記錄著系統(tǒng)與網絡用戶行為等豐富的信息,在網絡管理、用戶行為分析等諸多領域具有較高的實用價值。隨著大數(shù)據(jù)時代的來臨,單位時間內產生的日志數(shù)據(jù)規(guī)模呈幾何級數(shù)不斷增長,日志數(shù)據(jù)的多樣性、異構性與動態(tài)變化給日志數(shù)據(jù)采集、存儲和深入分析提出了挑戰(zhàn)。傳統(tǒng)的日志處理方式主要是基于單節(jié)點服務器,沒有擴展性,單節(jié)點在CPU、I/O與存儲方面的性能,都是十分有限的。當前,在實際應用中對日志數(shù)據(jù)分析的響應時間要求越來越高,實時性已和針對大數(shù)據(jù)量的高吞吐率并行計算成為了日志數(shù)據(jù)處理的基本需求。在實時處理的應用場景中,流式計算處理能完成日志流數(shù)據(jù)的實時處理,可針對一定時間段內規(guī)模不大的數(shù)據(jù)集完成知識提取,但數(shù)據(jù)量的局限性限制了可應用的算法和結果的可靠程度,因此,實時計算所提取和依賴的知識亟需與離線批處理技術針對大規(guī)模離線數(shù)據(jù)的分析結果相結合。針對信息化和大數(shù)據(jù)背景下飛速增長的日志數(shù)據(jù)的采集、存儲和分析面臨的主要問題與離線數(shù)據(jù)與實時流數(shù)據(jù)的知識提取及其整合問題,通過對大數(shù)據(jù)技術發(fā)展理論和實踐成果的研究,在分布式系統(tǒng)基礎架構Hadoop上通過Storm On YARN從資源調度層面集成MapReduce和Storm兩種不同計算框架構建日志數(shù)據(jù)實時處理平臺,采用Flume與HBase完成日志數(shù)據(jù)分布式采集與存儲,利用吞吐率較高的MapReduce完成大規(guī)模離線數(shù)據(jù)的全局性知識提取,通過Storm進行Kafka緩沖區(qū)中小規(guī)模數(shù)據(jù)的突發(fā)性知識提取、結合知識進行流數(shù)據(jù)的實時持續(xù)計算,在保證實時性的同時提高準確率。本文主要研究內容與結果如下:(1)日志數(shù)據(jù)實時處理平臺研究研究設計具有3層結構的日志數(shù)據(jù)實時處理平臺架構,包括負責數(shù)據(jù)采集與存儲的數(shù)據(jù)服務層、負責數(shù)據(jù)分析的業(yè)務邏輯層以及實現(xiàn)數(shù)據(jù)可視化的Web展示層,其中利用共享知識庫實現(xiàn)離線分析與實時分析的結合,并整合Hadoop、Storm、Flume、HBase與Kafka等大數(shù)據(jù)構件實現(xiàn)整體架構的分布式集群環(huán)境搭建。(2)日志數(shù)據(jù)的分布式采集與存儲采用Flume將從多源前端服務器中采集到的日志數(shù)據(jù)幾近實時地存儲到分布式數(shù)據(jù)庫HBase,其中采用預分區(qū)與RowKey隨機散列技術對HBase進行優(yōu)化。實驗結果表明,平臺有效完成了前端服務器中日志數(shù)據(jù)幾近實時的采集與存儲,經過優(yōu)化后的HBase在日志存儲過程中更加充分的利用集群中的I/O和CPU資源,負載更加均衡,有效解決了HBase的“熱點”問題。(3)基于MapReduce的離線日志數(shù)據(jù)深度分析結合MapReduce計算模型將傳統(tǒng)數(shù)據(jù)挖掘算法進行并行化處理,并將算法移植到平臺上執(zhí)行以實現(xiàn)對HBase中歷史日志數(shù)據(jù)的全局性知識提取并存入離線知識庫。并針對實際應用將K-means與Apriori進行并行化處理在MapReduce分布式環(huán)境下完成聚類分析與關聯(lián)規(guī)則分析。實驗結果表明,實驗結果表明平臺能有效從歷史日志數(shù)據(jù)中提取出高可靠度的知識,并利用MapReduce并行技術使深度分析獲得更高的運行效率與擴展性,充分滿足大規(guī)模日志數(shù)據(jù)知識提取的應用需求。(4)基于Storm的日志流數(shù)據(jù)實時分析整合Storm與Kafka實現(xiàn)實時計算的日志流數(shù)據(jù)源的穩(wěn)定接入。將傳統(tǒng)數(shù)據(jù)挖掘算法結合Storm模型完成對一定時間窗口內小規(guī)模實時數(shù)據(jù)的突發(fā)性知識提取并存入實時知識庫,并以共享知識庫中的信息作為決策支持對日志流數(shù)據(jù)進行Storm實時流式計算,完成離線計算與實時計算的結合。并針對實際應用混合K-means、KNN等多個算法完成網絡異常識別。實驗結果表明,平臺能有效提取出實時數(shù)據(jù)中的突發(fā)性知識,并依賴共享知識庫完成高精準度的實時持續(xù)計算,Storm技術的應用使得實時分析獲得更高的實時性,在流式數(shù)據(jù)處理方面表現(xiàn)出了較大的優(yōu)勢。綜上所述,本研究構建的日志數(shù)據(jù)實時處理平臺有效地解決了日志數(shù)據(jù)的采集、存儲與知識提取等問題,融合了Hadoop與Storm的優(yōu)勢,在利用MapReduce提取隱藏在歷史日志數(shù)據(jù)中的全局性知識的同時,基于Storm提取小規(guī)模實時日志數(shù)據(jù)中的突發(fā)性知識、結合提取出的兩種知識使用Storm傳統(tǒng)流式處理對實時日志流數(shù)據(jù)進行實時持續(xù)計算,可為日志數(shù)據(jù)采集、存儲與分析提供新的技術參考,具有一定的實用和推廣價值。
[Abstract]:Log data records the rich information of the behavior of the system and network users. It has high practical value in many fields, such as network management, user behavior analysis and so on. With the advent of the era of large data, the size of log data produced in unit time is growing in geometric progression, the diversity, heterogeneity and dynamic changes of log data are given. Log data collection, storage and in-depth analysis put forward a challenge. The traditional log processing method is mainly based on single node server, without extensibility, the performance of single node in CPU, I/O and storage is very limited. High throughput parallel computing for large data has become the basic requirement of log data processing. In real time application scenarios, streaming computing can complete real-time processing of log data, and can complete knowledge extraction for small data sets within a certain period of time, but the limitation of the amount of data limits the applicable algorithm. And the reliability of the results, therefore, the knowledge extracted and dependent on real-time computing needs to be combined with the analysis results of off-line batch processing for large-scale off-line data. The problem of knowledge extraction and integration, through the research of the development theory and practical results of large data technology, on the distributed system infrastructure Hadoop, two different computing frameworks of MapReduce and Storm are integrated from the resource scheduling level by Storm On YARN to construct log data real-time processing platform, and the log data distribution is completed by Flume and HBase. Type collection and storage, using high throughput MapReduce to complete the global knowledge extraction of large-scale off-line data. Through Storm, the sudden knowledge extraction of small and medium size data in Kafka buffer is extracted, and the real-time continuous calculation of flow data is carried out in combination with knowledge. The results are as follows: (1) the research and research of log data real-time processing platform design and design the log data real-time processing platform architecture with 3 layers of structure, including data service layer responsible for data acquisition and storage, the business logic layer of data analysis and the Web display layer to realize data visualization, in which the shared knowledge base is used to implement off-line analysis and The integration of real-time analysis and integration of large data components such as Hadoop, Storm, Flume, HBase and Kafka to realize the distributed cluster environment of the whole architecture. (2) the distributed collection and storage of log data is stored by Flume from the log data collected from the multi source front-end server to the distributed database HBase, which is used in the distributed database. HBase is optimized by pre partition and RowKey random hash technology. The experimental results show that the platform effectively completes the near real-time collection and storage of log data in the front-end server, and the optimized HBase makes full use of I/O and CPU resources in the cluster in the log storage process, and the load is more balanced and effectively solves HBase ". 3. (3) the traditional data mining algorithm is parallelized by the depth analysis of the offline log data based on MapReduce and the MapReduce computing model, and the algorithm is transplanted to the platform to implement the global knowledge extraction of the historical log data in the HBase into the off-line knowledge base. And the K-means is applied to the actual application. The experimental results show that the platform can effectively extract high reliability knowledge from the history log data and make the depth analysis more efficient and extensible by using the MapReduce parallel technology. The experimental results show that the platform can extract high reliability knowledge from the history log data effectively. The experimental results show that the platform can extract high reliability knowledge from the history log data effectively and make the depth analysis more efficient and extensible by using MapReduce parallel technology. Meet the application requirements of large-scale log data extraction. (4) real-time analysis of log flow data based on Storm and integration of Storm and Kafka to achieve the stable access of log stream data sources in real-time computing. Traditional data mining algorithms are combined with Storm model to complete the coexistence of sudden knowledge extraction of small scale real-time data in a certain time window. The real time knowledge base is entered, and the information in the shared knowledge base is used as decision support to carry out Storm real-time flow calculation for log flow data. The combination of off-line calculation and real-time calculation is completed. And many algorithms such as mixed K-means and KNN are used to perform network anomaly recognition. Experimental results show that the platform can extract real-time data effectively. The sudden knowledge and the real-time and continuous calculation of high precision depend on the shared knowledge base. The application of Storm technology makes the real-time analysis more real-time and has a great advantage in the flow data processing. In summary, the log data processing platform constructed by this research has effectively solved the log data mining. The problems of collection, storage and knowledge extraction are combined with the advantages of Hadoop and Storm. At the same time using MapReduce to extract the global knowledge hidden in the historical log data, the sudden knowledge of small scale real-time log data is extracted based on Storm, and the two kinds of knowledge extracted are used to use Storm traditional stream processing to the real-time log data. Real time continuous computation can provide new technical reference for log data collection, storage and analysis, and has certain practical and promotional value.
【學位授予單位】:西南大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP311.13
【參考文獻】
相關期刊論文 前10條
1 馬慶祥;劉釗遠;;基于Storm的實時報警服務的設計與實現(xiàn)[J];信息技術;2016年12期
2 趙陽;王春喜;;基于Storm框架結構的分布式實時日志分析系統(tǒng)的設計研究[J];信息與電腦(理論版);2016年08期
3 單莘;祝智崗;張龍;付長冬;魏書曉;;基于流處理技術的云計算平臺監(jiān)控方案的設計與實現(xiàn)[J];計算機應用與軟件;2016年04期
4 華輝有;陳啟買;劉海;張陽;袁沛權;;一種融合Kmeans和KNN的網絡入侵檢測算法[J];計算機科學;2016年03期
5 鄭志嫻;王敏;;基于大數(shù)據(jù)的K-means聚類算法在網絡安全檢測中的應用[J];湖北第二師范學院學報;2016年02期
6 王悅;;Hive日志分析的大數(shù)據(jù)存儲優(yōu)化探討[J];信息通信;2015年10期
7 薛瑞;朱曉民;;基于Spark Streaming的實時日志處理平臺設計與實現(xiàn)[J];電信工程技術與標準化;2015年09期
8 李敏;倪少權;邱小平;黃強;;物聯(lián)網環(huán)境下基于上下文的Hadoop大數(shù)據(jù)處理系統(tǒng)模型[J];計算機應用;2015年05期
9 江三鋒;王元亮;;基于Hive的海量web日志分析系統(tǒng)設計研究[J];軟件;2015年04期
10 陳潔;于永剛;劉明恒;潘盛合;徐克付;;安全管理平臺中基于云計算的日志分析系統(tǒng)設計[J];計算機工程;2015年02期
相關會議論文 前1條
1 金松昌;方濱興;楊樹強;賈焰;;基于Hadoop的網絡安全日志分析系統(tǒng)的設計與實現(xiàn)[A];全國計算機安全學術交流會論文集·第二十五卷[C];2010年
,本文編號:1817941
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1817941.html