物聯(lián)網(wǎng)大數(shù)據(jù)存儲與管理技術(shù)研究
發(fā)布時間:2018-02-15 22:51
本文關(guān)鍵詞: 物聯(lián)網(wǎng)大數(shù)據(jù) 分布式文件系統(tǒng) 數(shù)據(jù)檢索 數(shù)據(jù)立方體 節(jié)能任務(wù)調(diào)度 出處:《中國科學(xué)技術(shù)大學(xué)》2017年博士論文 論文類型:學(xué)位論文
【摘要】:物聯(lián)網(wǎng)(The Internet of Things,IoT)是一個將海量傳感設(shè)備與互聯(lián)網(wǎng)相結(jié)合起來而形成的巨大網(wǎng)絡(luò)。在物聯(lián)網(wǎng)中,海量傳感設(shè)備不斷地采集數(shù)據(jù)并發(fā)送到數(shù)據(jù)中心;隨著感知技術(shù)與網(wǎng)絡(luò)技術(shù)的不斷發(fā)展,數(shù)據(jù)呈現(xiàn)出海量特性,形成了物聯(lián)網(wǎng)大數(shù)據(jù)。對物聯(lián)網(wǎng)大數(shù)據(jù)進(jìn)行持久化存儲,可以獲得任一傳感器的歷史與當(dāng)前感知數(shù)據(jù),通過對數(shù)據(jù)進(jìn)行檢索和統(tǒng)計分析,可以實現(xiàn)復(fù)雜與規(guī)律的感知和趨勢分析;數(shù)據(jù)存儲與管理以流任務(wù)運行在數(shù)據(jù)中心中,通過節(jié)能任務(wù)調(diào)度,降低物聯(lián)網(wǎng)應(yīng)用的成本。這些都為城市安全、智慧城市、目標(biāo)識別與跟蹤、位置服務(wù)等諸多領(lǐng)域帶來了新的機(jī)遇。物聯(lián)網(wǎng)大數(shù)據(jù)的存儲與管理,需要持久化存儲數(shù)據(jù),實時檢索數(shù)據(jù),對數(shù)據(jù)進(jìn)行及時的分析和處理,并提供高效的計算框架,最終對數(shù)據(jù)實現(xiàn)有效的感知與控制。但是,物聯(lián)網(wǎng)大數(shù)據(jù)的海量特性為數(shù)據(jù)的存儲與管理帶來了巨大的挑戰(zhàn)。首先,"持久化存儲",海量傳感器頻繁地產(chǎn)生新的采集數(shù)據(jù),并發(fā)送到數(shù)據(jù)中心,形成了每秒數(shù)GB的數(shù)據(jù)寫入流,對HDFS等傳統(tǒng)持久化存儲系統(tǒng)帶來了巨大的挑戰(zhàn)。在以HDFS為代表的大規(guī)模分布式文件系統(tǒng)中,雖然它們支持大數(shù)據(jù)存儲,但由于這些文件系統(tǒng)在設(shè)計時并沒有考慮對實時、高性能的數(shù)據(jù)存儲,因此無法滿足日益增長的大數(shù)據(jù)在線存儲的需求,例如HDFS在面對海量小文件的數(shù)據(jù)流時,單機(jī)性能往往下降到數(shù)MB/s,遠(yuǎn)遠(yuǎn)滿足不了實際需求。第二,"數(shù)據(jù)檢索",存儲在持久化設(shè)備中的數(shù)據(jù),需要借助數(shù)據(jù)檢索系統(tǒng),快速查找數(shù)據(jù),但是目前以關(guān)系數(shù)據(jù)庫、NoSQL數(shù)據(jù)庫為主的數(shù)據(jù)庫系統(tǒng)不能有效滿足物聯(lián)網(wǎng)大數(shù)據(jù)的檢索需求,例如NoSQL數(shù)據(jù)庫設(shè)計了基于磁盤存儲的讀寫方式、索引結(jié)構(gòu)、查詢執(zhí)行、查詢優(yōu)化、恢復(fù)策略,但是磁盤固有的讀寫性能差等弊端限制了大數(shù)據(jù)存儲尤其是大數(shù)據(jù)分析性能的提升。第三,"數(shù)據(jù)統(tǒng)計分析",這需要建立數(shù)據(jù)立方體,以實現(xiàn)高效的數(shù)據(jù)統(tǒng)計分析。但是目前傳統(tǒng)的數(shù)據(jù)立方體,如HIVE等,都只能針對確定型數(shù)據(jù)進(jìn)行統(tǒng)計分析,當(dāng)面對物聯(lián)網(wǎng)中的概率型數(shù)據(jù)時,統(tǒng)計分析的時間開銷為"小時"級別,不能滿足實際應(yīng)用的需求。最后,數(shù)據(jù)的存儲、檢索、分析都以流任務(wù)的形式運行在數(shù)據(jù)中心之中,數(shù)據(jù)中心的運維成本有40%為能耗成本,如何實現(xiàn)節(jié)能任務(wù)調(diào)度就成為了降低數(shù)據(jù)中心成本的關(guān)鍵,而目前以Hadoop YARN為代表的任務(wù)調(diào)度平臺不支持節(jié)能任務(wù)調(diào)度。綜上所述,目前許多已有的數(shù)據(jù)存儲與管理技術(shù)在面對物聯(lián)網(wǎng)大數(shù)據(jù)時,都存在著局限性。針對上述問題,本文提出一種"面向物聯(lián)網(wǎng)大數(shù)據(jù)的數(shù)據(jù)存儲與管理系統(tǒng)框架"(Sensor Storage)。Sensor Storage是一個分布式的數(shù)據(jù)存儲、檢索、分析平臺,主要包括以下關(guān)鍵技術(shù)。(1)面向海量小文件的分布式文件系統(tǒng)。本研究建立一個基于HDFS擴(kuò)展的分布式存儲系統(tǒng)SensorFS,該系統(tǒng)架構(gòu)可以對海量小文件進(jìn)行快速存儲、查詢優(yōu)化,并提供高可擴(kuò)展性、數(shù)據(jù)安全性保障;本研究提出海量小文件的寫吞吐優(yōu)化機(jī)制以及算法,對小文件寫瓶頸進(jìn)行理論分析與建模,設(shè)計小文件寫優(yōu)化策略;提出海量小文件在HDFS中的文件讀取性能優(yōu)化機(jī)制;(2)一種空間有效的鍵值數(shù)據(jù)檢索系統(tǒng)。本研究建立一個基于Radix Tree的鍵值數(shù)據(jù)檢索系統(tǒng)RadixKV,為分布式文件系統(tǒng)中的海量內(nèi)容提供基于關(guān)鍵詞的快速數(shù)據(jù)檢索服務(wù);本研究分析了Radix Tree的優(yōu)勢與不足,對Radix Tree的在線更新性能進(jìn)行分析,并設(shè)計了一種自適應(yīng)并行索引更新策略;提出了一種空間開銷優(yōu)化的Radix Tree表達(dá)方式——Radix Array,設(shè)計了 Radix Array的數(shù)據(jù)結(jié)構(gòu),并分析了 Radix Array的空間開銷。(3)面向概率型數(shù)據(jù)的數(shù)據(jù)立方體系統(tǒng)。分析物聯(lián)網(wǎng)大數(shù)據(jù)中的"不確定性"特點,并有針對性地設(shè)計面向概率數(shù)據(jù)的數(shù)據(jù)立方體系統(tǒng)ProbabilisticCube,提供面向概率型數(shù)據(jù)的快速聚集查詢服務(wù);定義物聯(lián)網(wǎng)大數(shù)據(jù)中的概率數(shù)據(jù)模型,并基于概率數(shù)據(jù)模型定義、設(shè)計概率數(shù)據(jù)立方體;設(shè)計高性能的概率數(shù)據(jù)聚集操作;設(shè)計基于物化代價估計模型的數(shù)據(jù)立方體物化實現(xiàn)策略;設(shè)計面向概率數(shù)據(jù)的切片查詢和切塊查詢。(4)能耗有效的任務(wù)調(diào)度框架。建立一個基于Hadoop YARN擴(kuò)展的分布式任務(wù)調(diào)度框架Green Yarn,新的分布式任務(wù)調(diào)度框架對物聯(lián)網(wǎng)的流任務(wù)進(jìn)行合理調(diào)度,在不損失性能的前提下,結(jié)合服務(wù)器動態(tài)電壓調(diào)整的特性(DVFS),對任務(wù)和服務(wù)器進(jìn)行合理匹配;我們設(shè)計基于任務(wù)的能耗有效性模型,并設(shè)計分別面向離線批處理任務(wù)和在線任務(wù)的任務(wù)調(diào)度算法。通過本文系統(tǒng)研究,有望建立一個面向物聯(lián)網(wǎng)大數(shù)據(jù)的新型存儲架構(gòu),對文件系統(tǒng)、大數(shù)據(jù)檢索與分析提出創(chuàng)新的優(yōu)化設(shè)計,解決其中的基礎(chǔ)性問題。本文的研究初步緩解了物聯(lián)網(wǎng)大數(shù)據(jù)的存儲與管理壓力,并進(jìn)一步實現(xiàn)原型系統(tǒng),為大數(shù)據(jù)高效存儲與管理的進(jìn)一步驗證和實驗、應(yīng)用提供支持,為大數(shù)據(jù)管理理論與系統(tǒng)化方法提供新思路。
[Abstract]:The Internet of things (The Internet of Things, IoT) is a huge network of massive sensing equipment and Internet to combine and form. In the Internet of things, the mass sensing equipment constantly collect data and send to the data center; with the sensing technology and network technology development, data showing the mass characteristics, formation the big data networking. For persistent storage of data on the Internet of things, can get any sensor history and current sensing data, through the retrieval and statistical analysis of data, perception and trend analysis can realize the complex and rules; data storage and management to flow tasks running in the data center, through the energy saving task scheduling to reduce the cost, networking applications. These are the city safe, smart city, target recognition and tracking, location services and other areas brought new opportunities. The IOT data storage system Storage and management, need persistent storage of data, real-time data retrieval, analysis and processing of data and provide timely, efficient computing framework, finally realize the perception and effective control of data. However, the mass characteristics of big data networking and data storage tube science has brought great challenges. First of all, "persistence", mass sensor frequently produce new data acquisition, and sent to the data center, formed per second GB writes data flow, brings great challenges to the traditional HDFS storage system. In HDFS large scale distributed file system as the representative, although they support large data storage however, since these file systems are designed without considering the real-time, high performance data storage, so the data cannot meet the growing demand for online storage, such as HDFS in the face of massive small files Data flow, single performance often drops to MB/s, can not meet the actual demand. Second, data retrieval, data stored in persistent equipment, need the help of data retrieval system, quickly find the data, but the relational database, the database system can not effectively meet the needs of the Internet of things NoSQL database data the design of NoSQL database retrieval, such as disk read and write mode, based on the index structure, query execution, query optimization, recovery strategy, but the inherent drawbacks of disk read and write poor performance limits of big data storage especially big data analysis performance. Third, "statistical analysis", which requires the establishment of data the cube, in order to achieve efficient data statistical analysis. But the traditional data cube, such as HIVE, are only for statistical analysis was carried out to determine the type of data, when in the face of things The probabilistic data, statistical analysis of the time cost for "hour" level, can not meet the needs of practical application. Finally, data storage, retrieval, analysis in the form of a running stream task in the data center, data center maintenance costs 40% energy cost, how to realize the energy saving scheduling becomes a the key to reduce the cost of the data center, such task scheduling does not support task scheduling platform and the Hadoop YARN as the representative. In summary, the data storage and management technology of many large data network in the face of things, there are limitations. In view of the above problems, this paper put forward a framework of data storage and management system for large data networking "(Sensor Storage).Sensor Storage is a distributed data storage, retrieval, analysis platform, mainly including the following key technologies. (1) for massive small files The distributed file system. This study established a SensorFS based distributed storage system HDFS extension, the system architecture of massive small files fast storage, query optimization, and provides high scalability, data security; this study proposes to write throughput optimization mechanism and algorithm of massive small files, small file write bottleneck analysis and modeling, design of small file optimization strategy; the massive small files in the HDFS file read performance optimization mechanism; (2) a space efficient key data retrieval system. This study established a retrieval system based on Tree Radix RadixKV key data, for massive content in distributed file system keywords provide fast data retrieval service based on Radix Tree; this paper analyzes the advantages and disadvantages of the Radix Tree, online update performance analysis, and design a Adaptive parallel index update strategy; proposes an optimized expression of the space overhead of Radix Tree Radix Array, designed the data structure of Radix Array, and analyzes the space overhead of Radix Array. (3) data cube system based on probabilistic data. The analysis of large data networking in the "uncertainty" characteristics. And in the light of the design of the ProbabilisticCube data cube system for probabilistic data, provide probabilistic fast data aggregation query service; probabilistic data in large data networking model definition, and based on probability definition data model, design of probabilistic data cube; data aggregation operation probability of high performance design; design of cost data the cube model estimation of implementation strategy based on the probability of data oriented design; slice and dice query query. (4) energy effective task scheduling framework is built. A Hadoop based YARN scalable distributed task scheduling framework Green Yarn flow task distributed task scheduling framework of new things to make reasonable scheduling, without any performance loss, combined with the characteristics of server dynamic voltage scaling (DVFS), the server task and reasonable matching; we design energy efficiency model based on task, and is designed for off-line batch processing and online task scheduling algorithm. Through this system, is expected to establish a new storage architecture for the Internet of things big data, the file system, data retrieval and analysis put forward the optimization design innovation, to solve fundamental problems in this study. Preliminary ease of large data storage and management of pressure things, and further realizes the prototype system for storage and management of large data, further validation and experiment The application provides support to provide new ideas for large data management theory and systematization.
【學(xué)位授予單位】:中國科學(xué)技術(shù)大學(xué)
【學(xué)位級別】:博士
【學(xué)位授予年份】:2017
【分類號】:TP391.44;TN929.5
,
本文編號:1514067
本文鏈接:http://sikaile.net/shoufeilunwen/xxkjbs/1514067.html
最近更新
教材專著