數(shù)據(jù)中心集群監(jiān)控系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn)
本文選題:集群 + 監(jiān)控。 參考:《中國(guó)地質(zhì)大學(xué)(北京)》2012年碩士論文
【摘要】:當(dāng)前以太網(wǎng)帶寬不斷提升,普通計(jì)算機(jī)價(jià)格不斷下降,由普通PC計(jì)算機(jī)作為節(jié)點(diǎn),構(gòu)成基本運(yùn)算單元,通過(guò)高速局域網(wǎng)相互連接,依靠軟件進(jìn)行協(xié)作進(jìn)行工作的集群系統(tǒng)具有性價(jià)比高、擴(kuò)展性好等優(yōu)勢(shì),已經(jīng)取代了傳統(tǒng)的大型機(jī)或巨型機(jī),在很多工業(yè)領(lǐng)域得到了廣泛的應(yīng)用,如信息檢索、文本分析、大規(guī)模的數(shù)據(jù)挖掘、機(jī)器學(xué)習(xí)和時(shí)下流行的云計(jì)算。隨著集群系統(tǒng)的使用日益廣泛,人們?yōu)榱颂岣呒合到y(tǒng)的計(jì)算性能,不斷增加集群系統(tǒng)中節(jié)點(diǎn)的數(shù)量。集群系統(tǒng)由普通PC機(jī)器組成,PC機(jī)器性能并不穩(wěn)定,單個(gè)節(jié)點(diǎn)失效可能性非常大,在集群的規(guī)模不斷擴(kuò)大后,集群系統(tǒng)監(jiān)控的作用越來(lái)越重要。通過(guò)監(jiān)控,可以發(fā)現(xiàn)哪些節(jié)點(diǎn)已經(jīng)失效,停止工作,得到系統(tǒng)中每個(gè)節(jié)點(diǎn)的利用情況,分析整個(gè)集群的運(yùn)行趨勢(shì)、性能極限和作業(yè)瓶頸,為系統(tǒng)管理員的管理工作和集群任務(wù)調(diào)度提供依據(jù)。 本課題來(lái)自于子午工程數(shù)據(jù)中心,意在監(jiān)控?cái)?shù)據(jù)中心負(fù)責(zé)空間天氣數(shù)值計(jì)算的集群系統(tǒng)的運(yùn)行情況。本文根據(jù)子午工程數(shù)據(jù)中心的具體要求,設(shè)計(jì)和實(shí)現(xiàn)了一個(gè)集群監(jiān)控系統(tǒng),它的功能包括:采集集群系統(tǒng)中各個(gè)節(jié)點(diǎn)和系統(tǒng)負(fù)載、處理器各項(xiàng)使用時(shí)間、內(nèi)存使用情況、硬盤(pán)使用情況、網(wǎng)絡(luò)流量、系統(tǒng)相關(guān)的各種度量項(xiàng);將各個(gè)節(jié)點(diǎn)的度量項(xiàng)匯總,存入數(shù)據(jù)庫(kù),以WEB網(wǎng)頁(yè)的形式,展現(xiàn)給終端用戶,供用戶查詢和使用這些監(jiān)控項(xiàng);根據(jù)用戶設(shè)置的度量項(xiàng)的取值范圍,對(duì)這些度量項(xiàng)進(jìn)行量化分析,一旦發(fā)現(xiàn)存在異常度量項(xiàng),則通過(guò)預(yù)定的通信規(guī)則,將異常的監(jiān)控項(xiàng)發(fā)送給相關(guān)人員,以進(jìn)行進(jìn)一步的處理,減少不必要的損失。系統(tǒng)為C/S結(jié)構(gòu),,包括分布在各個(gè)節(jié)點(diǎn)的代理程序,一定數(shù)量的匯總程序和前臺(tái)顯示界面。系統(tǒng)從/proc獲取監(jiān)控?cái)?shù)據(jù),使用XML進(jìn)行數(shù)據(jù)傳送,RRDTool來(lái)繪制數(shù)值類監(jiān)控項(xiàng)的趨勢(shì)圖,后臺(tái)包括RRD和MySQL兩種類型的數(shù)據(jù)庫(kù)。 本文設(shè)計(jì)的集群監(jiān)控系統(tǒng),能夠穩(wěn)定有效的監(jiān)控子午工程數(shù)據(jù)中心,具有占用系統(tǒng)資源少、反應(yīng)靈敏等特點(diǎn)。
[Abstract]:At present, the bandwidth of Ethernet is increasing and the price of ordinary computer is decreasing. The common PC computer is used as the node to form the basic operation unit, which is connected to each other through high-speed local area network. The cluster system, which relies on software to work together, has the advantages of high cost performance and good expansibility. It has replaced the traditional mainframe or supercomputer, and has been widely used in many industrial fields, such as information retrieval, text analysis, etc. Large-scale data mining, machine learning, and the current popularity of cloud computing. With the increasing use of cluster system, in order to improve the computing performance of cluster system, the number of nodes in cluster system is increasing. The cluster system is composed of ordinary PC machines and the performance of PC machine is not stable, and the possibility of single node failure is very large. After the expansion of cluster scale, the monitoring function of cluster system becomes more and more important. Through monitoring, we can find out which nodes have failed, stop working, get the utilization of each node in the system, analyze the running trend, performance limit and job bottleneck of the whole cluster. It provides the basis for the management of the system administrator and the task scheduling of the cluster. The purpose of this paper is to monitor the operation of the cluster system which is responsible for the spatial weather numerical calculation in the Meridian Engineering data Center. According to the specific requirements of Meridian Engineering data Center, a cluster monitoring system is designed and implemented in this paper. Its functions include: collecting each node and system load in the cluster system, processing time, memory usage, etc. Hard disk usage, network traffic, system related measures, the measurement items of each node are summarized, stored in the database, displayed to the end users in the form of Web pages, for users to query and use these monitoring items; According to the value range of the measurement items set by the user, the quantitative analysis of these measures is carried out. Once the abnormal metrics are found, the monitoring items of the exceptions are sent to the relevant personnel through the predetermined communication rules for further processing. Reduce unnecessary losses. The system consists of C / S structure, including agents distributed in each node, a certain number of summary programs and foreground display interface. The system obtains monitoring data from / proc, uses XML to transfer data to RRDTool to draw the trend diagram of numerical class monitoring items, and backstage includes two types of databases: RRD and MySQL. The cluster monitoring system designed in this paper can monitor meridian engineering data center stably and effectively.
【學(xué)位授予單位】:中國(guó)地質(zhì)大學(xué)(北京)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2012
【分類號(hào)】:TP308;TP277
【參考文獻(xiàn)】
相關(guān)期刊論文 前9條
1 劉波,代亞非,吳非;Apache服務(wù)器監(jiān)控系統(tǒng)的研究[J];高技術(shù)通訊;2001年02期
2 邢航,劉清,鄭樺,徐智穹;基于網(wǎng)絡(luò)的遠(yuǎn)程監(jiān)控系統(tǒng)研究[J];廣東自動(dòng)化與信息工程;2004年01期
3 秦中盛;王寅峰;董小社;;支持網(wǎng)格監(jiān)控服務(wù)自動(dòng)部署的系統(tǒng)[J];華中科技大學(xué)學(xué)報(bào)(自然科學(xué)版);2006年S1期
4 魏文國(guó),張凌,董守斌,梁正友;一個(gè)可靠的集群簇/網(wǎng)格監(jiān)控系統(tǒng)[J];計(jì)算機(jī)應(yīng)用;2004年05期
5 門(mén)健;網(wǎng)絡(luò)告警管理系統(tǒng)的設(shè)計(jì)與測(cè)試[J];空軍工程大學(xué)學(xué)報(bào)(自然科學(xué)版);2004年04期
6 徐建;張琨;劉鳳玉;;基于Linux的計(jì)算系統(tǒng)性能監(jiān)控[J];南京理工大學(xué)學(xué)報(bào)(自然科學(xué)版);2007年05期
7 范軍濤;李國(guó)慶;;實(shí)用的機(jī)群監(jiān)控系統(tǒng)[J];計(jì)算機(jī)工程與設(shè)計(jì);2008年01期
8 孫愛(ài)婷;劉青昆;;高效的機(jī)群監(jiān)控信息采集模型[J];計(jì)算機(jī)工程與設(shè)計(jì);2010年20期
9 劉青昆;孫愛(ài)婷;;具有容錯(cuò)機(jī)制的機(jī)群監(jiān)控系統(tǒng)[J];計(jì)算機(jī)工程與設(shè)計(jì);2010年21期
本文編號(hào):2063355
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/2063355.html