基于大數(shù)據(jù)的動車組故障關聯(lián)關系規(guī)則挖掘算法研究與實現(xiàn)
本文選題:關聯(lián)規(guī)則 + 大數(shù)據(jù) ; 參考:《北京交通大學》2017年碩士論文
【摘要】:動車組作為完成鐵路高速運輸生產(chǎn)任務最重要的移動設備,是高新技術的集成體。與傳統(tǒng)機車車輛相比動車組在車輛結構上有很大的不同,而且其運行速度是傳統(tǒng)機車車輛所不可及的。在其運營過程中,故障管理和檢修是高速鐵路系統(tǒng)綜合保障工程中的重要組成部分,是確保實現(xiàn)動車組安全運行,高效率使用的必要保障。在檢修過程中,修程修制又起著指導性、關鍵性的作用,而且合理完善的修程修制是保證高速動車組快速、安全、舒適、高效運行的基本前提。然而,對安全問題的重視,無疑會造成動車組復雜的維修流程,這對于提升效率自然會是一個極大的影響。要提高動車組的維修效率,一方面是深入對動車組構造的理論研究;另一方面,在過去積累的大量動車組數(shù)據(jù)中包含著尚未發(fā)掘的有價值的信息。而隨著大數(shù)據(jù)相關技術的成熟,這些數(shù)據(jù)的價值也日益凸顯。為了使這些數(shù)據(jù)得到很好的利用,要從海量的故障數(shù)據(jù)中獲取其中隱含的故障關聯(lián)信息,以達到較早發(fā)現(xiàn)故障的目的。維修的策略主要有3種:周期修,狀態(tài)修和事后修。其中周期修是目前最為主要的一種方式,將維修等級分成五級,列車服役一定的時間或里程后就會進行相應的維修,更換一些對應的部件。此方法中,維修周期是根據(jù)專家經(jīng)驗確定的,為了保證安全所以有一定的余地。這樣雖然保證了安全,但是會陷入到過度修的情況中,即列車上某部件健康情況良好卻依然被更換,導致運維成本提高。事后修則是另一種極端,即當部件完全失效時再進行更換,這顯然是不可取的方案。故而就提出了折中的狀態(tài)修方案,根據(jù)部件當前的工作狀態(tài),判斷其損壞程度,在其將要損壞時進行更換,從而既保證了運輸安全,又降低成本的目的。目前在我國的鐵路事業(yè)中,大數(shù)據(jù)分析技術已經(jīng)運用到了一些領域中:基于Hadoop平臺設計并實現(xiàn)了一種分析和處理動車組振動數(shù)據(jù)的方案,用于消除高鐵振動數(shù)據(jù)中的線性漂移,發(fā)現(xiàn)數(shù)據(jù)中的異常點,通過數(shù)據(jù)分布情況判斷列車部件故障的類型。基于Hadoop平臺,通過分析歷史車流數(shù)據(jù)來高效準確的推算車流;提出了一種構建動車組數(shù)據(jù)倉庫的思路。其中也包括動車組故障數(shù)據(jù)的相關部分,可以說大數(shù)據(jù)分析對于龐大的鐵路系統(tǒng)來說是未來的發(fā)展方向,并且也已經(jīng)在動車組的運營管理的某些領域中得到了應用。隨著動車組維修領域的需求日益增長,動車組故障檢修方面也必將需要大數(shù)據(jù)分析技術的支持。大數(shù)據(jù)數(shù)據(jù)挖掘過程一般由數(shù)據(jù)清洗、數(shù)據(jù)集成、數(shù)據(jù)轉換、數(shù)據(jù)挖掘、模式評估和知識表示這幾個階段組成。在具體挖掘過程中,需要這幾個階段的反復執(zhí)行。數(shù)據(jù)挖掘主要分為關聯(lián)模式挖掘,聚類模式挖掘,決策樹模式挖掘等;而本文的主要工作:關聯(lián)規(guī)則挖掘,主要分為挖掘頻繁模式和根據(jù)頻繁模式生成關聯(lián)規(guī)則兩步。其中關聯(lián)規(guī)則的生成較為簡單,所以影響關聯(lián)規(guī)則算法效率的主要步驟是頻繁模式的挖掘,也是區(qū)分諸多算法效率的核心問題。因此在頻繁模式挖掘方面取得的任何進展都將對關聯(lián)規(guī)則以至于其他的數(shù)據(jù)挖掘任務的效率產(chǎn)生重要影響。綜上所述,本文通過在分布式計算平臺上實現(xiàn)關聯(lián)關系規(guī)則算法,用于分析動車組故障數(shù)據(jù)。填補我國目前動車組運維方面的不足。最早的關聯(lián)規(guī)則算法可以追溯到1993年,名叫AIS算法。但由于該算法效率過低,在由Agrwal等人的改進后提出了 Apriori算法,特點是使用了逐層搜索的迭代思路來找出事務數(shù)據(jù)庫中的頻繁項集,相較于AIS其效率大大的提高。作為一種經(jīng)典算法,后來的許多算法比如AprioriHybrid等算法皆是依據(jù)它改進而來的。Apriori算法主要通過兩個頻繁項集的重要特性,使得整個算法的效率提升:如項目集R是頻繁項集,則其子集也是頻繁項集;如R不是頻繁項集,則其超集都是非頻繁項集。通過這兩個性質,可以有效的減少頻繁項集的產(chǎn)生。Apriori算法使用的是一種迭代方法,叫做逐層搜索,其中k項集用于探索(k+1)項集。首先,掃描數(shù)據(jù)庫,累積每個單獨項的計數(shù),并記錄每個滿足最小支持度的項,即找出頻繁1項集的集合,記為L1。然后根據(jù)這個找出L2,即頻繁2項集的集合。以此類推,只到不能再找到頻繁k項集。一次數(shù)據(jù)庫的完整掃描只能完成一次找出Lk的操作。除了在故障診斷方面Apriori算法能發(fā)揮巨大的作用之外,該算法在商業(yè),價格分析等領域中都得到了廣泛的應用。該算法具有直觀,簡便易于實現(xiàn)等特點,同樣也有候選項集多,數(shù)據(jù)庫掃描次數(shù)多等方面的不足。可以說是優(yōu)點與缺點同樣明顯。本文根據(jù)算法的缺點進行了改進,考慮從蟻群優(yōu)化和布隆過濾器兩種思路對算法的性能做出優(yōu)化,主要是在產(chǎn)生關聯(lián)關系的中間過程中消除一些冗余,使得算法能更加快速的執(zhí)行。并對比算法之間的性能,選取性能更優(yōu)的算法用于進一步工作;另一方面,為了更好的分析數(shù)據(jù),就要使用大數(shù)據(jù)工具,才能高效,合理的進行計算。本文對于大數(shù)據(jù)平臺Hadoop進行深入研究,包括分布式文件系統(tǒng)(Hadoop Distributed File System)以及 Spark 框架。HDFS作為主流的分布式存儲系統(tǒng),主要有以下優(yōu)點:①擴容能力:能更可靠的存儲和處理PB級的數(shù)據(jù);②成本低:可以通過普通機器組成的服務群來分發(fā)以及處理數(shù)據(jù),這些服務器總計可達數(shù)千個節(jié)點。③高效率:通過分發(fā)數(shù)據(jù)和備份數(shù)據(jù),Hadoop可以在數(shù)據(jù)所在的節(jié)點上并行的處理他們。④高容錯性:在面對數(shù)據(jù)可能損害或出錯時,不是采用使用更好的機器以防止出錯這種策略,而是提供了一種機制,使得普通機器節(jié)點上的數(shù)據(jù)損壞出錯后也能很好的處理?梢哉f,HDFS是面向一種數(shù)據(jù)高出錯率的一種解決方案。這種容錯性高的特點可以保證數(shù)據(jù)安全可靠更可以使其可以部署在一般的普通商業(yè)機器上。Spark是一個基于內存計算的開源的集群計算系統(tǒng),目的是讓數(shù)據(jù)分析更加快速。Spark非常小巧玲瓏,由加州伯克利大學AMP實驗室的Matei為主的小團隊所開發(fā)。Spark是一種與Hadoop相似的開源集群計算環(huán)境,但是兩者之間還存在一些不同之處,這些有用的不同之處使Spark在某些工作負載方面表現(xiàn)得更加優(yōu)越,換句話說,Spark啟用了內存分布數(shù)據(jù)集,除了能夠提供交互式查詢外,它還可以優(yōu)化迭代工作負載。Spark是在Scala語言中實現(xiàn)的,它將Scala用作其應用程序框架。與Hadoop不同,Spark和Scala能夠緊密集成,其中的Scala可以像操作本地集合對象一樣輕松地操作分布式數(shù)據(jù)集。盡管創(chuàng)建Spark是為了支持分布式數(shù)據(jù)集上的迭代作業(yè),但是實際上它是對Hadoop的補充,可以在Hadoop文件系統(tǒng)中并行運行。最后,以關聯(lián)規(guī)則算法和大數(shù)據(jù)平臺為基礎,將前期理論知識和動車組故障數(shù)據(jù)相結合,確定故障關聯(lián)規(guī)則的挖掘方案。最終達到高速準確的挖掘動車組故障關聯(lián)規(guī)則的目的,為管理部門制定更加完善,合理的動車組維修流程提供優(yōu)化建議。隨著動車組的大規(guī)模應用,維修管理規(guī)程得到了補充,修訂和完善。使得檢修計劃和作業(yè)流程得到調整優(yōu)化,但由于尚在起步階段,檢修計劃會隨著鐵路建設,部件壽命等變動而調整。所以,很多方面我國仍處于研究階段。我國大數(shù)據(jù)分析主要面對的問題是投入產(chǎn)出比不高,消耗的資源較高但是沒有產(chǎn)生應有的效應。但從長遠來看,隨著相關行業(yè)的規(guī)范化和各行業(yè)原始數(shù)據(jù)的積累,大數(shù)據(jù)分析的前景必定廣闊。本論文"基于大數(shù)據(jù)的動車組故障關聯(lián)關系規(guī)則挖掘算法研究與實現(xiàn)"是基于動車組運維數(shù)據(jù)來實現(xiàn)動車組故障知識的獲取,優(yōu)化等工作。本研究實現(xiàn)了從海量動車組故障數(shù)據(jù)中利用改進的Apriori算法挖掘出故障的頻繁項集和關聯(lián)規(guī)則,并根據(jù)算法的不足進行改進;以及將改進后算法移植到Spark下更快速的完成上述工作。
[Abstract]:As the most important mobile equipment to complete the high speed transportation production task of railway, the EMU is the integration of high and new technology. Compared with the traditional locomotive, the EMU has a great difference in the vehicle structure, and its running speed is not available by the traditional locomotive. In its operation course, the fault management and maintenance are the high-speed railway system. The important part of the comprehensive guarantee project is the necessary guarantee to ensure the safe operation and efficient use of the EMU. During the maintenance process, the repair system plays a guiding and key role, and a reasonable and perfect repair system is the basic premise to ensure the rapid, safe, comfortable and efficient operation of the high speed EMU. The attention of the whole problem will undoubtedly cause the complex maintenance process of the EMU, which will naturally have a great influence on the efficiency of the lifting. To improve the maintenance efficiency of the EMU, it is the theoretical study of the EMU structure on the one hand; on the other hand, the large number of EMU data that has accumulated in the past contains the value that has not been excavated. With the maturity of large data related technologies, the value of these data is becoming more and more prominent. In order to make good use of these data, it is necessary to obtain the hidden fault association information from the massive failure data to achieve the purpose of early detection. There are 3 main maintenance strategies: periodic repair, state repair and post repair. The middle period repair is the most important way at present. The maintenance grade is divided into five levels. The train will be repaired after a certain time or mileage, and the corresponding parts will be replaced. In this method, the maintenance cycle is determined according to the experience of the expert, in order to ensure the safety and safety, this ensures the safety, But in the case of excessive repair, that is, a part of the train is in good health and is still being replaced, resulting in an increase in the cost of operation and maintenance. The latter is another extreme, that is, the replacement of the component when the component is completely invalid. This is obviously an undesirable scheme. Therefore, a compromise state repair scheme is proposed, based on the current work of the component. State, to judge the extent of its damage and replace it when it will be damaged, which not only ensures the safety of transportation, but also reduces the cost. At present, the large data analysis technology has been used in some fields in our country's railway industry. Based on the Hadoop platform, a scheme for analyzing and dealing with the vibration data of the EMU is designed and implemented. In order to eliminate the linear drift in the high speed rail vibration data, the anomaly points in the data are found and the type of the train component fault is judged by the data distribution. Based on the Hadoop platform, the data of the historical traffic flow is used to calculate the traffic flow efficiently and accurately. A train of thought for the construction of the EMU data warehouse is proposed. According to the related parts, it can be said that large data analysis is the future development direction for the large railway system, and has been applied in some areas of the operation management of the EMU. With the increasing demand of the EMU maintenance field, the fault maintenance of EMU will also need the support of large data analysis technology. The process of data mining in large data is usually composed of data cleaning, data integration, data conversion, data mining, pattern evaluation and knowledge representation. In the concrete mining process, the repeated execution of these stages is needed. Data mining is mainly divided into association pattern mining, clustering pattern mining, decision tree pattern mining, and so on. The main work: mining association rules is divided into two steps: mining frequent patterns and generating association rules according to frequent patterns. The generation of association rules is relatively simple, so the main steps that affect the efficiency of association rules algorithm are the mining of frequent patterns, and also the core problem to distinguish the efficiency of many algorithms. Any progress made will have an important impact on the efficiency of association rules and other data mining tasks. To sum up, this paper implements the algorithm of association rules on the distributed computing platform to analyze the malfunction data of the EMU. The algorithm can be traced back to 1993, called AIS algorithm. But because of the low efficiency of the algorithm, the Apriori algorithm is proposed after the improvement of Agrwal et al. The characteristic is to use the iterative idea of layer by layer search to find frequent itemsets in the transaction database, which is greatly improved compared to the efficiency of AIS. As a classic algorithm, many later calculations are made. The algorithm, such as AprioriHybrid, is based on its improved.Apriori algorithm, mainly through the important properties of two frequent itemsets, to improve the efficiency of the whole algorithm: if the item set R is a frequent itemset, then its subset is also a frequent itemset; for example, R is not a frequent itemset, and its superset is infrequent itemsets. Through these two properties, To effectively reduce frequent itemsets generation.Apriori algorithm is an iterative method called an iterative method called layer by layer, where k sets are used to explore (k+1) sets. First, the database is scanned, the count of each individual item is accumulated, and each item that satisfies the minimum support is recorded, that is, to find a set of frequent 1 sets, recorded as L1. and then based on this search. L2, that is, the set of frequent 2 sets. By analogy, only the frequent K itemsets can not be found. A complete scan of the database can only be completed to find the operation of Lk once. Besides the great role of the Apriori algorithm in fault diagnosis, the algorithm has been widely used in the fields of business, price analysis and so on. It has the characteristics of intuitionistic, simple and easy to implement. There are also many candidate items and many shortcomings of database scanning. It can be said that the advantages and disadvantages are equally obvious. In this paper, the shortcomings of the algorithm are improved and the performance of the algorithm is optimized from two ideas of ant colony optimization and blon filter. In the middle process of the connection, some redundancy can be eliminated so that the algorithm can be executed more quickly. And the performance of the algorithm is compared with the algorithm of better performance. On the other hand, in order to better analyze the data, it is necessary to use large data tools to achieve high efficiency and reasonable calculation. In this paper, the large data platform Hadoo P's in-depth study, including the Hadoop Distributed File System and the Spark framework.HDFS as the mainstream distributed storage system, has the following advantages: (1) capacity expansion: more reliable storage and processing of PB level data; and low cost: can be distributed and processed by a service group composed of ordinary machines. Data, the total number of these servers can reach thousands of nodes. 3. High efficiency: by distributing data and backing up the data, Hadoop can handle them parallel to the nodes of the data. 4. High fault tolerance: instead of using a better machine to prevent the error in the face of data damage or error, it provides a machine. As a result, HDFS is a solution to a high error rate of data. The high fault tolerance can ensure that the data is safe and reliable and can be deployed on the ordinary common business machine and.Spark is a memory based calculation. The open source cluster computing system is designed to make data analysis more rapid and.Spark very small. The.Spark is an open source cluster computing environment similar to Hadoop, developed by a small team based on Matei of the AMP laboratory in Berkeley University of California. But there are some differences between the two, and these useful differences make Spark In some of the workload performance, in other words, Spark enabled the memory distribution dataset, in addition to providing interactive queries, it also optimizes the iterative workload.Spark to be implemented in the Scala language, which uses Scala as its application framework. Unlike Hadoop, Spark and Scala can be tightly integrated, Scala in a distributed data set can be manipulated as easily as the local collection object. Although the creation of Spark is to support iterative jobs on a distributed data set, it is actually a supplement to the Hadoop and can run in parallel in the Hadoop file system. Finally, the former is based on the custom rule algorithm and the large data platform. With the combination of the theoretical knowledge and the malfunction data of the EMU, the mining scheme of the fault association rules is determined. Finally, the purpose of high speed and accurate mining of the mus fault association rules is achieved, and the optimization proposal for the management department to make a more perfect and reasonable EMU maintenance process is provided. With the large-scale application of the EMU, the maintenance management rules are obtained. It has been supplemented, revised and perfected. The maintenance plan and operation process have been adjusted and optimized. But because of the initial stage, the maintenance plan will be adjusted with the railway construction and the changes in the component life. So, in many aspects, our country is still in the stage of research. But in the long run, with the standardization of the related industries and the accumulation of the original data in various industries, the prospect of the large data analysis must be broad. This paper "research and implementation of the algorithm for mining fault association rules based on large data" is based on EMU Operation and maintenance data to realize the movement. In this study, we use improved Apriori algorithm to excavate frequent item sets and association rules from the malfunction data of mass EMU, and improve the algorithm according to the shortcomings of the algorithm. And the improved algorithm is transplanted to Spark to complete the work more quickly.
【學位授予單位】:北京交通大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:U269;TP311.13
【相似文獻】
相關期刊論文 前10條
1 梁開健;一種采掘意外關聯(lián)模式的新算法[J];湖南工程學院學報(自然科學版);2005年02期
2 李聯(lián);韋潛;;基于多層次關聯(lián)規(guī)則的持續(xù)關聯(lián)算法研究[J];信息安全與通信保密;2007年02期
3 徐前方;肖波;郭軍;;挖掘電信告警關聯(lián)模式方法[J];北京郵電大學學報;2011年02期
4 李娜娜,宋潔,顧軍華,郭樹軍;災害天氣關聯(lián)模式挖掘技術研究[J];河北工業(yè)大學學報;2005年02期
5 陳新保;朱建軍;陳建群;;“多元”關聯(lián)模式的時空數(shù)據(jù)挖掘[J];中南大學學報(自然科學版);2011年01期
6 王運鵬;胡修林;;面向異常檢測的關聯(lián)模式快速挖掘算法[J];計算機應用與軟件;2008年08期
7 馬存寶;裴林;李文娟;史浩山;;改進的關聯(lián)模式競爭求解算法[J];計算機工程與應用;2009年04期
8 方剛;應宏;涂承勝;郭皎;劉華成;;基于星形模型的時空拓撲關聯(lián)模式挖掘[J];計算機時代;2013年04期
9 林長方;黃毓珍;;關聯(lián)規(guī)則挖掘在臨床診斷中的應用研究[J];齊齊哈爾大學學報(自然科學版);2010年04期
10 劉振宇;徐維祥;;多支持度關聯(lián)規(guī)則在庫存管理中的應用[J];內蒙古大學學報(自然科學版);2012年03期
相關會議論文 前6條
1 李寧君;;關聯(lián)規(guī)則在圖書館管理中的應用[A];廣西圖書館學會2011年年會暨第29次科學討論會論文集[C];2011年
2 王媛媛;胡學鋼;;關聯(lián)規(guī)則挖掘研究[A];全國第16屆計算機科學與技術應用(CACIS)學術會議論文集[C];2004年
3 李國輝;付暢儉;徐新文;冷智花;;NBA視頻中關聯(lián)規(guī)則挖掘[A];第四屆和諧人機環(huán)境聯(lián)合學術會議論文集[C];2008年
4 張永;賈桂霞;馬華;;一種利潤約束的頻繁模式的挖掘方法[A];第二十二屆中國數(shù)據(jù)庫學術會議論文集(技術報告篇)[C];2005年
5 廖嘉;王國仁;張博;;一種有效的基于關聯(lián)規(guī)則的視頻分類方法[A];第二十三屆中國數(shù)據(jù)庫學術會議論文集(研究報告篇)[C];2006年
6 周萬松;邱保志;李向麗;;SAM*模式操作[A];第十屆全國數(shù)據(jù)庫學術會議論文集[C];1992年
相關博士學位論文 前8條
1 肖波;可信關聯(lián)規(guī)則挖掘算法研究[D];北京郵電大學;2009年
2 郭秀娟;基于關聯(lián)規(guī)則數(shù)據(jù)挖掘算法的研究[D];吉林大學;2004年
3 吳黃青娥;漢越復句關聯(lián)標記模式比較[D];華中師范大學;2012年
4 沈斌;關聯(lián)規(guī)則相關技術研究[D];浙江大學;2007年
5 張小剛;關聯(lián)規(guī)則挖掘及其在復雜工業(yè)過程控制中的應用研究[D];湖南大學;2002年
6 賀志;關聯(lián)規(guī)則優(yōu)化方法的研究[D];北京交通大學;2007年
7 錢鐵云;關聯(lián)文本分類關鍵技術研究[D];華中科技大學;2006年
8 胡星馳;基于計算方法的miRNA結構研究及與癌癥關聯(lián)分析[D];華中科技大學;2014年
相關碩士學位論文 前10條
1 王瀟瑩;順應—關聯(lián)模式下雙關語的語用研究[D];東北林業(yè)大學;2015年
2 焦艷麗;順應—關聯(lián)模式下的反語研究[D];燕山大學;2012年
3 張歡;政府關聯(lián)對房地產(chǎn)供應鏈的影響機制[D];清華大學;2015年
4 江琳;基于項約束的關聯(lián)挖掘在鉆井作業(yè)安全預警中的研究[D];西南石油大學;2016年
5 殷黎洋;全基因組乳腺癌DNA甲基化與基因表達關聯(lián)模式[D];西安電子科技大學;2015年
6 李溢龍;基于時序柵格的海洋異常事件關聯(lián)規(guī)則挖掘方法研究[D];重慶交通大學;2016年
7 林洋子;網(wǎng)絡建議中的身份與不禮貌關聯(lián)模式研究[D];福建師范大學;2016年
8 吳小雄;基于關聯(lián)規(guī)則的安全威脅感知方法研究[D];南京理工大學;2017年
9 李超;多尺度關聯(lián)規(guī)則挖掘理論與方法[D];河北師范大學;2017年
10 Syed Kamran Hussain;基于大數(shù)據(jù)的動車組故障關聯(lián)關系規(guī)則挖掘算法研究與實現(xiàn)[D];北京交通大學;2017年
,本文編號:1922572
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1922572.html