基于Hadoop的出租車(chē)數(shù)據(jù)質(zhì)量分析與處理
本文關(guān)鍵詞: Hadoop 數(shù)據(jù)質(zhì)量 數(shù)據(jù)清洗 ?奎c(diǎn) 出處:《武漢理工大學(xué)》2015年碩士論文 論文類(lèi)型:學(xué)位論文
【摘要】:深圳市通過(guò)智能交通系統(tǒng)(Intelligent Transportation System,ITS)建設(shè),建立了智能交通公用信息平臺(tái),信息平臺(tái)每天采集到海量的交通數(shù)據(jù),這些數(shù)據(jù)蘊(yùn)含著豐富的交通信息。高質(zhì)量的交通數(shù)據(jù)是ITS做出正確決策的保證,然而,實(shí)際的交通數(shù)據(jù)采集過(guò)程中,由于設(shè)備故障、外界環(huán)境干擾、人為操作失誤等多種因素的影響使得獲取的原始數(shù)據(jù)不可避免地存在丟失、冗余等質(zhì)量問(wèn)題。本文結(jié)合項(xiàng)目需求,采用基于Hadoop搭建的云計(jì)算平臺(tái)對(duì)深圳市海量出租車(chē)數(shù)據(jù)進(jìn)行數(shù)據(jù)質(zhì)量分析,并面向數(shù)據(jù)質(zhì)量進(jìn)行數(shù)據(jù)處理,主要工作包括以下幾個(gè)方面:(1)研究國(guó)內(nèi)外學(xué)者數(shù)據(jù)質(zhì)量評(píng)估和數(shù)據(jù)清洗方面取得的成果與不足,并在此基礎(chǔ)上引出本文的研究?jī)?nèi)容。(2)根據(jù)項(xiàng)目需求設(shè)計(jì)了基于決策學(xué)中層次分析法結(jié)合歷史數(shù)據(jù)的評(píng)價(jià)體系,利用層次分析法計(jì)算評(píng)價(jià)指標(biāo)權(quán)值并以歷史數(shù)據(jù)的期望為基準(zhǔn)得到數(shù)據(jù)質(zhì)量分?jǐn)?shù),將數(shù)據(jù)質(zhì)量問(wèn)題量化,直觀的反映數(shù)據(jù)質(zhì)量狀況。(3)針對(duì)深圳市出租車(chē)數(shù)據(jù)特征提出了GPS數(shù)據(jù)和營(yíng)運(yùn)數(shù)據(jù)質(zhì)量評(píng)價(jià)方案,首先找到影響數(shù)據(jù)質(zhì)量的主要因素,確定各自的評(píng)價(jià)指標(biāo),然后針對(duì)數(shù)據(jù)集中存在的冗余、不完整和錯(cuò)誤數(shù)據(jù),提出相應(yīng)的評(píng)價(jià)規(guī)則算法判斷是否符合條件。(4)面向深圳市出租車(chē)數(shù)據(jù)質(zhì)量分析結(jié)果,提高數(shù)據(jù)質(zhì)量。重點(diǎn)研究了重復(fù)數(shù)據(jù)清洗技術(shù),提出了基于MapReduce的分塊去重算法刪除重復(fù)數(shù)據(jù)。然后分別對(duì)GPS數(shù)據(jù)和營(yíng)運(yùn)數(shù)據(jù)提出了基于Hadoop平臺(tái)的出租車(chē)數(shù)據(jù)清洗方案,數(shù)據(jù)清洗方案主要針對(duì)數(shù)據(jù)不完整、冗余和錯(cuò)誤的質(zhì)量問(wèn)題,將傳統(tǒng)的清洗技術(shù)遷移到云平臺(tái)。(5)將清洗后高質(zhì)量的GPS數(shù)據(jù)應(yīng)用于出租車(chē)?奎c(diǎn)研究,提出了基于DBSCAN的?奎c(diǎn)檢測(cè)算法,從非載客的軌跡數(shù)據(jù)中找到出租車(chē)?奎c(diǎn),檢測(cè)算法主要分為三個(gè)步驟:候選點(diǎn)獲取,候選點(diǎn)過(guò)濾和?奎c(diǎn)候選點(diǎn)聚類(lèi)。候選點(diǎn)的獲取是根據(jù)候選點(diǎn)檢測(cè)算法,然后利用時(shí)間和空間屬性對(duì)候選點(diǎn)過(guò)濾,最后分析各種聚類(lèi)算法優(yōu)缺點(diǎn),選擇DBSCAN聚類(lèi)算法進(jìn)行?奎c(diǎn)聚類(lèi)。通過(guò)建立的數(shù)據(jù)質(zhì)量評(píng)價(jià)體系,對(duì)出租車(chē)的GPS數(shù)據(jù)和營(yíng)運(yùn)數(shù)據(jù)質(zhì)量進(jìn)行評(píng)估,最終得到兩個(gè)數(shù)據(jù)集的數(shù)據(jù)質(zhì)量得分,能夠直觀的反應(yīng)數(shù)據(jù)質(zhì)量的好壞,為后面的清洗任務(wù)提供依據(jù)。根據(jù)數(shù)據(jù)質(zhì)量評(píng)價(jià)結(jié)果研究相應(yīng)的數(shù)據(jù)清洗方案,能夠有效的提高了數(shù)據(jù)質(zhì)量,為ITS做出正確的決策提供支持。根據(jù)清洗后的數(shù)據(jù)研究出租車(chē)?奎c(diǎn),有助于城市管理人員更好的了解出租車(chē)駕駛員情況,對(duì)司機(jī)尋找乘客也有指導(dǎo)意義。
[Abstract]:Through the construction of Intelligent Transportation system in Shenzhen, the public information platform of intelligent transportation has been established. The information platform collects massive traffic data every day, which contains abundant traffic information. High quality traffic data is the guarantee for ITS to make the correct decision. However, in the actual traffic data collection process. Due to equipment failure, external environment interference, human error and other factors, the original data is inevitably lost, redundant and other quality problems. The cloud computing platform based on Hadoop is used to analyze the data quality of the mass taxi data in Shenzhen, and the data processing is oriented to the data quality. The main work includes the following aspects: 1) to study the achievements and shortcomings of domestic and foreign scholars in data quality assessment and data cleaning. On the basis of this, the research content of this paper is elicited. 2) according to the project requirements, the evaluation system based on AHP and historical data in decision science is designed. The weight value of evaluation index is calculated by AHP, and the data quality score is obtained based on the expectation of historical data, and the problem of data quality is quantified. According to the characteristics of taxi data in Shenzhen, the paper puts forward the evaluation scheme of GPS data and operation data quality. Firstly, it finds out the main factors that affect the data quality. Determine the respective evaluation indicators, and then address the data set of redundant, incomplete and erroneous data. The corresponding evaluation rule algorithm is put forward to judge whether or not it conforms to condition. (4) face to the result of taxi data quality analysis in Shenzhen to improve the data quality. The repeated data cleaning technology is studied emphatically. A block de-duplication algorithm based on MapReduce is proposed to delete the duplicate data. Then the cleaning scheme of taxi data based on Hadoop platform is proposed for GPS data and operation data respectively. The data cleaning scheme mainly aims at the quality problems of incomplete data, redundancy and error. The traditional cleaning technology is migrated to cloud platform. 5) the high quality GPS data after cleaning is applied to the research of taxi parking points. In this paper, a DBSCAN based algorithm for detecting stopping points is proposed. The algorithm can be divided into three steps: obtaining candidate points from the track data of non-passengers. Candidate point filtering and docking point candidate point clustering. Candidate points are obtained according to candidate point detection algorithm, then use time and space attributes to filter candidate points, and finally analyze the advantages and disadvantages of various clustering algorithms. The DBSCAN clustering algorithm is selected to cluster the docking points. Through the established data quality evaluation system, the GPS data and operation data quality of the taxi are evaluated. Finally, the data quality scores of the two data sets are obtained, which can directly reflect the quality of the data, and provide the basis for the later cleaning tasks. According to the evaluation results of data quality, the corresponding data cleaning scheme is studied. Can effectively improve the quality of data for ITS to make the right decision to provide support. According to the data washed after the study of taxi parking points, it is helpful for city managers to better understand the taxi driver situation. It is also instructive for drivers to find passengers.
【學(xué)位授予單位】:武漢理工大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2015
【分類(lèi)號(hào)】:U495
【參考文獻(xiàn)】
相關(guān)博士學(xué)位論文 前6條
1 王國(guó)華;高效重復(fù)數(shù)據(jù)刪除技術(shù)研究[D];華南理工大學(xué);2014年
2 喬媛媛;基于Hadoop的網(wǎng)絡(luò)流量分析系統(tǒng)的研究與應(yīng)用[D];北京郵電大學(xué);2014年
3 樊華;面向物聯(lián)網(wǎng)的RFID不確定數(shù)據(jù)清洗與存儲(chǔ)技術(shù)研究[D];國(guó)防科學(xué)技術(shù)大學(xué);2013年
4 夏英;智能交通系統(tǒng)中的時(shí)空數(shù)據(jù)分析關(guān)鍵技術(shù)研究[D];西南交通大學(xué);2012年
5 王燦;基于在線重復(fù)數(shù)據(jù)消除的海量數(shù)據(jù)處理關(guān)鍵技術(shù)研究[D];電子科技大學(xué);2012年
6 魏建生;高性能重復(fù)數(shù)據(jù)檢測(cè)與刪除技術(shù)研究[D];華中科技大學(xué);2012年
相關(guān)碩士學(xué)位論文 前4條
1 盧本新;數(shù)據(jù)倉(cāng)庫(kù)數(shù)據(jù)質(zhì)量管理的研究[D];大連理工大學(xué);2013年
2 王洵;宏觀統(tǒng)計(jì)數(shù)據(jù)質(zhì)量評(píng)估實(shí)證分析[D];廈門(mén)大學(xué);2013年
3 劉中超;數(shù)據(jù)中心的數(shù)據(jù)質(zhì)量管理工具設(shè)計(jì)與實(shí)現(xiàn)[D];華中科技大學(xué);2013年
4 苗潤(rùn)華;基于聚類(lèi)和孤立點(diǎn)檢測(cè)的數(shù)據(jù)預(yù)處理方法的研究[D];北京交通大學(xué);2012年
,本文編號(hào):1454849
本文鏈接:http://sikaile.net/kejilunwen/daoluqiaoliang/1454849.html