基于數(shù)據(jù)質(zhì)量與勢熵的聚類算法研究
本文關(guān)鍵詞:基于數(shù)據(jù)質(zhì)量與勢熵的聚類算法研究 出處:《武漢大學(xué)》2016年博士論文 論文類型:學(xué)位論文
更多相關(guān)文章: 矢量數(shù)據(jù)場 數(shù)據(jù)質(zhì)量 質(zhì)量聚類 勢熵 人臉自動(dòng)聚類
【摘要】:隨著計(jì)算機(jī)科學(xué)的發(fā)展,人類社會已經(jīng)進(jìn)入到大數(shù)據(jù)時(shí)代。在大數(shù)據(jù)時(shí)代,數(shù)據(jù)分析技術(shù)成為了利用大數(shù)據(jù)資源的關(guān)鍵工具,能夠發(fā)現(xiàn)數(shù)據(jù)中的價(jià)值,就能夠在大數(shù)據(jù)時(shí)代占據(jù)先機(jī)。數(shù)據(jù)挖掘作為數(shù)據(jù)分析的關(guān)鍵技術(shù),在大數(shù)據(jù)時(shí)代有著廣泛的應(yīng)用前景。數(shù)據(jù)挖掘能夠發(fā)現(xiàn)數(shù)據(jù)中隱藏的知識,充分利用數(shù)據(jù)資源,在一定程度上解決數(shù)據(jù)龐大而知識匱乏的問題。在數(shù)據(jù)挖掘中,主要有三種分析方式,分類,關(guān)聯(lián)和聚類。分類和關(guān)聯(lián)在機(jī)器學(xué)習(xí)中屬于監(jiān)督型學(xué)習(xí)算法,聚類屬于非監(jiān)督型學(xué)習(xí)算法。在大數(shù)據(jù)時(shí)代,往往強(qiáng)調(diào)全數(shù)據(jù)集的挖掘和學(xué)習(xí),并且很難有合適的訓(xùn)練集對算法進(jìn)行訓(xùn)練。因此,非監(jiān)督學(xué)習(xí)算法更適合大數(shù)據(jù)時(shí)代的背景,聚類分析也成為數(shù)據(jù)挖掘的研究熱點(diǎn)。本文針對數(shù)據(jù)挖掘中的聚類問題,提出了矢量數(shù)據(jù)場的理論、數(shù)據(jù)場數(shù)據(jù)質(zhì)量的新概念、數(shù)據(jù)質(zhì)量聚類算法、基于勢熵的峰值密度聚類算法。并使用人臉表情識別和人臉自動(dòng)聚類兩種實(shí)例對相關(guān)的理論和方法進(jìn)行了檢驗(yàn)。首先,數(shù)據(jù)場是一種分析數(shù)據(jù)的模型,經(jīng)典的數(shù)據(jù)場理論通過勢能描述數(shù)據(jù)在數(shù)據(jù)集中的分布情況。本文在其基礎(chǔ)上,提出了矢量數(shù)據(jù)場的概念,讓數(shù)據(jù)場不僅能描述數(shù)據(jù)的分布,還可以描述數(shù)據(jù)的運(yùn)動(dòng)趨勢,并通過哈密頓算子統(tǒng)一了矢量數(shù)據(jù)場和數(shù)量數(shù)據(jù)場的模型。其次,數(shù)據(jù)場的概念來源于物理場,而物體在物理場中有質(zhì)量,因此,數(shù)據(jù)在數(shù)據(jù)場中也應(yīng)有質(zhì)量。本文提出了數(shù)據(jù)質(zhì)量的新概念,即代表數(shù)據(jù)在數(shù)據(jù)集中的固有屬性,并隨著挖掘視角的改變而變化,其本質(zhì)是衡量數(shù)據(jù)在特定挖掘視角下的權(quán)值。對于數(shù)據(jù)場中不隨挖掘視角改變的屬性,本文提出了數(shù)據(jù)場基本矩陣的概念,并建立起數(shù)據(jù)場基本矩陣、數(shù)據(jù)質(zhì)量和數(shù)據(jù)勢能的線性方程組。數(shù)據(jù)場基本矩陣進(jìn)一步將數(shù)據(jù)場的計(jì)算矩陣化,并在此基礎(chǔ)上提出數(shù)據(jù)最優(yōu)質(zhì)量的內(nèi)凸點(diǎn)解法,解決了經(jīng)典數(shù)據(jù)場理論求解最優(yōu)數(shù)據(jù)質(zhì)量受初始點(diǎn)選取影響的問題。在勢能與質(zhì)量的方程組基礎(chǔ)上,結(jié)合“學(xué)習(xí)機(jī)”的思想,提出了基于非齊次線性方程組的最優(yōu)數(shù)據(jù)質(zhì)量的求解方法,提高了數(shù)據(jù)質(zhì)量求解的效率。然后,在數(shù)據(jù)質(zhì)量的基礎(chǔ)上,提出了數(shù)據(jù)質(zhì)量聚類算法。讓數(shù)據(jù)質(zhì)量代表數(shù)據(jù)的密集程度,找到聚類中心,并通過一次迭代完成聚類。該方法解決了傳統(tǒng)劃分聚類算法聚類中心確定不準(zhǔn)確,需要提前輸入聚類個(gè)數(shù)等問題。對于《Science》上發(fā)表的“峰值密度聚類算法”,需要手動(dòng)設(shè)定閾值的問題,提出了基于勢熵的峰值密度聚類算法。該方法基于香農(nóng)熵與聚類不確定性之間的關(guān)系,建立起香農(nóng)熵與閾值之間的關(guān)系函數(shù),由此來確定每個(gè)數(shù)據(jù)集所對應(yīng)的最佳閾值,提高了聚類算法的普適性。最后,通過人臉表情識別和人臉自動(dòng)聚類對新理論,新概念和新方法進(jìn)行了檢測。結(jié)果表明,數(shù)據(jù)質(zhì)量能夠很好地反映出像素點(diǎn)在人臉表情中的權(quán)值,并能構(gòu)建出較好的人臉表情特征臉,得到理想的識別結(jié)果。而數(shù)據(jù)質(zhì)量聚類算法和基于勢熵的峰值密度聚類算法在人臉自動(dòng)聚類中能夠得到優(yōu)于峰值密度聚類算法和DBSCAN等經(jīng)典聚類算法的結(jié)果。
[Abstract]:With the development of computer science, human society has entered the era of big data. In the era of big data, data analysis technology has become a key tool to use big data resources, and it can find the value in data, and it will take the initiative in the era of big data. As the key technology of data analysis, data mining has a wide application prospect in the era of large data. Data mining can discover the hidden knowledge in the data, make full use of the data resources, and solve the problem of large data and lack of knowledge to some extent. In data mining, there are three main types of analysis, classification, association and clustering. Classification and association are supervised learning algorithms in machine learning, and clustering is an unsupervised learning algorithm. In the era of large data, the mining and learning of the full data set is often emphasized, and it is difficult to train the appropriate training set for the algorithm. Therefore, the unsupervised learning algorithm is more suitable for the background of the large data age, and clustering analysis has become a hot topic in the research of data mining. Aiming at the clustering problem in data mining, this paper proposes vector data field theory, new concept of data field quality, data quality clustering algorithm and peak density clustering algorithm based on potential entropy. Two examples of facial expression recognition and automatic face clustering are used to test the related theories and methods. First, the data field is a model of data analysis. The classical data field theory describes the distribution of data in the data set through potential energy. Based on it, we put forward the concept of vector data field, so that data field can not only describe the distribution of data, but also describe the trend of data movement, and integrate the vector data field and quantitative data field model by Hamiltonian operator. Secondly, the concept of the data field comes from the physical field, and the object has the mass in the physical field. Therefore, the data should also have the quality in the data field. This paper proposes a new concept of data quality, that is, the inherent attributes representing data in data sets, and changes with the change of mining perspective. The essence of data is to weigh data in a specific mining perspective. For data fields that do not change with the mining perspective, the concept of data field basic matrix is proposed, and the linear equations of data field basic matrix, data quality and data potential energy are established. The basic matrix of data field further matrixes the computation of data field, and on this basis, we propose the solution of the interior convex point of the best quality of data, and solve the problem that the optimal data quality of classical data field is affected by the initial point selection. Based on the equations of potential energy and mass, combined with the idea of learning machine, a method of solving the optimal data quality based on non-homogeneous linear equations is proposed, which improves the efficiency of data quality solving. Then, on the basis of data quality, a data quality clustering algorithm is proposed. The data quality represents the intensity of the data, and the clustering center is found and the clustering is completed by one iteration. This method solves the problem that the clustering center of the traditional partition clustering algorithm is inaccurate and needs to enter the number of clustering in advance. For the "peak density clustering algorithm" published in "Science", it is necessary to manually set the threshold problem, and a peak density clustering algorithm based on potential entropy is proposed. Based on the relationship between Shannon entropy and clustering uncertainty, this method establishes the relationship function between Shannon entropy and threshold, so as to determine the optimal threshold for each dataset and improve the universality of clustering algorithm. Finally, the new theory, new concept and new method are detected by facial expression recognition and face automatic clustering. The results show that the quality of data can well reflect the weight of pixels in facial expression and construct a better facial expression feature face, and get the ideal recognition result. The data quality clustering algorithm and the potential density based peak density clustering algorithm can get better results than the peak density clustering algorithm and DBSCAN and other classical clustering algorithms.
【學(xué)位授予單位】:武漢大學(xué)
【學(xué)位級別】:博士
【學(xué)位授予年份】:2016
【分類號】:TP311.13
【相似文獻(xiàn)】
相關(guān)期刊論文 前10條
1 鞠彥輝;;企業(yè)數(shù)據(jù)質(zhì)量問題及其對策[J];中國管理信息化(綜合版);2007年09期
2 晨陽;;數(shù)據(jù)的生命之源是質(zhì)量——Business Objects公司發(fā)布其亞太地區(qū)數(shù)據(jù)質(zhì)量調(diào)研報(bào)告[J];每周電腦報(bào);2006年48期
3 劉賢榮;;構(gòu)建數(shù)據(jù)質(zhì)量治理體系的思考[J];金融電子化;2014年04期
4 亓文會;李傳春;;企業(yè)信息化中數(shù)據(jù)質(zhì)量監(jiān)督控制研究[J];中國管理信息化(綜合版);2007年07期
5 蘇小會;葛宇洲;;數(shù)據(jù)質(zhì)量提高方案探究[J];電子測試;2014年08期
6 畢思文,景東升;數(shù)字人體數(shù)據(jù)質(zhì)量標(biāo)準(zhǔn)[J];中國醫(yī)學(xué)影像技術(shù);2004年05期
7 盧紹年;;淺析企業(yè)信息化建設(shè)與數(shù)據(jù)質(zhì)量[J];廣西電業(yè);2013年03期
8 丁華;;計(jì)算機(jī)輔助調(diào)查與數(shù)據(jù)質(zhì)量[J];統(tǒng)計(jì)與決策;2014年03期
9 李慶莉;關(guān)注數(shù)據(jù)質(zhì)量[J];中國金融電腦;2003年11期
10 畢月俠;;影響企業(yè)信息系統(tǒng)數(shù)據(jù)質(zhì)量的因素和提高的措施[J];現(xiàn)代商業(yè);2009年12期
相關(guān)會議論文 前10條
1 鄭華;;基于數(shù)據(jù)世系的數(shù)據(jù)質(zhì)量評估框架[A];廣西計(jì)算機(jī)學(xué)會2010年學(xué)術(shù)年會論文集[C];2010年
2 陳翼;;數(shù)據(jù)質(zhì)量理論與高校信息化應(yīng)用建設(shè)探索[A];中國高等教育學(xué)會教育信息化分會第十次學(xué)術(shù)年會論文集[C];2010年
3 盧來發(fā);王樹理;;影響數(shù)據(jù)質(zhì)量的因素分析及對策[A];山西省第七次統(tǒng)計(jì)科學(xué)討論會論文集[C];2003年
4 王華;;利用抽樣調(diào)查評估普查數(shù)據(jù)質(zhì)量的理論初探[A];北京市第十三次統(tǒng)計(jì)科學(xué)討論會論文選編[C];2006年
5 劉慧;蔡青;劉敏;;基于Vague集的數(shù)據(jù)質(zhì)量綜合評估方法[A];第十二屆中國管理科學(xué)學(xué)術(shù)年會論文集[C];2010年
6 侯培莊;;確保CUJA質(zhì)量的幾點(diǎn)建議[A];外向型文獻(xiàn)庫的數(shù)據(jù)質(zhì)量控制——首屆CUJA系統(tǒng)學(xué)術(shù)討論會論文集[C];1990年
7 顧彬;王彥敏;盧剛;;大比例尺DLG數(shù)據(jù)質(zhì)量檢查方法研究[A];江蘇省測繪學(xué)會2009年學(xué)術(shù)年會論文集[C];2009年
8 錢闖;劉暉;張紅娟;;Trimble BD970 OEM板數(shù)據(jù)獲取與質(zhì)量分析[A];第三屆中國衛(wèi)星導(dǎo)航學(xué)術(shù)年會電子文集——S08衛(wèi)星導(dǎo)航模型與方法[C];2012年
9 李斌;;對CUJA數(shù)據(jù)質(zhì)量及系統(tǒng)軟件的幾點(diǎn)建議[A];外向型文獻(xiàn)庫的數(shù)據(jù)質(zhì)量控制——首屆CUJA系統(tǒng)學(xué)術(shù)討論會論文集[C];1990年
10 汪錫錕;;大型普查的組織工作研究[A];北京市第十三次統(tǒng)計(jì)科學(xué)討論會論文選編[C];2006年
相關(guān)重要報(bào)紙文章 前10條
1 白春華;豐寧國稅不斷提高征管數(shù)據(jù)質(zhì)量[N];承德日報(bào);2008年
2 ;采取多種舉措確保數(shù)據(jù)質(zhì)量[N];鄭州日報(bào);2009年
3 李艷;華寧確保經(jīng)普數(shù)據(jù)質(zhì)量[N];玉溪日報(bào);2009年
4 喬希萍;提高數(shù)據(jù)質(zhì)量 提升統(tǒng)計(jì)能力為科學(xué)發(fā)展提供有力的統(tǒng)計(jì) 保障[N];濟(jì)南日報(bào);2009年
5 ;數(shù)據(jù)質(zhì)量市場仍有很大挖潛空間[N];網(wǎng)絡(luò)世界;2009年
6 孫洪輝 涂輝榮 肖小群;詔安重視數(shù)據(jù)質(zhì)量建設(shè)[N];中國工商報(bào);2010年
7 李明湘 段鐘張;荊州數(shù)據(jù)質(zhì)量建設(shè)步入“四化”軌道[N];中國工商報(bào);2010年
8 通訊員 周明君 洪煒勛;寧陜統(tǒng)計(jì)局“五字”原則保數(shù)據(jù)質(zhì)量[N];安康日報(bào);2011年
9 鄭衛(wèi)青;讓數(shù)據(jù)不再掣肘公司經(jīng)營[N];中國保險(xiǎn)報(bào);2011年
10 楊克;內(nèi)江推進(jìn)辦案數(shù)據(jù)質(zhì)量建設(shè)[N];中國工商報(bào);2011年
相關(guān)博士學(xué)位論文 前1條
1 王大魁;基于數(shù)據(jù)質(zhì)量與勢熵的聚類算法研究[D];武漢大學(xué);2016年
相關(guān)碩士學(xué)位論文 前10條
1 王永凱;我國GDP數(shù)據(jù)質(zhì)量實(shí)證研究[D];首都經(jīng)濟(jì)貿(mào)易大學(xué);2015年
2 王彬;制藥企業(yè)流向數(shù)據(jù)質(zhì)量量化管理模式構(gòu)建[D];對外經(jīng)濟(jì)貿(mào)易大學(xué);2015年
3 于天嬌;基于元數(shù)據(jù)的銀行數(shù)據(jù)質(zhì)量管理技術(shù)研究[D];浙江大學(xué);2015年
4 凌云;數(shù)據(jù)質(zhì)量評估方法研究[D];四川師范大學(xué);2015年
5 方劍委;基于濾波對角化方法提高傅立葉變換質(zhì)譜數(shù)據(jù)質(zhì)量[D];國防科學(xué)技術(shù)大學(xué);2013年
6 張磊;自動(dòng)氣象站數(shù)據(jù)質(zhì)量控制軟件設(shè)計(jì)與實(shí)現(xiàn)[D];電子科技大學(xué);2014年
7 齊藝蘭;ERP系統(tǒng)的數(shù)據(jù)質(zhì)量評價(jià)研究[D];西安電子科技大學(xué);2014年
8 高曉松;基于EPDM的錄井?dāng)?shù)據(jù)管理平臺的研究與開發(fā)[D];東北石油大學(xué);2015年
9 段宗然;利用Benford法則研究農(nóng)林牧漁產(chǎn)值數(shù)據(jù)質(zhì)量的可靠性[D];燕山大學(xué);2015年
10 熊晟;知識庫質(zhì)量控制平臺的設(shè)計(jì)與實(shí)現(xiàn)[D];北京交通大學(xué);2016年
,本文編號:1339573
本文鏈接:http://sikaile.net/shoufeilunwen/xxkjbs/1339573.html