天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁 > 科技論文 > 軟件論文 >

不完整數(shù)據(jù)上的聚類算法研究

發(fā)布時(shí)間:2018-08-15 15:53
【摘要】:進(jìn)入二十一世紀(jì)以來,人與人之間、人類與物理世界之間的聯(lián)系變得愈來愈緊密。在這種情況下,數(shù)據(jù)的產(chǎn)生無處不在。然而,在數(shù)據(jù)規(guī)模幾乎爆炸式增長(zhǎng)的同時(shí),數(shù)據(jù)質(zhì)量并沒有得到相應(yīng)的提升,也無法得到足夠的保障。因?yàn)閿?shù)據(jù)在最初獲取以及交換和傳播的過程中,可能會(huì)出現(xiàn)各式各樣的狀況使得我們最終所獲得的數(shù)據(jù)質(zhì)量存在問題。然而常用的聚類算法通常需要數(shù)據(jù)的質(zhì)量較高時(shí)才能正常使用,然而當(dāng)大數(shù)據(jù)的質(zhì)量存在問題時(shí),這類方法通常表現(xiàn)欠佳。因而通常先使用數(shù)據(jù)清洗技術(shù)對(duì)存在質(zhì)量問題的數(shù)據(jù)先行進(jìn)行清洗,而后再進(jìn)行諸如聚類的數(shù)據(jù)挖掘操作。但是在大規(guī)模數(shù)據(jù)上進(jìn)行數(shù)據(jù)清洗往往具有很昂貴的時(shí)間開銷,而最終的清洗效果可能尚不如人愿;即我們花費(fèi)了大量的時(shí)間在數(shù)據(jù)清洗上,最終數(shù)據(jù)上可能仍舊存在無法清除的質(zhì)量問題,也就是說最終清洗結(jié)果并不能顯著的提高數(shù)據(jù)挖掘結(jié)果的質(zhì)量。所以,直接在弱可用數(shù)據(jù)上進(jìn)行聚類操作的研究對(duì)該問題的解決提供了一個(gè)新的思路,即我們不清洗數(shù)據(jù)直接進(jìn)行聚類操作,或者在沒有清洗干凈的數(shù)據(jù)上進(jìn)行聚類操作。本文主要研究如何在不完整數(shù)據(jù)集合上進(jìn)行聚類分析的操作。首先,本文分析了不完整數(shù)據(jù)的空間結(jié)構(gòu),由此理解了不完整數(shù)據(jù)對(duì)于聚類操作的影響。據(jù)此設(shè)計(jì)了基于模糊聚類的不完整聚類算法,基于模糊聚類的不完整數(shù)據(jù)聚類算法將數(shù)據(jù)當(dāng)中的缺失視為聚類迭代過程當(dāng)中的優(yōu)化變量,并在迭代過程中不斷進(jìn)行更新求解,完成不完整數(shù)據(jù)的聚類;诿芏确治龅牟煌暾麛(shù)據(jù)聚類算法,將聚類過程中的兩個(gè)核心要求進(jìn)行了刻畫,要求聚類當(dāng)中的簇中心必須是周圍點(diǎn)密度大的點(diǎn),并且與其它的點(diǎn)密度大的點(diǎn)之間的距離盡量遠(yuǎn),在確定了簇中心以后再依據(jù)一定的策略將其它點(diǎn)劃分入當(dāng)前的簇當(dāng)中去;谛畔⒗碚摰牟煌暾麛(shù)據(jù)聚類算法將聚類過程視為記錄對(duì)簇的不確定度不斷變化的過程,隨屬性的加入,一條記錄對(duì)類別的不確定度不斷減小,直至最后我們可以將其劃分到不確定度最小的簇當(dāng)中去;針對(duì)不完整數(shù)據(jù),我們需要先估計(jì)出需要的信息理論基本參數(shù)和簇的信息參數(shù),通過這兩者的結(jié)合,完成對(duì)不完整數(shù)據(jù)的聚類操作。在每種算法的設(shè)計(jì)最后,本文都通過相關(guān)的實(shí)驗(yàn)對(duì)算法進(jìn)行了實(shí)驗(yàn)分析。
[Abstract]:Since the beginning of the 21st century, the relationship between human beings and the physical world has become more and more close. In this case, data generation is ubiquitous. However, while the scale of data increases almost explosively, the quality of data has not been improved and can not be guaranteed adequately. In the process of acquisition, exchange and propagation, various conditions may arise which may lead to problems in the quality of the data we ultimately obtain. However, the commonly used clustering algorithms usually require higher quality of data to be used properly. However, when the quality of large data is problematic, such methods usually perform poorly. Data cleaning technology is used to clean the data with quality problems first, and then to do data mining operations such as clustering. However, data cleaning on large-scale data often has a very expensive time cost, and the final cleaning effect may not be as desirable; that is, we spend a lot of time in data cleaning. Therefore, the study of clustering operation directly on the weak available data provides a new way to solve this problem, that is, we do not clean the data directly. In this paper, we mainly study how to do clustering analysis on incomplete data sets. Firstly, we analyze the spatial structure of incomplete data, and then understand the impact of incomplete data on clustering operations. Complete clustering algorithm, the incomplete data clustering algorithm based on fuzzy clustering regards the missing data as the optimization variable in the clustering iterative process, and carries on the renewal solution unceasingly in the iterative process, completes the incomplete data clustering. It is required that the cluster center of the cluster must be the point with high density of the surrounding points, and the distance between the cluster center and other points with high density should be as far as possible. After determining the cluster center, other points are divided into the current cluster according to certain strategies. Cheng is regarded as a process of recording the uncertainties of a pair of clusters. With the addition of attributes, the uncertainties of a record pair are decreasing until we can divide it into clusters with the least uncertainties. At the end of the design of each algorithm, the algorithm is experimentally analyzed through related experiments.
【學(xué)位授予單位】:哈爾濱工業(yè)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP311.13

【參考文獻(xiàn)】

相關(guān)期刊論文 前5條

1 李建中;王宏志;高宏;;大數(shù)據(jù)可用性的研究進(jìn)展[J];軟件學(xué)報(bào);2016年07期

2 周志華;;《機(jī)器學(xué)習(xí)》[J];中國(guó)民商;2016年03期

3 夏慧;梁曉明;許宏;張紅君;張超;;基于臨床大數(shù)據(jù)中心的醫(yī)療質(zhì)量控制管理系統(tǒng)研究與應(yīng)用[J];中國(guó)數(shù)字醫(yī)學(xué);2016年02期

4 王宏志;;大數(shù)據(jù)質(zhì)量管理:問題與研究進(jìn)展[J];科技導(dǎo)報(bào);2014年34期

5 李建中;劉顯敏;;大數(shù)據(jù)的一個(gè)重要方面:數(shù)據(jù)可用性[J];計(jì)算機(jī)研究與發(fā)展;2013年06期



本文編號(hào):2184685

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2184685.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶ca4e3***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com