面向不確定數(shù)據(jù)的頻繁模式挖掘方法研究
本文選題:反單調(diào)性 + 頻繁模式 ; 參考:《山東師范大學(xué)》2016年博士論文
【摘要】:大數(shù)據(jù)時(shí)代悄然到來(lái),數(shù)據(jù)挖掘技術(shù)正面臨著前所未有的機(jī)遇和挑戰(zhàn)。作為數(shù)據(jù)挖掘領(lǐng)域的重要研究課題,頻繁模式挖掘和關(guān)聯(lián)規(guī)則發(fā)現(xiàn)受到了持續(xù)而廣泛的關(guān)注,并且涌現(xiàn)了大量經(jīng)典理論、高效算法和新興應(yīng)用領(lǐng)域。挖掘頻繁項(xiàng)集,是關(guān)聯(lián)規(guī)則發(fā)現(xiàn)中的關(guān)鍵技術(shù)和步驟,并決定了關(guān)聯(lián)規(guī)則的總體性能,目前已廣泛應(yīng)用于市場(chǎng)銷售、文本挖掘、公眾健康等各個(gè)領(lǐng)域。在實(shí)際應(yīng)用中,由于技術(shù)手段有限、測(cè)量設(shè)備誤差、通訊開(kāi)銷限制和用戶隱私保護(hù)等諸多因素的影響,獲得的原始數(shù)據(jù)往往存在不確定性。同時(shí),受到主客觀條件的限制,頻繁模式挖掘過(guò)程中也會(huì)帶來(lái)一系列的不確定性,這些不確定性在挖掘過(guò)程中不斷傳播和積累,可能導(dǎo)致挖掘出的知識(shí)與真實(shí)結(jié)果之間存在較大差距甚至毫無(wú)意義。而傳統(tǒng)的挖掘方法卻未將這些因素考慮進(jìn)去,只簡(jiǎn)單地認(rèn)為挖掘出的知識(shí)一般都是有用的和確定的,致使傳統(tǒng)的頻繁模式挖掘方法在處理不確定數(shù)據(jù)時(shí)面臨著得到的挖掘結(jié)果異常卻難以解釋的窘態(tài)。這顯然是不科學(xué)和不妥當(dāng)?shù)。因?針對(duì)不確定頻繁模式挖掘的研究顯得尤為重要,并日益受到廣大研究人員的關(guān)注。本文主要針對(duì)兩類典型的不確定性數(shù)據(jù),即概率數(shù)據(jù)和容錯(cuò)數(shù)據(jù),進(jìn)行概率頻繁模式挖掘和近似頻繁模式挖掘的研究,并應(yīng)用在中醫(yī)藥診療數(shù)據(jù)環(huán)境下,實(shí)現(xiàn)基于不確定數(shù)據(jù)的高效頻繁模式挖掘。本文的主要工作和成果總結(jié)如下:1.針對(duì)概率數(shù)據(jù)中垂直格式的數(shù)據(jù)表示形式,提出了一種基于Eclat框架的概率頻繁項(xiàng)集精確挖掘算法(UBEclat)。首先,對(duì)于采用垂直數(shù)據(jù)格式的概率數(shù)據(jù),本文設(shè)計(jì)了一種適用于Eclat框架,旨在提高算法執(zhí)行效率的雙向排序策略,然后基于概率頻度的定義,提出了采用分而治之方法的概率頻繁項(xiàng)集精確挖掘算法。在基準(zhǔn)數(shù)據(jù)集和真實(shí)數(shù)據(jù)集上的對(duì)比實(shí)驗(yàn)表明,UBEclat算法能夠依據(jù)支持度的概率分布,準(zhǔn)確挖掘出所有概率頻繁項(xiàng)集。這為有效解決概率頻繁項(xiàng)集的精確挖掘問(wèn)題提供了新的思路。2.針對(duì)概率頻繁項(xiàng)集精確挖掘算法執(zhí)行效率較低,運(yùn)行時(shí)間過(guò)長(zhǎng)的問(wèn)題,基于概率數(shù)據(jù)的可能性理論,提出了一種高效的概率頻繁項(xiàng)集近似挖掘算法(NDUEclat)。結(jié)合Eclat框架和近似方法的優(yōu)勢(shì),NDUEclat算法采用分而治之的方法,應(yīng)用大數(shù)定律優(yōu)化挖掘過(guò)程,改進(jìn)了頻繁項(xiàng)集挖掘的效率。在基準(zhǔn)數(shù)據(jù)集和真實(shí)數(shù)據(jù)集上的多組對(duì)比實(shí)驗(yàn)也驗(yàn)證了該算法具有良好的挖掘性能。目前,這也是第一個(gè)基于支持度的概率分布,在垂直數(shù)據(jù)格式的概率數(shù)據(jù)中高效挖掘不確定頻繁項(xiàng)集的近似算法。3.針對(duì)NP-hard類的容錯(cuò)頻繁模式挖掘問(wèn)題,提出了一種將容錯(cuò)數(shù)據(jù)庫(kù)映射為事務(wù)信息系統(tǒng),基于粗糙集理論挖掘近似頻繁模式的新方法。依據(jù)挖掘出的頻繁項(xiàng)目確定決策表中的決策屬性;基于粗糙集理論中上近似和下近似概念,確定近似頻繁模式的匹配程度。在基準(zhǔn)數(shù)據(jù)集和真實(shí)數(shù)據(jù)集上進(jìn)行的對(duì)比實(shí)驗(yàn)證實(shí)了該方法在挖掘的準(zhǔn)確率指標(biāo)上,比以往方法有更好的性能表現(xiàn)。顯然,基于粗糙集理論的近似挖掘方法為有效提高近似頻繁模式挖掘的準(zhǔn)確性和適用性提供了新的思路。4.以減少敏感參數(shù)設(shè)置的影響、提高挖掘效率的同時(shí)保證實(shí)際挖掘結(jié)果的可用性為目的,研究了基于容錯(cuò)數(shù)據(jù)的粗糙集理論,提出了一種挖掘近似頻繁閉模式的新模型。新模型主要由三部分組成:用聚類算法完成數(shù)據(jù)預(yù)處理;對(duì)同一類中的事務(wù)依據(jù)粗糙集理論進(jìn)行屬性約簡(jiǎn)生成核模式;將核模式作為初始種子構(gòu)建等價(jià)類,用分而治之的方法挖掘近似頻繁閉模式。在傳統(tǒng)中醫(yī)藥數(shù)據(jù)集上的實(shí)驗(yàn)結(jié)果表明,該模型可以更精準(zhǔn)地表達(dá)近似頻繁模式,有利于實(shí)現(xiàn)基于中醫(yī)診療應(yīng)用的知識(shí)發(fā)現(xiàn)。綜上所述,本文針對(duì)概率數(shù)據(jù)中如何提高頻繁模式挖掘的效率、如何屏蔽容錯(cuò)數(shù)據(jù)中因數(shù)據(jù)表達(dá)不準(zhǔn)確而對(duì)挖掘結(jié)果造成的影響以及如何確定容錯(cuò)率以獲得有意義的挖掘結(jié)果等問(wèn)題,從數(shù)據(jù)庫(kù)的特點(diǎn)和數(shù)據(jù)的表示方式、模式挖掘的類型、具體挖掘技術(shù)的選擇等幾個(gè)不同的角度提出了相應(yīng)的解決方案,并通過(guò)實(shí)驗(yàn)驗(yàn)證了它們的有效性。本文工作可以為今后面向不確定數(shù)據(jù)的頻繁模式挖掘研究提供幫助。
[Abstract]:As the big data era is coming, data mining technology is facing unprecedented opportunities and challenges. As an important research topic in the field of data mining, frequent pattern mining and association rules discovery have been paid more and more attention, and a large number of classical theories, high efficiency algorithms and new application fields are emerging, and frequent itemsets are excavated. The key technologies and steps of association rules are found, and the overall performance of association rules is determined. It has been widely used in various fields, such as market sales, text mining, public health and so on. In practical applications, the influence of many factors, such as limited technical means, measurement equipment error, communication overhead limit and user privacy protection, has been obtained in practical application. The original data often have uncertainty. At the same time, limited by the subjective and objective conditions, the frequent pattern mining also brings a series of uncertainties. These uncertainties are constantly spread and accumulated during the mining process, which may lead to a large gap or even meaningless between the knowledge and the real results. The mining method does not take these factors into consideration, only simply thinks that the mining knowledge is generally useful and definite, which causes the traditional frequent pattern mining method to face the abnormal but difficult to explain when dealing with the uncertain data. This is obviously unscientific and inappropriate. The study of frequent pattern mining is particularly important and is increasingly concerned by the researchers. This paper mainly focuses on two types of typical uncertain data, probability data and fault-tolerant data, the research of probability frequent pattern mining and approximate frequent pattern mining, and is applied to the diagnosis and treatment data environment of traditional Chinese medicine. The main work and achievements of this paper are summarized as follows: 1. in view of the data representation of vertical format in probabilistic data, an accurate mining algorithm for probability frequent itemsets based on Eclat framework (UBEclat) is proposed. First, this paper designs the probability data for using the vertical data format. The Eclat framework is suitable for improving the efficiency of the algorithm. Then, based on the definition of probability frequency, a precise mining algorithm for probability frequent itemsets using a divide and conquer method is proposed. The comparison experiment on the datum data set and the real data set shows that the UBEclat algorithm can be based on the probability distribution of the support degree. In order to effectively solve the problem of accurate mining of probability frequent itemsets, it provides a new idea.2. for the problem of low execution efficiency and long running time for the accurate mining algorithm of probability frequent itemsets. Based on probability data possibility theory, an efficient approximate mining of probability frequent itemsets is proposed. Algorithm (NDUEclat). Combining the advantages of Eclat framework and approximate method, NDUEclat algorithm adopts a divide and conquer method, uses large number law to optimize the mining process and improves the efficiency of frequent itemsets mining. The multi group comparison experiments on datum dataset and real data set also verify that the algorithm has good mining performance. It is the first probability distribution based on the support degree. An approximate algorithm for mining the uncertain frequent itemsets in the probability data of the vertical data format.3. is used to solve the fault tolerant frequent pattern mining problem of the NP-hard class. A new method of mapping the fault tolerant database into a transaction information system based on the rough set theory is proposed. Method. The decision attributes in the decision table are determined according to the frequent items excavated, and the matching degree of the approximate frequent patterns is determined based on the upper and lower approximation concepts in the rough set theory. The comparison experiments on the datum data set and the real dataset confirm that the method has a better performance than the previous method on the accuracy index of the mining. It is obvious that the approximate mining method based on rough set theory provides a new idea to effectively improve the accuracy and applicability of the approximate frequent pattern mining,.4. to reduce the influence of the setting of sensitive parameters, improve the efficiency of mining and ensure the availability of the actual mining results, and study the rough set based on fault-tolerant data. In this paper, a new model for mining approximately frequent closed patterns is proposed. The new model is composed of three parts: data preprocessing by clustering algorithm, and attribute reduction in the same class of transactions based on rough set theory to generate the kernel model; the kernel model is used as the initial seed to construct the equivalent class, and a divide and conquer method is used to excavate the approximate frequency. The experimental results on traditional Chinese medicine data set show that the model can express approximate frequent patterns more accurately and be helpful to realize knowledge discovery based on the application of diagnosis and treatment of traditional Chinese medicine. In summary, this paper aims at how to improve the efficiency of frequent pattern mining in the probability data and how to shield the data from fault tolerance data to express inaccurate data. In order to determine the impact of the mining results and how to determine the fault tolerance rate to obtain meaningful mining results, the corresponding solutions are proposed from the characteristics of the database, the representation of data, the type of pattern mining, the selection of specific mining techniques and so on, and their effectiveness is verified by experiments. This work can help future research on frequent pattern mining for uncertain data.
【學(xué)位授予單位】:山東師范大學(xué)
【學(xué)位級(jí)別】:博士
【學(xué)位授予年份】:2016
【分類號(hào)】:TP311.13
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 劉文遠(yuǎn);李承芳;陳子軍;;面向不確定數(shù)據(jù)的概率閾值可見(jiàn)最近鄰查詢算法[J];小型微型計(jì)算機(jī)系統(tǒng);2013年08期
2 郭鑫;顏一鳴;徐洪智;董堅(jiān)峰;;不確定樹數(shù)據(jù)庫(kù)中的動(dòng)態(tài)聚類算法[J];小型微型計(jì)算機(jī)系統(tǒng);2013年06期
3 蔣濤;高云君;張彬;周傲英;樂(lè)光學(xué);;不確定數(shù)據(jù)查詢處理[J];電子學(xué)報(bào);2013年05期
4 王爽;王國(guó)仁;;面向不確定感知數(shù)據(jù)的頻繁項(xiàng)查詢算法[J];計(jì)算機(jī)學(xué)報(bào);2013年03期
5 馮培恩;劉嶼;邱清盈;李立新;;提高Eclat算法效率的策略[J];浙江大學(xué)學(xué)報(bào)(工學(xué)版);2013年02期
6 傅向華;陳冬劍;王志強(qiáng);;基于倒排索引位運(yùn)算的深度優(yōu)先頻繁項(xiàng)集挖掘[J];小型微型計(jì)算機(jī)系統(tǒng);2012年08期
7 張李一;張守志;施伯樂(lè);;一種不確定性數(shù)據(jù)頻繁模式的垂直挖掘算法[J];小型微型計(jì)算機(jī)系統(tǒng);2012年02期
8 張玉芳;熊忠陽(yáng);耿曉斐;陳劍敏;;Eclat算法的分析及改進(jìn)[J];計(jì)算機(jī)工程;2010年23期
9 周傲英;金澈清;王國(guó)仁;李建中;;不確定性數(shù)據(jù)管理技術(shù)研究綜述[J];計(jì)算機(jī)學(xué)報(bào);2009年01期
10 周海巖;;采用頻繁項(xiàng)目鏈表變換的頻繁項(xiàng)目集挖掘算法[J];小型微型計(jì)算機(jī)系統(tǒng);2008年07期
,本文編號(hào):1884473
本文鏈接:http://sikaile.net/shoufeilunwen/xxkjbs/1884473.html