代價敏感屬性約簡的歸并算法研究

發(fā)布時間：2018-09-05 17:04

【摘要】：數(shù)據(jù)挖掘又稱數(shù)據(jù)庫中的知識發(fā)現(xiàn)(Knowledge Discover in Database, KDD),是目前人工智能和數(shù)據(jù)庫領(lǐng)域研究的熱點問題。在現(xiàn)實世界中,數(shù)據(jù)集中存儲的屬性有幾十、幾百、甚至上千種。這些屬性中有很多是冗余的,它們會干擾數(shù)據(jù)挖掘的過程,也很大程度上會影響算法的效率。因此,人們提出了屬性約簡這一預(yù)處理技術(shù)。另一方面,現(xiàn)實世界中的行為或者事物都有各種代價,如測試代價、誤分類代價、延遲代價等,涉及金錢、時間、人工等方面的開銷。代價敏感學(xué)習(xí)致力于涉及各類代價的挖掘問題。當(dāng)前被研究的代價敏感屬性約簡問題包括：最小測試代價屬性約簡問題、簡單公共測試代價屬性約簡問題、最小測試時間代價問題等。人們將不同算法框架應(yīng)用于這些具體問題。啟發(fā)式算法的速度很快,但是由于它們常常會收斂于局部最優(yōu)解,因此正確率不高�；厮菟惴m然能夠保證找到最優(yōu)解,但是運行時間往往不能被接受。仿生算法也常常能找到最優(yōu)解,不過其耗費的時間代價過大。最近還有學(xué)者提出半貪心算法,能夠在一定時間內(nèi)得到較好的結(jié)果。本文將分治算法與回溯算法相結(jié)合,提出一種歸并算法,以改善回溯算法的不足。本文的歸并算法包含三個關(guān)鍵技術(shù)：分組與合并、回溯算法、以及競爭機制。該算法先將屬性隨機分組,得到一些屬性子集,組的大小g對算法的性能有很大的影響。在極端情況下,組的大小與屬性數(shù)目相同的情況下,歸并算法將退回為回溯算法。屬性子集通過回溯算法得到屬性子集的約簡,然后將每對相鄰的約簡合并成一個新的屬性子集。重復(fù)以上過程直到只剩一個屬性子集,這個屬性子集的約簡就是原問題的約簡。屬性分組后,全局重要的屬性可能在局部約簡時被刪除,導(dǎo)致歸并算法得到的解不是全局最優(yōu)解。因此我們采用競爭機制,運行歸并算法p次,得到p個解,再從這p個解中選取最優(yōu)解。本文將該算法運用于最小測試代價屬性約簡問題、簡單公共測試代價屬性約簡問題以及最小測試時間代價問題這三個問題。并使用來自UCI(University of California-Irvine)數(shù)據(jù)庫中的四個真實數(shù)據(jù)集對提出的歸并算法進行實驗。其中每種數(shù)據(jù)集使用了三種不同分布的測試代價。通過實驗我們得知：競爭機制能有效提高結(jié)果的質(zhì)量；對于不同問題,p值大于6之后算法結(jié)果趨于穩(wěn)定；最優(yōu)g值對于不同情況略有不同,在最小測試代價問題中為6,簡單公共測試代價、最小測試時間代價問題中均為7。與現(xiàn)有的啟發(fā)式算法、蟻群算法和回溯算法相比,歸并算法在保持較高的正確率的情況下,能夠大大縮短運行時間。因此,本文提出的歸并算法是針對該類問題的一種有效并且高效的算法。
[Abstract]:Data mining, also known as knowledge discovery (Knowledge Discover in Database, KDD),) in database, is a hot issue in the field of artificial intelligence and database. In the real world, data sets store dozens, hundreds, or even thousands of attributes. Many of these attributes are redundant, which interfere with the process of data mining and greatly affect the efficiency of the algorithm. Therefore, the preprocessing technique of attribute reduction is proposed. On the other hand, behaviors or things in the real world have various costs, such as the cost of testing, the cost of misclassification, the cost of delay, the cost of money, time, labor, and so on. Cost-sensitive learning focuses on mining problems involving various costs. The cost sensitive attribute reduction problem which has been studied at present includes: the minimum test cost attribute reduction problem, the simple common test cost attribute reduction problem, the minimum test time cost problem and so on. Different algorithms are applied to these specific problems. The speed of heuristic algorithms is very fast, but the accuracy is not high because they often converge to local optimal solutions. Although the backtracking algorithm can guarantee to find the optimal solution, the running time is often unacceptable. Bionic algorithms can often find the optimal solution, but the cost of time is too large. Recently, some scholars have proposed a half-greedy algorithm, which can get better results in a certain time. In this paper, a merging algorithm is proposed to improve the deficiency of the backtracking algorithm by combining the partition-and-conquer algorithm with the backtracking algorithm. The merging algorithm consists of three key technologies: grouping and merging, backtracking algorithm, and competition mechanism. In this algorithm, the attributes are grouped randomly and some subsets of attributes are obtained. The size of the group g has a great influence on the performance of the algorithm. In extreme cases, when the group size is the same as the number of attributes, the merging algorithm is returned to the backtracking algorithm. The reduction of attribute subset is obtained by backtracking algorithm, and then each pair of adjacent reduction is merged into a new attribute subset. Repeat the above process until there is only a subset of attributes whose reduction is the reduction of the original problem. After the attributes are grouped, the globally important attributes may be deleted in the local reduction, resulting in the solution obtained by the merging algorithm is not the global optimal solution. So we use the competition mechanism, run the merging algorithm p times, obtain p solutions, and then select the optimal solution from the p solution. In this paper, the algorithm is applied to three problems: the minimum test cost attribute reduction problem, the simple common test cost attribute reduction problem and the minimum test time cost problem. Four real data sets from UCI (University of California-Irvine) database are used to test the proposed merging algorithm. Each data set uses three different distribution of test costs. Through experiments, we know that the competition mechanism can effectively improve the quality of the results, the algorithm results tend to be stable when the value of p is greater than 6 for different problems, and the optimal g value is slightly different for different cases. In the minimum test cost problem is 6, the simple common test cost and the minimum test time cost problem are all 7. 5%. Compared with the existing heuristic algorithm, ant colony algorithm and backtracking algorithm, the merging algorithm can greatly shorten the running time while maintaining a higher accuracy. Therefore, the merging algorithm proposed in this paper is an effective and efficient algorithm for this kind of problems.
【學(xué)位授予單位】：西南石油大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2016
【分類號】：TP18;TP311.13

【參考文獻】

相關(guān)期刊論文前10條

1 孟蕓;王U，

本文編號：2224887

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/zidonghuakongzhilunwen/2224887.html

上一篇：基于遙感技術(shù)的湖泊歷史變遷檢測及展示
下一篇：基于粒子群優(yōu)化的網(wǎng)絡(luò)視頻流特征選擇與識別

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

代價敏感屬性約簡的歸并算法研究