計算機輔助醫(yī)學影像診斷中的關鍵學習技術研究
發(fā)布時間:2018-09-17 16:42
【摘要】:利用計算機技術輔助放射科醫(yī)生進行病例診斷,即計算機輔助診斷(Computer Aided Diagnosis, CAD)在早期乳腺癌檢查中起到越來越重要的作用,能有效幫助減少乳腺癌患者的死亡率。臨床上已標記病例樣本難以搜集同時陰性病例樣本數(shù)遠大于陽性病例樣本數(shù),因而在CAD應用中存在小樣本、非平衡數(shù)據(jù)集的學習問題。非平衡及小樣本學習問題是關于類別嚴重不對稱及信息欠充分表達數(shù)據(jù)集的學習性能問題。非平衡及小樣本學習在許多現(xiàn)實應用中具有重要意義,盡管經典機器學習與數(shù)據(jù)挖掘技術在許多實際應用中取得很大成功,然而針對小樣本及非平衡數(shù)據(jù)的學習對于學者們來說仍然是一個很大的挑戰(zhàn)。本論文系統(tǒng)地闡述了機器學習在小樣本與非平衡學習環(huán)境下性能下降的主要原因,并就目前解決小樣本、非平衡學習問題的有效方法進行了綜述。本論文在充分理解常用欠采樣方法在處理非平衡樣本時易于丟失類別信息的問題基礎上,重點研究如何合理、有效處理非平衡數(shù)據(jù)。論文提出兩種欠采樣新方法有效提取最富含類別信息的樣本以此解決欠采樣引起的類別信息丟失問題。另外針對小樣本學習問題,論文提出新的類別標記算法。該算法通過自動標記未標記樣本擴大訓練樣本集,同時有效減少標記過程中易發(fā)生的標記錯誤。 本論文聚焦小樣本、非平衡數(shù)據(jù)的學習技術研究。圍繞非平衡數(shù)據(jù)集的重采樣及未標記樣本的類別標記等問題展開研究。論文的主要工作包括: (1)針對CAD應用中標記病例樣本難以收集所引起的小樣本學習問題,本論文利用大量存在的未標記樣本來擴充訓練樣本集以此解決小樣本學習問題。然而樣本標記過程中往往存在錯誤類別標記,誤標記樣本如同噪聲會顯著降低學習性能。針對半監(jiān)督學習中的誤標記問題,本論文提出混合類別標記(Hybrid Class Labeling)算法,算法從幾何距離、概率分布及語義概念三個不同角度分別進行類別標記。三種標記方法基于不同原理,具有顯著差異性。將三種標記方法有一致標記結果的未標記樣本加入訓練樣本集。為進一步減少可能存在的誤標記樣本對學習過程造成的不利影響,算法將偽標記隸屬度引入SVM(Support Vector Machine)學習中,由隸屬度控制樣本對學習過程的貢獻程度;赨CI中Breast-cancer數(shù)據(jù)集的實驗結果表明該算法能有效地解決小樣本學習問題。相比于單一的類別標記技術,該算法造成更少的錯誤標記樣本,得到顯著優(yōu)于其它算法的學習性能。 (2)針對常用欠采樣技術在采樣過程中往往會丟失有效類別信息的問題,本論文提出了基于凸殼(Convex Hull,CH)結構的欠采樣新方法。數(shù)據(jù)集的凸殼是包含集合中所有樣本的最小凸集,所有樣本點都位于凸殼頂點構成的多邊形或多面體內。受凸殼的幾何特性啟發(fā),算法采樣大類樣本集得到其凸殼結構,以簡約的凸殼頂點替代大類訓練樣本達到平衡樣本集的目的。鑒于實際應用中兩類樣本往往重疊,對應凸殼也將重疊。此時采用凸殼來表征大類的邊界結構對學習過程是一個挑戰(zhàn),容易引起過學習及學習機的泛化能力下降。考慮到縮減凸殼(Reduced Convex Hull,RCH)、縮放凸殼(Scaled Convex Hull,SCH)在凸殼縮減過程中帶來邊界信息丟失的問題,我們提出多層次縮減凸殼結構(Hierarchy Reduced Convex Hull,HRCH)。受RCH與SCH結構上存在顯著差異性及互補性的啟發(fā),我們將RCH與SCH進行融合生成HRCH結構。相比于其它縮減凸殼結構,HRCH包含更多樣、互補的類別信息,有效減少凸殼縮減過程中類別的信息丟失。算法通過選擇不同取值的縮減因子與縮放因子采樣大類,所得多個HRCH結構分別與稀有類樣本組成訓練樣本集。由此訓練得多個學習機,并通過集成學習產生最終分類器。通過與其它四種參考算法的實驗對比分析,該算法表現(xiàn)出更好分類性能及魯棒性。 (3)針對欠采樣算法中類別信息的丟失問題,本論文進一步提出基于反向k近鄰的欠采樣新方法,RKNN。相比于廣泛采用的k近鄰,反向k近鄰是基于全局的角度來檢查鄰域。任一點的反向k近鄰不僅與其周圍鄰近點有關,也受數(shù)據(jù)集中的其余點影響。樣本集的數(shù)據(jù)分布改變會導致每個樣本點的反向最近鄰關系發(fā)生變化,它能整體反應樣本集的完整分布結構。利用反向最近鄰將樣本相鄰關系進行傳遞的特點,克服最近鄰查詢僅關注查詢點局部分布的缺陷。該算法針對大類樣本集,采用反向k最近鄰技術去除噪聲、不穩(wěn)定的邊界樣本及冗余樣本,保留最富含類別信息且可靠的樣本作為訓練樣本。算法在平衡訓練樣本的同時有效改善了欠采樣引起的類別信息丟失問題;赨CI中Breast-cancer數(shù)據(jù)集的實驗結果驗證了該算法解決非平衡學習問題的有效性。相比于基于k最近鄰的欠采樣方法,RKNN算法得到了更好的性能表現(xiàn)。
[Abstract]:Computer aided diagnosis (CAD) plays an increasingly important role in early breast cancer screening and can effectively help reduce the mortality of breast cancer patients. For the number of positive case samples, there are small sample, unbalanced data sets learning problems in CAD applications. Unbalanced and small sample learning problems are about the learning performance of data sets with serious class asymmetry and insufficient information representation. Machine learning and data mining have achieved great success in many practical applications, but learning from small samples and unbalanced data is still a great challenge for scholars. This paper systematically expounds the main reasons for the performance degradation of machine learning in small samples and unbalanced learning environments, and proposes solutions to these problems. In this paper, based on a thorough understanding of the problem that under-sampling methods are easy to lose class information when dealing with non-equilibrium samples, we focus on how to deal with non-equilibrium data reasonably and effectively. Two new methods of under-sampling are proposed to extract the richest class information effectively. In addition, a new class labeling algorithm is proposed to solve the problem of class information loss caused by under-sampling. This algorithm enlarges the training sample set by automatically labeling unlabeled samples and effectively reduces the labeling errors in the labeling process.
This dissertation focuses on the study of small sample, unbalanced data learning technology. It focuses on the resampling of unbalanced data sets and the labeling of unlabeled samples.
(1) In order to solve the problem of small sample learning caused by the difficulty in collecting labeled case samples in CAD applications, this paper uses a large number of unlabeled samples to expand the training sample set to solve the problem of small sample learning. To solve the problem of mislabeling in semi-supervised learning, Hybrid Class Labeling algorithm is proposed in this paper. The algorithm labels classes from three different perspectives: geometric distance, probability distribution and semantic concepts. Results Unlabeled samples were added to the training sample set. To further reduce the possible adverse effects of mislabeled samples on the learning process, pseudo-labeled membership was introduced into SVM (Support Vector Machine) learning, and the contribution of the samples to the learning process was controlled by the membership degree. The results show that the algorithm can effectively solve the problem of small sample learning. Compared with the single class labeling technique, the algorithm results in fewer error labeling samples, and the learning performance of the algorithm is significantly better than that of other algorithms.
(2) In order to solve the problem that under-sampling often loses valid class information in the process of sampling, a new method of under-sampling based on Convex Hull (CH) structure is proposed in this paper. Inspired by the geometric characteristics of convex hulls, the algorithm sampled a large class of samples to obtain their convex hulls, and replaced the large class of training samples with a simple convex hull vertex to achieve the goal of balancing the sample set. Considering the loss of boundary information caused by scaled convex hull (SCH) in the process of convex hull reduction, we propose a multi-level reduced convex hull (HRCH). Inspired by significant differences and complementarities in structure, we fuse RCH and SCH to generate HRCH structure. Compared with other reduced convex hull structures, HRCH contains more diverse and complementary class information, which can effectively reduce the loss of class information in convex hull reduction process. The algorithm samples large classes by choosing reduction factors and scaling factors with different values. The resulting HRCH structure consists of a set of training samples and a set of rare class samples, from which multiple learning machines are trained and the final classifier is generated by ensemble learning.
(3) Aiming at the loss of class information in the under-sampling algorithm, this paper proposes a new under-sampling method based on the reverse k-nearest neighbor, RKNN. Compared with the widely used k-nearest neighbor, the reverse k-nearest neighbor is based on the global perspective to check the neighborhood. The change of the data distribution of the sample set will lead to the change of the reverse nearest neighbor relation of each sample point, which can reflect the whole distribution structure of the sample set. The algorithm balances the training samples and effectively improves the problem of class information loss caused by under-sampling. The experimental results based on Breast-cancer data set in UCI show that the proposed algorithm is effective in reducing noise, unstable boundary samples and redundant samples. Compared with the k nearest neighbor method, the RKNN algorithm has better performance.
【學位授予單位】:浙江大學
【學位級別】:博士
【學位授予年份】:2014
【分類號】:R81-39
本文編號:2246516
[Abstract]:Computer aided diagnosis (CAD) plays an increasingly important role in early breast cancer screening and can effectively help reduce the mortality of breast cancer patients. For the number of positive case samples, there are small sample, unbalanced data sets learning problems in CAD applications. Unbalanced and small sample learning problems are about the learning performance of data sets with serious class asymmetry and insufficient information representation. Machine learning and data mining have achieved great success in many practical applications, but learning from small samples and unbalanced data is still a great challenge for scholars. This paper systematically expounds the main reasons for the performance degradation of machine learning in small samples and unbalanced learning environments, and proposes solutions to these problems. In this paper, based on a thorough understanding of the problem that under-sampling methods are easy to lose class information when dealing with non-equilibrium samples, we focus on how to deal with non-equilibrium data reasonably and effectively. Two new methods of under-sampling are proposed to extract the richest class information effectively. In addition, a new class labeling algorithm is proposed to solve the problem of class information loss caused by under-sampling. This algorithm enlarges the training sample set by automatically labeling unlabeled samples and effectively reduces the labeling errors in the labeling process.
This dissertation focuses on the study of small sample, unbalanced data learning technology. It focuses on the resampling of unbalanced data sets and the labeling of unlabeled samples.
(1) In order to solve the problem of small sample learning caused by the difficulty in collecting labeled case samples in CAD applications, this paper uses a large number of unlabeled samples to expand the training sample set to solve the problem of small sample learning. To solve the problem of mislabeling in semi-supervised learning, Hybrid Class Labeling algorithm is proposed in this paper. The algorithm labels classes from three different perspectives: geometric distance, probability distribution and semantic concepts. Results Unlabeled samples were added to the training sample set. To further reduce the possible adverse effects of mislabeled samples on the learning process, pseudo-labeled membership was introduced into SVM (Support Vector Machine) learning, and the contribution of the samples to the learning process was controlled by the membership degree. The results show that the algorithm can effectively solve the problem of small sample learning. Compared with the single class labeling technique, the algorithm results in fewer error labeling samples, and the learning performance of the algorithm is significantly better than that of other algorithms.
(2) In order to solve the problem that under-sampling often loses valid class information in the process of sampling, a new method of under-sampling based on Convex Hull (CH) structure is proposed in this paper. Inspired by the geometric characteristics of convex hulls, the algorithm sampled a large class of samples to obtain their convex hulls, and replaced the large class of training samples with a simple convex hull vertex to achieve the goal of balancing the sample set. Considering the loss of boundary information caused by scaled convex hull (SCH) in the process of convex hull reduction, we propose a multi-level reduced convex hull (HRCH). Inspired by significant differences and complementarities in structure, we fuse RCH and SCH to generate HRCH structure. Compared with other reduced convex hull structures, HRCH contains more diverse and complementary class information, which can effectively reduce the loss of class information in convex hull reduction process. The algorithm samples large classes by choosing reduction factors and scaling factors with different values. The resulting HRCH structure consists of a set of training samples and a set of rare class samples, from which multiple learning machines are trained and the final classifier is generated by ensemble learning.
(3) Aiming at the loss of class information in the under-sampling algorithm, this paper proposes a new under-sampling method based on the reverse k-nearest neighbor, RKNN. Compared with the widely used k-nearest neighbor, the reverse k-nearest neighbor is based on the global perspective to check the neighborhood. The change of the data distribution of the sample set will lead to the change of the reverse nearest neighbor relation of each sample point, which can reflect the whole distribution structure of the sample set. The algorithm balances the training samples and effectively improves the problem of class information loss caused by under-sampling. The experimental results based on Breast-cancer data set in UCI show that the proposed algorithm is effective in reducing noise, unstable boundary samples and redundant samples. Compared with the k nearest neighbor method, the RKNN algorithm has better performance.
【學位授予單位】:浙江大學
【學位級別】:博士
【學位授予年份】:2014
【分類號】:R81-39
【參考文獻】
相關期刊論文 前2條
1 楊風召,朱揚勇;一種有效的量化交易數(shù)據(jù)相似性搜索方法[J];計算機研究與發(fā)展;2004年02期
2 沈曄;李敏丹;夏順仁;;計算機輔助乳腺癌診斷中的非平衡學習技術[J];浙江大學學報(工學版);2013年01期
,本文編號:2246516
本文鏈接:http://sikaile.net/yixuelunwen/yundongyixue/2246516.html
最近更新
教材專著