面向不平衡二分類準則的稀疏模型構造算法研究
發(fā)布時間:2018-03-05 14:28
本文選題:不平衡二分類 切入點:QM 出處:《安徽大學》2017年碩士論文 論文類型:學位論文
【摘要】:社會的進步,科學的發(fā)展,給人們生活帶來了日新月異的變化。與此同時各種數(shù)據(jù)信息的不斷積累,在方便人們的同時,也帶來了新的挑戰(zhàn)。如何從這些大量數(shù)據(jù)中發(fā)現(xiàn)有用信息成為當前急需解決的迫切問題。機器學習的出現(xiàn)為解決上述挑戰(zhàn)提供了一種有效的手段,其中的分類學習特別是二分類學習由于在眾多領域的廣泛應用更是成為當前的研究熱點。然而在現(xiàn)實的生活中,很多應用(如網(wǎng)絡搜索引擎、個性化推薦系統(tǒng)等)都是不平衡二分類問題,且具有數(shù)據(jù)維度高的特點,已有面向小數(shù)據(jù)的傳統(tǒng)二分類算法很難直接應用在上述問題中。對此,近些年有學者提出研究直接優(yōu)化不平衡準則的稀疏二分類模型構造算法,并取得了較好的效果。但這些研究考慮的不平衡準則都是AUC或F1等簡單易分解的標準,對于其他較復雜的不平衡準則,如何獲得相應的稀疏模型,則研究較少。本文就是在這樣的背景下,主要研究了面向復雜不平衡準則的稀疏模型構造算法。全文的主要工作如下:(1)文中從二分類學習入手,首先介紹了傳統(tǒng)二分類和不平衡二分類在評估準則的差異,然后總結了面向不平衡二分類算法的研究現(xiàn)狀,重點分析了不平衡稀疏模型構造算法的進展,在此基礎上,提出研究基于L1范式的復雜不平衡稀疏模型構造算法。(2)不同于已有不平衡稀疏模型構造算法多關注AUC或F1等簡單準則,本文研究了面向復雜不可分QM準則的稀疏模型構造算法。算法首先定義了基于QM的新目標函數(shù),針對該目標非光滑難以直接優(yōu)化,提出使用割平面算法進行求解,不僅解決上述問題,且算法的外圍迭代次數(shù)僅為O(1/ε)。不平衡基準數(shù)據(jù)集上的實驗結果表明,當用QM為評價標準時,本文提出的算法不僅有很好的精度還有較高的稀疏度。(3)針對已有不平衡稀疏模型構造算法都采用批學習,當面對大規(guī)模數(shù)據(jù)集時,計算效率較差,本文提出一種基于隨機學習的稀疏模型構造算法。更具體的說,我們關注的不是某一個具體的不平衡標準,而是具有一類通用特性(如偽線性)的評價準則。文中首先將直接優(yōu)化偽線性準則問題變成一個代價敏感問題。針對新問題,如果直接使用隨機梯度法求解難以獲得滿意的稀疏度,因此提出使用COlMID算法作為優(yōu)化方法,確保了解的稀疏性。同時針對已有COMID算法即使是強凸目標函數(shù),也僅能獲得O(logT/T)收斂速度,給出一種基于多項式衰減的改進方法,并從理論上證明了所提新方法具有O(1/T)的最優(yōu)收斂效率。不平衡基準數(shù)據(jù)集上的實驗證明了本文所提算法的高效性和有效性。
[Abstract]:With the progress of society and the development of science, people's life has changed with each passing day. At the same time, the accumulation of various kinds of data and information is convenient for people. It has also brought new challenges. How to find useful information from these large amounts of data has become an urgent and urgent problem. The emergence of machine learning provides an effective means to solve these challenges. The classification learning, especially the second classification learning, has become the current research hotspot because of its wide application in many fields. However, in the real life, many applications (such as network search engine, web search engine, etc.). Personalized recommendation system is an unbalanced two-classification problem, and has the characteristics of high data dimension. The traditional two-classification algorithm for small data is difficult to be directly applied to the above problems. In recent years, some scholars have proposed a sparse two-class model construction algorithm, which directly optimizes the disequilibrium criteria, and has achieved good results. However, the disequilibrium criteria considered in these studies are simple and easy to decompose standards such as AUC or F1. There are few studies on how to obtain the corresponding sparse model for other more complex unbalanced criteria. This paper mainly studies the sparse model construction algorithm for complex imbalance criterion. The main work of this paper is as follows: (1) in this paper, we first introduce the difference between traditional two classification and unbalanced two classification in evaluation criteria. Then, the research status of unbalanced two-classification algorithm is summarized, and the development of unbalanced sparse model construction algorithm is analyzed. An algorithm for constructing complex unbalanced sparse models based on L1 normal form is proposed. It is different from the simple criteria such as AUC or F1. In this paper, a sparse model construction algorithm based on complex indivisible QM criterion is studied. Firstly, a new objective function based on QM is defined. In view of the non-smooth and difficult direct optimization of the objective, a cutting plane algorithm is proposed to solve the problem. Not only to solve the above problem, but also to solve the problem, and the number of the peripheral iterations of the algorithm is only 1 / 蔚. The experimental results on the unbalanced datum dataset show that, when QM is used as the evaluation criterion, The algorithm presented in this paper not only has good accuracy but also has a high sparsity. (3) for the existing algorithms of constructing unbalanced sparse models, batch learning is used, and the computational efficiency is poor when dealing with large data sets. In this paper, we propose a sparse model construction algorithm based on random learning. In this paper, the problem of direct optimization of pseudo linear criteria is first turned into a cost sensitive problem. If it is difficult to obtain satisfactory sparsity by using the stochastic gradient method directly, the COlMID algorithm is proposed as an optimization method to ensure the sparsity of the solution. At the same time, for the existing COMID algorithm, even if it is a strongly convex objective function, We can only get the convergence rate of OGlogT / T, and give an improved method based on polynomial attenuation. It is proved theoretically that the proposed method has the optimal convergence efficiency of 1 / T). The experiment on the unbalanced datum dataset proves the efficiency and effectiveness of the proposed algorithm.
【學位授予單位】:安徽大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP181;TP311.13
【參考文獻】
相關期刊論文 前6條
1 陳日新;朱明旱;;半監(jiān)督k近鄰分類方法[J];中國圖象圖形學報;2013年02期
2 林江豪;陽愛民;周詠梅;陳錦;蔡澤鍵;;一種基于樸素貝葉斯的微博情感分類[J];計算機工程與科學;2012年09期
3 姚亞夫;邢留濤;;決策樹C4.5連續(xù)屬性分割閾值算法改進及其應用[J];中南大學學報(自然科學版);2011年12期
4 張鵬;唐世渭;;樸素貝葉斯分類中的隱私保護方法研究[J];計算機學報;2007年08期
5 劉紅巖,陳劍,陳國青;數(shù)據(jù)挖掘中的數(shù)據(jù)分類算法綜述[J];清華大學學報(自然科學版);2002年06期
6 張學工;關于統(tǒng)計學習理論與支持向量機[J];自動化學報;2000年01期
,本文編號:1570625
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1570625.html
最近更新
教材專著