基于Spark的在線欺詐檢測算法設計與實現(xiàn)
發(fā)布時間:2018-05-26 02:55
本文選題:欺詐檢測 + 不平衡學習; 參考:《浙江大學》2017年碩士論文
【摘要】:在大數(shù)據(jù)時代背景下,電子商務、第三方支付等線上業(yè)務爆發(fā)式增長,隨之而來的是日益猖獗的線上欺詐案件,在線欺詐檢測技術作為企業(yè)風控能力的基石,通過對業(yè)務行為建模,更加精準、高效地識別欺詐案件,為廣大用戶和線上平臺挽回損失、規(guī)避風險,發(fā)揮著巨大的作用。由于線上欺詐案件與正常交易的極度不平衡性,在線欺詐檢測需要重點解決不平衡學習問題。除此以外,隨著線上業(yè)務量日益增長,在線欺詐檢測系統(tǒng)作為業(yè)務系統(tǒng)的核心組件,對其性能要求也越來越嚴格,將大數(shù)據(jù)技術和在線欺詐檢測有機結合將極大地提升企業(yè)的風控防御能力。本論文從相關技術介紹切入,詳細討論了包括分布式計算框架Spark,實時流計算組件Spark Streaming在內(nèi)的大數(shù)據(jù)技術,同時介紹了在線欺詐檢測研究的進展。結合大數(shù)據(jù)背景,本文提出了基于聚類的數(shù)據(jù)集自平衡構建算法和分布式資損敏感Lasso算法,將兩者有機結合基于Spark分布式計算框架進行了實現(xiàn),并在實際在線欺詐檢測數(shù)據(jù)集上進行了相關指標的測評。本論文的主要貢獻有:1)提出了一種基于聚類的數(shù)據(jù)集自平衡增量構建算法,利用增量聚類算法度量類內(nèi)樣本的相似度,選擇類內(nèi)具有代表性的多個樣本點構成訓練集,在能夠保留時序數(shù)據(jù)信息的情況下,有效解決在線欺詐檢測數(shù)據(jù)集的類內(nèi)、類間不平衡等問題;2)結合在線支付欺詐檢測場景,提出了分布式資損敏感Lasso算法,在大數(shù)據(jù)背景下能夠高效地進行模型訓練,并能有效提高在線欺詐檢測模型的資損率;3)基于Spark分布式計算框架和Spark Streaming實時流處理組件,無縫集成基于聚類的數(shù)據(jù)集自平衡增量構建算法和分布式資損敏感Lasso算法,驗證了上述方法在大數(shù)據(jù)背景下的在線欺詐檢測場景的有效性。
[Abstract]:Under the background of big data era, e-commerce, third-party payment and other online business explosive growth, followed by the increasingly rampant online fraud cases, online fraud detection technology as the cornerstone of enterprise wind control capacity, Through the modeling of business behavior, more accurate and efficient identification of fraud cases, for the vast number of users and online platforms to recover losses, avoid risks, play a huge role. Because of the extreme imbalance between online fraud cases and normal transactions, online fraud detection needs to focus on solving the imbalance learning problem. In addition, with the increasing volume of online business, the online fraud detection system, as the core component of the business system, has become more and more stringent in its performance requirements. The combination of big data technology and online fraud detection will greatly improve the ability of wind control defense. This paper discusses the big data technology including the distributed computing framework (Spark), the real-time stream computing component (Spark Streaming), and the research progress of online fraud detection. Based on the background of big data, this paper proposes a clustering based self-balancing algorithm for data sets and a distributed loss-sensitive Lasso algorithm. The two algorithms are implemented based on the distributed computing framework of Spark. The related indexes are evaluated on the actual online fraud detection data set. The main contributions of this paper are: (1) A clustering based self-balanced incremental algorithm is proposed. Using the incremental clustering algorithm to measure the similarity of samples within a class, a training set is constructed by selecting a number of representative sample points in the class. This paper proposes a distributed loss-sensitive Lasso algorithm based on the on-line payment fraud detection scenario, which can effectively solve the problems of in-class and inter-class imbalance in online fraud detection data set. Under the background of big data, model training can be carried out efficiently, and the capital loss rate of online fraud detection model can be improved effectively. It is based on Spark distributed computing framework and Spark Streaming real-time stream processing module. The clustering based self-balanced incremental construction algorithm and the distributed capital-loss sensitive Lasso algorithm are seamlessly integrated to verify the effectiveness of the above methods in the online fraud detection scenario under the background of big data.
【學位授予單位】:浙江大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP311.13
【參考文獻】
相關期刊論文 前3條
1 孫大為;張廣艷;鄭緯民;;大數(shù)據(jù)流式計算:關鍵技術及系統(tǒng)實例[J];軟件學報;2014年04期
2 李國杰;程學旗;;大數(shù)據(jù)研究:未來科技及經(jīng)濟社會發(fā)展的重大戰(zhàn)略領域——大數(shù)據(jù)的研究現(xiàn)狀與科學思考[J];中國科學院院刊;2012年06期
3 陳建增;;第三方支付業(yè)務的反欺詐措施與技術探析[J];時代金融;2012年21期
相關碩士學位論文 前1條
1 魏吉勇;B2B平臺的反欺詐問題研究[D];南京大學;2014年
,本文編號:1935671
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1935671.html
最近更新
教材專著