基于Hadoop平臺(tái)的醫(yī)療保險(xiǎn)欺詐檢測(cè)的研究與應(yīng)用
本文選題:聚類 + 分類; 參考:《電子科技大學(xué)》2017年碩士論文
【摘要】:隨著我國醫(yī)療與經(jīng)濟(jì)水平的進(jìn)一步提高,我國醫(yī)療保險(xiǎn)覆蓋面已非常廣,老百姓享受到了醫(yī)保政策帶來的真切好處。與之相對(duì)的,醫(yī);馂E用的情況也有愈演愈烈的趨勢(shì),越來越多的基金被套取,打擊非法欺詐行為勢(shì)在必行。目前,醫(yī)保經(jīng)辦機(jī)構(gòu)主要利用規(guī)則系統(tǒng)對(duì)結(jié)算信息進(jìn)行審核,規(guī)則依賴于少數(shù)指標(biāo),由于規(guī)則的不完善性與更新的滯后性使得相對(duì)不變的規(guī)則很容易被精心偽造的數(shù)據(jù)欺騙,利用計(jì)算機(jī)技術(shù)輔助審查迫在眉睫。本文分析醫(yī)保數(shù)據(jù)特點(diǎn),使用數(shù)據(jù)挖掘技術(shù)建立了一套欺詐檢測(cè)的流程,并結(jié)合業(yè)務(wù)系統(tǒng),實(shí)現(xiàn)了醫(yī)保大數(shù)據(jù)欺詐檢測(cè)與審核,主要內(nèi)容如下:1.原始數(shù)據(jù)的特征工程處理。由于歷史原因,現(xiàn)有數(shù)據(jù)集存在諸多瑕疵,首先對(duì)原始數(shù)據(jù)利用特征工程進(jìn)行了處理,包括清除噪聲數(shù)據(jù),補(bǔ)全缺失值,結(jié)合實(shí)際業(yè)務(wù)流程提取特征等步驟。2.基于DBSCAN的粗粒度欺詐篩查。根據(jù)數(shù)據(jù)極度不平衡的特點(diǎn),研究無監(jiān)督算法在欺詐檢測(cè)中的應(yīng)用,主要對(duì)比了各種聚類算法對(duì)數(shù)據(jù)集應(yīng)用的效果,并結(jié)合標(biāo)簽信息擬定了使用DBSCAN算法識(shí)別異常群簇。3.基于密度抽樣與隨機(jī)森林的精準(zhǔn)欺詐檢測(cè)。在聚類劃分異常群體的基礎(chǔ)上,提出一種基于密度的抽樣方法對(duì)數(shù)據(jù)進(jìn)行再平衡,并在隨機(jī)森林算法中利用抽樣信息對(duì)子分類器進(jìn)行選擇集成,分類與聚類算法的結(jié)合使用使得準(zhǔn)確性大幅提高,最終形成完整的欺詐檢測(cè)框架。4.基于Hadoop平臺(tái)的并行化實(shí)現(xiàn)。針對(duì)大規(guī)模數(shù)據(jù)的場景提出了 DBSCAN與隨機(jī)森林的并行化算法,并在Hadoop平臺(tái)上使用Map-Reduce進(jìn)行了實(shí)現(xiàn),完成了一個(gè)欺詐檢測(cè)與審核系統(tǒng)。本文將數(shù)據(jù)挖掘技術(shù)應(yīng)用到醫(yī)保異常檢測(cè)領(lǐng)域,其創(chuàng)新之處在于不再局限于針對(duì)特定欺詐場景進(jìn)行建模,使得其能識(shí)別出一些較為罕見的數(shù)據(jù),具有更強(qiáng)的泛用性;以局部密度為紐帶,提出了一種基于密度的抽樣方法,將DBSCAN算法與隨機(jī)森林算法結(jié)合使用,在保證高準(zhǔn)確率的同時(shí)有效地控制了過擬合;在實(shí)現(xiàn)并行化算法的同時(shí)提出了一種高維數(shù)據(jù)的劃分方法,體現(xiàn)了負(fù)載均衡的思想。
[Abstract]:With the further improvement of medical and economic level in China, the coverage of medical insurance in China has been very wide, and the common people enjoy the real benefits of medical insurance policy. On the other hand, the abuse of medical insurance fund is becoming more and more serious, and more funds are withdrawn, so it is imperative to crack down on illegal fraud. At present, medical insurance agencies mainly use the rule system to audit the settlement information, and the rules depend on a few indicators. Due to the imperfections of the rules and the lag of updating, the relatively unchanged rules are easy to be deceived by carefully forged data. The use of computer technology to assist the examination is imminent. This paper analyzes the characteristics of medical insurance data, establishes a set of process of fraud detection by using data mining technology, and realizes the fraud detection and audit of medical insurance big data by combining business system. The main contents are as follows: 1. Feature engineering processing of raw data. Because of the historical reasons, there are many defects in the existing data sets. Firstly, the original data utilization feature engineering is processed, including removing the noise data, making up the missing value, and extracting the features according to the actual business process. Coarse granularity fraud screening based on DBSCAN. According to the characteristics of extremely unbalanced data, the application of unsupervised algorithm in fraud detection is studied. The effects of various clustering algorithms on the application of data sets are compared, and the DBSCAN algorithm is used to identify abnormal cluster. 3. Precision fraud detection based on density sampling and random forest. On the basis of clustering and dividing abnormal population, a density-based sampling method is proposed to rebalance the data, and the sampling information is used to select and integrate the sub-classifiers in the random forest algorithm. With the combination of classification and clustering, the accuracy is greatly improved, and a complete fraud detection framework. 4. Parallel implementation based on Hadoop platform. A parallel algorithm of DBSCAN and random forest is proposed for large-scale data scene. A fraud detection and verification system is implemented on Hadoop platform using Map-Reduce. In this paper, data mining technology is applied to the field of medical insurance anomaly detection. Its innovation is that it is no longer limited to the modeling of specific fraud scenarios, so that it can identify some rare data and have more universal use. Based on local density, a density-based sampling method is proposed, which combines DBSCAN algorithm with random forest algorithm to ensure high accuracy and effectively control over-fitting. At the same time, a high dimensional data partition method is proposed, which embodies the idea of load balancing.
【學(xué)位授予單位】:電子科技大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:F842.684;TP311.13
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 陳亞琳;王旭明;;基于數(shù)據(jù)挖掘的醫(yī)保欺詐預(yù)警模型研究[J];電腦知識(shí)與技術(shù);2016年11期
2 張金霞;;我國醫(yī)保欺詐問題的風(fēng)險(xiǎn)防范及管理對(duì)策研究[J];商;2016年17期
3 劉格華;;醫(yī)療保險(xiǎn)基金欺詐形式分析及對(duì)策研究[J];中國總會(huì)計(jì)師;2015年06期
4 李亞子;尤斌;;醫(yī)療保險(xiǎn)騙保特征分析[J];中國社會(huì)保障;2015年02期
5 李亞青;;社會(huì)醫(yī)療保險(xiǎn)財(cái)政補(bǔ)貼增長及可持續(xù)性研究——以醫(yī)保制度整合為背景[J];公共管理學(xué)報(bào);2015年01期
6 李德仁;姚遠(yuǎn);邵振峰;;智慧城市中的大數(shù)據(jù)[J];武漢大學(xué)學(xué)報(bào)(信息科學(xué)版);2014年06期
7 王蔚臆;;醫(yī)保欺詐的成因及其監(jiān)管探析[J];管理觀察;2014年08期
8 孫翎;;中國社會(huì)醫(yī)療保險(xiǎn)制度整合的研究綜述[J];華東經(jīng)濟(jì)管理;2013年02期
9 沈培;張吉?jiǎng)P;;聚類分析在醫(yī)療費(fèi)用數(shù)據(jù)挖掘中的應(yīng)用[J];華南預(yù)防醫(yī)學(xué);2012年01期
10 龐洋;徐巧鳳;;基于網(wǎng)格分區(qū)確定DBSCAN參數(shù)的方法[J];計(jì)算機(jī)與現(xiàn)代化;2010年05期
相關(guān)碩士學(xué)位論文 前5條
1 張海洋;醫(yī)療保險(xiǎn)欺詐檢測(cè)問題研究[D];山東大學(xué);2016年
2 楊超;基于BP神經(jīng)網(wǎng)絡(luò)的健康保險(xiǎn)欺詐識(shí)別研究[D];青島大學(xué);2014年
3 彭黎;神經(jīng)網(wǎng)絡(luò)算法在新農(nóng)合醫(yī)療保險(xiǎn)欺詐風(fēng)險(xiǎn)預(yù)警中的應(yīng)用[D];湖南大學(xué);2014年
4 熊明明;美國醫(yī)療保險(xiǎn)欺詐與濫用控制(HCFAC)研究[D];湖南大學(xué);2012年
5 何俊華;數(shù)據(jù)挖掘技術(shù)在醫(yī)保領(lǐng)域中的研究與應(yīng)用[D];復(fù)旦大學(xué);2011年
,本文編號(hào):1999822
本文鏈接:http://sikaile.net/jingjilunwen/bxjjlw/1999822.html