面向刑事案件的精細(xì)分類與串并案分析技術(shù)研究

發(fā)布時(shí)間：2018-09-08 17:00

【摘要】：隨著信息技術(shù)的高速發(fā)展,公安領(lǐng)域的情報(bào)信息系統(tǒng)也面臨著海量數(shù)據(jù),主要是文本數(shù)據(jù)帶來的巨大挑戰(zhàn),傳統(tǒng)的手工處理方式已經(jīng)難以滿足業(yè)務(wù)上的需求,必須采用更加自動(dòng)化、智能化的文本挖掘技術(shù)來提高辦案效率。面向刑事案件文本,重點(diǎn)研究案件精細(xì)分類和串并案分析這兩個(gè)刑偵人員普遍關(guān)注的問題。提出了基于樸素貝葉斯和關(guān)鍵詞共現(xiàn)圖譜的兩級分類方法TLC-NBK,該方法根據(jù)案件文本長度短、詞頻低、類別分布具有層次性和不均衡性的特點(diǎn),首先在文檔頻率DF方法的基礎(chǔ)上引入了詞性特征,提出雙因子評估算法進(jìn)行特征選擇,然后利用面向不均衡類別的多變量貝努利模型進(jìn)行樸素貝葉斯分類,實(shí)現(xiàn)了一級案件類別的快速、準(zhǔn)確劃分;在第一級分類器的基礎(chǔ)上,針對其所屬的二級案件類別分別構(gòu)建以文檔集為基本單位的關(guān)鍵詞共現(xiàn)向量,以關(guān)鍵詞間的共現(xiàn)關(guān)系代替詞頻計(jì)算權(quán)重,并提出了逆類別頻率因子對共現(xiàn)權(quán)重進(jìn)行修正,最后采用簡單向量距離算法實(shí)現(xiàn)二級案件類別的精細(xì)分類。此外,還利用同義詞網(wǎng)技術(shù)消除了領(lǐng)域同義詞對分類結(jié)果的干擾。提出了基于案件特征的密度聚類方法,實(shí)現(xiàn)了系列案件的串并分析。該方法首先結(jié)合規(guī)則和字典從非結(jié)構(gòu)化的案情描述信息中抽取出結(jié)構(gòu)化的案件特征;接著定義了案件文本間的特征相似度計(jì)算公式,綜合考慮了精細(xì)案件類別、案發(fā)時(shí)間和案發(fā)地點(diǎn)對案件特征相似度的影響,并采用層次分析法決策各維度的權(quán)重值;最后,借鑒經(jīng)典密度聚類算法OPTICS的思想,提出了特征密度聚類算法OPTICS-FD,能夠有效的分析出系列案件的密集簇,輔助刑偵人員破案。最后,通過實(shí)驗(yàn)對雙因子評估算法、兩級分類器、案件特征抽取和串并案聚類進(jìn)行了測試。結(jié)果表明,在刑事案件文本挖掘領(lǐng)域,相比于傳統(tǒng)方法,TLC-NBK方法的準(zhǔn)確率和召回率分別提升了7.53%和12.99%;OPTICS-FD算法的縮減率與召回率分別達(dá)到了66.52%和91.25%,更好的支持了刑偵人員進(jìn)行決策。
[Abstract]:With the rapid development of information technology, the information system in the field of public security is also faced with a huge amount of data, mainly text data, the traditional manual processing method has been difficult to meet the needs of the business. More automatic and intelligent text mining technology must be adopted to improve the efficiency of case handling. Focusing on the text of criminal cases, this paper focuses on the fine classification of cases and the analysis of serial cases, which are generally concerned by criminal investigators. A two-level classification method, TLC-NBK, based on naive Bayes and cooccurrence map of keywords is proposed. The method is based on the characteristics of short text length, low word frequency, hierarchical and unbalanced distribution of categories. Firstly, based on the DF method of document frequency, part of speech feature is introduced, and a two-factor evaluation algorithm is proposed for feature selection, and then naive Bayesian classification is carried out by using the multi-variable Bernoulli model oriented to unbalanced categories. On the basis of the first level classifier, the cooccurrence vector of keywords based on the document set is constructed for the second class case category to which it belongs. The cooccurrence relation between keywords is used instead of the word frequency to calculate the weight, and the inverse class frequency factor is proposed to modify the co-occurrence weight. Finally, the simple vector distance algorithm is used to realize the fine classification of the second-level case category. In addition, the interference of domain synonyms to classification results is eliminated by using synonym net technology. A density clustering method based on case features is proposed, and the serial case sequence analysis is realized. The method firstly extracts the structured case features from the unstructured case description information by combining rules and dictionaries, and then defines the formula for calculating the similarity of features between the case texts, and considers the fine case categories synthetically. The influence of time and location on the similarity of case features is analyzed, and the weight of each dimension is determined by AHP. Finally, the idea of OPTICS, a classical density clustering algorithm, is used for reference. The feature density clustering algorithm (OPTICS-FD,) is proposed to analyze the cluster of cases effectively and to assist the criminal investigators to solve the cases. Finally, the double factor evaluation algorithm, two-level classifier, case feature extraction and string-parallel case clustering are tested through experiments. The results show that in the field of criminal case text mining, the accuracy and recall rate of TLC-NBK method are increased by 7.53% and 12.99%, respectively, and the reduction rate and recall rate of OPTICS-FD algorithm are 66.52% and 91.25%, respectively.
【學(xué)位授予單位】：華中科技大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2016
【分類號】：TP391.1;D918.2

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 吳文浩;吳升;;多時(shí)間尺度密度聚類算法的案事件分析應(yīng)用[J];地球信息科學(xué)學(xué)報(bào);2015年07期

2 陳龍;Neil Stuart;Williams A.Mackaness;;美國內(nèi)布拉斯加州林肯市犯罪行為的聚類及熱點(diǎn)分布分析[J];測繪與空間地理信息;2015年03期

3 盧睿;;刑事案件的屬性約簡聚類算法研究[J];中國人民公安大學(xué)學(xué)報(bào)(自然科學(xué)版);2015年01期

4 蘇光大;田青;徐偉;鄧宇;;人臉識別技術(shù)及其在公共安全領(lǐng)域的應(yīng)用[J];警察技術(shù);2014年05期

5 周志濤;鮑靈佳;;社會網(wǎng)絡(luò)分析在團(tuán)伙詐騙犯罪偵查中的應(yīng)用[J];江西警察學(xué)院學(xué)報(bào);2014年03期

6 陳俊杰;候宏旭;高靜;;一種KeyGraph的建模方法[J];中北大學(xué)學(xué)報(bào)(自然科學(xué)版);2014年02期

7 李為;;基于數(shù)據(jù)挖掘技術(shù)的網(wǎng)絡(luò)違法案件分析研究[J];現(xiàn)代計(jì)算機(jī)(專業(yè)版);2013年35期

8 楊靜;王靖;;基于聚類分析檢索團(tuán)伙多起犯罪的迭代算法[J];計(jì)算機(jī)與現(xiàn)代化;2013年01期

9 高建強(qiáng);譚劍;崔永發(fā);;一種基于通訊痕跡的社會網(wǎng)絡(luò)團(tuán)伙分析模型[J];計(jì)算機(jī)應(yīng)用與軟件;2012年03期

10 楊凱峰;張毅坤;李燕;;基于文檔頻率的特征選擇方法[J];計(jì)算機(jī)工程;2010年17期

相關(guān)碩士學(xué)位論文前3條

1 韓彥斌;基于人臉檢測和特征提取的移動(dòng)人像采集系統(tǒng)[D];云南大學(xué);2015年

2 金鑫;基于文本機(jī)會發(fā)現(xiàn)的共識與非共識標(biāo)簽區(qū)分方法[D];東北大學(xué);2011年

3 程春惠;公安犯罪案件文本挖掘關(guān)鍵技術(shù)研究[D];浙江大學(xué);2010年

，

本文編號：2231130

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2231130.html

上一篇：基于稀疏表示的圖像去噪算法
下一篇：基于遺傳算法的分布式數(shù)據(jù)挖掘MapReduce架構(gòu)研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

面向刑事案件的精細(xì)分類與串并案分析技術(shù)研究