基于圖像特征及OCR的垃圾圖像過濾方法研究
本文關鍵詞: 垃圾圖像 特征提取 KNN 短文本分類 出處:《南京理工大學》2017年碩士論文 論文類型:學位論文
【摘要】:隨著互聯(lián)網(wǎng)的蓬勃發(fā)展,電子郵件成為人們日常交流的重要工具。人們在通過電子郵件接收大量有用信息的同時,也會接收到很多廣告、色情、欺詐、木馬甚至是反動的內容,這些不良內容占用了大量的網(wǎng)絡資源、增加了用戶風險、降低了用戶體驗,屬于垃圾郵件。目前,垃圾郵件由文本型逐漸發(fā)展為圖像型和圖像文本混合型,以往針對文本的垃圾郵件過濾方法研究較多,而針對圖像的垃圾郵件過濾方法仍不盡人意。本文主要針對垃圾郵件中的垃圾圖像過濾技術進行研究。本文設計了一種兩層垃圾圖像過濾方法,通過利用圖像底層特征和OCR識別兩種途徑逐級篩選垃圾圖像,在提高檢出率的同時降低了誤檢率。根據(jù)采用的特征類型,該方法分為基于特征的過濾層和基于內容的過濾層。前者為第一層過濾,屬于粗分類,利用圖像的底層特征初步篩選出垃圾圖像;后者為第二層過濾,屬于細分類,利用垃圾圖像中識別的文本內容來提取關鍵詞并進行垃圾類別的分類。在基于特征的過濾層中,本文提出了基于置信度分析的KNN過濾方法。首先分析垃圾圖像和正常圖像的顏色、梯度以及HOG等圖像底層特征;然后分析各特征KNN分類結果及置信度分布,通過置信度實現(xiàn)多特征分類結果的融合,降低誤識率。在基于內容的過濾層中,本文首先設計了垃圾圖像中文本的檢測、分割和識別方法,針對垃圾圖像中文本傾斜問題設計了基于傅立葉和投影的單字分割方法;然后提出了融入相對詞頻的卡方檢驗方法用于提取文本中的關鍵詞特征,降低了低頻詞被選為特征的概率;最后設計了基于SVM及先驗語料庫的短文本分類方法,將垃圾圖像進一步分類為犯罪、教育、保險和商品促銷等幾類。采用SPAM公共圖像集和搜集整理的圖像集上進行了實驗分析和比較,結果表明本文兩層垃圾圖像過濾方法獲得了比較理想的準確率和誤識率。
[Abstract]:With the rapid development of the Internet, email has become an important tool for daily communication. People receive a lot of useful information through email, but also receive a lot of advertisements, pornography, fraud, Trojan horses and even reactionary content. This bad content takes up a lot of network resources, increases the risk of users, reduces the user experience, and belongs to spam. At present, spam has gradually evolved from text-based to image-based and image-text hybrid. In the past, there have been many researches on spam filtering methods for text. However, the spam filtering method for images is still unsatisfactory. In this paper, the spam image filtering technology in spam is mainly studied. A two-layer spam image filtering method is designed in this paper. In order to improve the detection rate and reduce the false detection rate, the garbage images are screened by using the image bottom feature and OCR recognition step by step. The method is divided into feature-based filtering layer and content-based filtering layer, the former is the first layer filtering, which belongs to coarse classification, the garbage image is preliminarily filtered by the bottom features of the image, and the latter is the second layer filtering, which belongs to the fine classification. The text content recognized in garbage images is used to extract keywords and classify garbage categories. In the feature-based filtering layer, In this paper, a new KNN filtering method based on confidence analysis is proposed. Firstly, the color, gradient and HOG underlying features of garbage image and normal image are analyzed, and then the KNN classification results and confidence distribution of each feature are analyzed. In the content-based filtering layer, this paper first designs the methods of Chinese text detection, segmentation and recognition of junk image. Aiming at the problem of Chinese text tilt in garbage images, a new segmentation method based on Fourier transform and projection is proposed, and then a chi-square test method is proposed to extract the keyword features from the text, which is based on the relative word frequency. The probability of low frequency words being selected as features is reduced. Finally, a short text classification method based on SVM and a priori corpus is designed to further classify garbage images into crimes, education, etc. The experimental analysis and comparison between the SPAM common image set and the collected image set are carried out. The results show that the two-layer garbage image filtering method in this paper has achieved an ideal accuracy rate and a false recognition rate.
【學位授予單位】:南京理工大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP391.41
【參考文獻】
相關期刊論文 前9條
1 劉新瀚;錢侃;王宇飛;朱向霄;孫知信;;自然場景下基于連通域檢測的文字識別算法研究[J];計算機技術與發(fā)展;2015年05期
2 XU Bin;LI Ruiguang;LIU Yashu;YAN Hanbing;LI Siyuan;ZHANG Honggang;;Filtering Chinese Image Spam Using Pseudo-OCR[J];Chinese Journal of Electronics;2015年01期
3 劉艷洋;曹玉東;賈旭;;基于內容的圖像型垃圾郵件過濾技術研究[J];遼寧工業(yè)大學學報(自然科學版);2014年02期
4 秦偉;;基于OCR的圖像型垃圾郵件過濾系統(tǒng)研究[J];機械工程與自動化;2013年06期
5 王宗輝;張衛(wèi)豐;張迎周;周國強;;基于陸地移動距離的相似度測量檢測圖像型垃圾郵件[J];江蘇科技大學學報(自然科學版);2012年01期
6 王忠桃;岳焱;彭鑫;;含傾斜文字的圖像垃圾郵件過濾技術研究[J];計算機與數(shù)字工程;2010年05期
7 程紅蓉;秦志光;萬明成;曾志華;;垃圾圖像判別中的特征提取與選擇研究[J];計算機應用研究;2009年06期
8 耿技;萬明成;程紅蓉;周俊怡;;基于文本區(qū)域特征的圖像型垃圾郵件過濾算法[J];計算機應用;2008年08期
9 許洋洋;袁華;;一種基于內容的廣告垃圾圖像過濾方法[J];山東大學學報(理學版);2006年03期
相關碩士學位論文 前1條
1 鄭冬冬;基于貝葉斯網(wǎng)絡的圖像型垃圾郵件識別研究[D];江蘇大學;2010年
,本文編號:1508293
本文鏈接:http://sikaile.net/wenyilunwen/guanggaoshejilunwen/1508293.html