手機(jī)短消息不良信息過(guò)濾方法的研究與實(shí)現(xiàn)
本文選題:短消息 + 分詞 ; 參考:《上海交通大學(xué)》2008年碩士論文
【摘要】: 手機(jī)短消息在最近幾年進(jìn)入了爆發(fā)式的快速增長(zhǎng)時(shí)期。然而,短消息在給用戶帶來(lái)極大便利的同時(shí),也成為信息安全的重大隱患。通過(guò)短消息這一新興的信息工具,各種色情暴力、政治謠言、反動(dòng)言論、詐騙信息和非法廣告的傳播,已經(jīng)成為影響社會(huì)穩(wěn)定的重要因素之一。非法手機(jī)短消息考驗(yàn)著社會(huì)應(yīng)對(duì)不法侵害的能力。面對(duì)這種運(yùn)用現(xiàn)代信息技術(shù)作案的新型犯罪,如何防范和打擊,對(duì)公、檢、法機(jī)關(guān)乃至銀行、工信等部門(mén)都是新的挑戰(zhàn)。 本文提出了基于文本內(nèi)容分類(lèi)的短消息分類(lèi)與過(guò)濾機(jī)制,設(shè)計(jì)出改進(jìn)型的基于貝葉斯算法短消息過(guò)濾模型,開(kāi)發(fā)了文本短消息攔截過(guò)濾平臺(tái),給出了該模型的幾個(gè)關(guān)鍵功能模塊的具體實(shí)現(xiàn),完成對(duì)短消息內(nèi)容的識(shí)別和短消息的自動(dòng)過(guò)濾,所做的主要工作如下:首先,依據(jù)短消息分類(lèi)的特點(diǎn),分析了短消息分類(lèi)權(quán)重的不一致性。在正常情況下,人們最不希望將正常短消息誤判為不良短消息而被過(guò)濾掉,為使希望損失最小,不但要求短消息分類(lèi)的準(zhǔn)確性要高,并且正常短消息被誤判為不良短消息的權(quán)重要高于不良短消息誤判為正常短消息的權(quán)重。其次,我們?cè)O(shè)計(jì)了短消息分類(lèi)與過(guò)濾相關(guān)主要模塊:短消息采集、中文分詞、特征選取、短消息分類(lèi)與過(guò)濾。最后,我們對(duì)該模型進(jìn)行了測(cè)試,借鑒了文本分類(lèi)和信息檢索領(lǐng)域中的評(píng)價(jià)指標(biāo)對(duì)系統(tǒng)平臺(tái)實(shí)驗(yàn)結(jié)果進(jìn)行了質(zhì)量評(píng)價(jià)。 本文設(shè)計(jì)和實(shí)現(xiàn)中的特點(diǎn)和創(chuàng)新性在以下三個(gè)方面。第一,提出了在短消息服務(wù)器上設(shè)計(jì)與實(shí)現(xiàn)短消息過(guò)濾。與一般在手機(jī)端進(jìn)行短消息過(guò)濾不同,服務(wù)器端同時(shí)收到由短消息貓發(fā)送的大量相同內(nèi)容的短消息,只要一條判別為垃圾短消息,那么其他的短消息也同樣可判別為垃圾短消息,并把它拋棄,節(jié)省了網(wǎng)絡(luò)流量,也克服了普通手機(jī)處理能力不強(qiáng)、過(guò)濾處理效率不高的缺點(diǎn)。第二,在中文分詞模塊中,采用多級(jí)哈希表數(shù)據(jù)結(jié)構(gòu)來(lái)實(shí)現(xiàn)中文詞條的快速查找,其速度比基于數(shù)據(jù)庫(kù)中文詞表的詞條查詢速度快很多,提高了中文分詞的效率;在分詞過(guò)程中采用了最大匹配法,提高了分詞的準(zhǔn)確度。第三,使用文檔頻度與詞條頻度相結(jié)合來(lái)進(jìn)行特征選取。既體現(xiàn)了詞條在同類(lèi)文檔中出現(xiàn)的普遍性,也體現(xiàn)了詞條對(duì)于單個(gè)文檔本身的表意能力。該方法比文檔頻度法更接近實(shí)際情況,能夠更有效地純化分類(lèi)的特征向量。 將文本分類(lèi)和信息過(guò)濾技術(shù)引用到了短消息過(guò)濾平臺(tái)中,實(shí)驗(yàn)結(jié)果證明該短消息自動(dòng)過(guò)濾平臺(tái)具有較好的應(yīng)用前景。依據(jù)公安部、工業(yè)和信息化部、國(guó)家安全部和國(guó)務(wù)院新聞辦聯(lián)合發(fā)文精神,相信運(yùn)用本文研究的方法,一定能夠做到打擊查處破獲一批違法短消息案件,監(jiān)控、封堵一些涉及重大敏感事件的有害公眾短消息。
[Abstract]:Cell phone SMS has entered a explosive period of rapid growth in recent years. However, short message not only brings great convenience to users, but also becomes a major hidden danger of information security. Through short message as a new information tool, various sexual violence, political rumors, reactionary speech, fraud information and the spread of illegal advertising, has become one of the important factors affecting social stability. The illegal mobile phone short message tests the society's ability to deal with illegal infringement. In the face of this new type of crime using modern information technology, how to prevent and crack down on it is a new challenge to the public, prosecutors, legal organs, even banks, industry and credit departments. In this paper, a text message classification and filtering mechanism based on text content classification is proposed, an improved short message filtering model based on Bayesian algorithm is designed, and a text short message interception and filtering platform is developed. The realization of several key function modules of the model is given, and the recognition of short message content and the automatic filtering of short message are completed. The main work is as follows: firstly, according to the characteristics of short message classification, The inconsistency of the weight of short message classification is analyzed. Under normal circumstances, people do not want to be filtered out by misjudging normal short messages as bad ones. In order to minimize the loss, it is not only required that the accuracy of short message classification be high. And the weight of normal short message is higher than that of bad short message. Secondly, we design the main modules of short message classification and filtering: short message collection, Chinese word segmentation, feature selection, short message classification and filtering. Finally, we test the model, and use the evaluation indexes in the field of text classification and information retrieval to evaluate the experimental results of the system platform. In this paper, the design and implementation of the characteristics and innovation in the following three aspects. First, the design and implementation of short message filtering on short message server is proposed. Unlike the usual short message filtering on the phone, the server receives a large number of the same messages sent by the short message cat at the same time, as long as one message is classified as spam. So other short messages can also be identified as spam short messages, and discard it, save network traffic, but also overcome the common mobile phone processing capacity is not strong, filter processing efficiency is not high shortcomings. Secondly, in the Chinese word segmentation module, the multi-level hash table data structure is used to realize the fast search of Chinese words, which is much faster than the query speed of Chinese word table based on database, and improves the efficiency of Chinese word segmentation. In the process of word segmentation, the maximum matching method is used to improve the accuracy of word segmentation. Thirdly, the feature selection is based on the combination of document frequency and term frequency. It not only reflects the universality of terms in the same document, but also reflects the ability of the entry to express itself to a single document. This method is closer to the actual situation than the document frequency method and can purify the classification feature vector more effectively. The text classification and information filtering techniques are applied to the short message filtering platform. The experimental results show that the short message automatic filtering platform has a good application prospect. In accordance with the spirit of joint issuance by the Ministry of Public Security, the Ministry of Industry and Information, the Ministry of National Security and the Information Office of the State Council, it is believed that by using the method studied in this paper, we will be able to crack down on and deal with a number of illegal short message cases and monitor them. Block some harmful public short messages involving major and sensitive events.
【學(xué)位授予單位】:上海交通大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2008
【分類(lèi)號(hào)】:TN929.53
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 孫健,王偉,鐘義信;基于K-最近距離的自動(dòng)文本分類(lèi)的研究[J];北京郵電大學(xué)學(xué)報(bào);2001年01期
2 張曉輝,李瑩,王華勇,趙宏;應(yīng)用特征聚合進(jìn)行中文文本分類(lèi)的改進(jìn)KNN算法[J];東北大學(xué)學(xué)報(bào);2003年03期
3 陳鑫;基于文本的分類(lèi)方法研究[J];電腦開(kāi)發(fā)與應(yīng)用;2003年07期
4 王灝,黃厚寬,田盛豐;文本分類(lèi)實(shí)現(xiàn)技術(shù)[J];廣西師范大學(xué)學(xué)報(bào)(自然科學(xué)版);2003年01期
5 蘇偉峰,李紹滋,李堂秋;一個(gè)基于概念的中文文本分類(lèi)模型[J];計(jì)算機(jī)工程與應(yīng)用;2002年06期
6 劉貴龍,宋柔,王慧玲;模糊聚類(lèi)分析在文本分類(lèi)中的應(yīng)用[J];計(jì)算機(jī)工程與應(yīng)用;2003年09期
7 唐煥玲,付克明,魯明羽;文本分類(lèi)系統(tǒng)SECTCS中若干技術(shù)問(wèn)題的探討[J];計(jì)算機(jī)工程與應(yīng)用;2003年11期
8 湛燕,陳昊,袁方,王熙照;基于中文文本分類(lèi)的分詞方法研究[J];計(jì)算機(jī)工程與應(yīng)用;2003年23期
9 黃力芹,汪濤,吳耿鋒;工作流管理系統(tǒng)的分類(lèi)及其和CSCW的關(guān)系[J];計(jì)算機(jī)工程;2001年04期
10 蔣偉華,林亞平,黃燦燦;特殊搜索引擎中的文本分類(lèi)研究[J];計(jì)算機(jī)工程;2001年05期
相關(guān)博士學(xué)位論文 前1條
1 卜東波;聚類(lèi)/分類(lèi)理論研究及其在文本挖掘中的應(yīng)用[D];中國(guó)科學(xué)院研究生院(計(jì)算技術(shù)研究所);2000年
,本文編號(hào):1909284
本文鏈接:http://sikaile.net/wenyilunwen/guanggaoshejilunwen/1909284.html