基于機器學習的網站分級研究與實現(xiàn)
發(fā)布時間:2018-06-20 05:58
本文選題:內容分類 + 深度學習 ; 參考:《電子科技大學》2017年碩士論文
【摘要】:互聯(lián)網中的不良信息長久以來一直存在,而且數量上呈現(xiàn)增長趨勢,其中以色情信息居多,還包含著賭博,傳銷等違法內容。為此社會各界為整治互聯(lián)網環(huán)境獻計出力,國家也出臺了相應的法律法規(guī)以規(guī)范網絡環(huán)境,但是不良信息卻屢禁不止,泛濫成災。目前已經有許多不良信息攔截系統(tǒng)以軟件或硬件的方式為我們的網絡環(huán)境更加美好出力,但是其中大多數系統(tǒng)都“各自為政”,重復建立自己的黑名單庫。本系統(tǒng)的目標是通過主動檢測網站內容,建立共享的不良信息數據庫,為攔截系統(tǒng)提供公共數據支持。本系統(tǒng)通過研究深度學習的圖像分類與文本分類算法,將新型算法運用到不良信息分類的任務中。深度學習算法較傳統(tǒng)知識工程或統(tǒng)計學方法需要手動提取特征的方法相比,深度學習具有自動學習特征提取的能力,在圖像識別方面具有更高的分類準確度。在文本分類算法上提出新方法,將網頁長文本截取為短文本再分類,將分類結果匯總得到網頁文本的色情比例,并且根據服務人群不同調節(jié)色情比例閾值以滿足不同人群的過濾需求。在圖像分類算法上,深度卷積模型最為有效,并且深度卷積模型在近幾年的發(fā)展中,又有了長足進步,并發(fā)展出幾種類型的模型,如直線型、局部雙分支型和局部多分支型。本文通過研究不同類型模型在不良圖片分類任務上的表現(xiàn),并采用微調的方式去訓練多種深度卷積模型,最終根據模型的計算量消耗與模型的準確率選擇最合適的圖像分類算法。系統(tǒng)設計充分考慮了系統(tǒng)擴展性與移植性,并且可利用老舊或閑散設備作為系統(tǒng)工作節(jié)點,節(jié)省項目資金。本系統(tǒng)主要包括五個部分,網絡爬蟲模塊、文本分類模塊、圖片分類模塊、數據存儲模塊和數據展示模塊。其中網絡爬蟲模塊,文本分類模塊,圖片分類模塊為本論文的主要研究方向。
[Abstract]:The bad information in the Internet has been existed for a long time and the quantity is increasing. Among them, pornographic information is the majority, but also contains illegal content such as gambling, pyramid selling and so on. In order to improve the Internet environment, the government has also issued the corresponding laws and regulations to regulate the network environment, but the bad information is not only banned, but also overflowed. At present, there are many bad information intercepting systems to help our network environment better by software or hardware, but most of them are "doing their own thing" and repeatedly establishing their own blacklist database. The aim of this system is to provide public data support for intercepting system by actively detecting website content, establishing shared bad information database. This system applies the new algorithm to the task of bad information classification by studying the image classification and text classification algorithms of depth learning. Compared with the traditional knowledge engineering or statistical methods, depth learning has the ability to extract features automatically and has higher classification accuracy in image recognition. In the text classification algorithm, a new method is put forward, which intercepts the long text of the web page and classifies it into short text, and then summarizes the classification results to get the pornographic proportion of the page text. And adjust the threshold of pornography proportion according to different service groups to meet the filtering needs of different groups. In the image classification algorithm, the depth convolution model is the most effective, and the depth convolution model has made great progress in recent years, and developed several types of models, such as linear type, local double branching type and local multi-branching type. In this paper, we study the performance of different types of models in the task of bad image classification, and use fine-tuning to train various kinds of deep convolution models. Finally, the most suitable image classification algorithm is selected according to the computational cost of the model and the accuracy of the model. The system design fully considers the expansibility and portability of the system, and can use the old or idle equipment as the work node of the system, thus saving the project money. The system mainly includes five parts, web crawler module, text classification module, picture classification module, data storage module and data display module. Among them, web crawler module, text classification module and picture classification module are the main research directions of this paper.
【學位授予單位】:電子科技大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP393.092
【參考文獻】
相關期刊論文 前7條
1 張志銘;李若蘭;;內容分級制度視角下的網絡色情淫穢治理[J];浙江社會科學;2013年06期
2 蔡濱榮;;構建和諧網絡信息環(huán)境——關于互聯(lián)網內容安全管理的思考[J];中國電信業(yè);2010年03期
3 周德懋;李舟軍;;高性能網絡爬蟲:研究綜述[J];計算機科學;2009年08期
4 李雪;;網絡不良信息呈多元化發(fā)展[J];信息安全與通信保密;2009年03期
5 王靖華;;美國互聯(lián)網管制的三個標準[J];當代傳播;2008年03期
6 劉金紅;陸余良;;主題網絡爬蟲研究綜述[J];計算機應用研究;2007年10期
7 嚴亞蘭,查先進;Web網頁并行爬行研究[J];計算機應用研究;2005年04期
,本文編號:2043227
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/2043227.html
最近更新
教材專著