基于機(jī)器學(xué)習(xí)的網(wǎng)站分級(jí)研究與實(shí)現(xiàn)
發(fā)布時(shí)間:2018-06-20 05:58
本文選題:內(nèi)容分類 + 深度學(xué)習(xí) ; 參考:《電子科技大學(xué)》2017年碩士論文
【摘要】:互聯(lián)網(wǎng)中的不良信息長(zhǎng)久以來(lái)一直存在,而且數(shù)量上呈現(xiàn)增長(zhǎng)趨勢(shì),其中以色情信息居多,還包含著賭博,傳銷等違法內(nèi)容。為此社會(huì)各界為整治互聯(lián)網(wǎng)環(huán)境獻(xiàn)計(jì)出力,國(guó)家也出臺(tái)了相應(yīng)的法律法規(guī)以規(guī)范網(wǎng)絡(luò)環(huán)境,但是不良信息卻屢禁不止,泛濫成災(zāi)。目前已經(jīng)有許多不良信息攔截系統(tǒng)以軟件或硬件的方式為我們的網(wǎng)絡(luò)環(huán)境更加美好出力,但是其中大多數(shù)系統(tǒng)都“各自為政”,重復(fù)建立自己的黑名單庫(kù)。本系統(tǒng)的目標(biāo)是通過(guò)主動(dòng)檢測(cè)網(wǎng)站內(nèi)容,建立共享的不良信息數(shù)據(jù)庫(kù),為攔截系統(tǒng)提供公共數(shù)據(jù)支持。本系統(tǒng)通過(guò)研究深度學(xué)習(xí)的圖像分類與文本分類算法,將新型算法運(yùn)用到不良信息分類的任務(wù)中。深度學(xué)習(xí)算法較傳統(tǒng)知識(shí)工程或統(tǒng)計(jì)學(xué)方法需要手動(dòng)提取特征的方法相比,深度學(xué)習(xí)具有自動(dòng)學(xué)習(xí)特征提取的能力,在圖像識(shí)別方面具有更高的分類準(zhǔn)確度。在文本分類算法上提出新方法,將網(wǎng)頁(yè)長(zhǎng)文本截取為短文本再分類,將分類結(jié)果匯總得到網(wǎng)頁(yè)文本的色情比例,并且根據(jù)服務(wù)人群不同調(diào)節(jié)色情比例閾值以滿足不同人群的過(guò)濾需求。在圖像分類算法上,深度卷積模型最為有效,并且深度卷積模型在近幾年的發(fā)展中,又有了長(zhǎng)足進(jìn)步,并發(fā)展出幾種類型的模型,如直線型、局部雙分支型和局部多分支型。本文通過(guò)研究不同類型模型在不良圖片分類任務(wù)上的表現(xiàn),并采用微調(diào)的方式去訓(xùn)練多種深度卷積模型,最終根據(jù)模型的計(jì)算量消耗與模型的準(zhǔn)確率選擇最合適的圖像分類算法。系統(tǒng)設(shè)計(jì)充分考慮了系統(tǒng)擴(kuò)展性與移植性,并且可利用老舊或閑散設(shè)備作為系統(tǒng)工作節(jié)點(diǎn),節(jié)省項(xiàng)目資金。本系統(tǒng)主要包括五個(gè)部分,網(wǎng)絡(luò)爬蟲模塊、文本分類模塊、圖片分類模塊、數(shù)據(jù)存儲(chǔ)模塊和數(shù)據(jù)展示模塊。其中網(wǎng)絡(luò)爬蟲模塊,文本分類模塊,圖片分類模塊為本論文的主要研究方向。
[Abstract]:The bad information in the Internet has been existed for a long time and the quantity is increasing. Among them, pornographic information is the majority, but also contains illegal content such as gambling, pyramid selling and so on. In order to improve the Internet environment, the government has also issued the corresponding laws and regulations to regulate the network environment, but the bad information is not only banned, but also overflowed. At present, there are many bad information intercepting systems to help our network environment better by software or hardware, but most of them are "doing their own thing" and repeatedly establishing their own blacklist database. The aim of this system is to provide public data support for intercepting system by actively detecting website content, establishing shared bad information database. This system applies the new algorithm to the task of bad information classification by studying the image classification and text classification algorithms of depth learning. Compared with the traditional knowledge engineering or statistical methods, depth learning has the ability to extract features automatically and has higher classification accuracy in image recognition. In the text classification algorithm, a new method is put forward, which intercepts the long text of the web page and classifies it into short text, and then summarizes the classification results to get the pornographic proportion of the page text. And adjust the threshold of pornography proportion according to different service groups to meet the filtering needs of different groups. In the image classification algorithm, the depth convolution model is the most effective, and the depth convolution model has made great progress in recent years, and developed several types of models, such as linear type, local double branching type and local multi-branching type. In this paper, we study the performance of different types of models in the task of bad image classification, and use fine-tuning to train various kinds of deep convolution models. Finally, the most suitable image classification algorithm is selected according to the computational cost of the model and the accuracy of the model. The system design fully considers the expansibility and portability of the system, and can use the old or idle equipment as the work node of the system, thus saving the project money. The system mainly includes five parts, web crawler module, text classification module, picture classification module, data storage module and data display module. Among them, web crawler module, text classification module and picture classification module are the main research directions of this paper.
【學(xué)位授予單位】:電子科技大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP393.092
【參考文獻(xiàn)】
相關(guān)期刊論文 前7條
1 張志銘;李若蘭;;內(nèi)容分級(jí)制度視角下的網(wǎng)絡(luò)色情淫穢治理[J];浙江社會(huì)科學(xué);2013年06期
2 蔡濱榮;;構(gòu)建和諧網(wǎng)絡(luò)信息環(huán)境——關(guān)于互聯(lián)網(wǎng)內(nèi)容安全管理的思考[J];中國(guó)電信業(yè);2010年03期
3 周德懋;李舟軍;;高性能網(wǎng)絡(luò)爬蟲:研究綜述[J];計(jì)算機(jī)科學(xué);2009年08期
4 李雪;;網(wǎng)絡(luò)不良信息呈多元化發(fā)展[J];信息安全與通信保密;2009年03期
5 王靖華;;美國(guó)互聯(lián)網(wǎng)管制的三個(gè)標(biāo)準(zhǔn)[J];當(dāng)代傳播;2008年03期
6 劉金紅;陸余良;;主題網(wǎng)絡(luò)爬蟲研究綜述[J];計(jì)算機(jī)應(yīng)用研究;2007年10期
7 嚴(yán)亞蘭,查先進(jìn);Web網(wǎng)頁(yè)并行爬行研究[J];計(jì)算機(jī)應(yīng)用研究;2005年04期
,本文編號(hào):2043227
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/2043227.html
最近更新
教材專著