網(wǎng)絡(luò)內(nèi)容過(guò)濾系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn)
發(fā)布時(shí)間:2018-11-03 17:23
【摘要】:校園網(wǎng)給師生提供便利的同時(shí)也帶來(lái)了危害,大量不健康和無(wú)用的信息充斥著網(wǎng)絡(luò)世界,給高校校園網(wǎng)的管理和維護(hù)帶來(lái)了很大的挑戰(zhàn)。網(wǎng)絡(luò)內(nèi)容過(guò)濾是一種有效的應(yīng)對(duì)方法,能夠自動(dòng)地將網(wǎng)絡(luò)中特定的信息過(guò)濾掉。本文首先回顧了國(guó)內(nèi)外網(wǎng)絡(luò)過(guò)濾領(lǐng)域的發(fā)展現(xiàn)狀、存在的問(wèn)題以及常見(jiàn)的過(guò)濾方法。本系統(tǒng)實(shí)現(xiàn)了兩個(gè)關(guān)鍵的系統(tǒng)功能模塊:網(wǎng)絡(luò)數(shù)據(jù)包的捕獲和重組模塊、網(wǎng)絡(luò)文本數(shù)據(jù)處理模塊。文中完成了網(wǎng)絡(luò)內(nèi)容過(guò)濾系統(tǒng)兩大關(guān)鍵功能:實(shí)現(xiàn)對(duì)特定URL的過(guò)濾以及對(duì)網(wǎng)頁(yè)正文內(nèi)容的過(guò)濾,其中網(wǎng)頁(yè)正文是文本內(nèi)容,不包括圖像視頻等多媒體信息。網(wǎng)絡(luò)數(shù)據(jù)捕獲模塊主要研究分析了網(wǎng)絡(luò)協(xié)議的解析,在具體的分析過(guò)程中涉及到以太網(wǎng)數(shù)據(jù)幀、IP數(shù)據(jù)包、TCP數(shù)據(jù)段和HTTP報(bào)文,同時(shí)在基于網(wǎng)絡(luò)協(xié)議分析的基礎(chǔ)上完成了在Windows系統(tǒng)下利用網(wǎng)絡(luò)數(shù)據(jù)包捕獲庫(kù)Winpcap對(duì)網(wǎng)絡(luò)數(shù)據(jù)包的捕獲和分析,最終這個(gè)模塊實(shí)現(xiàn)了URL過(guò)濾功能和HTML的頁(yè)面重組,為文本數(shù)據(jù)處理模塊提供了文本數(shù)據(jù)。根據(jù)校園網(wǎng)的特點(diǎn),URL過(guò)濾功能中的URL過(guò)濾庫(kù)可以由自行定義的多個(gè)不同規(guī)則庫(kù)組成,并且根據(jù)不同時(shí)間段運(yùn)行不同的過(guò)濾規(guī)則庫(kù)。網(wǎng)絡(luò)文本數(shù)據(jù)處理模塊研究了網(wǎng)頁(yè)文本分類(lèi)技術(shù)。因?yàn)榫W(wǎng)頁(yè)文本是一種半結(jié)構(gòu)化的文本數(shù)據(jù),首先研究和實(shí)現(xiàn)了從網(wǎng)頁(yè)文本中提取文本數(shù)據(jù)。然后重點(diǎn)研究了文本分類(lèi)技術(shù),主要包括文本預(yù)處理和文本分類(lèi)器的訓(xùn)練兩大技術(shù)難點(diǎn)。文本預(yù)處理技術(shù)中還涉及到中文分詞、特征選擇和權(quán)重計(jì)算等技術(shù)。對(duì)現(xiàn)在主流的各種文本分類(lèi)器進(jìn)行了理論上的分析和比較,最終根據(jù)校園網(wǎng)的特點(diǎn)選擇了類(lèi)中心向量分類(lèi)器作為文本分類(lèi)器。根據(jù)訓(xùn)練集文本完成文本分類(lèi)器的學(xué)習(xí),并對(duì)分類(lèi)器的效果進(jìn)行了交叉驗(yàn)證測(cè)試,取得了較滿意的分類(lèi)結(jié)果。最后對(duì)網(wǎng)絡(luò)內(nèi)容過(guò)濾系統(tǒng)進(jìn)行了總結(jié)和展望。希望下一步工作可以實(shí)現(xiàn)更加全面的網(wǎng)絡(luò)內(nèi)容過(guò)濾系統(tǒng),不僅僅是文本內(nèi)容,還可以包括圖片、聲音和視頻等多媒體信息的過(guò)濾。
[Abstract]:Campus network not only provides convenience to teachers and students but also brings harm. A large number of unhealthy and useless information flooded the network world and brought great challenges to the management and maintenance of campus network in colleges and universities. Web content filtering is an effective response method, which can automatically filter out the specific information in the network. Firstly, this paper reviews the status quo, existing problems and common filtering methods in the field of network filtering at home and abroad. This system realizes two key function modules: network data packet capture and recombination module, network text data processing module. In this paper, two key functions of the network content filtering system are accomplished: filtering the specific URL and filtering the content of the text of the web page. The text of the web page is the text content, not the multimedia information such as image and video. The network data capture module mainly studies and analyzes the analysis of network protocol, which involves Ethernet data frame, IP data packet, TCP data segment and HTTP message. At the same time, on the basis of network protocol analysis, the capture and analysis of network data packets using network packet capture library (Winpcap) under Windows system is completed. Finally, this module realizes the function of URL filtering and the page recombination of HTML. Provides text data for text data processing module. According to the characteristics of campus network, the URL filter library in the URL filtering function can be composed of several different rule libraries defined by itself, and run different filtering rule libraries according to different time periods. Web text data processing module studies the technology of web page text classification. Because web text is a kind of semi-structured text data, firstly, we study and realize extracting text data from web text. Then it focuses on the text classification technology, including the text preprocessing and text classifier training two major technical difficulties. Chinese word segmentation, feature selection and weight calculation are also involved in text preprocessing. This paper analyzes and compares all kinds of mainstream text classifiers in theory, and finally selects class center vector classifier as text classifier according to the characteristics of campus network. According to the text of the training set, the text classifier is learned, and the effect of the classifier is tested by cross-validation, and satisfactory results are obtained. Finally, the network content filtering system is summarized and prospected. It is hoped that the next step will be to implement a more comprehensive network content filtering system, not only for text content, but also for the filtering of multimedia information, such as pictures, sounds and videos.
【學(xué)位授予單位】:電子科技大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2014
【分類(lèi)號(hào)】:TP393.08
本文編號(hào):2308442
[Abstract]:Campus network not only provides convenience to teachers and students but also brings harm. A large number of unhealthy and useless information flooded the network world and brought great challenges to the management and maintenance of campus network in colleges and universities. Web content filtering is an effective response method, which can automatically filter out the specific information in the network. Firstly, this paper reviews the status quo, existing problems and common filtering methods in the field of network filtering at home and abroad. This system realizes two key function modules: network data packet capture and recombination module, network text data processing module. In this paper, two key functions of the network content filtering system are accomplished: filtering the specific URL and filtering the content of the text of the web page. The text of the web page is the text content, not the multimedia information such as image and video. The network data capture module mainly studies and analyzes the analysis of network protocol, which involves Ethernet data frame, IP data packet, TCP data segment and HTTP message. At the same time, on the basis of network protocol analysis, the capture and analysis of network data packets using network packet capture library (Winpcap) under Windows system is completed. Finally, this module realizes the function of URL filtering and the page recombination of HTML. Provides text data for text data processing module. According to the characteristics of campus network, the URL filter library in the URL filtering function can be composed of several different rule libraries defined by itself, and run different filtering rule libraries according to different time periods. Web text data processing module studies the technology of web page text classification. Because web text is a kind of semi-structured text data, firstly, we study and realize extracting text data from web text. Then it focuses on the text classification technology, including the text preprocessing and text classifier training two major technical difficulties. Chinese word segmentation, feature selection and weight calculation are also involved in text preprocessing. This paper analyzes and compares all kinds of mainstream text classifiers in theory, and finally selects class center vector classifier as text classifier according to the characteristics of campus network. According to the text of the training set, the text classifier is learned, and the effect of the classifier is tested by cross-validation, and satisfactory results are obtained. Finally, the network content filtering system is summarized and prospected. It is hoped that the next step will be to implement a more comprehensive network content filtering system, not only for text content, but also for the filtering of multimedia information, such as pictures, sounds and videos.
【學(xué)位授予單位】:電子科技大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2014
【分類(lèi)號(hào)】:TP393.08
【參考文獻(xiàn)】
相關(guān)期刊論文 前1條
1 張莉,曾致遠(yuǎn);Windows下網(wǎng)頁(yè)信息實(shí)時(shí)監(jiān)聽(tīng)程序的設(shè)計(jì)與實(shí)現(xiàn)[J];微計(jì)算機(jī)信息;2005年03期
相關(guān)碩士學(xué)位論文 前1條
1 曲建華;Web上的信息過(guò)濾問(wèn)題研究[D];山東師范大學(xué);2003年
,本文編號(hào):2308442
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/2308442.html
最近更新
教材專(zhuān)著