互聯(lián)網(wǎng)業(yè)務(wù)重組與內(nèi)容提取
發(fā)布時(shí)間:2019-01-15 07:25
【摘要】:互聯(lián)網(wǎng)的迅猛發(fā)展帶動(dòng)了網(wǎng)絡(luò)應(yīng)用的快速增長,互聯(lián)網(wǎng)為用戶提供了種類繁多的網(wǎng)絡(luò)業(yè)務(wù),并不斷滿足網(wǎng)絡(luò)用戶的各種需求。每天都會(huì)產(chǎn)生海量的數(shù)據(jù)信息,過濾不良信息,篩選有用的信息,具有重要的研究價(jià)值與工程意義。 本文致力于網(wǎng)絡(luò)應(yīng)用的業(yè)務(wù)重組與內(nèi)容提取的研究與實(shí)現(xiàn),主要工作內(nèi)容包括三個(gè)部分,網(wǎng)絡(luò)業(yè)務(wù)重組設(shè)計(jì)與實(shí)現(xiàn)、基于正則表達(dá)式的論壇社區(qū)應(yīng)用的內(nèi)容提取與安全審計(jì)、基于DOM樹的網(wǎng)頁內(nèi)容提取與分析。 本文首先介紹了HTML語言、DOM模型以及涉及到的報(bào)文采集技術(shù),數(shù)據(jù)包重組技術(shù)等關(guān)鍵技術(shù)。其次,設(shè)計(jì)與實(shí)現(xiàn)了網(wǎng)絡(luò)業(yè)務(wù)重組過程,其中介紹了數(shù)據(jù)包重組過程,并使用了libnids開源庫實(shí)現(xiàn)了TCP會(huì)話重組,并對(duì)HTTP數(shù)據(jù)進(jìn)行了壓縮解碼與塊解碼,得到了web頁面。再次,采集幾十種熱門論壇通信數(shù)據(jù),通過分析得到了幾種常用的論壇通用系統(tǒng),并提取了論壇識(shí)別特征,提出了論壇指紋概念,優(yōu)化了傳統(tǒng)的論壇審計(jì)方法。最后,結(jié)合網(wǎng)頁特點(diǎn)與提取信息的特征,提出了基于DOM的網(wǎng)頁提取方法:對(duì)網(wǎng)頁進(jìn)行預(yù)處理,選擇標(biāo)簽作為網(wǎng)頁提取特征,通過構(gòu)建DOM樹,實(shí)現(xiàn)了對(duì)網(wǎng)頁內(nèi)容的快速提取。通過這個(gè)方法完成了網(wǎng)絡(luò)辦公管理服務(wù)系統(tǒng)的軟件版本跟蹤模塊,并分析了網(wǎng)頁特征提取方法與網(wǎng)頁特點(diǎn)。
[Abstract]:With the rapid development of the Internet, the rapid growth of network applications, the Internet provides users with a wide variety of network services, and constantly meet the needs of network users. It has important research value and engineering significance to produce massive data information, filter bad information and filter useful information every day. This paper is devoted to the research and implementation of business reorganization and content extraction of network application. The main work includes three parts: design and implementation of network business reorganization, content extraction and security audit of forum community application based on regular expression. Web content extraction and analysis based on DOM tree. This paper first introduces the HTML language, DOM model, packet collection technology, packet recombination technology and other key technologies. Secondly, this paper designs and implements the process of network business reorganization, which introduces the process of packet recombination, and uses libnids open source library to realize TCP session reconfiguration. The HTTP data is compressed and decoded, and the web page is obtained. Thirdly, through the analysis of dozens of popular forum communication data, several common forum systems are obtained, and the forum identification features are extracted, the concept of forum fingerprint is proposed, and the traditional forum auditing method is optimized. Finally, combining the characteristics of web pages and the features of extracting information, a method of web page extraction based on DOM is put forward: preprocessing the web pages, selecting tags as the feature of page extraction, and constructing the DOM tree to quickly extract the content of the web pages. Through this method, the software version tracking module of the network office management service system is completed, and the method of feature extraction and the feature of the web page are analyzed.
【學(xué)位授予單位】:北京郵電大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2014
【分類號(hào)】:TP393.092
本文編號(hào):2408982
[Abstract]:With the rapid development of the Internet, the rapid growth of network applications, the Internet provides users with a wide variety of network services, and constantly meet the needs of network users. It has important research value and engineering significance to produce massive data information, filter bad information and filter useful information every day. This paper is devoted to the research and implementation of business reorganization and content extraction of network application. The main work includes three parts: design and implementation of network business reorganization, content extraction and security audit of forum community application based on regular expression. Web content extraction and analysis based on DOM tree. This paper first introduces the HTML language, DOM model, packet collection technology, packet recombination technology and other key technologies. Secondly, this paper designs and implements the process of network business reorganization, which introduces the process of packet recombination, and uses libnids open source library to realize TCP session reconfiguration. The HTTP data is compressed and decoded, and the web page is obtained. Thirdly, through the analysis of dozens of popular forum communication data, several common forum systems are obtained, and the forum identification features are extracted, the concept of forum fingerprint is proposed, and the traditional forum auditing method is optimized. Finally, combining the characteristics of web pages and the features of extracting information, a method of web page extraction based on DOM is put forward: preprocessing the web pages, selecting tags as the feature of page extraction, and constructing the DOM tree to quickly extract the content of the web pages. Through this method, the software version tracking module of the network office management service system is completed, and the method of feature extraction and the feature of the web page are analyzed.
【學(xué)位授予單位】:北京郵電大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2014
【分類號(hào)】:TP393.092
【參考文獻(xiàn)】
相關(guān)期刊論文 前4條
1 溫曙光;謝高崗;;libpcap-MT:一種多線程的通用數(shù)據(jù)包捕獲庫[J];計(jì)算機(jī)研究與發(fā)展;2011年05期
2 馬如林;蔣華;張慶霞;;一種哈希表快速查找的改進(jìn)方法[J];計(jì)算機(jī)工程與科學(xué);2008年09期
3 姚光開,于永棠,柴喬林;微型TCP/IP協(xié)議棧的設(shè)計(jì)與實(shí)現(xiàn)[J];計(jì)算機(jī)應(yīng)用;2003年09期
4 林延福,楊新旭,李學(xué)干;網(wǎng)絡(luò)內(nèi)容審計(jì)及其關(guān)鍵技術(shù)的研究[J];現(xiàn)代電子技術(shù);2005年02期
,本文編號(hào):2408982
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/2408982.html
最近更新
教材專著