基于Web內(nèi)容的業(yè)務(wù)洞察系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)
本文選題:URL分析 + 網(wǎng)頁分類 ; 參考:《北京郵電大學(xué)》2017年碩士論文
【摘要】:互聯(lián)網(wǎng)時代是信息爆發(fā)的時代,人們可以瀏覽多種多樣的網(wǎng)絡(luò)資源,塑造自己獨(dú)特的瀏覽習(xí)慣。對于單個用戶而言,其訪問的網(wǎng)絡(luò)資源信息的集合在一定程度上代表了其瀏覽習(xí)慣以及興趣愛好。目前針對這些日志的普遍處理方法是采用DPI技術(shù)進(jìn)行常規(guī)的字段統(tǒng)計(jì),不涉及到對報(bào)文內(nèi)的具體內(nèi)容的分析,或者針對內(nèi)容的分析只局限于URL指向的頁面內(nèi)容的目標(biāo)文本,忽視了 URL資源的結(jié)構(gòu)特點(diǎn)等諸多因素,最終降低了內(nèi)容分析的精度。將URL資源的背景知識等信息也作為分析的原材料,結(jié)合URL的多級結(jié)構(gòu)特點(diǎn)和網(wǎng)頁類型特點(diǎn)實(shí)現(xiàn)對Web內(nèi)容(Web頁面和URL)的信息提取與分析的方法成為了研究重點(diǎn)。本文圍繞網(wǎng)絡(luò)運(yùn)營商如何針對用戶進(jìn)行業(yè)務(wù)洞察的背景和需求,對基于Web內(nèi)容的業(yè)務(wù)洞察實(shí)現(xiàn)時所需要的相關(guān)技術(shù)方案進(jìn)行研究,最終設(shè)計(jì)并開發(fā)完成基于Web內(nèi)容的業(yè)務(wù)洞察系統(tǒng)的搭建。主要研究內(nèi)容有:1.研究新聞類、視頻類、電子商務(wù)類的不同類型網(wǎng)頁內(nèi)容提取。本文分析了不同類型的網(wǎng)頁的結(jié)構(gòu)特點(diǎn)并設(shè)計(jì)和實(shí)現(xiàn)了不同類型的網(wǎng)頁內(nèi)容的提取方法,最終運(yùn)用在URL分析和Web內(nèi)容分析等功能模塊中;2.研究URL標(biāo)簽信息獲取。本文對URL的結(jié)構(gòu)特點(diǎn)和背景知識進(jìn)行分析,并歸納總結(jié)出一種可以識別URL信息并對信息進(jìn)行統(tǒng)一化自動管理的方法;3.研究系統(tǒng)的平臺架構(gòu)搭建方案。本文從需求出發(fā),將零散的技術(shù)以功能模塊的形式進(jìn)行整合,最終轉(zhuǎn)化為完整的系統(tǒng)。根據(jù)對相關(guān)技術(shù)研究和調(diào)研所得到的解決方案,本文實(shí)現(xiàn)了網(wǎng)頁信息多級標(biāo)簽獲取方法,將URL拆分成多個字段并對每個字段的內(nèi)容進(jìn)行歸類和解析的方法以及通過網(wǎng)絡(luò)資源搜索匹配及識別信息的處理方法,并通過測試驗(yàn)證了這些方法的有效性。基于以上關(guān)鍵技術(shù)方案的實(shí)現(xiàn),本文完成了基于Web內(nèi)容的業(yè)務(wù)洞察系統(tǒng)的開發(fā),該系統(tǒng)根據(jù)用戶網(wǎng)絡(luò)訪問日志中的請求URL字段集合,實(shí)現(xiàn)了 URL分析,網(wǎng)頁分類,Web內(nèi)容分析,規(guī)則管理等功能,將URL字段集合轉(zhuǎn)化為用戶的行為特征信息,為用戶特征提取提供基礎(chǔ),同時為網(wǎng)絡(luò)運(yùn)營商等服務(wù)提供商針對用戶進(jìn)行業(yè)務(wù)洞察提供了先決條件。
[Abstract]:Internet era is the era of information explosion, people can browse a variety of network resources, shape their own unique browsing habits. To a certain extent, the collection of network resources information accessed by a single user represents their browsing habits and interests. At present, the general method of dealing with these logs is to use the DPI technology to carry on the conventional field statistics, which does not involve the analysis of the specific content in the message, or the analysis of the content is limited to the target text of the page content pointed to by the URL. Many factors, such as the structural characteristics of URL resources, are ignored, and the accuracy of content analysis is reduced. The information such as background knowledge of URL resources is also used as the raw material of analysis, and the method of extracting and analyzing the information of URL content web pages and URLs based on the characteristics of multilevel structure and web page type of URL has become the focus of research. This paper focuses on the background and requirements of network operators how to carry out business insight for users, and studies the relevant technical solutions needed for the realization of business insight based on Web content. Finally, we design and develop the business insight system based on Web content. The main research contents are: 1. Research on different types of web content extraction of news, video and e-commerce. This paper analyzes the structural characteristics of different types of web pages and designs and implements the extraction methods of different types of web pages. Finally, it is used in the functional modules of URL analysis and Web content analysis. URL tag information acquisition is studied. In this paper, the structural characteristics and background knowledge of URL are analyzed, and a method of recognizing URL information and managing it automatically is summarized. Research the platform architecture of the system. In this paper, the scattered technology is integrated in the form of functional modules, and finally transformed into a complete system. According to the solution of research and research on related technology, this paper realizes the method of obtaining multilevel tags of web information. The URL is divided into several fields and the contents of each field are classified and parsed, and the methods of searching, matching and identifying information through network resources are presented, and the validity of these methods is verified by testing. Based on the implementation of the above key technology, this paper completes the development of a business insight system based on Web content. According to the set of requested URL fields in user network access log, the system realizes URL analysis and web page classification. The function of rule management transforms the URL field set into the behavior characteristic information of the user, which provides the basis for the feature extraction of the user, and also provides the precondition for the service provider such as the network operator to carry on the service insight to the user.
【學(xué)位授予單位】:北京郵電大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP393.09
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 宋宇;羅準(zhǔn)辰;真溱;;基于引用背景信息的關(guān)鍵詞自動抽取方法研究[J];情報(bào)理論與實(shí)踐;2016年11期
2 忻禾登;;基于NoSQL數(shù)據(jù)庫的大數(shù)據(jù)查詢技術(shù)[J];信息記錄材料;2016年04期
3 宋宇;真溱;;關(guān)鍵詞自動抽取技術(shù)綜述[J];情報(bào)理論與實(shí)踐;2016年07期
4 居美云;;軟件測試用例設(shè)計(jì)[J];信息與電腦(理論版);2016年12期
5 朱澤德;李淼;張健;曾偉輝;曾新華;;一種基于LDA模型的關(guān)鍵詞抽取方法[J];中南大學(xué)學(xué)報(bào)(自然科學(xué)版);2015年06期
6 李華康;孫國梓;胥備;徐向陽;夏春蓉;;一種基于知識網(wǎng)絡(luò)血緣關(guān)系的網(wǎng)頁分類方法[J];江蘇科技大學(xué)學(xué)報(bào)(自然科學(xué)版);2014年04期
7 曾超宇;李金香;;Redis在高速緩存系統(tǒng)中的應(yīng)用[J];微型機(jī)與應(yīng)用;2013年12期
8 孫立偉;何國輝;吳禮發(fā);;網(wǎng)絡(luò)爬蟲技術(shù)的研究[J];電腦知識與技術(shù);2010年15期
9 胡學(xué)鋼;李星華;謝飛;吳信東;;基于詞匯鏈的中文新聞網(wǎng)頁關(guān)鍵詞抽取方法[J];模式識別與人工智能;2010年01期
10 許世明;武波;馬翠;邸思;徐洪奎;杜如虛;;一種基于預(yù)分類的高效SVM中文網(wǎng)頁分類器[J];計(jì)算機(jī)工程與應(yīng)用;2010年01期
相關(guān)碩士學(xué)位論文 前5條
1 楊鎰銘;基于URL模式的網(wǎng)頁分類算法研究[D];中國科學(xué)技術(shù)大學(xué);2016年
2 何金城;分布式數(shù)據(jù)管理平臺的設(shè)計(jì)與實(shí)現(xiàn)[D];中山大學(xué);2015年
3 孫駿雄;基于網(wǎng)絡(luò)爬蟲的網(wǎng)站信息采集技術(shù)研究[D];大連海事大學(xué);2014年
4 莫卓穎;基于語義DOM的WEB信息抽取[D];廣西師范大學(xué);2012年
5 何維;行業(yè)網(wǎng)站分類方法研究與應(yīng)用[D];浙江大學(xué);2006年
,本文編號:1820872
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/1820872.html