面向Web文本挖掘的主題網(wǎng)絡爬蟲研究
[Abstract]:With the advent of the Web3.0 era, the number and complexity of Web pages in the Internet show an explosive growth trend. The information contained in the Web page also increases in geometric order. The information of the Web page is usually reflected by the text in the Web page, so there are abundant knowledge and rules in the Web text data that are valuable to the user. However, due to the semi-structured, real-time and discrete characteristics of Web text data, it is difficult for users to obtain the knowledge they need directly from such a complex data set. Therefore, how to effectively mine the information and knowledge that users really care about from the massive Web data, and present it in a way that users can understand, is a very hot research topic. This paper mainly starts from two aspects: obtaining Web text data and analyzing Web text data. It studies how to accurately and efficiently obtain the Web text information needed by users and mine the valuable knowledge. The specific research work of this paper is as follows: firstly, the principle and structure of the implementation of topic web crawler are synthetically analyzed, and then the classification of theme web crawler is introduced. Select functional theme web crawler as the focus of this study. Finally, this paper analyzes the implementation language of web crawler, and chooses Node.js as a new language to implement the text representation model of topic web crawler. Web text representation model for topic network community is implemented. Firstly, the existing text representation model is analyzed synthetically. Then, based on the fact that the Web text data in this paper is mainly short text, combined with the related techniques of keyword extraction and word vector representation in natural language processing, This paper presents a text representation model based on keyword vector. Web text clustering algorithm: firstly, the definition of Web text mining technology is introduced. Secondly, the clustering mining technology in Web text mining is introduced in detail. On the basis of analyzing the classification of Web text clustering algorithm, BIRCH algorithm is selected as the Web text clustering algorithm in this paper. Then, the shortcomings and shortcomings of BIRCH algorithm are analyzed, and a new Web text clustering algorithm is proposed. On the basis of the above research, this paper designs and implements the information acquisition and analysis system for the topic network community by combining the research results of Web text mining technology and topic web crawler technology.
【學位授予單位】:電子科技大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP391.1;TP393.09
【參考文獻】
相關(guān)期刊論文 前6條
1 吳威;;基于Web文本挖掘算法預防現(xiàn)實危害的研究[J];信息網(wǎng)絡安全;2016年09期
2 薛蘇琴;牛永潔;;基于向量空間模型的中文文本相似度的研究[J];電子設計工程;2016年10期
3 史玉珍;單冬紅;;基于子主題選擇與三級分層結(jié)構(gòu)的Web文本挖掘方法[J];電信科學;2016年05期
4 張志昌;周慧霞;姚東任;魯小勇;;基于詞向量的中文詞匯蘊涵關(guān)系識別[J];計算機工程;2016年02期
5 俞忻峰;;社交網(wǎng)絡挖掘方案研究[J];現(xiàn)代電子技術(shù);2015年04期
6 許鑫;郭金龍;姚占雷;;基于Web文本挖掘的行業(yè)態(tài)勢分析——以2011上海車展為例[J];圖書情報工作;2012年16期
相關(guān)碩士學位論文 前10條
1 劉小云;網(wǎng)絡爬蟲技術(shù)在云平臺上的研究與實現(xiàn)[D];電子科技大學;2016年
2 王琨;面向教育輿情的主題網(wǎng)絡爬蟲設計與實現(xiàn)[D];南華大學;2015年
3 陳千;主題網(wǎng)絡爬蟲關(guān)鍵技術(shù)的研究與應用[D];北京理工大學;2015年
4 楊志國;基于WEB挖掘和文本分析的動態(tài)網(wǎng)絡輿情預警研究[D];武漢理工大學;2014年
5 唐東;基于XML和SVM的Web文本挖掘系統(tǒng)研究[D];電子科技大學;2014年
6 湯卓;基于Web文本挖掘的網(wǎng)絡口碑分析系統(tǒng)的設計與實現(xiàn)[D];華中科技大學;2013年
7 仰孝富;基于BIRCH改進算法的文本聚類研究[D];北京林業(yè)大學;2013年
8 趙茉莉;網(wǎng)絡爬蟲系統(tǒng)的研究與實現(xiàn)[D];電子科技大學;2013年
9 張宏兵;Web文本挖掘技術(shù)在網(wǎng)頁推薦中的應用研究[D];南京理工大學;2013年
10 張曉雷;面向Web挖掘的主題網(wǎng)絡爬蟲的研究與實現(xiàn)[D];西安電子科技大學;2012年
,本文編號:2355288
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/2355288.html