基于主題和類別的網(wǎng)絡(luò)新聞采集系統(tǒng)設(shè)計與實現(xiàn)

發(fā)布時間：2019-01-06 07:01

【摘要】：隨著互聯(lián)網(wǎng)的發(fā)展,網(wǎng)絡(luò)新聞成為人們獲取信息的重要來源之一。網(wǎng)絡(luò)新聞具有傳播速度快、影響范圍大、社會受眾面廣等優(yōu)點,但是也存在一些虛假性、低質(zhì)量的網(wǎng)絡(luò)新聞,網(wǎng)絡(luò)新聞質(zhì)量的參差不齊降低了用戶的閱讀體驗。另外,網(wǎng)絡(luò)新聞在某種程度上成為網(wǎng)絡(luò)輿論的來源和傳播途徑,因此在海量的網(wǎng)絡(luò)新聞數(shù)據(jù)中采集到真實、準(zhǔn)確、結(jié)構(gòu)化的網(wǎng)絡(luò)新聞數(shù)據(jù)成為網(wǎng)絡(luò)輿情研究的重點。本文面向主題網(wǎng)絡(luò)新聞和類別網(wǎng)絡(luò)新聞,著重解決了網(wǎng)絡(luò)新聞采集中主題采集、類別采集的問題,并在其基本功能的實現(xiàn)基礎(chǔ)上,進一步考慮提高系統(tǒng)的性能。本文引入了主題爬蟲和SVM分類器的概念,引入了Xpath和多線程的技術(shù),在以上理論和技術(shù)的基礎(chǔ)上,設(shè)計并實現(xiàn)了一個基于主題和類別的網(wǎng)絡(luò)新聞采集系統(tǒng),系統(tǒng)具有采集并存儲主題網(wǎng)絡(luò)新聞和類別網(wǎng)絡(luò)新聞的功能。在基于主題的網(wǎng)絡(luò)新聞采集中,本系統(tǒng)通過計算頁面相似度的方法形成爬取的優(yōu)先級隊列,然后通過Xpath技術(shù)抽取主題網(wǎng)絡(luò)新聞的標(biāo)題、網(wǎng)址、發(fā)布時間、發(fā)布來源、正文等內(nèi)容,最后將采集到的主題性網(wǎng)絡(luò)新聞數(shù)據(jù)存儲到系統(tǒng)數(shù)據(jù)庫中。在基于類別的網(wǎng)絡(luò)新聞采集中,本文引入Libsvm包來實現(xiàn)分類器的訓(xùn)練和構(gòu)造,然后通過Xpath技術(shù)抽取類別新聞的標(biāo)題、網(wǎng)址、發(fā)布時間、發(fā)布來源、正文等內(nèi)容,類別包括社會、娛樂、財經(jīng)和體育,最后將采集到的類別性網(wǎng)絡(luò)新聞數(shù)據(jù)存儲到系統(tǒng)數(shù)據(jù)庫中。首先,本文介紹了網(wǎng)絡(luò)新聞采集的研究背景和意義,著重介紹了國內(nèi)外對于主題爬蟲、分類器的研究工作;其次,本文介紹了網(wǎng)絡(luò)新聞采集過程中涉及的理論和技術(shù),包括Robots協(xié)議、通用網(wǎng)絡(luò)爬蟲、支持向量機、主題爬蟲搜索策略、Xpath技術(shù)等;然后,本文對系統(tǒng)的需求進行了分析和介紹,對系統(tǒng)的體系結(jié)構(gòu)進行了整體設(shè)計,對系統(tǒng)的模塊組成進行了詳細(xì)設(shè)計,其中系統(tǒng)的模塊包括新聞網(wǎng)站種子注入模塊、網(wǎng)頁源代碼獲取模塊、網(wǎng)頁解析模塊、分類模塊、主題過濾模塊、URL調(diào)度模塊、URL去重模塊、網(wǎng)頁信息抽取模塊、數(shù)據(jù)庫存儲模塊;另外,本文在系統(tǒng)整體設(shè)計和詳細(xì)設(shè)計的基礎(chǔ)上,通過調(diào)用ICTCLAS包和Libsvm包,實現(xiàn)了以上設(shè)計的諸多模塊,進一步地實現(xiàn)了基于主題的網(wǎng)絡(luò)新聞采集和基于類別的網(wǎng)絡(luò)新聞采集的功能。最后,本文列舉了系統(tǒng)運行所需要的硬件環(huán)境和軟件環(huán)境,對系統(tǒng)的功能和性能分別進行了測試,測試的結(jié)果達(dá)到系統(tǒng)預(yù)期要求,但是還有很多需要改進的地方。本系統(tǒng)采用C#語言在Windows7 32位操作系統(tǒng)環(huán)境下對主題采集和類別采集進行了實現(xiàn)。系統(tǒng)的健壯性、高效性、持續(xù)性、穩(wěn)定性等都達(dá)到預(yù)期要求,能夠準(zhǔn)確、及時、有效地采集并存儲基于主題和基于類別的網(wǎng)絡(luò)新聞數(shù)據(jù)。
[Abstract]:With the development of Internet, network news has become one of the important sources for people to obtain information. Network news has the advantages of fast transmission, wide influence, wide social acceptance, but there are some false, low-quality network news, the uneven quality of network news reduces the user's reading experience. In addition, to some extent, network news has become the source of public opinion and the way of dissemination, so collecting real, accurate and structured network news data in the mass of network news data has become the focus of network public opinion research. This paper aims at the topic network news and the category network news, and solves the problem of the topic collection and the category collection in the network news collection emphatically, and on the basis of its basic function realization, further consideration to improve the performance of the system. In this paper, the concepts of topic crawler and SVM classifier are introduced, and Xpath and multithreading techniques are introduced. Based on the above theories and techniques, a network news collection system based on topic and category is designed and implemented. The system has the function of collecting and storing topic network news and category network news. In the network news collection based on topic, this system forms the crawling priority queue by calculating the similarity of the page, then extracts the title, URL, release time, release source of the topic network news by Xpath technology. Finally, the collected thematic network news data is stored in the system database. In the network news collection based on category, this paper introduces Libsvm packet to realize the training and construction of classifier, and then extracts the title, URL, publishing time, publishing source, text and other contents of category news through Xpath technology. Entertainment, finance and sports, and finally the collection of category network news data stored in the system database. First of all, this paper introduces the research background and significance of the network news collection, focusing on the domestic and foreign research on the topic crawler, classifier; Secondly, this paper introduces the theory and technology involved in the process of network news collection, including Robots protocol, general web crawler, support vector machine, topic crawler search strategy, Xpath technology and so on. Then, this paper analyzes and introduces the requirements of the system, designs the architecture of the system as a whole, and designs the module composition of the system in detail. The module of the system includes the seed injection module of the news website. Web source code acquisition module, web page analysis module, classification module, theme filtering module, URL scheduling module, URL de-reduplication module, page information extraction module, database storage module; In addition, on the basis of the overall design and detailed design of the system, by calling the ICTCLAS package and the Libsvm package, this paper realizes many modules of the above design. The functions of subject-based network news collection and category-based network news collection are further realized. Finally, this paper lists the hardware and software environment needed to run the system, and tests the function and performance of the system separately. The results of the test meet the expected requirements of the system, but there are still many areas for improvement. This system uses C # language in Windows7 32-bit operating system environment to realize the subject collection and category acquisition. The robustness, efficiency, persistence and stability of the system can meet the expected requirements, and can accurately, timely and effectively collect and store the network news data based on topic and category.
【學(xué)位授予單位】：山東師范大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2017
【分類號】：TP311.52

【參考文獻】

相關(guān)期刊論文前10條

1 徐晨初;張燕平;劉國濤;;一種優(yōu)化路徑的聚焦爬蟲爬行策略[J];小型微型計算機系統(tǒng);2016年08期

2 王景中;邱銅相;;基于TF-IDF改進算法的聚焦主題網(wǎng)絡(luò)爬蟲[J];計算機應(yīng)用;2015年10期

3 崔萌;張春雷;;LIBSVM,LIBLINEAR,SVM~(muticlass)比較研究[J];電子技術(shù);2015年06期

4 張瑩;吳和生;;面向多進程負(fù)載均衡的Hash算法比較與分析[J];計算機工程;2014年09期

5 林華;;Robots協(xié)議維護互聯(lián)網(wǎng)秩序[J];IT時代周刊;2014年17期

6 苗鳳華;周巧姝;;SQL Server 2008數(shù)據(jù)管理系統(tǒng)的優(yōu)勢研究[J];長春師范大學(xué)學(xué)報;2014年06期

7 張運詩;仲兆準(zhǔn);鐘勝奎;謝光偉;;基于Visual Studio 2010的員工信息數(shù)據(jù)庫設(shè)計和實現(xiàn)[J];電腦知識與技術(shù);2013年28期

8 范·哈克;米歇爾·帕克斯;曼紐爾·卡斯特;張建中;李雪晴;;新聞業(yè)的未來:網(wǎng)絡(luò)新聞[J];國際新聞界;2013年01期

9 高曉琴;;一種改進的SVM文本數(shù)據(jù)分類技術(shù)研究[J];科技通報;2012年04期

10 黃瑜青;;基于混合核函數(shù)的SVM在文本自動分類的應(yīng)用[J];計算機光盤軟件與應(yīng)用;2012年02期

相關(guān)博士學(xué)位論文前1條

1 陳竹敏;面向垂直搜索引擎的主題爬行技術(shù)研究[D];山東大學(xué);2008年

相關(guān)碩士學(xué)位論文前9條

1 李晴;Robots協(xié)議與互聯(lián)網(wǎng)競爭規(guī)治[D];清華大學(xué);2015年

2 于甜甜;基于語義樹的語句相似度和相關(guān)度在問答系統(tǒng)中的研究[D];山東財經(jīng)大學(xué);2014年

3 紀(jì)偉;微博數(shù)據(jù)采集系統(tǒng)的設(shè)計與實現(xiàn)[D];河北科技大學(xué);2013年

4 黃濤;布隆過濾器在網(wǎng)頁去重中的研究與應(yīng)用[D];大連海事大學(xué);2013年

5 張科;基于《知網(wǎng)》義原空間的文本相似度計算研究與實現(xiàn)[D];重慶大學(xué);2013年

6 高龍;搜索引擎中通用爬蟲系統(tǒng)的研究與設(shè)計[D];吉林大學(xué);2013年

7 賀蘇偉;教育新聞采集系統(tǒng)的設(shè)計與實現(xiàn)[D];華南理工大學(xué);2012年

8 董紅贊;中小企業(yè)信息管理系統(tǒng)需求分析流程研究[D];上海交通大學(xué);2009年

9 張玲;智能信息采集搜索策略研究[D];湖南大學(xué);2004年

，

本文編號：2402489

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2402489.html

上一篇：基于隱樸素貝葉斯的商品評論情感分類方法
下一篇：基于機器視覺的插件機研發(fā)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于主題和類別的網(wǎng)絡(luò)新聞采集系統(tǒng)設(shè)計與實現(xiàn)