天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于Spark的新聞網(wǎng)頁(yè)分類系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)

發(fā)布時(shí)間:2018-05-03 10:49

  本文選題:網(wǎng)頁(yè)分類 + 網(wǎng)頁(yè)結(jié)構(gòu)信息 ; 參考:《北京郵電大學(xué)》2017年碩士論文


【摘要】:互聯(lián)網(wǎng)的發(fā)展日新月異。時(shí)至今日,互聯(lián)網(wǎng)已經(jīng)成為一個(gè)完善的龐大的系統(tǒng),其中的信息不僅數(shù)量巨大,而且實(shí)時(shí)性好。互聯(lián)網(wǎng)的這些優(yōu)點(diǎn)使得我們?cè)絹碓揭蕾嚮ヂ?lián)網(wǎng)去獲取外界信息。但是因?yàn)榛ヂ?lián)網(wǎng)的開放性和異構(gòu)性,網(wǎng)絡(luò)信息紛繁復(fù)雜,而從如此大量而缺乏規(guī)律的網(wǎng)絡(luò)信息中很難準(zhǔn)備地找到需要的信息。另外,很多時(shí)候希望過濾某些類別的網(wǎng)頁(yè)。網(wǎng)頁(yè)分類技術(shù)是一種解決以上問題的有效方法,該技術(shù)對(duì)互聯(lián)中的網(wǎng)頁(yè)進(jìn)行統(tǒng)一的組織和處理以達(dá)到用戶使用便捷化和資源利用高效化的目的。本文對(duì)傳統(tǒng)網(wǎng)頁(yè)分類整個(gè)流程進(jìn)行了較為深入的研究,對(duì)其中的網(wǎng)頁(yè)信息提取、特征選擇、特征項(xiàng)權(quán)值計(jì)算、分類方法進(jìn)行了研究和分析。在此基礎(chǔ)上所做的主要工作有:1)針對(duì)以往網(wǎng)頁(yè)分類方法中忽略文本語義層次信息的缺陷,引入主題模型,提出基于向量空間模型結(jié)合主題模型的分類方法,分別使用改進(jìn)的方法和傳統(tǒng)的方法在相同的數(shù)據(jù)集合上進(jìn)行對(duì)比實(shí)驗(yàn),實(shí)驗(yàn)結(jié)果顯示引入LDA模型后,在所有類別上分類效果都有提升。2)針對(duì)以往網(wǎng)頁(yè)分類方法中忽略網(wǎng)頁(yè)的結(jié)構(gòu)信息的缺陷,基于網(wǎng)頁(yè)結(jié)構(gòu)信息對(duì)TF-IDF進(jìn)行改進(jìn),對(duì)相同的數(shù)據(jù)集分別使用傳統(tǒng)的TF-IDF和改進(jìn)的TF-IDF向量化文本,使用相同的SVM分類方法進(jìn)行對(duì)比實(shí)驗(yàn),實(shí)驗(yàn)結(jié)果顯示考慮網(wǎng)頁(yè)結(jié)構(gòu)信息后會(huì)提升分類效果。3)針對(duì)以往網(wǎng)頁(yè)分類中將網(wǎng)頁(yè)當(dāng)作孤立對(duì)象處理,不考慮網(wǎng)頁(yè)間聯(lián)系的缺陷,使用網(wǎng)頁(yè)關(guān)系信息對(duì)隨機(jī)森林方法進(jìn)行改進(jìn),設(shè)計(jì)實(shí)驗(yàn)證明了改進(jìn)的隨機(jī)森林比原始的隨機(jī)森林方法分類效果更佳。4)在理論研究的基礎(chǔ)上,實(shí)現(xiàn)了一個(gè)基于Spark的網(wǎng)頁(yè)分類系統(tǒng),主要模塊包括網(wǎng)頁(yè)爬取模塊、網(wǎng)頁(yè)預(yù)處理模塊和網(wǎng)頁(yè)分類模塊。
[Abstract]:The development of the Internet is changing with each passing day. Today, the Internet has become a complete huge system, in which the amount of information is not only huge, but also real-time. These advantages of the Internet make us rely more and more on the Internet to obtain external information. However, because of the openness and heterogeneity of the Internet, the network information is complicated, and it is difficult to find the needed information from such a large number of and lack of regular network information. In addition, there are times when you want to filter certain categories of pages. Web page classification technology is an effective method to solve the above problems. It organizes and processes web pages in interconnection in a unified way to achieve the purpose of user convenience and high efficiency of resource utilization. In this paper, the whole process of traditional web page classification is deeply studied, and the web page information extraction, feature selection, feature item weight calculation and classification method are studied and analyzed. The main work done on this basis is: (1) aiming at the defect of neglecting the semantic level information of text in the previous web page classification methods, a topic model is introduced, and a classification method based on vector space model and topic model is proposed. The improved method and the traditional method are used to compare the same data set. The experimental results show that the LDA model is introduced. The classification effect in all categories is improved. 2) aiming at the defect of ignoring the structural information of web pages in the previous methods of web page classification, the TF-IDF is improved based on the structure information of the web pages. For the same data set, the traditional TF-IDF and the improved TF-IDF vectorized text are used respectively, and the same SVM classification method is used to carry on the contrast experiment. The experimental results show that considering the structure information of web pages will improve the classification effect. 3) aiming at the disadvantages of treating web pages as isolated objects and not considering the relationship between web pages, the random forest method is improved by using web pages' relational information. The experimental results show that the improved random forest classification method is better than the original random forest method. Based on the theoretical research, a web page classification system based on Spark is implemented. The main modules include the web crawling module. Page preprocessing module and web page classification module.
【學(xué)位授予單位】:北京郵電大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP393.092

【參考文獻(xiàn)】

相關(guān)期刊論文 前10條

1 李瑋;;Apache Spark技術(shù)研究與應(yīng)用前景分析[J];電信技術(shù);2016年09期

2 ;CNNIC發(fā)布第37次《中國(guó)互聯(lián)網(wǎng)絡(luò)發(fā)展?fàn)顩r統(tǒng)計(jì)報(bào)告》[J];國(guó)家圖書館學(xué)刊;2016年02期

3 趙本本;殷旭東;王偉;;基于Scrapy的GitHub數(shù)據(jù)爬蟲[J];電子技術(shù)與軟件工程;2016年06期

4 潘澄;;基于領(lǐng)域向量模型的新聞網(wǎng)頁(yè)分類算法[J];軟件導(dǎo)刊;2015年07期

5 張永;孟曉飛;;基于投影尋蹤的kNN文本分類算法的加速策略[J];科學(xué)技術(shù)與工程;2014年36期

6 王振振;何明;杜永萍;;基于LDA主題模型的文本相似度計(jì)算[J];計(jì)算機(jī)科學(xué);2013年12期

7 覃世安;李法運(yùn);;文本分類中TF-IDF方法的改進(jìn)研究[J];現(xiàn)代圖書情報(bào)技術(shù);2013年10期

8 任永功;楊榮杰;尹明飛;馬名威;;基于信息增益的文本特征選擇方法[J];計(jì)算機(jī)科學(xué);2012年11期

9 薛永大;;網(wǎng)頁(yè)分類技術(shù)研究綜述[J];電腦知識(shí)與技術(shù);2012年25期

10 徐峻嶺;周毓明;陳林;徐寶文;;基于互信息的無監(jiān)督特征選擇[J];計(jì)算機(jī)研究與發(fā)展;2012年02期

相關(guān)碩士學(xué)位論文 前5條

1 光順利;基于Spark的文本分類的研究[D];長(zhǎng)春工業(yè)大學(xué);2016年

2 賀捷;隨機(jī)森林在文本分類中的應(yīng)用[D];華南理工大學(xué);2015年

3 張東晉;基于單事件新聞多文檔聚類及自動(dòng)文摘的設(shè)計(jì)與實(shí)現(xiàn)[D];廈門大學(xué);2014年

4 曹桂鋒;搜索引擎中網(wǎng)頁(yè)分類和網(wǎng)頁(yè)凈化的研究與實(shí)現(xiàn)[D];武漢理工大學(xué);2013年

5 劉春剛;基于文本挖掘的計(jì)算機(jī)漏洞自動(dòng)分類技術(shù)[D];上海交通大學(xué);2013年



本文編號(hào):1838202

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/guanlilunwen/ydhl/1838202.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶b41de***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com