天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

面向領(lǐng)域的Web文本采集與分類(lèi)

發(fā)布時(shí)間:2018-03-21 18:49

  本文選題:主題爬蟲(chóng) 切入點(diǎn):特征提取 出處:《西安建筑科技大學(xué)》2011年碩士論文 論文類(lèi)型:學(xué)位論文


【摘要】:隨著互聯(lián)網(wǎng)的大規(guī)模普及和各行業(yè)信息化程度的提高,與行業(yè)領(lǐng)域相關(guān)的Web文本信息快速積累,如何從這些海量信息中定向提取符合要求的知識(shí),是當(dāng)前信息處理領(lǐng)域的研究熱點(diǎn)。 本文以陜西省教育廳專(zhuān)項(xiàng)科研項(xiàng)目“面向特定領(lǐng)域需求的概念設(shè)計(jì)方案自動(dòng)生成方法研究”為課題研究背景,通過(guò)網(wǎng)絡(luò)信息采集和分類(lèi)技術(shù),對(duì)領(lǐng)域相關(guān)主題網(wǎng)絡(luò)資源發(fā)現(xiàn)與采集、采集到的網(wǎng)頁(yè)文本信息預(yù)處理與分類(lèi)這兩方面的問(wèn)題進(jìn)行研究,主要研究工作如下: (1)對(duì)主題描述方法進(jìn)行研究,將專(zhuān)業(yè)詞庫(kù)與特征選擇相結(jié)合,在專(zhuān)家給出的有限專(zhuān)業(yè)詞庫(kù)基礎(chǔ)上,對(duì)已有的領(lǐng)域代表性文本和通過(guò)網(wǎng)絡(luò)采集到的主題相關(guān)文本進(jìn)行特征提取和特征選擇,篩選主題特征詞,擴(kuò)充專(zhuān)業(yè)詞庫(kù),通過(guò)由主題特征詞構(gòu)成的向量來(lái)明確表示主題; (2)鑒于主題爬蟲(chóng)網(wǎng)頁(yè)采集的不確定性,對(duì)一般網(wǎng)頁(yè)的結(jié)構(gòu)特點(diǎn)進(jìn)行分析,采用基于行塊分布函數(shù)的方法抽取網(wǎng)頁(yè)正文,去掉干擾主題相關(guān)度判斷與文本分類(lèi)的廣告、導(dǎo)航等無(wú)用文本信息,取得了較好的網(wǎng)頁(yè)去噪效果,且具有通用性。 (3)采用綜合價(jià)值評(píng)價(jià)的主題爬蟲(chóng)搜索策略,綜合考慮網(wǎng)頁(yè)內(nèi)容分析和鏈接分析兩方面的因素,結(jié)合PageRank算法,計(jì)算網(wǎng)頁(yè)的綜合鏈接價(jià)值,篩選出與主題相關(guān)的URL。 (4)對(duì)采集到的網(wǎng)頁(yè)提取出標(biāo)題和網(wǎng)頁(yè)正文,保存為文本文檔并進(jìn)行預(yù)處理,根據(jù)現(xiàn)有的機(jī)械主題類(lèi)別信息,采用基于KNN的機(jī)械主題文本分類(lèi)算法對(duì)文檔集合進(jìn)行多子類(lèi)分類(lèi),并對(duì)該分類(lèi)算法進(jìn)行了實(shí)驗(yàn)分析。 最后,結(jié)合以上研究?jī)?nèi)容,以機(jī)械領(lǐng)域挖掘機(jī)為主題,實(shí)現(xiàn)了一個(gè)機(jī)械領(lǐng)域Web文本采集與挖掘原型系統(tǒng)。
[Abstract]:With the large-scale popularization of the Internet and the improvement of the degree of informatization of various industries, the Web text information related to the industry field is accumulated rapidly. How to extract the required knowledge from these massive information, It is a hot topic in the field of information processing. In this paper, the research background of the special research project of Shaanxi Provincial Education Department, "Research on automatic Generation method of Conceptual Design Scheme oriented to specific Domain demand", is studied through network information collection and classification technology. The main research work is as follows: (1) this paper studies the discovery and collection of web resources and the preprocessing and classification of web page text information. The main research work is as follows:. 1) researching the method of subject description, combining professional lexicon with feature selection, and based on the limited professional lexicon given by experts. Feature extraction and feature selection are carried out on the existing domain representative text and related text collected through the network, theme feature words are screened, professional lexicon is expanded, and the theme is clearly represented by vector composed of theme feature words. 2) in view of the uncertainty of the collection of subject crawler pages, the structural characteristics of general web pages are analyzed, and the text of the web pages is extracted by the method of line block distribution function, and the advertisements that interfere with the judgment of the relevance of the topic and the classification of the text are removed. Navigation and other useless text information, achieved a better effect of web denoising, and universal. (3) using the topic crawler search strategy of comprehensive value evaluation, considering the two factors of web content analysis and link analysis, combining with PageRank algorithm, calculating the comprehensive link value of the web page, the URLs related to the topic are screened out. The title and text of the collected pages are extracted and stored as text documents. According to the existing mechanical subject category information, the text classification algorithm of mechanical topic based on KNN is used to classify the document set with multiple subclasses. The classification algorithm is analyzed experimentally. Finally, a prototype system of Web text acquisition and mining in mechanical field is implemented by taking the excavator in mechanical field as the subject of the above research.
【學(xué)位授予單位】:西安建筑科技大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2011
【分類(lèi)號(hào)】:TP393.09

【引證文獻(xiàn)】

相關(guān)碩士學(xué)位論文 前2條

1 魏勝輝;機(jī)械領(lǐng)域文本采集和分類(lèi)的研究與設(shè)計(jì)[D];西安建筑科技大學(xué);2012年

2 代宏;基于流媒體技術(shù)的農(nóng)村基層黨員干部遠(yuǎn)程教育系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn)[D];電子科技大學(xué);2013年



本文編號(hào):1645104

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/wenyilunwen/guanggaoshejilunwen/1645104.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶(hù)53769***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com
亚洲一区二区欧美激情| 91精品国产综合久久精品| 日本高清视频在线播放| 欧美日韩精品久久亚洲区熟妇人 | 国产成人精品视频一二区| 夜色福利久久精品福利| 久久精品国产亚洲av久按摩 | 亚洲一级二级三级精品| 欧美日韩国产自拍亚洲| 激情偷拍一区二区三区视频| 国产一区日韩二区欧美| 日韩在线视频精品中文字幕| 亚洲综合伊人五月天中文| 色偷偷偷拍视频在线观看| 国产欧美日韩不卡在线视频| 色婷婷国产熟妇人妻露脸| 91久久精品国产成人| 99久久免费看国产精品| 91久久国产福利自产拍| 久久碰国产一区二区三区| 国产又粗又猛又爽又黄| 国产又长又粗又爽免费视频| 一区二区三区日韩在线| 欧美一级日韩中文字幕| 最近中文字幕高清中文字幕无| 婷婷亚洲综合五月天麻豆| 91人妻久久精品一区二区三区 | 日本加勒比中文在线观看| 日本一级特黄大片国产| 日韩国产亚洲一区二区三区| 五月婷婷欧美中文字幕| 色哟哟在线免费一区二区三区| 免费精品一区二区三区| 国产一区二区三区成人精品| 九九热视频网在线观看| 欧美日韩亚洲巨色人妻| 国产av天堂一区二区三区粉嫩| 91精品蜜臀一区二区三区| 日本欧美一区二区三区就| 久久精品一区二区少妇| 一区二区免费视频中文乱码国产|