面向領(lǐng)域的Web文本采集與分類

發(fā)布時間：2018-03-21 18:49

本文選題：主題爬蟲　切入點：特征提取　出處：《西安建筑科技大學(xué)》2011年碩士論文　論文類型：學(xué)位論文

【摘要】：隨著互聯(lián)網(wǎng)的大規(guī)模普及和各行業(yè)信息化程度的提高,與行業(yè)領(lǐng)域相關(guān)的Web文本信息快速積累,如何從這些海量信息中定向提取符合要求的知識,是當前信息處理領(lǐng)域的研究熱點。本文以陜西省教育廳專項科研項目“面向特定領(lǐng)域需求的概念設(shè)計方案自動生成方法研究”為課題研究背景,通過網(wǎng)絡(luò)信息采集和分類技術(shù),對領(lǐng)域相關(guān)主題網(wǎng)絡(luò)資源發(fā)現(xiàn)與采集、采集到的網(wǎng)頁文本信息預(yù)處理與分類這兩方面的問題進行研究,主要研究工作如下: (1)對主題描述方法進行研究,將專業(yè)詞庫與特征選擇相結(jié)合,在專家給出的有限專業(yè)詞庫基礎(chǔ)上,對已有的領(lǐng)域代表性文本和通過網(wǎng)絡(luò)采集到的主題相關(guān)文本進行特征提取和特征選擇,篩選主題特征詞,擴充專業(yè)詞庫,通過由主題特征詞構(gòu)成的向量來明確表示主題; (2)鑒于主題爬蟲網(wǎng)頁采集的不確定性,對一般網(wǎng)頁的結(jié)構(gòu)特點進行分析,采用基于行塊分布函數(shù)的方法抽取網(wǎng)頁正文,去掉干擾主題相關(guān)度判斷與文本分類的廣告、導(dǎo)航等無用文本信息,取得了較好的網(wǎng)頁去噪效果,且具有通用性。 (3)采用綜合價值評價的主題爬蟲搜索策略,綜合考慮網(wǎng)頁內(nèi)容分析和鏈接分析兩方面的因素,結(jié)合PageRank算法,計算網(wǎng)頁的綜合鏈接價值,篩選出與主題相關(guān)的URL。 (4)對采集到的網(wǎng)頁提取出標題和網(wǎng)頁正文,保存為文本文檔并進行預(yù)處理,根據(jù)現(xiàn)有的機械主題類別信息,采用基于KNN的機械主題文本分類算法對文檔集合進行多子類分類,并對該分類算法進行了實驗分析。最后,結(jié)合以上研究內(nèi)容,以機械領(lǐng)域挖掘機為主題,實現(xiàn)了一個機械領(lǐng)域Web文本采集與挖掘原型系統(tǒng)。
[Abstract]:With the large-scale popularization of the Internet and the improvement of the degree of informatization of various industries, the Web text information related to the industry field is accumulated rapidly. How to extract the required knowledge from these massive information, It is a hot topic in the field of information processing. In this paper, the research background of the special research project of Shaanxi Provincial Education Department, "Research on automatic Generation method of Conceptual Design Scheme oriented to specific Domain demand", is studied through network information collection and classification technology. The main research work is as follows: (1) this paper studies the discovery and collection of web resources and the preprocessing and classification of web page text information. The main research work is as follows:. 1) researching the method of subject description, combining professional lexicon with feature selection, and based on the limited professional lexicon given by experts. Feature extraction and feature selection are carried out on the existing domain representative text and related text collected through the network, theme feature words are screened, professional lexicon is expanded, and the theme is clearly represented by vector composed of theme feature words. 2) in view of the uncertainty of the collection of subject crawler pages, the structural characteristics of general web pages are analyzed, and the text of the web pages is extracted by the method of line block distribution function, and the advertisements that interfere with the judgment of the relevance of the topic and the classification of the text are removed. Navigation and other useless text information, achieved a better effect of web denoising, and universal. (3) using the topic crawler search strategy of comprehensive value evaluation, considering the two factors of web content analysis and link analysis, combining with PageRank algorithm, calculating the comprehensive link value of the web page, the URLs related to the topic are screened out. The title and text of the collected pages are extracted and stored as text documents. According to the existing mechanical subject category information, the text classification algorithm of mechanical topic based on KNN is used to classify the document set with multiple subclasses. The classification algorithm is analyzed experimentally. Finally, a prototype system of Web text acquisition and mining in mechanical field is implemented by taking the excavator in mechanical field as the subject of the above research.
【學(xué)位授予單位】：西安建筑科技大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2011
【分類號】：TP393.09

【引證文獻】

相關(guān)碩士學(xué)位論文前2條

1 魏勝輝;機械領(lǐng)域文本采集和分類的研究與設(shè)計[D];西安建筑科技大學(xué);2012年

2 代宏;基于流媒體技術(shù)的農(nóng)村基層黨員干部遠程教育系統(tǒng)設(shè)計與實現(xiàn)[D];電子科技大學(xué);2013年

，

本文編號：1645104

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/wenyilunwen/guanggaoshejilunwen/1645104.html

上一篇：基于ASP.NET技術(shù)的綜合業(yè)務(wù)管理平臺的設(shè)計與實現(xiàn)
下一篇：高職廣告設(shè)計專業(yè)人才培養(yǎng)體系與業(yè)界需求的相適應(yīng)研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

面向領(lǐng)域的Web文本采集與分類