面向領(lǐng)域的Web文本采集與分類(lèi)
發(fā)布時(shí)間:2018-03-21 18:49
本文選題:主題爬蟲(chóng) 切入點(diǎn):特征提取 出處:《西安建筑科技大學(xué)》2011年碩士論文 論文類(lèi)型:學(xué)位論文
【摘要】:隨著互聯(lián)網(wǎng)的大規(guī)模普及和各行業(yè)信息化程度的提高,與行業(yè)領(lǐng)域相關(guān)的Web文本信息快速積累,如何從這些海量信息中定向提取符合要求的知識(shí),是當(dāng)前信息處理領(lǐng)域的研究熱點(diǎn)。 本文以陜西省教育廳專(zhuān)項(xiàng)科研項(xiàng)目“面向特定領(lǐng)域需求的概念設(shè)計(jì)方案自動(dòng)生成方法研究”為課題研究背景,通過(guò)網(wǎng)絡(luò)信息采集和分類(lèi)技術(shù),對(duì)領(lǐng)域相關(guān)主題網(wǎng)絡(luò)資源發(fā)現(xiàn)與采集、采集到的網(wǎng)頁(yè)文本信息預(yù)處理與分類(lèi)這兩方面的問(wèn)題進(jìn)行研究,主要研究工作如下: (1)對(duì)主題描述方法進(jìn)行研究,將專(zhuān)業(yè)詞庫(kù)與特征選擇相結(jié)合,在專(zhuān)家給出的有限專(zhuān)業(yè)詞庫(kù)基礎(chǔ)上,對(duì)已有的領(lǐng)域代表性文本和通過(guò)網(wǎng)絡(luò)采集到的主題相關(guān)文本進(jìn)行特征提取和特征選擇,篩選主題特征詞,擴(kuò)充專(zhuān)業(yè)詞庫(kù),通過(guò)由主題特征詞構(gòu)成的向量來(lái)明確表示主題; (2)鑒于主題爬蟲(chóng)網(wǎng)頁(yè)采集的不確定性,對(duì)一般網(wǎng)頁(yè)的結(jié)構(gòu)特點(diǎn)進(jìn)行分析,采用基于行塊分布函數(shù)的方法抽取網(wǎng)頁(yè)正文,去掉干擾主題相關(guān)度判斷與文本分類(lèi)的廣告、導(dǎo)航等無(wú)用文本信息,取得了較好的網(wǎng)頁(yè)去噪效果,且具有通用性。 (3)采用綜合價(jià)值評(píng)價(jià)的主題爬蟲(chóng)搜索策略,綜合考慮網(wǎng)頁(yè)內(nèi)容分析和鏈接分析兩方面的因素,結(jié)合PageRank算法,計(jì)算網(wǎng)頁(yè)的綜合鏈接價(jià)值,篩選出與主題相關(guān)的URL。 (4)對(duì)采集到的網(wǎng)頁(yè)提取出標(biāo)題和網(wǎng)頁(yè)正文,保存為文本文檔并進(jìn)行預(yù)處理,根據(jù)現(xiàn)有的機(jī)械主題類(lèi)別信息,采用基于KNN的機(jī)械主題文本分類(lèi)算法對(duì)文檔集合進(jìn)行多子類(lèi)分類(lèi),并對(duì)該分類(lèi)算法進(jìn)行了實(shí)驗(yàn)分析。 最后,結(jié)合以上研究?jī)?nèi)容,以機(jī)械領(lǐng)域挖掘機(jī)為主題,實(shí)現(xiàn)了一個(gè)機(jī)械領(lǐng)域Web文本采集與挖掘原型系統(tǒng)。
[Abstract]:With the large-scale popularization of the Internet and the improvement of the degree of informatization of various industries, the Web text information related to the industry field is accumulated rapidly. How to extract the required knowledge from these massive information, It is a hot topic in the field of information processing. In this paper, the research background of the special research project of Shaanxi Provincial Education Department, "Research on automatic Generation method of Conceptual Design Scheme oriented to specific Domain demand", is studied through network information collection and classification technology. The main research work is as follows: (1) this paper studies the discovery and collection of web resources and the preprocessing and classification of web page text information. The main research work is as follows:. 1) researching the method of subject description, combining professional lexicon with feature selection, and based on the limited professional lexicon given by experts. Feature extraction and feature selection are carried out on the existing domain representative text and related text collected through the network, theme feature words are screened, professional lexicon is expanded, and the theme is clearly represented by vector composed of theme feature words. 2) in view of the uncertainty of the collection of subject crawler pages, the structural characteristics of general web pages are analyzed, and the text of the web pages is extracted by the method of line block distribution function, and the advertisements that interfere with the judgment of the relevance of the topic and the classification of the text are removed. Navigation and other useless text information, achieved a better effect of web denoising, and universal. (3) using the topic crawler search strategy of comprehensive value evaluation, considering the two factors of web content analysis and link analysis, combining with PageRank algorithm, calculating the comprehensive link value of the web page, the URLs related to the topic are screened out. The title and text of the collected pages are extracted and stored as text documents. According to the existing mechanical subject category information, the text classification algorithm of mechanical topic based on KNN is used to classify the document set with multiple subclasses. The classification algorithm is analyzed experimentally. Finally, a prototype system of Web text acquisition and mining in mechanical field is implemented by taking the excavator in mechanical field as the subject of the above research.
【學(xué)位授予單位】:西安建筑科技大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2011
【分類(lèi)號(hào)】:TP393.09
【引證文獻(xiàn)】
相關(guān)碩士學(xué)位論文 前2條
1 魏勝輝;機(jī)械領(lǐng)域文本采集和分類(lèi)的研究與設(shè)計(jì)[D];西安建筑科技大學(xué);2012年
2 代宏;基于流媒體技術(shù)的農(nóng)村基層黨員干部遠(yuǎn)程教育系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn)[D];電子科技大學(xué);2013年
,本文編號(hào):1645104
本文鏈接:http://sikaile.net/wenyilunwen/guanggaoshejilunwen/1645104.html
最近更新
教材專(zhuān)著