面向Web站點(diǎn)的標(biāo)簽標(biāo)識(shí)相關(guān)技術(shù)的研究與應(yīng)用
[Abstract]:In recent years, with the explosive growth of Internet sites and the relative overload of Internet users, it has become a great challenge for people to find specific types of sites in the vast ocean of the Internet. How to classify the Internet sites as a whole is particularly important. Now the research on website classification is based on single label classification, or two classification or multi-classification. In view of this situation and the feature of multi-topic, this paper presents a multi-label identification system for web site. It is an automatic multi-theme location system for existing websites. The introduction of this paper briefly introduces the background, significance, research status and main research content of website label. Then it introduces the technology of web crawler, the algorithm of web page information extraction and text classification, the algorithm of multi-label and the evaluation index. Secondly, there are three aspects of discussion on the multi-tags of the website, which focus on the following problems: first, how to analyze the structure of the website and extract the structural information; second, how to locate the content information of the web page and extract the text; Third, how to label the website according to the structure information and text information. This paper divides the work into the following parts. 1. The backtracking and structural feature extraction of website topology can be divided into two types: one is the physical structure determined according to the location of the file in the server, the other is the link structure of the website. However, neither of these two structures can clearly reflect the hierarchical relationship of the website. Therefore, this paper proposes a method of backtracking the hierarchical relationship of web sites. Experiments show that the algorithm has good backtracking performance to the hierarchical structure of the website. 2. Most of the information of the website comes from the text content of the web page, so it is necessary to separate the web page information according to the form of the text and the noise. The improved DSE algorithm proposed in this paper combines the DSE algorithm with the statistical rules of text and punctuation to achieve text extraction. Compared with DSE algorithm, the improved DSE algorithm has satisfactory text extraction results. 3, the label identification system of website aims at the situation that the class feature sample is uneven, this paper puts forward the method of attribute weighting, which makes the weight of the class with many feature samples low by weighting the feature sample to the ML-KNN algorithm. The class weight with less feature samples is high, which ensures that the classification accuracy is low due to the imbalance of samples between categories. Experimental results show that the attribute weighted algorithm S-ML-KNN does improve the classification accuracy.
【學(xué)位授予單位】:北京郵電大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2014
【分類號(hào)】:TP393.092
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 常育紅,姜哲,朱小燕;基于標(biāo)記樹表示方法的頁面結(jié)構(gòu)分析[J];計(jì)算機(jī)工程與應(yīng)用;2004年16期
2 劉端陽;邱衛(wèi)杰;;基于SVM期望間隔的多標(biāo)簽分類的主動(dòng)學(xué)習(xí)[J];計(jì)算機(jī)科學(xué);2011年04期
3 朱明,王軍,王俊普;基于多層模式的多記錄網(wǎng)頁信息抽取方法[J];計(jì)算機(jī)工程;2001年09期
4 胡仁龍;袁春風(fēng);武港山;濮小佳;;基于重復(fù)模式的自動(dòng)Web信息抽取[J];計(jì)算機(jī)工程;2008年22期
5 周明建,高濟(jì),李飛;基于本體論的Web信息抽取[J];計(jì)算機(jī)輔助設(shè)計(jì)與圖形學(xué)學(xué)報(bào);2004年04期
6 李效東,顧毓清;基于DOM的Web信息提取[J];計(jì)算機(jī)學(xué)報(bào);2002年05期
7 孟小峰,王海燕,谷明哲,王靜;XWIS中基于預(yù)定義模式的包裝器[J];計(jì)算機(jī)應(yīng)用;2001年09期
8 歐健文,董守斌,蔡斌;模板化網(wǎng)頁主題信息的提取方法[J];清華大學(xué)學(xué)報(bào)(自然科學(xué)版);2005年S1期
9 董寶力,祁國寧,顧新建;基于混合向量空間模型的主題網(wǎng)站識(shí)別[J];清華大學(xué)學(xué)報(bào)(自然科學(xué)版);2005年S1期
10 田永鴻,黃鐵軍,高文;基于多粒度樹模型的Web站點(diǎn)描述及挖掘算法[J];軟件學(xué)報(bào);2004年09期
,本文編號(hào):2385230
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/2385230.html