面向Web站點的標簽標識相關技術的研究與應用
發(fā)布時間:2018-12-18 03:11
【摘要】:近年來,隨著互聯(lián)網(wǎng)站點爆發(fā)式的增長,互聯(lián)網(wǎng)信息相對用戶已經(jīng)過載,人們在浩瀚的互聯(lián)網(wǎng)海洋中找到特定類型的站點成為一個巨大的挑戰(zhàn),如何將互聯(lián)網(wǎng)的站點以一個整體進行有效的分類顯得尤為重要,F(xiàn)在對網(wǎng)站分類的研究均是基于單標簽分類的,或者是二分類或者是多分類。針對這種情況以及網(wǎng)站多主題的特性,本文提出了一種對網(wǎng)站進行多標簽標識的系統(tǒng)。它是一種對現(xiàn)有網(wǎng)站以站點為單位自動進行多主題定位的系統(tǒng)。本文緒論部分簡要介紹了網(wǎng)站標簽標識的背景、意義,網(wǎng)站標識的研究現(xiàn)狀以及本文主要的研究內容;然后介紹了網(wǎng)絡爬蟲技術,介紹了網(wǎng)頁信息抽取及文本分類算法,介紹了多標簽的算法及評價指標;其次是對網(wǎng)站多標記進行了三方面的討論,將重點研究以下問題:一是如何分析網(wǎng)站結構并提取結構信息;二是如何定位網(wǎng)頁內容類信息并提取正文;三是如何根據(jù)結構信息和正文信息對網(wǎng)站進行標簽標識。 本文將工作主要分為以下幾個部分。 1、網(wǎng)站拓撲結構的回溯及結構特征抽取 網(wǎng)站結構分為兩種,一種是根據(jù)文件在服務器的存放位置來確定的物理結構,一種是網(wǎng)站的鏈接結構,然而這兩種結構都不能較清晰的反應網(wǎng)站的層次關系。因此本文提出了一種網(wǎng)站拓撲結構回溯的方法來對網(wǎng)站的層次關系進行回溯。實驗表明,該算法對于網(wǎng)站層次結構的回溯性能良好。 2、網(wǎng)頁正文內容定位及正文內容抽取 網(wǎng)站的信息大部分來源于網(wǎng)頁的正文內容,因此如何將網(wǎng)頁信息按照正文和噪聲的形式進行分離顯得很有必要。本文提出的改進DSE算法通過將DSE算法與正文內容文字與標點符號的統(tǒng)計規(guī)則相結合來實現(xiàn)正文提取。通過與DSE算法進行比較得出,改進后的DSE算法有令人滿意的正文提取結果。 3、網(wǎng)站的標簽標識系統(tǒng) 針對類別特征樣本不均的情況,本文提出了屬性加權的方式,對ML-KNN算法進行特征樣本加權,使得特征樣本多的類別權重低,特征樣本少的類別權重高,從而保證了因類別間樣本不平衡導致的分類準確率低的問題。實驗證明,屬性加權的算法S-ML-KNN確實提高了分類準確率。
[Abstract]:In recent years, with the explosive growth of Internet sites and the relative overload of Internet users, it has become a great challenge for people to find specific types of sites in the vast ocean of the Internet. How to classify the Internet sites as a whole is particularly important. Now the research on website classification is based on single label classification, or two classification or multi-classification. In view of this situation and the feature of multi-topic, this paper presents a multi-label identification system for web site. It is an automatic multi-theme location system for existing websites. The introduction of this paper briefly introduces the background, significance, research status and main research content of website label. Then it introduces the technology of web crawler, the algorithm of web page information extraction and text classification, the algorithm of multi-label and the evaluation index. Secondly, there are three aspects of discussion on the multi-tags of the website, which focus on the following problems: first, how to analyze the structure of the website and extract the structural information; second, how to locate the content information of the web page and extract the text; Third, how to label the website according to the structure information and text information. This paper divides the work into the following parts. 1. The backtracking and structural feature extraction of website topology can be divided into two types: one is the physical structure determined according to the location of the file in the server, the other is the link structure of the website. However, neither of these two structures can clearly reflect the hierarchical relationship of the website. Therefore, this paper proposes a method of backtracking the hierarchical relationship of web sites. Experiments show that the algorithm has good backtracking performance to the hierarchical structure of the website. 2. Most of the information of the website comes from the text content of the web page, so it is necessary to separate the web page information according to the form of the text and the noise. The improved DSE algorithm proposed in this paper combines the DSE algorithm with the statistical rules of text and punctuation to achieve text extraction. Compared with DSE algorithm, the improved DSE algorithm has satisfactory text extraction results. 3, the label identification system of website aims at the situation that the class feature sample is uneven, this paper puts forward the method of attribute weighting, which makes the weight of the class with many feature samples low by weighting the feature sample to the ML-KNN algorithm. The class weight with less feature samples is high, which ensures that the classification accuracy is low due to the imbalance of samples between categories. Experimental results show that the attribute weighted algorithm S-ML-KNN does improve the classification accuracy.
【學位授予單位】:北京郵電大學
【學位級別】:碩士
【學位授予年份】:2014
【分類號】:TP393.092
本文編號:2385230
[Abstract]:In recent years, with the explosive growth of Internet sites and the relative overload of Internet users, it has become a great challenge for people to find specific types of sites in the vast ocean of the Internet. How to classify the Internet sites as a whole is particularly important. Now the research on website classification is based on single label classification, or two classification or multi-classification. In view of this situation and the feature of multi-topic, this paper presents a multi-label identification system for web site. It is an automatic multi-theme location system for existing websites. The introduction of this paper briefly introduces the background, significance, research status and main research content of website label. Then it introduces the technology of web crawler, the algorithm of web page information extraction and text classification, the algorithm of multi-label and the evaluation index. Secondly, there are three aspects of discussion on the multi-tags of the website, which focus on the following problems: first, how to analyze the structure of the website and extract the structural information; second, how to locate the content information of the web page and extract the text; Third, how to label the website according to the structure information and text information. This paper divides the work into the following parts. 1. The backtracking and structural feature extraction of website topology can be divided into two types: one is the physical structure determined according to the location of the file in the server, the other is the link structure of the website. However, neither of these two structures can clearly reflect the hierarchical relationship of the website. Therefore, this paper proposes a method of backtracking the hierarchical relationship of web sites. Experiments show that the algorithm has good backtracking performance to the hierarchical structure of the website. 2. Most of the information of the website comes from the text content of the web page, so it is necessary to separate the web page information according to the form of the text and the noise. The improved DSE algorithm proposed in this paper combines the DSE algorithm with the statistical rules of text and punctuation to achieve text extraction. Compared with DSE algorithm, the improved DSE algorithm has satisfactory text extraction results. 3, the label identification system of website aims at the situation that the class feature sample is uneven, this paper puts forward the method of attribute weighting, which makes the weight of the class with many feature samples low by weighting the feature sample to the ML-KNN algorithm. The class weight with less feature samples is high, which ensures that the classification accuracy is low due to the imbalance of samples between categories. Experimental results show that the attribute weighted algorithm S-ML-KNN does improve the classification accuracy.
【學位授予單位】:北京郵電大學
【學位級別】:碩士
【學位授予年份】:2014
【分類號】:TP393.092
【參考文獻】
相關期刊論文 前10條
1 常育紅,姜哲,朱小燕;基于標記樹表示方法的頁面結構分析[J];計算機工程與應用;2004年16期
2 劉端陽;邱衛(wèi)杰;;基于SVM期望間隔的多標簽分類的主動學習[J];計算機科學;2011年04期
3 朱明,王軍,王俊普;基于多層模式的多記錄網(wǎng)頁信息抽取方法[J];計算機工程;2001年09期
4 胡仁龍;袁春風;武港山;濮小佳;;基于重復模式的自動Web信息抽取[J];計算機工程;2008年22期
5 周明建,高濟,李飛;基于本體論的Web信息抽取[J];計算機輔助設計與圖形學學報;2004年04期
6 李效東,顧毓清;基于DOM的Web信息提取[J];計算機學報;2002年05期
7 孟小峰,王海燕,谷明哲,王靜;XWIS中基于預定義模式的包裝器[J];計算機應用;2001年09期
8 歐健文,董守斌,蔡斌;模板化網(wǎng)頁主題信息的提取方法[J];清華大學學報(自然科學版);2005年S1期
9 董寶力,祁國寧,顧新建;基于混合向量空間模型的主題網(wǎng)站識別[J];清華大學學報(自然科學版);2005年S1期
10 田永鴻,黃鐵軍,高文;基于多粒度樹模型的Web站點描述及挖掘算法[J];軟件學報;2004年09期
,本文編號:2385230
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/2385230.html
最近更新
教材專著