天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

面向Web站點(diǎn)的標(biāo)簽標(biāo)識(shí)相關(guān)技術(shù)的研究與應(yīng)用

發(fā)布時(shí)間:2018-12-18 03:11
【摘要】:近年來,隨著互聯(lián)網(wǎng)站點(diǎn)爆發(fā)式的增長,互聯(lián)網(wǎng)信息相對用戶已經(jīng)過載,人們在浩瀚的互聯(lián)網(wǎng)海洋中找到特定類型的站點(diǎn)成為一個(gè)巨大的挑戰(zhàn),如何將互聯(lián)網(wǎng)的站點(diǎn)以一個(gè)整體進(jìn)行有效的分類顯得尤為重要。現(xiàn)在對網(wǎng)站分類的研究均是基于單標(biāo)簽分類的,或者是二分類或者是多分類。針對這種情況以及網(wǎng)站多主題的特性,本文提出了一種對網(wǎng)站進(jìn)行多標(biāo)簽標(biāo)識(shí)的系統(tǒng)。它是一種對現(xiàn)有網(wǎng)站以站點(diǎn)為單位自動(dòng)進(jìn)行多主題定位的系統(tǒng)。本文緒論部分簡要介紹了網(wǎng)站標(biāo)簽標(biāo)識(shí)的背景、意義,網(wǎng)站標(biāo)識(shí)的研究現(xiàn)狀以及本文主要的研究內(nèi)容;然后介紹了網(wǎng)絡(luò)爬蟲技術(shù),介紹了網(wǎng)頁信息抽取及文本分類算法,介紹了多標(biāo)簽的算法及評價(jià)指標(biāo);其次是對網(wǎng)站多標(biāo)記進(jìn)行了三方面的討論,將重點(diǎn)研究以下問題:一是如何分析網(wǎng)站結(jié)構(gòu)并提取結(jié)構(gòu)信息;二是如何定位網(wǎng)頁內(nèi)容類信息并提取正文;三是如何根據(jù)結(jié)構(gòu)信息和正文信息對網(wǎng)站進(jìn)行標(biāo)簽標(biāo)識(shí)。 本文將工作主要分為以下幾個(gè)部分。 1、網(wǎng)站拓?fù)浣Y(jié)構(gòu)的回溯及結(jié)構(gòu)特征抽取 網(wǎng)站結(jié)構(gòu)分為兩種,一種是根據(jù)文件在服務(wù)器的存放位置來確定的物理結(jié)構(gòu),一種是網(wǎng)站的鏈接結(jié)構(gòu),然而這兩種結(jié)構(gòu)都不能較清晰的反應(yīng)網(wǎng)站的層次關(guān)系。因此本文提出了一種網(wǎng)站拓?fù)浣Y(jié)構(gòu)回溯的方法來對網(wǎng)站的層次關(guān)系進(jìn)行回溯。實(shí)驗(yàn)表明,該算法對于網(wǎng)站層次結(jié)構(gòu)的回溯性能良好。 2、網(wǎng)頁正文內(nèi)容定位及正文內(nèi)容抽取 網(wǎng)站的信息大部分來源于網(wǎng)頁的正文內(nèi)容,因此如何將網(wǎng)頁信息按照正文和噪聲的形式進(jìn)行分離顯得很有必要。本文提出的改進(jìn)DSE算法通過將DSE算法與正文內(nèi)容文字與標(biāo)點(diǎn)符號(hào)的統(tǒng)計(jì)規(guī)則相結(jié)合來實(shí)現(xiàn)正文提取。通過與DSE算法進(jìn)行比較得出,改進(jìn)后的DSE算法有令人滿意的正文提取結(jié)果。 3、網(wǎng)站的標(biāo)簽標(biāo)識(shí)系統(tǒng) 針對類別特征樣本不均的情況,本文提出了屬性加權(quán)的方式,對ML-KNN算法進(jìn)行特征樣本加權(quán),使得特征樣本多的類別權(quán)重低,特征樣本少的類別權(quán)重高,從而保證了因類別間樣本不平衡導(dǎo)致的分類準(zhǔn)確率低的問題。實(shí)驗(yàn)證明,屬性加權(quán)的算法S-ML-KNN確實(shí)提高了分類準(zhǔn)確率。
[Abstract]:In recent years, with the explosive growth of Internet sites and the relative overload of Internet users, it has become a great challenge for people to find specific types of sites in the vast ocean of the Internet. How to classify the Internet sites as a whole is particularly important. Now the research on website classification is based on single label classification, or two classification or multi-classification. In view of this situation and the feature of multi-topic, this paper presents a multi-label identification system for web site. It is an automatic multi-theme location system for existing websites. The introduction of this paper briefly introduces the background, significance, research status and main research content of website label. Then it introduces the technology of web crawler, the algorithm of web page information extraction and text classification, the algorithm of multi-label and the evaluation index. Secondly, there are three aspects of discussion on the multi-tags of the website, which focus on the following problems: first, how to analyze the structure of the website and extract the structural information; second, how to locate the content information of the web page and extract the text; Third, how to label the website according to the structure information and text information. This paper divides the work into the following parts. 1. The backtracking and structural feature extraction of website topology can be divided into two types: one is the physical structure determined according to the location of the file in the server, the other is the link structure of the website. However, neither of these two structures can clearly reflect the hierarchical relationship of the website. Therefore, this paper proposes a method of backtracking the hierarchical relationship of web sites. Experiments show that the algorithm has good backtracking performance to the hierarchical structure of the website. 2. Most of the information of the website comes from the text content of the web page, so it is necessary to separate the web page information according to the form of the text and the noise. The improved DSE algorithm proposed in this paper combines the DSE algorithm with the statistical rules of text and punctuation to achieve text extraction. Compared with DSE algorithm, the improved DSE algorithm has satisfactory text extraction results. 3, the label identification system of website aims at the situation that the class feature sample is uneven, this paper puts forward the method of attribute weighting, which makes the weight of the class with many feature samples low by weighting the feature sample to the ML-KNN algorithm. The class weight with less feature samples is high, which ensures that the classification accuracy is low due to the imbalance of samples between categories. Experimental results show that the attribute weighted algorithm S-ML-KNN does improve the classification accuracy.
【學(xué)位授予單位】:北京郵電大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2014
【分類號(hào)】:TP393.092

【參考文獻(xiàn)】

相關(guān)期刊論文 前10條

1 常育紅,姜哲,朱小燕;基于標(biāo)記樹表示方法的頁面結(jié)構(gòu)分析[J];計(jì)算機(jī)工程與應(yīng)用;2004年16期

2 劉端陽;邱衛(wèi)杰;;基于SVM期望間隔的多標(biāo)簽分類的主動(dòng)學(xué)習(xí)[J];計(jì)算機(jī)科學(xué);2011年04期

3 朱明,王軍,王俊普;基于多層模式的多記錄網(wǎng)頁信息抽取方法[J];計(jì)算機(jī)工程;2001年09期

4 胡仁龍;袁春風(fēng);武港山;濮小佳;;基于重復(fù)模式的自動(dòng)Web信息抽取[J];計(jì)算機(jī)工程;2008年22期

5 周明建,高濟(jì),李飛;基于本體論的Web信息抽取[J];計(jì)算機(jī)輔助設(shè)計(jì)與圖形學(xué)學(xué)報(bào);2004年04期

6 李效東,顧毓清;基于DOM的Web信息提取[J];計(jì)算機(jī)學(xué)報(bào);2002年05期

7 孟小峰,王海燕,谷明哲,王靜;XWIS中基于預(yù)定義模式的包裝器[J];計(jì)算機(jī)應(yīng)用;2001年09期

8 歐健文,董守斌,蔡斌;模板化網(wǎng)頁主題信息的提取方法[J];清華大學(xué)學(xué)報(bào)(自然科學(xué)版);2005年S1期

9 董寶力,祁國寧,顧新建;基于混合向量空間模型的主題網(wǎng)站識(shí)別[J];清華大學(xué)學(xué)報(bào)(自然科學(xué)版);2005年S1期

10 田永鴻,黃鐵軍,高文;基于多粒度樹模型的Web站點(diǎn)描述及挖掘算法[J];軟件學(xué)報(bào);2004年09期

,

本文編號(hào):2385230

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/guanlilunwen/ydhl/2385230.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶f41c1***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請E-mail郵箱bigeng88@qq.com