天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁(yè) > 科技論文 > 搜索引擎論文 >

基于自然標(biāo)注的文本分類

發(fā)布時(shí)間:2018-03-09 08:10

  本文選題:文本分類 切入點(diǎn):鏈接分析 出處:《哈爾濱工業(yè)大學(xué)》2013年碩士論文 論文類型:學(xué)位論文


【摘要】:文本分類的研究和搜索引擎中,分類語(yǔ)料庫(kù)的構(gòu)建一直是通過(guò)人工標(biāo)注等方式實(shí)現(xiàn)的,這個(gè)過(guò)程往往需要大量人力,使成本較高。同時(shí),這種方式構(gòu)建好的分類體系總是不靈活的,對(duì)于分類體系的改變必須重新經(jīng)過(guò)人工修繕,需要專人維護(hù)。在互聯(lián)網(wǎng)中,各個(gè)網(wǎng)站通常會(huì)按照分類體系去組織網(wǎng)站結(jié)構(gòu),通過(guò)各級(jí)導(dǎo)航欄等對(duì)網(wǎng)站提供的信息進(jìn)行不同層級(jí)的分類。對(duì)于含有噪音的粗分類結(jié)果再通過(guò)聚類分析的方法去掉其中的誤分類。根據(jù)這個(gè)思路,本文提出一種基于網(wǎng)站自然標(biāo)注信息的自動(dòng)文本分類系統(tǒng),通過(guò)以下步驟實(shí)現(xiàn): 通過(guò)對(duì)獲取的網(wǎng)頁(yè)結(jié)構(gòu)進(jìn)行分析,得到網(wǎng)頁(yè)結(jié)構(gòu)塊,即網(wǎng)頁(yè)中的相同功能的板塊,導(dǎo)航欄就被劃分到其中的一個(gè)塊中,通過(guò)基于圖的鏈接分析的方法得到頁(yè)面之間的關(guān)系提取出網(wǎng)站中各個(gè)網(wǎng)頁(yè)的導(dǎo)航欄。 對(duì)于提取出的導(dǎo)航欄將導(dǎo)航欄中的錨文本進(jìn)行分析,作為分類關(guān)鍵詞,根據(jù)網(wǎng)頁(yè)的自身信息進(jìn)行分析,得出網(wǎng)頁(yè)在網(wǎng)站中的是否到達(dá)網(wǎng)站結(jié)構(gòu)的葉節(jié)點(diǎn),以確定網(wǎng)頁(yè)在網(wǎng)站中的層次結(jié)構(gòu)。網(wǎng)站的分類結(jié)構(gòu)與指定的分類體系作比較,確定網(wǎng)頁(yè)的分類。再通過(guò)計(jì)算網(wǎng)頁(yè)中正文與網(wǎng)頁(yè)中每一行的非正文的格式信息的比值,對(duì)這個(gè)值平滑化后通過(guò)聚類的方法確定網(wǎng)頁(yè)的正文。 僅使用這種方式得到的結(jié)果往往因?yàn)楦骶W(wǎng)站分類標(biāo)準(zhǔn)不同和欺騙鏈接等原因使結(jié)果中含有一定量的噪音,需要進(jìn)行進(jìn)一步凈化處理。通過(guò)對(duì)各個(gè)分類內(nèi)部的數(shù)據(jù)進(jìn)行聚類得到數(shù)據(jù)的分布情況,通過(guò)選擇空間中分布較近的簇丟棄離群的簇,提高分類的準(zhǔn)確率。 本文通過(guò)將生成的分類語(yǔ)料應(yīng)用于SVM分類器中,將自動(dòng)生成的語(yǔ)料作為訓(xùn)練集,我們看到測(cè)試集的分類可以達(dá)到一個(gè)較高的準(zhǔn)確率。同時(shí)在英文語(yǔ)料和中文語(yǔ)料的實(shí)驗(yàn)結(jié)果也都有很好的效果。說(shuō)明在用戶提供的分類體系下系統(tǒng)可以得到一個(gè)比較高的準(zhǔn)確率,,在文本分類和信息檢索中有較高的可用性。
[Abstract]:In the research of text classification and search engine, the construction of classification corpus has always been realized by manual annotation, which often requires a lot of manpower to make the cost higher. At the same time, In this way, it is always inflexible to construct a good classification system. The changes to the classification system must be repaired manually and need special maintenance. In the Internet, each website usually organizes the website structure according to the classification system. The information provided by the website is classified at different levels through navigation bars at all levels. For the coarse classification results with noise, the false classification is removed by clustering analysis. In this paper, an automatic text classification system based on natural tagging information is proposed, which is realized by the following steps:. By analyzing the structure of the web page, we get the structure block of the web page, that is, the block of the same function in the web page, and the navigation bar is divided into one of the blocks. The relationship between pages is extracted from the navigation bar of each web page by graph-based link analysis. For the extracted navigation bar, the anchor text in the navigation bar is analyzed as a classification key word, and according to the information of the page itself, the paper obtains whether the web page in the website reaches the leaf node of the website structure. In order to determine the hierarchical structure of the web page in the website, the classification structure of the website is compared with the designated classification system, and the classification of the page is determined. Then, by calculating the ratio of the format information of the text of the page to the non-text of each line in the page, After smoothing the value, the text of the web page is determined by clustering method. Results obtained only in this way tend to contain a certain amount of noise due to differences in the classification criteria of websites and spoofing links. The distribution of the data is obtained by clustering the data within each classification, and the accuracy of classification is improved by selecting the nearest cluster in the space to discard the outlier cluster. In this paper, the generated classifier is applied to the SVM classifier, and the automatically generated corpus is used as the training set. We see that the classification of test sets can achieve a higher accuracy. At the same time, the experimental results of English corpus and Chinese corpus are also very good. It shows that the system can obtain a classification system provided by users. A relatively high accuracy rate, It has high availability in text classification and information retrieval.
【學(xué)位授予單位】:哈爾濱工業(yè)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文 前10條

1 張麗敏;;垂直搜索引擎的主題爬蟲(chóng)策略[J];電腦知識(shí)與技術(shù);2010年15期

2 朱巖;景麗萍;于劍;;一種利用近鄰和信息熵的主動(dòng)文本標(biāo)注方法[J];計(jì)算機(jī)研究與發(fā)展;2012年06期

3 李培峰;朱巧明;錢培德;;基于Web的大規(guī)模語(yǔ)料庫(kù)構(gòu)建方法[J];計(jì)算機(jī)工程;2008年07期

4 周立柱,林玲;聚焦爬蟲(chóng)技術(shù)研究綜述[J];計(jì)算機(jī)應(yīng)用;2005年09期

5 羅俊;;一種基于圖的層次多標(biāo)記文本分類方法[J];計(jì)算機(jī)應(yīng)用研究;2010年03期

6 孫茂松;;基于互聯(lián)網(wǎng)自然標(biāo)注資源的自然語(yǔ)言處理[J];中文信息學(xué)報(bào);2011年06期

7 王開(kāi)軍;張軍英;李丹;張新娜;郭濤;;自適應(yīng)仿射傳播聚類[J];自動(dòng)化學(xué)報(bào);2007年12期

8 韓忠明;張玉沙;張慧;萬(wàn)月亮;黃今慧;;有效的中文微博短文本傾向性分類算法[J];計(jì)算機(jī)應(yīng)用與軟件;2012年10期

9 孫吉貴;劉杰;趙連宇;;聚類算法研究[J];軟件學(xué)報(bào);2008年01期

10 齊鵬;張俊;李冠宇;;基于本體的垂直搜索引擎分類索引模型設(shè)計(jì)[J];計(jì)算機(jī)工程與設(shè)計(jì);2010年23期



本文編號(hào):1587698

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1587698.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶c0d6a***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com