天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當前位置:主頁 > 科技論文 > 搜索引擎論文 >

基于維基類目網(wǎng)絡和URL模式樹的網(wǎng)頁分類方法探究

發(fā)布時間:2018-05-09 14:38

  本文選題:網(wǎng)頁分類 + 維基網(wǎng)絡��; 參考:《上海交通大學》2013年碩士論文


【摘要】:分類是信息檢索中的一個重要問題,而網(wǎng)頁分類對于提高互聯(lián)網(wǎng)服務質(zhì)量尤其意義重大。諸多互聯(lián)網(wǎng)上的關鍵應用包括站點目錄、搜索引擎、網(wǎng)頁爬蟲、推薦系統(tǒng)、用戶行為分析系統(tǒng)和廣告投放系統(tǒng)無不依賴于高效而準確的頁面分類來提高服務質(zhì)量。針對這些應用中涉及到的分類問題,有許多分類方法相繼被提出,其中包括基于頁面內(nèi)容的文本分類方法�;陧撁鎯�(nèi)容的分類方法依賴于正文質(zhì)量,如果正文質(zhì)量太差,或者文本長度太短,會導致分類性能的下降。隨著一些大規(guī)模詞典和類目體系的建立,基于第三方詞庫的分類方法引起了廣泛的關注。第三方詞庫可以提供現(xiàn)成的語義類目,一方面可以作為輔助信息增強語義識別能力,提高分類的精度;另一方面可以直接用于分類,這樣的分類方式能從一定程度上解決短文本的分類缺陷,,并且不需要依靠訓練集,能高效地進行分類。 本文的分類建立在全網(wǎng)環(huán)境,全網(wǎng)環(huán)境數(shù)據(jù)結(jié)構(gòu)復雜、噪聲多、干擾強,使用傳統(tǒng)的分類方法,一方面如果文本質(zhì)量太差,會大大影響分類的準確率;另一方面,全網(wǎng)數(shù)據(jù)量龐大,使用傳統(tǒng)分類方法勢必要通過引入大量訓練集來訓練分類模型,可能無法進行高效地分類。本文提出了一種基于維基網(wǎng)絡的主題分類模型,詞匯量和語義都極其豐富的維基類目網(wǎng)絡涵蓋了大量詞匯,并且維基百科是在線實時編輯系統(tǒng),很多詞匯甚至能“與時俱進”,從而對全網(wǎng)范圍的詞匯有較好的覆蓋。另外,這種分類方法不需要依賴訓練集來訓練模型,只要完成了維基網(wǎng)絡的類目關聯(lián)就可以用于分類預測。同時,盡管維基類目詞匯實時變化,但是整個類目體系相對比較穩(wěn)定,從而本方法可以在長時間內(nèi)保持有效。我們在實驗階段對比了傳統(tǒng)的基于頁面內(nèi)容的分類方法,證明本方案的可行性。 另外,本文還創(chuàng)新性地提出了基于URL模式樹的站點功能分類方法,基于URL模式樹的功能分類借鑒了自然語言處理的語法樹核函數(shù)(Tree Kernel)的方法,構(gòu)造了URL語法規(guī)則和URL語法樹,并通過改進的Tree Kernel來進行站點功能的分類。
[Abstract]:Classification is an important problem in information retrieval, and web page classification is of great significance to improve the quality of Internet service. Many key applications on the Internet include site catalogues, search engines, web crawlers, recommendation systems, user behavior analysis systems and advertising delivery systems, all of which rely on efficient and accurate page classification to improve the quality of service. Aiming at the classification problems involved in these applications, many classification methods have been proposed one after another, including text classification methods based on page content. The classification method based on page content depends on the text quality. If the text quality is too poor or the text length is too short, the classification performance will be degraded. With the establishment of some large-scale dictionaries and category systems, classification methods based on third-party lexicon have attracted wide attention. Third party lexicon can provide ready-made semantic categories. On the one hand, it can be used as auxiliary information to enhance semantic recognition ability and improve classification accuracy; on the other hand, it can be directly used in classification. This classification method can solve the problem of short text classification to some extent, and it can be classified efficiently without the need of training set. The classification of this paper is based on the whole network environment, the data structure of the whole network environment is complex, the noise is many, the interference is strong, using the traditional classification method, on the one hand, if the text quality is too poor, it will greatly affect the classification accuracy; on the other hand, Because of the huge amount of data in the whole network, the traditional classification method is bound to introduce a large number of training sets to train the classification model, which may not be able to classify efficiently. In this paper, a subject classification model based on Wikimedia is proposed. Wikimedia, which has abundant vocabulary and semantics, covers a large number of words, and Wikipedia is an online real-time editing system, and many words can even "keep pace with the times". In order to the whole network of vocabulary has a better coverage. In addition, this classification method does not need to rely on the training set to train the model, as long as the Wikimedia classification association is completed, it can be used for classification prediction. At the same time, although the wiki vocabulary changes in real time, the whole category system is relatively stable, so the method can be effective for a long time. In the experiment stage, we compared the traditional classification methods based on page content to prove the feasibility of this scheme. In addition, this paper also innovatively proposes a site function classification method based on URL schema tree. The function classification based on URL schema tree uses the method of natural language processing syntax tree kernel function to construct URL syntax rules and URL syntax tree. And through the improved Tree Kernel to carry out the classification of site functions.
【學位授予單位】:上海交通大學
【學位級別】:碩士
【學位授予年份】:2013
【分類號】:TP391.3

【共引文獻】

相關期刊論文 前1條

1 高輝;王沙沙;傅彥;;Web輿情的長期趨勢預測方法[J];電子科技大學學報;2011年03期

相關碩士學位論文 前2條

1 何世柱;文本分類和聚類若干模型的研究[D];江西師范大學;2011年

2 邱強;基于關鍵詞的文本分類研究[D];西北農(nóng)林科技大學;2010年



本文編號:1866404

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1866404.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶65755***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com