天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁 > 科技論文 > 搜索引擎論文 >

面向互聯(lián)網(wǎng)文本的大規(guī)模層次分類技術(shù)研究

發(fā)布時(shí)間:2018-09-19 12:36
【摘要】:隨著信息技術(shù)的發(fā)展,互聯(lián)網(wǎng)數(shù)據(jù)以及電子數(shù)據(jù)急劇增長。為了有效地組織和管理互聯(lián)網(wǎng)上的海量文本信息,通常按照樹型或者有向無環(huán)圖結(jié)構(gòu)的主題類別層次對互聯(lián)網(wǎng)文本進(jìn)行分類,將其組織為一個(gè)包含數(shù)千、甚至數(shù)萬個(gè)類別的分類目錄。通過建立全面、精確的互聯(lián)網(wǎng)分類目錄,可以實(shí)現(xiàn)快速、精細(xì)的網(wǎng)絡(luò)訪問控制。在這個(gè)過程中,大規(guī)模層次分類問題研究如何將互聯(lián)網(wǎng)文本準(zhǔn)確地分到類別層次中的各個(gè)類別。面向互聯(lián)網(wǎng)文本的大規(guī)模層次分類技術(shù)是構(gòu)建互聯(lián)網(wǎng)分類目錄的基礎(chǔ),是構(gòu)建健康、和諧的互聯(lián)網(wǎng)環(huán)境的重要技術(shù)手段,同時(shí)也是信息檢索、綠色上網(wǎng)、網(wǎng)絡(luò)信譽(yù)管理、安全過濾等網(wǎng)絡(luò)應(yīng)用的基礎(chǔ)。與傳統(tǒng)文本分類不同,大規(guī)模層次分類的分類體系規(guī)模巨大,缺少足夠有效的訓(xùn)練語料,其分類對象以web文本為主,同時(shí)正向社會(huì)化文本演進(jìn)。這些特征使其與傳統(tǒng)的文本分類問題有很大差別,在技術(shù)上也帶來了更大的挑戰(zhàn)。本文在分析了相關(guān)工作的基礎(chǔ)上,主要針對大規(guī)模層次分類的分類體系規(guī)模巨大、稀有類別普遍、分類學(xué)習(xí)缺少標(biāo)注樣本、分類對象向社會(huì)化文本演進(jìn)等四個(gè)特性進(jìn)行了研究,主要研究內(nèi)容和成果包括:1)對大規(guī)模層次分類問題進(jìn)行了綜述。給出了大規(guī)模層次分類問題的定義,分析了大規(guī)模層次分類問題的求解策略;對大規(guī)模層次分類問題的求解方法加以分類,在分類的基礎(chǔ)上,介紹了各種典型的求解方法并進(jìn)行對比;最后總結(jié)了大規(guī)模層次分類問題求解方法并指出了各種分類方法的適用性。2)針對類別層次規(guī)模巨大的特性,研究了基于候選類別搜索的兩階段分類方法,通過搜索類別層次中與待分類文檔相關(guān)的候選類別,將大規(guī)模分類問題降低為一個(gè)規(guī)模較小的分類問題,然后根據(jù)候選類別的樣本訓(xùn)練分類器,對文檔進(jìn)行分類。首先對候選搜索相關(guān)概念進(jìn)行定義并提出了候選搜索的量化評價(jià)指標(biāo);然后分析了候選搜索問題的計(jì)算復(fù)雜度,通過將集合覆蓋問題規(guī)約到候選搜索問題,證明了候選搜索問題是NP難的;進(jìn)一步提出了一個(gè)基于貪心策略的啟發(fā)式候選搜索算法,證明了該算法采用的貪心策略是一個(gè)局部最優(yōu)選擇,并且該算法是多項(xiàng)式時(shí)間復(fù)雜度;在分類階段,根據(jù)候選類別在類別樹中的上下文信息,利用祖先類別區(qū)分不同候選類別。最后,結(jié)合該候選搜索方法和祖先輔助策略實(shí)現(xiàn)了一個(gè)兩階段分類方法,綜合判斷文檔類別。我們采用ODP簡體中文目錄中的網(wǎng)頁數(shù)據(jù)進(jìn)行了實(shí)驗(yàn)論證,實(shí)驗(yàn)結(jié)果顯示,相比已有算法,本文提出的候選類別搜索算法在候選類別搜索的準(zhǔn)確率上提高了大約7.5%,在此基礎(chǔ)上,結(jié)合類別層次的兩階段分類方法取得了更好的分類效果。3)針對稀有類別實(shí)例稀少的特性,利用LDA主題模型挖掘文檔的主題特征,研究基于LDA特征抽取的層次式分類方法。在主題類別層次中,一個(gè)主題類別通常包含一系列的子話題類別,文檔中的主題特征能夠很好地反映其所屬的類別,對此我們采用LDA模型進(jìn)行主題特征抽取,將文檔從詞特征空間轉(zhuǎn)化到主題特征空間,通過特征降維以減小文本數(shù)據(jù)的高維稀疏問題。另外,結(jié)合類別層次進(jìn)行樣本數(shù)據(jù)分組,以增加稀有類別的訓(xùn)練樣本。由于LDA主題抽取的時(shí)間開銷比較大,我們采用了層次式分類模型,以降低分類學(xué)習(xí)和預(yù)測的時(shí)間開銷。最后,結(jié)合網(wǎng)頁數(shù)據(jù)的特點(diǎn),采用適合處理小樣本、高維模式問題的支持向量機(jī)模型訓(xùn)練兩類分類器,提出了一個(gè)top-down分類框架進(jìn)行分類的訓(xùn)練和預(yù)測。我們在ODP簡體中文目錄上進(jìn)行實(shí)驗(yàn)測試,同基于特征詞的top-down分類方法相比,本文提出的方法能夠有效提高web主題目錄中稀有類別的分類性能。4)針對專家編制的分類體系缺少語料的問題,研究了無標(biāo)記數(shù)據(jù)分類方法。傳統(tǒng)的文本分類方法需要標(biāo)注好的語料來訓(xùn)練分類器,但是人工標(biāo)記語料代價(jià)昂貴。對此,本文結(jié)合類別知識(shí)和主題層次信息來構(gòu)造web查詢,從多種web數(shù)據(jù)中搜索相關(guān)文檔并抽取學(xué)習(xí)樣本,為監(jiān)督學(xué)習(xí)找到分類依據(jù),并結(jié)合層次式支持向量機(jī)進(jìn)行分類器的學(xué)習(xí)。針對web搜索結(jié)果中含有噪聲數(shù)據(jù)的問題,采用以下三個(gè)手段來提高分類學(xué)習(xí)效果:1)利用類別知識(shí)和類別層次信息構(gòu)造web查詢,采用節(jié)點(diǎn)的標(biāo)簽路徑生成查詢關(guān)鍵詞;2)利用多數(shù)據(jù)源產(chǎn)生樣本,同時(shí)從谷歌搜索引擎、維基百科這兩個(gè)數(shù)據(jù)源搜索相關(guān)頁面和文檔,獲取全面的樣本數(shù)據(jù);3)結(jié)合類別層次對樣本數(shù)據(jù)分組,為每個(gè)類別獲得更加完整的特征源,利用主題類別層次學(xué)習(xí)分類模型。最后實(shí)現(xiàn)了一種基于無標(biāo)記web數(shù)據(jù)的層次式文本分類方法。我們在ODP簡體中文目錄數(shù)據(jù)集上進(jìn)行實(shí)驗(yàn)測試,本文提出的方法在分類精度上接近于有標(biāo)注訓(xùn)練樣本的監(jiān)督分類方法,但是避免了人工標(biāo)注樣本的工作。5)針對社會(huì)化文本分類對象,提出了一個(gè)用戶主題模型UTM,根據(jù)微博的不同生成方式,將用戶興趣分為原創(chuàng)興趣和轉(zhuǎn)發(fā)興趣進(jìn)行分析;采用吉布斯抽樣法對模型進(jìn)行推導(dǎo),分別發(fā)現(xiàn)用戶的原創(chuàng)主題偏好和轉(zhuǎn)發(fā)主題偏好,然后以此計(jì)算用戶興趣詞。根據(jù)UTM模型發(fā)現(xiàn)的用戶興趣詞,可以實(shí)現(xiàn)微博用戶的關(guān)鍵詞標(biāo)記和標(biāo)簽推薦。我們在新浪微博數(shù)據(jù)集上驗(yàn)證了UTM模型的性能表現(xiàn),實(shí)驗(yàn)結(jié)果表明在微博用戶興趣詞標(biāo)記上,其準(zhǔn)確率高于已有方法。針對用戶興趣詞粒度太細(xì),不能有效實(shí)現(xiàn)用戶分類的不足,隨后提出了一個(gè)有監(jiān)督的產(chǎn)生式模型u LTM,該模型將用戶偏好表示為標(biāo)簽和主題,對用戶標(biāo)簽進(jìn)行主題建模。u LTM將用戶標(biāo)簽類別作為一個(gè)觀察變量,將其引入產(chǎn)生式模型,利用主題模型的無監(jiān)督學(xué)習(xí)機(jī)制發(fā)現(xiàn)微博中的隱含主題模式,利用有監(jiān)督學(xué)習(xí)發(fā)現(xiàn)用戶標(biāo)簽的主題特征分布,然后推導(dǎo)微博用戶的主題類別,最終實(shí)現(xiàn)微博用戶的準(zhǔn)確分類。我們在Twitter數(shù)據(jù)集上驗(yàn)證了u LTM模型在微博用戶分類上的性能表現(xiàn),實(shí)驗(yàn)結(jié)果表明該模型適合對主題含義明確的類別標(biāo)簽進(jìn)行建模與分類。綜上所述,本文針對大規(guī)模層次分類的分類體系規(guī)模巨大、稀有類別普遍、分類學(xué)習(xí)缺少標(biāo)注樣本、分類對象向社會(huì)化文本演進(jìn)等四個(gè)特征,研究了大規(guī)模層次分類的候選類別搜索、稀有類別分類、無標(biāo)記數(shù)據(jù)學(xué)習(xí)、社會(huì)化文本建模等關(guān)鍵技術(shù),對于互聯(lián)網(wǎng)文本信息的分類和主題挖掘工作具有重要的理論意義和應(yīng)用價(jià)值。
[Abstract]:With the development of information technology, Internet data and electronic data are increasing rapidly. In order to organize and manage mass text information on the Internet effectively, Internet text is usually classified according to the topic category hierarchy of tree or directed acyclic graph structure, and organized into a classification of thousands, even tens of thousands of categories. Catalog. Fast and fine network access control can be achieved by establishing a comprehensive and accurate Internet categorized catalog. In this process, large-scale hierarchical categorization studies how to accurately categorize Internet text into various categories in the category hierarchy. Class catalogue is the foundation of building a healthy and harmonious Internet environment, and is also the basis of information retrieval, green Internet access, network reputation management, security filtering and other network applications. These features make it very different from the traditional text classification problems and bring greater challenges in technology. Based on the analysis of related work, this paper mainly aims at the large-scale hierarchical classification system, the rare categories are common, and the classification learning is scarce. The main research contents and achievements are as follows: 1) The large-scale hierarchical classification problem is summarized, the definition of large-scale hierarchical classification problem is given, the solution strategy of large-scale hierarchical classification problem is analyzed, and the large-scale hierarchical classification problem is solved. Solution methods are classified, and on the basis of classification, various typical solving methods are introduced and compared. Finally, large-scale hierarchical classification problem solving methods are summarized and the applicability of various classification methods is pointed out. 2) Aiming at the huge scale of category hierarchy, a two-stage classification method based on candidate category search is studied. The problem of large-scale classification is reduced to a small-scale classification problem by searching for candidate categories related to documents to be classified in the category hierarchy. Then the classifier is trained according to the samples of candidate categories to classify documents. The computational complexity of the candidate search problem is analyzed. By reducing the set coverage problem to the candidate search problem, it is proved that the candidate search problem is NP-hard; furthermore, a heuristic candidate search algorithm based on greedy strategy is proposed, which proves that the greedy strategy used in the algorithm is a local optimal choice, and the algorithm is many. In the classification stage, according to the context information of candidate classes in the category tree, different candidate classes are distinguished by the ancestor classes. Finally, a two-stage classification method is implemented by combining the candidate search method and the ancestor assistant strategy to synthetically determine the document category. We adopt the number of pages in the simplified ODP Chinese directory. The experimental results show that the proposed candidate category search algorithm improves the accuracy of candidate category search by about 7.5% compared with the existing algorithms. On this basis, combined with the two-stage classification method at the class level, it achieves better classification results. Topic model mining document topic features, research on hierarchical classification method based on LDA feature extraction. In topic category hierarchy, a topic category usually contains a series of sub-topic categories, the topic features in the document can well reflect the category it belongs to, so we use LDA model to extract topic features and text. In order to reduce the high-dimensional sparse problem of text data, the document is transformed from word feature space to topic feature space. In addition, the sample data is grouped according to the category hierarchy to increase the training samples of rare categories. Finally, a top-down classification framework is proposed to train and predict the two classifiers based on the support vector machine (SVM) model which is suitable for small samples and high-dimensional pattern problems. Compared with the traditional text categorization method, the proposed method can effectively improve the classification performance of rare categories in Web subject catalog. 4) Aiming at the lack of corpus in the expert-compiled classification system, the unlabeled data classification method is studied. This paper combines category knowledge and topic hierarchy information to construct web query, searches relevant documents from various web data and extracts learning samples, finds classification basis for supervised learning, and learns classifier by combining hierarchical support vector machine. To solve the problem of noisy data in web search results, the following methods are adopted There are three ways to improve the effect of classification learning: 1) using category knowledge and category hierarchy information to construct web query, using node label path to generate query keywords; 2) using multiple data sources to generate samples, while searching relevant pages and documents from Google search engine, Wikipedia, and other two data sources to obtain comprehensive sample data; 3) knots; Finally, a hierarchical text categorization method based on unlabeled web data is implemented. The experimental results on ODP simplified Chinese catalog dataset show that the proposed method can obtain more complete feature sources for each category. It is close to the supervisory classification method with labeled training samples, but avoids manual labeling. 5) For social text classification objects, a user topic model UTM is proposed, which divides user interest into original interest and forwarding interest according to different generation methods of micro-blog. The user's original topic preference and forwarding topic preference are found respectively, and then the user's interest words are calculated. According to the user's interest words discovered by UTM model, the keyword marking and tag recommendation can be realized. We validate the performance of the UTM model on the Sina microblog data set, and the experimental results show that the performance of the UTM model is in micro-blog. In order to overcome the shortcomings of fine granularity of user interest words, a supervised production model, u LTM, is proposed. The model expresses user preferences as tags and topics, and builds a topic model for user tags. u LTM classifies user tags as categories. As an observer variable, it is introduced into the production model to discover the hidden topic patterns in micro-blogs by the unsupervised learning mechanism of the topic model. Subject feature distributions of user tags are discovered by supervised learning. Subject categories of micro-blog users are deduced, and the accurate classification of micro-blog users is finally realized. The experimental results show that the model is suitable for modeling and classifying the category labels with explicit subject meanings. In summary, the classification system of large-scale hierarchical classification is huge, rare categories are common, classification learning lacks annotated samples, and classification objects are socialized. Four characteristics, such as text evolution, are studied, including candidate category search, rare category classification, unlabeled data learning, social text modeling and other key technologies for large-scale hierarchical classification.
【學(xué)位授予單位】:國防科學(xué)技術(shù)大學(xué)
【學(xué)位級別】:博士
【學(xué)位授予年份】:2014
【分類號】:TP391.1

【相似文獻(xiàn)】

相關(guān)期刊論文 前10條

1 王義章;層次分類模型的構(gòu)造及實(shí)現(xiàn)[J];計(jì)算機(jī)應(yīng)用研究;1994年04期

2 陸彥婷;陸建峰;楊靜宇;;層次分類方法綜述[J];模式識(shí)別與人工智能;2013年12期

3 古平;羅志恒;歐陽源怞;;基于增量模式的文檔層次分類研究[J];計(jì)算機(jī)工程;2014年01期

4 何力;丁兆云;賈焰;韓偉紅;;大規(guī)模層次分類中的候選類別搜索[J];計(jì)算機(jī)學(xué)報(bào);2014年01期

5 譚金波;;一種改進(jìn)的文檔層次分類方法[J];現(xiàn)代圖書情報(bào)技術(shù);2007年02期

6 古平;朱慶生;張程;莊致;;一種融合本體和上下文的自適應(yīng)層次分類模型[J];北京理工大學(xué)學(xué)報(bào);2009年10期

7 史鐵林,王雪,何濤,楊叔子;層次分類診斷模型[J];華中理工大學(xué)學(xué)報(bào);1993年01期

8 張金;王橋;陳卓寧;;基于規(guī)則動(dòng)態(tài)解析的層次分類樹控件[J];機(jī)械工程師;2007年01期

9 李文;苗奪謙;衛(wèi)志華;王煒立;;基于阻塞先驗(yàn)知識(shí)的文本層次分類模型[J];模式識(shí)別與人工智能;2010年04期

10 高波;趙政;;文本層次分類系統(tǒng)的研究[J];計(jì)算機(jī)工程與應(yīng)用;2006年11期

相關(guān)會(huì)議論文 前1條

1 周毅;江云亮;張銘;熊宇紅;馮是聰;;基于“鏈接”層次分類的主題爬取[A];第二十四屆中國數(shù)據(jù)庫學(xué)術(shù)會(huì)議論文集(技術(shù)報(bào)告篇)[C];2007年

相關(guān)博士學(xué)位論文 前2條

1 何力;面向互聯(lián)網(wǎng)文本的大規(guī)模層次分類技術(shù)研究[D];國防科學(xué)技術(shù)大學(xué);2014年

2 祝翠玲;基于類別結(jié)構(gòu)的文本層次分類方法研究[D];山東大學(xué);2011年

相關(guān)碩士學(xué)位論文 前10條

1 朱麗;基于層次分類的病性分析[D];南京理工大學(xué);2015年

2 張薇娟;基于模糊認(rèn)知圖的分步文本層次分類研究[D];天津師范大學(xué);2008年

3 肖雪;中文文本層次分類研究及其在唐詩分類中的應(yīng)用[D];重慶大學(xué);2006年

4 孔照昆;中文文本層次分類方法研究及應(yīng)用[D];揚(yáng)州大學(xué);2013年

5 王棟;基于SVM的分類方法在內(nèi)容管理中的應(yīng)用[D];西北大學(xué);2006年

6 谷峰;中文網(wǎng)頁層次分類研究[D];華僑大學(xué);2007年

7 李慧;蛋白質(zhì)功能預(yù)測的層次化分類方法研究[D];吉林大學(xué);2010年

8 白振田;基于向量空間模型與規(guī)則匹配相結(jié)合的文本層次分類系統(tǒng)的研究[D];南京農(nóng)業(yè)大學(xué);2006年

9 藺燕;西藏民族學(xué)院分層次分類型教學(xué)研究[D];西藏民族學(xué)院;2014年

10 章張;基于層次分類的網(wǎng)絡(luò)內(nèi)容監(jiān)管系統(tǒng)中串匹配算法的設(shè)計(jì)與實(shí)現(xiàn)[D];南京理工大學(xué);2004年

,

本文編號:2250136

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2250136.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶14d4f***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請E-mail郵箱bigeng88@qq.com