面向互聯(lián)網(wǎng)文本的大規(guī)模層次分類技術(shù)研究
[Abstract]:With the development of information technology, Internet data and electronic data are increasing rapidly. In order to organize and manage mass text information on the Internet effectively, Internet text is usually classified according to the topic category hierarchy of tree or directed acyclic graph structure, and organized into a classification of thousands, even tens of thousands of categories. Catalog. Fast and fine network access control can be achieved by establishing a comprehensive and accurate Internet categorized catalog. In this process, large-scale hierarchical categorization studies how to accurately categorize Internet text into various categories in the category hierarchy. Class catalogue is the foundation of building a healthy and harmonious Internet environment, and is also the basis of information retrieval, green Internet access, network reputation management, security filtering and other network applications. These features make it very different from the traditional text classification problems and bring greater challenges in technology. Based on the analysis of related work, this paper mainly aims at the large-scale hierarchical classification system, the rare categories are common, and the classification learning is scarce. The main research contents and achievements are as follows: 1) The large-scale hierarchical classification problem is summarized, the definition of large-scale hierarchical classification problem is given, the solution strategy of large-scale hierarchical classification problem is analyzed, and the large-scale hierarchical classification problem is solved. Solution methods are classified, and on the basis of classification, various typical solving methods are introduced and compared. Finally, large-scale hierarchical classification problem solving methods are summarized and the applicability of various classification methods is pointed out. 2) Aiming at the huge scale of category hierarchy, a two-stage classification method based on candidate category search is studied. The problem of large-scale classification is reduced to a small-scale classification problem by searching for candidate categories related to documents to be classified in the category hierarchy. Then the classifier is trained according to the samples of candidate categories to classify documents. The computational complexity of the candidate search problem is analyzed. By reducing the set coverage problem to the candidate search problem, it is proved that the candidate search problem is NP-hard; furthermore, a heuristic candidate search algorithm based on greedy strategy is proposed, which proves that the greedy strategy used in the algorithm is a local optimal choice, and the algorithm is many. In the classification stage, according to the context information of candidate classes in the category tree, different candidate classes are distinguished by the ancestor classes. Finally, a two-stage classification method is implemented by combining the candidate search method and the ancestor assistant strategy to synthetically determine the document category. We adopt the number of pages in the simplified ODP Chinese directory. The experimental results show that the proposed candidate category search algorithm improves the accuracy of candidate category search by about 7.5% compared with the existing algorithms. On this basis, combined with the two-stage classification method at the class level, it achieves better classification results. Topic model mining document topic features, research on hierarchical classification method based on LDA feature extraction. In topic category hierarchy, a topic category usually contains a series of sub-topic categories, the topic features in the document can well reflect the category it belongs to, so we use LDA model to extract topic features and text. In order to reduce the high-dimensional sparse problem of text data, the document is transformed from word feature space to topic feature space. In addition, the sample data is grouped according to the category hierarchy to increase the training samples of rare categories. Finally, a top-down classification framework is proposed to train and predict the two classifiers based on the support vector machine (SVM) model which is suitable for small samples and high-dimensional pattern problems. Compared with the traditional text categorization method, the proposed method can effectively improve the classification performance of rare categories in Web subject catalog. 4) Aiming at the lack of corpus in the expert-compiled classification system, the unlabeled data classification method is studied. This paper combines category knowledge and topic hierarchy information to construct web query, searches relevant documents from various web data and extracts learning samples, finds classification basis for supervised learning, and learns classifier by combining hierarchical support vector machine. To solve the problem of noisy data in web search results, the following methods are adopted There are three ways to improve the effect of classification learning: 1) using category knowledge and category hierarchy information to construct web query, using node label path to generate query keywords; 2) using multiple data sources to generate samples, while searching relevant pages and documents from Google search engine, Wikipedia, and other two data sources to obtain comprehensive sample data; 3) knots; Finally, a hierarchical text categorization method based on unlabeled web data is implemented. The experimental results on ODP simplified Chinese catalog dataset show that the proposed method can obtain more complete feature sources for each category. It is close to the supervisory classification method with labeled training samples, but avoids manual labeling. 5) For social text classification objects, a user topic model UTM is proposed, which divides user interest into original interest and forwarding interest according to different generation methods of micro-blog. The user's original topic preference and forwarding topic preference are found respectively, and then the user's interest words are calculated. According to the user's interest words discovered by UTM model, the keyword marking and tag recommendation can be realized. We validate the performance of the UTM model on the Sina microblog data set, and the experimental results show that the performance of the UTM model is in micro-blog. In order to overcome the shortcomings of fine granularity of user interest words, a supervised production model, u LTM, is proposed. The model expresses user preferences as tags and topics, and builds a topic model for user tags. u LTM classifies user tags as categories. As an observer variable, it is introduced into the production model to discover the hidden topic patterns in micro-blogs by the unsupervised learning mechanism of the topic model. Subject feature distributions of user tags are discovered by supervised learning. Subject categories of micro-blog users are deduced, and the accurate classification of micro-blog users is finally realized. The experimental results show that the model is suitable for modeling and classifying the category labels with explicit subject meanings. In summary, the classification system of large-scale hierarchical classification is huge, rare categories are common, classification learning lacks annotated samples, and classification objects are socialized. Four characteristics, such as text evolution, are studied, including candidate category search, rare category classification, unlabeled data learning, social text modeling and other key technologies for large-scale hierarchical classification.
【學(xué)位授予單位】:國防科學(xué)技術(shù)大學(xué)
【學(xué)位級別】:博士
【學(xué)位授予年份】:2014
【分類號】:TP391.1
【相似文獻(xiàn)】
相關(guān)期刊論文 前10條
1 王義章;層次分類模型的構(gòu)造及實(shí)現(xiàn)[J];計(jì)算機(jī)應(yīng)用研究;1994年04期
2 陸彥婷;陸建峰;楊靜宇;;層次分類方法綜述[J];模式識(shí)別與人工智能;2013年12期
3 古平;羅志恒;歐陽源怞;;基于增量模式的文檔層次分類研究[J];計(jì)算機(jī)工程;2014年01期
4 何力;丁兆云;賈焰;韓偉紅;;大規(guī)模層次分類中的候選類別搜索[J];計(jì)算機(jī)學(xué)報(bào);2014年01期
5 譚金波;;一種改進(jìn)的文檔層次分類方法[J];現(xiàn)代圖書情報(bào)技術(shù);2007年02期
6 古平;朱慶生;張程;莊致;;一種融合本體和上下文的自適應(yīng)層次分類模型[J];北京理工大學(xué)學(xué)報(bào);2009年10期
7 史鐵林,王雪,何濤,楊叔子;層次分類診斷模型[J];華中理工大學(xué)學(xué)報(bào);1993年01期
8 張金;王橋;陳卓寧;;基于規(guī)則動(dòng)態(tài)解析的層次分類樹控件[J];機(jī)械工程師;2007年01期
9 李文;苗奪謙;衛(wèi)志華;王煒立;;基于阻塞先驗(yàn)知識(shí)的文本層次分類模型[J];模式識(shí)別與人工智能;2010年04期
10 高波;趙政;;文本層次分類系統(tǒng)的研究[J];計(jì)算機(jī)工程與應(yīng)用;2006年11期
相關(guān)會(huì)議論文 前1條
1 周毅;江云亮;張銘;熊宇紅;馮是聰;;基于“鏈接”層次分類的主題爬取[A];第二十四屆中國數(shù)據(jù)庫學(xué)術(shù)會(huì)議論文集(技術(shù)報(bào)告篇)[C];2007年
相關(guān)博士學(xué)位論文 前2條
1 何力;面向互聯(lián)網(wǎng)文本的大規(guī)模層次分類技術(shù)研究[D];國防科學(xué)技術(shù)大學(xué);2014年
2 祝翠玲;基于類別結(jié)構(gòu)的文本層次分類方法研究[D];山東大學(xué);2011年
相關(guān)碩士學(xué)位論文 前10條
1 朱麗;基于層次分類的病性分析[D];南京理工大學(xué);2015年
2 張薇娟;基于模糊認(rèn)知圖的分步文本層次分類研究[D];天津師范大學(xué);2008年
3 肖雪;中文文本層次分類研究及其在唐詩分類中的應(yīng)用[D];重慶大學(xué);2006年
4 孔照昆;中文文本層次分類方法研究及應(yīng)用[D];揚(yáng)州大學(xué);2013年
5 王棟;基于SVM的分類方法在內(nèi)容管理中的應(yīng)用[D];西北大學(xué);2006年
6 谷峰;中文網(wǎng)頁層次分類研究[D];華僑大學(xué);2007年
7 李慧;蛋白質(zhì)功能預(yù)測的層次化分類方法研究[D];吉林大學(xué);2010年
8 白振田;基于向量空間模型與規(guī)則匹配相結(jié)合的文本層次分類系統(tǒng)的研究[D];南京農(nóng)業(yè)大學(xué);2006年
9 藺燕;西藏民族學(xué)院分層次分類型教學(xué)研究[D];西藏民族學(xué)院;2014年
10 章張;基于層次分類的網(wǎng)絡(luò)內(nèi)容監(jiān)管系統(tǒng)中串匹配算法的設(shè)計(jì)與實(shí)現(xiàn)[D];南京理工大學(xué);2004年
,本文編號:2250136
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2250136.html