基于標(biāo)簽詞抽取的搜索結(jié)果聚類研究
[Abstract]:At present, people are in an era of "information explosion", so various search engines emerge as the times require. However, because the information on the Internet is semi-structured or unstructured, although a variety of methods are used to improve the accuracy of the retrieval results, the retrieval results still contain pages that are not related to the user query. Although we can use correlation ranking and other methods, it is still not convenient for users to display the results. In order to facilitate users to view their interested web pages, the search engine returns the results of clustering, so that users can browse the web according to the subject category, reducing the burden of users browsing web pages. On the basis of studying the present situation of Chinese text clustering, this paper summarizes the key technologies of Chinese text clustering, including text preprocessing, text representation model, feature extraction, feature dimensionality reduction, etc. Text similarity calculation and existing clustering algorithms are analyzed and compared. Then, the paper analyzes and studies the text similarity calculation, including document similarity calculation and dissimilarity calculation, as well as the proximity measure between clusters. The support vector regression theory and its technical realization are analyzed. In this paper, a text clustering method based on tag word extraction is proposed. The goal of this method is to cluster the search results returned by search engines, and then the text clustering system is implemented in this paper. First, we preprocess the web pages returned from the search results, including de-noising, participle, and deactivation. Then the three-element model words are extracted as label words, and a scoring method based on supervised model is put forward, and some similar word substitution and string integration are made for label words. Finally, according to the label word clustering, hierarchical clustering method is used to complete the clustering. This paper designs a cluster system and carries on the experiment to it, the experiment content includes the tag word extraction, the support vector regression statistics, the label word clustering experiment. The experimental results show that the algorithm is effective in clustering search results and can cluster similar documents into the same category.
【學(xué)位授予單位】:北京郵電大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2012
【分類號】:TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 吳啟明;易云飛;;文本聚類綜述[J];河池學(xué)院學(xué)報;2008年02期
2 趙亞琴;周獻(xiàn)中;;一種基于小生境遺傳算法的中文文本聚類新方法[J];計算機(jī)工程;2006年06期
3 姚清耘;劉功申;李翔;;基于向量空間模型的文本聚類算法[J];計算機(jī)工程;2008年18期
4 卜東波,白碩,李國杰;聚類/分類中的粒度原理[J];計算機(jī)學(xué)報;2002年08期
5 彭京;楊冬青;唐世渭;付艷;蔣漢奎;;一種基于語義內(nèi)積空間模型的文本聚類算法[J];計算機(jī)學(xué)報;2007年08期
6 張紅云,劉向東,段曉東,苗奪謙,馬垣;數(shù)據(jù)挖掘中聚類算法比較研究[J];計算機(jī)應(yīng)用與軟件;2003年02期
7 駱雄武;萬小軍;楊建武;吳於茜;;基于后綴樹的Web檢索結(jié)果聚類標(biāo)簽生成方法[J];中文信息學(xué)報;2009年02期
8 孫爽;章勇;;一種基于語義相似度的文本聚類算法[J];南京航空航天大學(xué)學(xué)報;2006年06期
9 宋韶旭;李春平;;基于非對稱相似度的文本聚類方法[J];清華大學(xué)學(xué)報(自然科學(xué)版);2006年07期
10 魯松,白碩,黃雄;基于向量空間模型中義項(xiàng)詞語的無導(dǎo)詞義消歧[J];軟件學(xué)報;2002年06期
,本文編號:2219983
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2219983.html