基于標(biāo)簽詞抽取的搜索結(jié)果聚類研究

發(fā)布時間：2018-09-03 12:29

【摘要】：當(dāng)前人們正處于一個“信息爆炸”的時代,因此各種各樣的搜索引擎應(yīng)運(yùn)而生。但是由于網(wǎng)上的信息都是半結(jié)構(gòu)化或者非結(jié)構(gòu)化的,盡管采用了各種方法來提高檢索結(jié)果的精度,但是檢索結(jié)果中仍然包含了與用戶查詢不相關(guān)的頁面。雖然可以采取相關(guān)度排序等方法,仍不能很方便的為用戶展示結(jié)果。為了方便用戶查看自己感興趣的網(wǎng)頁,將搜索引擎返回的結(jié)果進(jìn)行聚類,使用戶可以按照主題類別瀏覽網(wǎng)頁,減輕用戶瀏覽網(wǎng)頁的負(fù)擔(dān)。本文在研究中文文本聚類現(xiàn)狀的基礎(chǔ)上,對中文文本聚類的關(guān)鍵技術(shù)進(jìn)行了總結(jié),其中,包括文本預(yù)處理、文本表示模型、特征抽取、特征降維、文本相似度計算以及現(xiàn)有的聚類算法,并對現(xiàn)有的聚類算法作了分析比較。然后,論文分析并研究了文本的相似度計算,包括文檔的相似度計算和相異度計算,以及簇和簇之間的鄰近度度量。并且分析了支持向量回歸理論和其技術(shù)上的實(shí)現(xiàn)。本文提出了一種基于標(biāo)簽詞抽取的文本聚類方法,該方法的實(shí)現(xiàn)目標(biāo)是對搜索引擎返回的搜索結(jié)果進(jìn)行聚類,然后論文實(shí)現(xiàn)了文本聚類系統(tǒng)。首先從搜索結(jié)果返回的網(wǎng)頁文檔進(jìn)行預(yù)處理,包括去噪、分詞、去除停用詞。然后從中抽取3元模型詞作為標(biāo)簽詞,提出了基于監(jiān)督模型的打分方法,并對標(biāo)簽詞做一些相似詞替換、詞串整合等后期處理。最后根據(jù)標(biāo)簽詞進(jìn)行語料聚類,采用了層次聚類的方法,最終完成聚類。論文設(shè)計了聚類系統(tǒng),并對其進(jìn)行實(shí)驗(yàn),實(shí)驗(yàn)內(nèi)容包括標(biāo)簽詞的抽取,支持向量的回歸統(tǒng)計,標(biāo)簽詞的聚類實(shí)驗(yàn)。通過實(shí)驗(yàn)證明,算法在對搜索結(jié)果進(jìn)行聚類時有著較好的效果,能夠?qū)㈩悇e相似的文檔聚到同一個類別中。
[Abstract]:At present, people are in an era of "information explosion", so various search engines emerge as the times require. However, because the information on the Internet is semi-structured or unstructured, although a variety of methods are used to improve the accuracy of the retrieval results, the retrieval results still contain pages that are not related to the user query. Although we can use correlation ranking and other methods, it is still not convenient for users to display the results. In order to facilitate users to view their interested web pages, the search engine returns the results of clustering, so that users can browse the web according to the subject category, reducing the burden of users browsing web pages. On the basis of studying the present situation of Chinese text clustering, this paper summarizes the key technologies of Chinese text clustering, including text preprocessing, text representation model, feature extraction, feature dimensionality reduction, etc. Text similarity calculation and existing clustering algorithms are analyzed and compared. Then, the paper analyzes and studies the text similarity calculation, including document similarity calculation and dissimilarity calculation, as well as the proximity measure between clusters. The support vector regression theory and its technical realization are analyzed. In this paper, a text clustering method based on tag word extraction is proposed. The goal of this method is to cluster the search results returned by search engines, and then the text clustering system is implemented in this paper. First, we preprocess the web pages returned from the search results, including de-noising, participle, and deactivation. Then the three-element model words are extracted as label words, and a scoring method based on supervised model is put forward, and some similar word substitution and string integration are made for label words. Finally, according to the label word clustering, hierarchical clustering method is used to complete the clustering. This paper designs a cluster system and carries on the experiment to it, the experiment content includes the tag word extraction, the support vector regression statistics, the label word clustering experiment. The experimental results show that the algorithm is effective in clustering search results and can cluster similar documents into the same category.
【學(xué)位授予單位】：北京郵電大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2012
【分類號】：TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 吳啟明;易云飛;;文本聚類綜述[J];河池學(xué)院學(xué)報;2008年02期

2 趙亞琴;周獻(xiàn)中;;一種基于小生境遺傳算法的中文文本聚類新方法[J];計算機(jī)工程;2006年06期

3 姚清耘;劉功申;李翔;;基于向量空間模型的文本聚類算法[J];計算機(jī)工程;2008年18期

4 卜東波,白碩,李國杰;聚類/分類中的粒度原理[J];計算機(jī)學(xué)報;2002年08期

5 彭京;楊冬青;唐世渭;付艷;蔣漢奎;;一種基于語義內(nèi)積空間模型的文本聚類算法[J];計算機(jī)學(xué)報;2007年08期

6 張紅云,劉向東,段曉東,苗奪謙,馬垣;數(shù)據(jù)挖掘中聚類算法比較研究[J];計算機(jī)應(yīng)用與軟件;2003年02期

7 駱雄武;萬小軍;楊建武;吳於茜;;基于后綴樹的Web檢索結(jié)果聚類標(biāo)簽生成方法[J];中文信息學(xué)報;2009年02期

8 孫爽;章勇;;一種基于語義相似度的文本聚類算法[J];南京航空航天大學(xué)學(xué)報;2006年06期

9 宋韶旭;李春平;;基于非對稱相似度的文本聚類方法[J];清華大學(xué)學(xué)報(自然科學(xué)版);2006年07期

10 魯松,白碩,黃雄;基于向量空間模型中義項(xiàng)詞語的無導(dǎo)詞義消歧[J];軟件學(xué)報;2002年06期

，

本文編號：2219983

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2219983.html

上一篇：網(wǎng)絡(luò)鏈接侵權(quán)責(zé)任探析
下一篇：商務(wù)元搜索引擎中域語義映射問題研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于標(biāo)簽詞抽取的搜索結(jié)果聚類研究