基于語義的搜索結(jié)果聚類方法研究

發(fā)布時間：2018-05-16 09:45

本文選題：搜索結(jié)果 + 聚類��；參考：《北京郵電大學(xué)》2014年碩士論文

【摘要】：隨著網(wǎng)絡(luò)的發(fā)展,越來越多的人們在互聯(lián)網(wǎng)上獲取信息。搜索引擎作為用戶與互聯(lián)網(wǎng)交互的中轉(zhuǎn)站,負(fù)責(zé)信息的獲取和檢索,給人們帶來了極大的便利。但是,隨著互聯(lián)網(wǎng)上信息量的增長,搜索引擎返回的檢索結(jié)果也日益繁雜,包含了很多不相干的、·重復(fù)的、混雜的結(jié)果。人們往往需要浪費(fèi)很多的精力和時間來瀏覽這些信息才能找到滿意的結(jié)果。因此,一些研究人員將信息檢索中的聚類技術(shù)應(yīng)用于搜索結(jié)果的分類中,將繁雜的搜索結(jié)果分類呈現(xiàn)給用戶,這種方法稱為搜索結(jié)果聚類。搜索結(jié)果聚類是指利用聚類這種無監(jiān)督的機(jī)器學(xué)習(xí)手段,按照“最大化類內(nèi)相似度,最小化類間相似度”的原則,將搜索結(jié)果聚集成類提取聚類標(biāo)簽給予用戶一個類目導(dǎo)航。另外,搜索結(jié)果聚類對象不是傳統(tǒng)的長文本而是搜索結(jié)果的短文摘。目前,搜索結(jié)果聚類技術(shù)多是采用獨(dú)立的詞語表示搜索結(jié)果短文摘,忽略了詞語之間的語義關(guān)聯(lián)等語義信息,存在嚴(yán)重的語義缺失。本論文針對搜索結(jié)果聚類技術(shù)中的語義缺失現(xiàn)象,對基于語義的搜索結(jié)果聚類方法進(jìn)行了深入研究,主要的研究內(nèi)容有：搜索結(jié)果預(yù)處理方法和建模方法,經(jīng)典的搜索結(jié)果聚類方法以及基于語義的搜索結(jié)果聚類方法。另外,本論文在以上研究的基礎(chǔ)上提出了基于OPTICS的搜索結(jié)果聚類算法和基于WordNet的后綴樹聚類算法。這兩種算法針對搜索結(jié)果聚類的語義缺失現(xiàn)象均提出了相應(yīng)的改進(jìn),側(cè)重于挖掘和利用搜索結(jié)果短文摘中的語義信息,以達(dá)到提高搜索結(jié)果聚類準(zhǔn)確率的目的。最后,本論文在搜索結(jié)果數(shù)據(jù)集上進(jìn)行了聚類實(shí)驗(yàn),并對比分析了兩種新算法的聚類性能。實(shí)驗(yàn)結(jié)果表明,本論文中提出的兩種改進(jìn)算法在聚類準(zhǔn)確率方面較原算法有明顯提高,并且縮短了運(yùn)行時間,能夠提高搜索結(jié)果聚類的可瀏覽性和實(shí)時性。
[Abstract]:With the development of the network, more and more people get information on the Internet. As the transfer station of the interaction between the user and the Internet, the search engine is responsible for the acquisition and retrieval of information, which has brought great convenience to people. However, with the increase of the amount of information on the Internet, the retrieval results of the search engine return are also increasingly complex, including a lot of information. Unrelated, repetitive, mixed results. People often need to waste a lot of energy and time to browse the information in order to find satisfactory results. Therefore, some researchers apply clustering techniques in information retrieval to the classification of search results, and classify the complex search results to users. This method is called search. The clustering of search results is an unsupervised machine learning method based on clustering. According to the principle of "maximizing the intra class similarity, minimizing the similarity between classes", the search results are aggregated into classes to extract clustering tags to give users a category navigation. In addition, the search result clustering object is not the traditional long text but the traditional long text. At present, most of the search results clustering techniques use independent words to express search results, ignore semantic information and semantic information between words, and have serious semantic loss.
In this paper, the semantic based search results clustering method is studied deeply in the search result clustering technology. The main research contents are: search results preprocessing method and modeling method, classic search result clustering method and semantic based search result clustering method. On the basis of the research, the OPTICS based search results clustering algorithm and the WordNet based suffix tree clustering algorithm are proposed. These two algorithms have proposed corresponding improvements to the semantic missing phenomenon of the search results clustering, focusing on mining and utilizing the semantic information in the search results short text, in order to improve the clustering accuracy of the search results. Finally, this paper carries out clustering experiments on the data set of the search results, and compares and analyzes the clustering performance of the two new algorithms. The experimental results show that the two improved algorithms proposed in this paper are significantly higher in clustering accuracy than those of the original algorithm, and the running time is shortened, and the clustering of the search results can be improved. Browsing and real-time.

【學(xué)位授予單位】：北京郵電大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2014
【分類號】：TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 李建江;崔健;王聃;嚴(yán)林;黃義雙;;MapReduce并行編程模型研究綜述[J];電子學(xué)報;2011年11期

2 文坤梅;盧正鼎;孫小林;李瑞軒;;語義搜索研究綜述[J];計算機(jī)科學(xué);2008年05期

3 劉德山;;一種改進(jìn)的基于后綴樹模型搜索結(jié)果聚類算法[J];計算機(jī)科學(xué);2011年11期

4 徐戈;王厚峰;;自然語言處理中主題模型的發(fā)展[J];計算機(jī)學(xué)報;2011年08期

5 楊燕;靳蕃;KAMEL Mohamed;;聚類有效性評價綜述[J];計算機(jī)應(yīng)用研究;2008年06期

6 郭慶琳;李艷梅;唐琦;;基于VSM的文本相似度計算的研究[J];計算機(jī)應(yīng)用研究;2008年11期

7 郭曉娟;劉曉霞;李曉玲;;層次聚類算法的改進(jìn)及分析[J];計算機(jī)應(yīng)用與軟件;2008年06期

8 黃莉;;詞法分析在自然語言處理中的地位和作用[J];價值工程;2010年10期

9 孫學(xué)剛,陳群秀,馬亮;基于主題的Web文檔聚類研究[J];中文信息學(xué)報;2003年03期

10 曾依靈;許洪波;白碩;;改進(jìn)的OPTICS算法及其在文本聚類中的應(yīng)用[J];中文信息學(xué)報;2008年01期

，

本文編號：1896368

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1896368.html

上一篇：域內(nèi)資源整合系統(tǒng)及其標(biāo)準(zhǔn)協(xié)議體系
下一篇：基于Hadoop的網(wǎng)絡(luò)爬蟲技術(shù)研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于語義的搜索結(jié)果聚類方法研究