天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁 > 科技論文 > 搜索引擎論文 >

搜索引擎分類展示技術(shù)研究

發(fā)布時(shí)間:2018-05-13 15:49

  本文選題:搜索引擎 + 分類索引; 參考:《哈爾濱工業(yè)大學(xué)》2012年碩士論文


【摘要】:隨著科學(xué)技術(shù)的進(jìn)步,互聯(lián)網(wǎng)技術(shù)和通信技術(shù)也得以蓬勃發(fā)展。網(wǎng)絡(luò)信息含量逐漸呈現(xiàn)出爆炸式增長(zhǎng)的趨勢(shì)。人們也越來越習(xí)慣通過網(wǎng)絡(luò)獲取自己所需的信息資源。但是,信息膨脹在為網(wǎng)絡(luò)用戶帶來便利的同時(shí),也在某種程度上增加了他們的困擾:要在浩瀚的信息海洋中快速定位目標(biāo)已經(jīng)變得越來越困難。為了解決這一問題,本文對(duì)搜索引擎分類展示技術(shù)進(jìn)行了研究,試圖通過合適的類別體系為用戶提供指引,幫助其減少不必要的時(shí)間浪費(fèi)。 本文將搜索引擎分類展示的實(shí)現(xiàn)過程劃分為兩部分:其一作為分類模塊,用來對(duì)網(wǎng)頁類別進(jìn)行標(biāo)識(shí);其二作為搜索引擎模塊,,用來建立分類索引和分類檢索,為用戶實(shí)現(xiàn)最終的分類展示。在分類模塊中,首先要對(duì)網(wǎng)頁集合進(jìn)行預(yù)處理工作,將網(wǎng)頁由文本形式轉(zhuǎn)換為空間向量形式。本文提出了基于網(wǎng)頁分塊的正文抽取算法,通過判斷標(biāo)簽樹中的節(jié)點(diǎn)找到網(wǎng)頁正文,再利用基于文檔頻率的特征提取算法過濾文本中區(qū)分度過低的詞語,來實(shí)現(xiàn)網(wǎng)頁向空間向量的轉(zhuǎn)化。然后是對(duì)文本分類器進(jìn)行訓(xùn)練,本文采取基于決策樹的方法對(duì)支持向量機(jī)二元分類器進(jìn)行擴(kuò)展,以解決多類別分類問題,并提出更加適用于層次分類的多重特征選擇技術(shù),文本在不同類別層次使用不同的特征向量表示,并且同一文本特征在不同層次分類器被賦予不同的權(quán)值,提高了層次體系中的分類精度。在搜索引擎模塊中,本文采用開源搜索引擎Lucene作為系統(tǒng)實(shí)現(xiàn)的基礎(chǔ)架構(gòu),利用Lucene索引文件中域的概念建立分類索引,在索引中存入網(wǎng)頁的類別信息。當(dāng)用戶希望查看某一類別搜索結(jié)果時(shí),通過對(duì)該類別層次所在的域進(jìn)行檢索,就可以為用戶提供分類展示的結(jié)果。 最后,本文對(duì)上述方法進(jìn)行了實(shí)現(xiàn),以分類準(zhǔn)確率和樣本召回率作為分類模塊的評(píng)估標(biāo)準(zhǔn),以分類展示檢索時(shí)間以及搜索結(jié)果的準(zhǔn)確率作為搜索引擎模塊的評(píng)估標(biāo)準(zhǔn),對(duì)得到的實(shí)驗(yàn)結(jié)果進(jìn)行分析,從而確認(rèn)在實(shí)際應(yīng)用中實(shí)現(xiàn)搜索引擎分類展示的可行性。
[Abstract]:With the progress of science and technology, Internet technology and communication technology have also flourished. The content of network information has gradually shown an explosive growth trend. People are also increasingly used to obtain the information resources they need through the network. However, information is expanding to the convenience of network users, but also to some extent. Their trouble: it is becoming more and more difficult to locate the target quickly in the vast ocean of information. In order to solve this problem, this paper studies the search engine classification display technology, trying to provide guidance to users through a suitable category system to help reduce unnecessary waste of time.
The realization process of the search engine classification display is divided into two parts: one is used as a classification module, which is used to identify the category of web pages; secondly, as a search engine module, it is used to establish classified index and classified retrieval for the user to realize the final classification. In this paper, the web page is transformed from text form to space vector form. In this paper, a text extraction algorithm based on Web page partition is proposed. The text is found by judging the node in the label tree, and then the feature extraction algorithm based on the document frequency is used to filter the words in the text to transform the space vector to the space vector. In this paper, the text classifier is trained. In this paper, a decision tree based method is adopted to extend the two element classifier of support vector machine to solve the multi class classification problem, and a multi feature selection technique which is more suitable for hierarchical classification is put forward. The text is represented by different feature vectors at different classes of classification, and the same text feature is in the same text. Different hierarchical classifiers are given different weights to improve the classification accuracy in the hierarchical system. In the search engine module, the open source search engine Lucene is used as the basic framework of the system implementation. The classification index is established by using the concept of the middle domain of the Lucene index file, and the category information of the web page is stored in the cable quotation. When we look at a certain category of search results, we can provide users with the results of category display by retrieving the domain in which the category is located.
Finally, this paper implements the above method. The classification accuracy rate and sample recall rate are used as the evaluation criteria of the classification module. The classification and retrieval time and the accuracy rate of the search results are used as the evaluation criteria of the search engine module, and the results are analyzed to confirm the realization of the search engine in the practical application. The feasibility of class display.

【學(xué)位授予單位】:哈爾濱工業(yè)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2012
【分類號(hào)】:TP391.1;TP393.092

【參考文獻(xiàn)】

相關(guān)期刊論文 前10條

1 朱靖波,陳文亮;基于領(lǐng)域知識(shí)的文本分類[J];東北大學(xué)學(xué)報(bào);2005年08期

2 龍樹全;趙正文;唐華;;中文分詞算法概述[J];電腦知識(shí)與技術(shù);2009年10期

3 計(jì)智偉;胡珉;尹建新;;特征選擇算法綜述[J];電子設(shè)計(jì)工程;2011年09期

4 顏根廷;李傳江;馬廣富;;支持向量分類器的模糊積分集成方法[J];哈爾濱工業(yè)大學(xué)學(xué)報(bào);2008年07期

5 李榮陸,王建會(huì),陳曉云,陶曉鵬,胡運(yùn)發(fā);使用最大熵模型進(jìn)行中文文本分類[J];計(jì)算機(jī)研究與發(fā)展;2005年01期

6 熊亮;基于概念樹的文本自動(dòng)分類系統(tǒng)的研究與實(shí)現(xiàn)[J];計(jì)算機(jī)工程與應(yīng)用;2005年30期

7 張瑞雪;宋明秋;公衍磊;;逆序解析DOM樹及網(wǎng)頁正文信息提取[J];計(jì)算機(jī)科學(xué);2011年04期

8 黃玲;陳龍;;基于網(wǎng)頁分塊的正文信息提取方法[J];計(jì)算機(jī)應(yīng)用;2008年S2期

9 韓忠明;李文正;莫倩;;有效HTML文本信息抽取方法的研究[J];計(jì)算機(jī)應(yīng)用研究;2008年12期

10 張俊英;胡俠;卜佳俊;;網(wǎng)頁文本信息自動(dòng)提取技術(shù)綜述[J];計(jì)算機(jī)應(yīng)用研究;2009年08期



本文編號(hào):1883831

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1883831.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶9c756***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com