中文垂直搜索技術(shù)的研究與實(shí)現(xiàn)
發(fā)布時(shí)間:2018-01-19 05:03
本文關(guān)鍵詞: 搜索引擎 垂直搜索 Nutch 中文分詞 文本聚類 出處:《河北科技大學(xué)》2012年碩士論文 論文類型:學(xué)位論文
【摘要】:隨著互聯(lián)網(wǎng)的迅捷發(fā)展,中國(guó)網(wǎng)民人數(shù)日益增多,網(wǎng)絡(luò)提供的服務(wù)也五花八門(mén),網(wǎng)站數(shù)量急劇增加,網(wǎng)站信息資源日益膨脹。面對(duì)浩如煙海的信息資源,如何精準(zhǔn)有效的檢索到令人滿意的結(jié)果,不必在眾多選擇中游移不定而被信息海洋淹沒(méi),成了人們最為關(guān)注的問(wèn)題。垂直搜索引擎的出現(xiàn)正迎合了這一契機(jī),它致力于為人們提供更快,更高,更專業(yè)的檢索服務(wù)。 本文對(duì)目前搜索引擎技術(shù)領(lǐng)域的熱點(diǎn)問(wèn)題進(jìn)行了探索性的研究,內(nèi)容主要包括: 1)爬蟲(chóng)爬取網(wǎng)頁(yè)的過(guò)程,爬取初始種子集選擇,運(yùn)行時(shí)打開(kāi)線程數(shù)與網(wǎng)絡(luò)資源開(kāi)銷的關(guān)系。 2)研究中文分詞的分詞方法,及目前流行的ICTCLAS,JE分詞,paoding分詞等幾種分詞方案在垂直搜索引擎中被植入后的分詞效果。 3)研究了在線網(wǎng)頁(yè)聚類算法在Nutch中的應(yīng)用,主要分析了開(kāi)源的carrot2中l(wèi)ingo和STC聚類算法的運(yùn)行情況比對(duì)。 4)對(duì)搜索引擎?zhèn)性化方面研究主要完成語(yǔ)音輸入,檢索同義詞轉(zhuǎn)換,以及異構(gòu)文檔的處理。 垂直搜索是和某專題相關(guān)的目標(biāo)集中的資源的搜索。本文在垂直搜索的關(guān)鍵技術(shù)研究的基礎(chǔ)上,設(shè)計(jì)了采用Nutch框架的面向全國(guó)高校的校園采風(fēng)垂直搜索引擎系統(tǒng)。通過(guò)對(duì)該系統(tǒng)的測(cè)試,實(shí)驗(yàn)結(jié)果表明該系統(tǒng)有良好的查準(zhǔn)率。
[Abstract]:With the rapid development of the Internet, the number of Internet users in China is increasing day by day, the services provided by the network are also various, the number of websites has increased dramatically, and the information resources of the website are expanding day by day. How to accurately and effectively retrieve satisfactory results without being swamped by information in many choices has become the most concerned issue. The emergence of vertical search engines caters to this opportunity. It aims to provide people with faster, higher, more professional retrieval services. This paper has carried on the exploratory research to the current hot spot question in the search engine technical domain, the content mainly includes: 1) the relationship between the number of open threads and the cost of network resources, including the process of crawling the web page, the selection of the initial seed set, and the number of threads opened at runtime. 2) the segmentation method of Chinese word segmentation and the segmentation effect of the popular ICTCLASJE participle segmentation in vertical search engine were studied. 3) the application of online web page clustering algorithm in Nutch is studied, and the comparison between lingo and STC clustering algorithm in open source carrot2 is analyzed. 4) the research of search engine personalization mainly completes the speech input, the retrieval synonym conversion, and the heterogeneous document processing. Vertical search is the search of resources in the target set related to a topic. This paper is based on the research of the key technology of vertical search. A vertical search engine system for campus mining in colleges and universities is designed using Nutch framework. The test results show that the system has a good precision.
【學(xué)位授予單位】:河北科技大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2012
【分類號(hào)】:TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前1條
1 秦文,苑春法;基于決策樹(shù)的漢語(yǔ)未登錄詞識(shí)別[J];中文信息學(xué)報(bào);2004年01期
相關(guān)碩士學(xué)位論文 前3條
1 王思力;面向大規(guī)模信息檢索的中文分詞技術(shù)研究[D];中國(guó)科學(xué)院研究生院(計(jì)算技術(shù)研究所);2006年
2 鄧錦輝;受限域中文問(wèn)答系統(tǒng)中答案抽取的研究[D];昆明理工大學(xué);2008年
3 張脂平;因子分析算法的研究及其在Web文本特征提取中的應(yīng)用[D];福州大學(xué);2005年
,本文編號(hào):1442724
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1442724.html
最近更新
教材專著