基于DHT的分布式價(jià)格搜索引擎研究
發(fā)布時(shí)間:2019-01-26 21:01
【摘要】:近年來(lái),隨著網(wǎng)絡(luò)資源的多樣化和人們對(duì)專有領(lǐng)域信息的需求,垂直搜索引擎的研究越來(lái)越受到人們的關(guān)注。面向價(jià)格的搜索就是垂直搜索引擎中的一種。但現(xiàn)有的價(jià)格搜索引擎幾乎都是基于集中式的,當(dāng)大量用戶在同一時(shí)間進(jìn)行請(qǐng)求時(shí),中央服務(wù)器就會(huì)成為“瓶頸”且容易出現(xiàn)單點(diǎn)故障。隨著網(wǎng)絡(luò)規(guī)模的不斷擴(kuò)大,對(duì)分布式垂直搜索的研究顯得越來(lái)越重要。本文將P2P技術(shù)與垂直搜索引擎相結(jié)合,設(shè)計(jì)了一個(gè)基于DHT的分布式價(jià)格搜索引擎,并討論了主題爬蟲(chóng)的爬行策略、利用URL規(guī)則對(duì)網(wǎng)頁(yè)的主題相關(guān)性進(jìn)行判斷以及利用XPath技術(shù)對(duì)web信息進(jìn)行抽取。然后討論了如何利用DHT的思想實(shí)現(xiàn)索引的構(gòu)建和分布式存儲(chǔ),有效的避免了集中式索引可能出現(xiàn)的問(wèn)題。 最后,針對(duì)現(xiàn)有的價(jià)格搜素引擎存在的搜索結(jié)果呈現(xiàn)結(jié)構(gòu)不清晰、混亂的問(wèn)題,本文提出了對(duì)搜索結(jié)果進(jìn)行聚類的想法。通過(guò)對(duì)現(xiàn)有聚類算法的研究和分析,本文對(duì)k-means算法進(jìn)行了改進(jìn),并利用改進(jìn)后的算法對(duì)搜索結(jié)果進(jìn)行聚類,使得簇內(nèi)的文檔相似度較高,簇間的文檔相似度較低。然后每個(gè)簇都用類標(biāo)簽進(jìn)行描述,用戶只需根據(jù)類標(biāo)簽查看自己感興趣的信息即可,而無(wú)需對(duì)所有的返回結(jié)果進(jìn)行逐個(gè)瀏覽,大大節(jié)省了瀏覽時(shí)間和查找時(shí)間。
[Abstract]:In recent years, with the diversification of network resources and people's demand for proprietary domain information, the research of vertical search engine has attracted more and more attention. Price-oriented search is one of the vertical search engines. However, most existing price search engines are based on centralized search engines. When a large number of users make requests at the same time, the central server becomes a "bottleneck" and is prone to a single point of failure. With the expansion of network scale, the research of distributed vertical search becomes more and more important. This paper combines P2P technology with vertical search engine, designs a distributed price search engine based on DHT, and discusses the crawling strategy of topic crawler. URL rules are used to judge the relevance of web pages and XPath technology is used to extract web information. Then it discusses how to use the idea of DHT to realize index construction and distributed storage, which can effectively avoid the possible problems of centralized index. Finally, aiming at the problem that the search results of the existing price search engine are not clear and confusing, this paper puts forward the idea of clustering the search results. Through the research and analysis of the existing clustering algorithms, this paper improves the k-means algorithm, and makes use of the improved algorithm to cluster the search results, which makes the document similarity within the cluster is higher, and the document similarity between the clusters is lower. Then each cluster is described by class tags. Users only need to view the information they are interested in according to the class tag, without having to browse all the returned results one by one, which greatly saves the browsing time and searching time.
【學(xué)位授予單位】:西華大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP391.3
本文編號(hào):2415907
[Abstract]:In recent years, with the diversification of network resources and people's demand for proprietary domain information, the research of vertical search engine has attracted more and more attention. Price-oriented search is one of the vertical search engines. However, most existing price search engines are based on centralized search engines. When a large number of users make requests at the same time, the central server becomes a "bottleneck" and is prone to a single point of failure. With the expansion of network scale, the research of distributed vertical search becomes more and more important. This paper combines P2P technology with vertical search engine, designs a distributed price search engine based on DHT, and discusses the crawling strategy of topic crawler. URL rules are used to judge the relevance of web pages and XPath technology is used to extract web information. Then it discusses how to use the idea of DHT to realize index construction and distributed storage, which can effectively avoid the possible problems of centralized index. Finally, aiming at the problem that the search results of the existing price search engine are not clear and confusing, this paper puts forward the idea of clustering the search results. Through the research and analysis of the existing clustering algorithms, this paper improves the k-means algorithm, and makes use of the improved algorithm to cluster the search results, which makes the document similarity within the cluster is higher, and the document similarity between the clusters is lower. Then each cluster is described by class tags. Users only need to view the information they are interested in according to the class tag, without having to browse all the returned results one by one, which greatly saves the browsing time and searching time.
【學(xué)位授予單位】:西華大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前1條
1 王文鈞;李巍;;垂直搜索引擎的現(xiàn)狀與發(fā)展探究[J];情報(bào)科學(xué);2010年03期
,本文編號(hào):2415907
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2415907.html
最近更新
教材專著