基于MapReduce的信息檢索相關(guān)算法并行化研究與實(shí)現(xiàn)
發(fā)布時(shí)間:2019-01-05 07:54
【摘要】:隨著Internet的日益普及與迅速發(fā)展,互聯(lián)網(wǎng)上的信息量呈幾何級(jí)數(shù)增長(zhǎng),信息爆炸已成為當(dāng)今網(wǎng)絡(luò)時(shí)代的特征之一。作為訪問(wèn)互聯(lián)網(wǎng)的重要入口,搜索引擎在幫助用戶從浩如煙海的Internet中快速準(zhǔn)確地獲得所需信息方面起到了日益重要的作用,人們的生產(chǎn)生活已經(jīng)越來(lái)越依賴搜索引擎。搜索引擎檢索的對(duì)象是整個(gè)互聯(lián)網(wǎng)上的全部數(shù)據(jù),包括網(wǎng)頁(yè)、圖片、音樂(lè)、視頻、FTP資源等。這些海量的數(shù)據(jù)對(duì)信息檢索系統(tǒng)的高效運(yùn)行提出了新的挑戰(zhàn):一方面,單臺(tái)計(jì)算機(jī)的處理能力受到CPU時(shí)鐘頻率、內(nèi)存容量、磁盤讀寫速度和網(wǎng)絡(luò)帶寬等因素的制約,無(wú)法在理想的時(shí)間內(nèi)獨(dú)自處理全部的數(shù)據(jù);另一方面,這些海量數(shù)據(jù)并非存儲(chǔ)在單臺(tái)計(jì)算機(jī)上或者單個(gè)數(shù)據(jù)庫(kù)中,而是分布在整個(gè)Internet上,這就需要成千上萬(wàn)臺(tái)計(jì)算機(jī)以“相互合作”的方式對(duì)這些海量數(shù)據(jù)進(jìn)行處理。因此,為搜索引擎設(shè)計(jì)能夠高效地處理海量Internet數(shù)據(jù)的并行算法成為了學(xué)術(shù)界和工業(yè)界共同的研究方向與追求目標(biāo)。在過(guò)去的數(shù)十年中,并行計(jì)算領(lǐng)域的研究取得了長(zhǎng)足的進(jìn)步,一些經(jīng)典的并行計(jì)算平臺(tái)相繼出現(xiàn),如MPI、OpenMP、OpenCL、CUD A等,特別是Google于2004年提出的MapReduce并行計(jì)算模型,以其良好的可擴(kuò)展性、可靠性和易用性,為并行計(jì)算提供了簡(jiǎn)單、高效的計(jì)算模型和運(yùn)行環(huán)境,降低了并行計(jì)算從理論向應(yīng)用轉(zhuǎn)化的難度,為并行計(jì)算的實(shí)際應(yīng)用提供了一個(gè)簡(jiǎn)單易用的平臺(tái)。 信息檢索領(lǐng)域的傳統(tǒng)算法發(fā)展至今已日趨成熟,然而,有些算法并非是專為并行環(huán)境設(shè)計(jì)的,面臨著無(wú)法直接處理大規(guī)模的海量數(shù)據(jù)或者無(wú)法在有效的時(shí)間內(nèi)完成對(duì)海量數(shù)據(jù)的計(jì)算的窘境。因此,如果能夠?qū)⑦@些算法加以改造,使其能夠分布在多臺(tái)計(jì)算機(jī)上并行地運(yùn)行,則可以大大提高對(duì)海量數(shù)據(jù)的處理效率,更加快速地響應(yīng)人們的搜索需求,改善用戶的搜索體驗(yàn)。在信息檢索領(lǐng)域中,查詢推薦(Query Suggestion)與網(wǎng)頁(yè)排序(Page Rank)是兩項(xiàng)重要的研究?jī)?nèi)容:查詢推薦可以幫助用戶更加精確有效地查詢并節(jié)省搜索時(shí)間,而網(wǎng)頁(yè)排序則可以改善搜索質(zhì)量、幫助用戶更容易地找到所需的網(wǎng)頁(yè)。如果能夠?qū)@兩個(gè)領(lǐng)域中的一些串行算法進(jìn)行并行化改造,使其能夠并行地運(yùn)行于計(jì)算機(jī)集群中,則能夠有效提升搜索引擎對(duì)大規(guī)模數(shù)據(jù)的處理能力,加快搜索引擎在查詢推薦和網(wǎng)頁(yè)排序方面的更新速度,提高用戶對(duì)檢索的滿意度。 本文研究了查詢推薦領(lǐng)域的QUBIC算法和基于頻繁項(xiàng)集挖掘的網(wǎng)頁(yè)排序算法,以對(duì)海量Internet數(shù)據(jù)的并行處理作為研究背景,基于MapReduce并行計(jì)算模型對(duì)QUBIC算法和基于頻繁項(xiàng)集挖掘的網(wǎng)頁(yè)排序算法進(jìn)行了并行化改造,使得QUBIC算法和基于頻繁項(xiàng)集挖掘的網(wǎng)頁(yè)排序算法能夠運(yùn)行于MapReduce并行計(jì)算框架之中,并利用Hadoop并行計(jì)算軟件框架實(shí)現(xiàn)了一個(gè)原型系統(tǒng)。具體而言,本文的主要研究工作包含以下方面: (1)對(duì)QUBIC算法進(jìn)行基于MapReduce模型的并行化改造,提出了數(shù)據(jù)分布和并行計(jì)算的具體方法,包括:搜索引擎日志文件的分布存儲(chǔ),Query-URL二部圖的構(gòu)造,Jaccard相似系數(shù)的計(jì)算,QAG的生成,QAG中連通分量的計(jì)算以及對(duì)Query的排序。 (2)對(duì)傳統(tǒng)的SON頻繁項(xiàng)集挖掘算法進(jìn)行基于MapReduce模型的并行化改造,提出頻繁項(xiàng)集并行挖掘的PSON算法,并將其應(yīng)用于對(duì)頻繁URL的挖掘。在計(jì)算出搜索引擎返回結(jié)果中關(guān)聯(lián)性較大的一組URL后,按照其重要程度降序呈現(xiàn)給用戶。 本文在Hadoop并行計(jì)算平臺(tái)上實(shí)現(xiàn)了本文對(duì)原算法進(jìn)行并行化改造的思想,并進(jìn)行了實(shí)驗(yàn)。實(shí)驗(yàn)表明,本文提出的對(duì)相關(guān)算法進(jìn)行并行化改造的方法是行之有效的,并且具有良好的可擴(kuò)展性能和加速比性能。最后,本文實(shí)現(xiàn)了一個(gè)原型系統(tǒng),從整體上演示了QUBIC并行算法和頻繁URL并行挖掘算法的運(yùn)行效果,驗(yàn)證了這兩類算法的正確性和有效性。
[Abstract]:With the increasing popularity and rapid development of the Internet, the amount of information on the Internet has been increasing, and the information explosion has become one of the features of the current network era. As an important gateway to the Internet, the search engine has played an increasingly important role in helping users to quickly and accurately obtain the required information from the Internet, such as the smoke and sea, and people's production life has become more and more dependent on the search engine. The object retrieved by the search engine is all the data on the whole Internet, including web pages, pictures, music, videos, FTP resources, and so on. These massive data pose a new challenge to the efficient operation of the information retrieval system. On the one hand, the processing power of a single-stage computer is restricted by factors such as CPU clock frequency, memory capacity, disk read-write speed and network bandwidth. it is not possible to process all of the data on its own within an ideal time; on the other hand, these mass data are not stored on a single computer or in a single database, but distributed across the internet, This requires thousands of computers to process these mass data in 鈥渕utual cooperation鈥,
本文編號(hào):2401495
[Abstract]:With the increasing popularity and rapid development of the Internet, the amount of information on the Internet has been increasing, and the information explosion has become one of the features of the current network era. As an important gateway to the Internet, the search engine has played an increasingly important role in helping users to quickly and accurately obtain the required information from the Internet, such as the smoke and sea, and people's production life has become more and more dependent on the search engine. The object retrieved by the search engine is all the data on the whole Internet, including web pages, pictures, music, videos, FTP resources, and so on. These massive data pose a new challenge to the efficient operation of the information retrieval system. On the one hand, the processing power of a single-stage computer is restricted by factors such as CPU clock frequency, memory capacity, disk read-write speed and network bandwidth. it is not possible to process all of the data on its own within an ideal time; on the other hand, these mass data are not stored on a single computer or in a single database, but distributed across the internet, This requires thousands of computers to process these mass data in 鈥渕utual cooperation鈥,
本文編號(hào):2401495
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2401495.html
最近更新
教材專著