基于MapReduce的分布式搜索引擎研究
發(fā)布時間:2018-06-29 00:05
本文選題:搜索引擎 + MapReduce; 參考:《蘭州理工大學》2013年碩士論文
【摘要】:隨著資源爆炸式增長,搜索引擎已成為互聯(lián)網用戶獲取信息的重要工具,傳統(tǒng)搜索引擎多采用集中式架構,將搜索系統(tǒng)部署在一臺服務器上,因此對服務器性能要求較高,且存在系統(tǒng)穩(wěn)定性與可擴展性不高等問題;另外它采用關鍵詞匹配模式,致使用戶無法從海量數據中快速準確獲取信息,在信息覆蓋率、結果相關性和準確性方面都無法滿足用戶的更高需求。近年來,分布式計算理論被廣泛的研究,基于分布式計算的搜索引擎應運而生,它克服了集中式搜索引擎的不足,通過擴展系統(tǒng)的服務器來實現(xiàn)大數據量的處理,同時引入用戶個性化搜索模型,結合了語義分析等研究熱點,已成為數據挖掘和智能信息處理領域的研究熱點。 通過對搜索引擎的工作原理、結構和分布式計算等相關技術的研究,本文對基于MapReduce分布式搜索引擎中的模型框架、數據處理流程、排序算法優(yōu)化和主題爬蟲進行了研究。主要研究工作包括以下幾個方面:‘ (1)通過研究分布式文件系統(tǒng)(HDFS),分析了MapReduce編程模型的工作原理,針對原架構中單—NameNode控制結構存在負載不均衡及性能瓶頸等問題,提出了基于多NameNode節(jié)點控制的結構;在MapReduce模型處理數據過程中,因中間結果中Key值過于分散或集中,造成了數據不均衡問題,導致Reduce端作業(yè)的執(zhí)行時間過長或失敗,本文通過在Map階段處理后,引入數據平衡機制,從而提高了系統(tǒng)的性能并降低了系統(tǒng)的故障率。 (2)PageRank算法采用的是平均分配權值的思路,且沒考慮頁面間主題相關性;本文通過引入主題相關度和時效性機制,使算法能同時兼顧鏈接間的主題相關性和頁面的時效性;PageRank算法在計算網頁權值時會產生大量的中間迭代數據,從而導致算法性能降低,本文采用了基于塊結構算法劃分網絡的方法,有效地減少了中間迭代計算所產生的數據量,提高了算法的性能。 (3)通過采用基于詞頻差異的特征選取方法和改進后的TF-IDF公式,改進了Context Graph爬蟲搜索策略,綜合考慮了網頁不同部分的文本信息對特征選取,及各特征詞類間權重和類中權重的影響,提高了主題爬蟲的爬行效率。
[Abstract]:With the explosive growth of resources, search engines have become an important tool for Internet users to obtain information. Traditional search engines often use centralized architecture and deploy search systems on a single server. Besides, it uses keyword matching mode, which makes users can not get information from mass data quickly and accurately, and can not get information coverage in information coverage. Results relevance and accuracy can not meet the higher demand of users. In recent years, distributed computing theory has been widely studied, and the search engine based on distributed computing has emerged as the times require. It overcomes the shortcomings of centralized search engine and realizes the processing of large amount of data by extending the server of the system. At the same time, the introduction of user personalized search model, combined with semantic analysis and other research hot spots, has become a research hotspot in the field of data mining and intelligent information processing. Based on the research of the working principle, structure and distributed computing technology of search engine, this paper studies the model framework, data processing flow, sorting algorithm optimization and subject crawler in MapReduce distributed search engine. The main research work includes the following aspects: (1) by studying the distributed File system (HDFS), the working principle of MapReduce programming model is analyzed, and the problems of load imbalance and performance bottleneck in the single NameNode control structure in the original architecture are pointed out. This paper proposes a structure based on multi-node control of NameNode. In the process of data processing in MapReduce model, the key value in the intermediate result is too scattered or centralized, which results in the problem of data imbalance, which results in the long execution time or failure of the reduce job. After processing in Map stage, this paper introduces the data balance mechanism to improve the performance of the system and reduce the failure rate of the system. (2) the PageRank algorithm adopts the idea of average distribution weight, and does not consider the topic correlation between pages; In this paper, by introducing the mechanism of topic correlation and timeliness, the PageRank algorithm can produce a lot of intermediate iterative data when calculating the weights of web pages by taking into account both the topic correlation between links and the timeliness of pages. As a result, the performance of the algorithm is reduced. In this paper, the algorithm based on block structure is used to divide the network, which effectively reduces the amount of data generated by the intermediate iterative computation. The performance of the algorithm is improved. (3) by adopting the feature selection method based on word frequency difference and the improved TF-IDF formula, the context Graph crawler search strategy is improved, and the text information selection of different parts of the web page is considered synthetically. The crawling efficiency of the subject reptiles is improved by the influence of the weight of each feature class and the weight in the class.
【學位授予單位】:蘭州理工大學
【學位級別】:碩士
【學位授予年份】:2013
【分類號】:TP391.3
【參考文獻】
相關期刊論文 前10條
1 李建江;崔健;王聃;嚴林;黃義雙;;MapReduce并行編程模型研究綜述[J];電子學報;2011年11期
2 余e,
本文編號:2079945
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2079945.html
教材專著