天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當前位置:主頁 > 科技論文 > 搜索引擎論文 >

基于MapReduce的分布式搜索引擎研究

發(fā)布時間:2018-06-29 00:05

  本文選題:搜索引擎 + MapReduce; 參考:《蘭州理工大學》2013年碩士論文


【摘要】:隨著資源爆炸式增長,搜索引擎已成為互聯(lián)網用戶獲取信息的重要工具,傳統(tǒng)搜索引擎多采用集中式架構,將搜索系統(tǒng)部署在一臺服務器上,因此對服務器性能要求較高,且存在系統(tǒng)穩(wěn)定性與可擴展性不高等問題;另外它采用關鍵詞匹配模式,致使用戶無法從海量數據中快速準確獲取信息,在信息覆蓋率、結果相關性和準確性方面都無法滿足用戶的更高需求。近年來,分布式計算理論被廣泛的研究,基于分布式計算的搜索引擎應運而生,它克服了集中式搜索引擎的不足,通過擴展系統(tǒng)的服務器來實現(xiàn)大數據量的處理,同時引入用戶個性化搜索模型,結合了語義分析等研究熱點,已成為數據挖掘和智能信息處理領域的研究熱點。 通過對搜索引擎的工作原理、結構和分布式計算等相關技術的研究,本文對基于MapReduce分布式搜索引擎中的模型框架、數據處理流程、排序算法優(yōu)化和主題爬蟲進行了研究。主要研究工作包括以下幾個方面:‘ (1)通過研究分布式文件系統(tǒng)(HDFS),分析了MapReduce編程模型的工作原理,針對原架構中單—NameNode控制結構存在負載不均衡及性能瓶頸等問題,提出了基于多NameNode節(jié)點控制的結構;在MapReduce模型處理數據過程中,因中間結果中Key值過于分散或集中,造成了數據不均衡問題,導致Reduce端作業(yè)的執(zhí)行時間過長或失敗,本文通過在Map階段處理后,引入數據平衡機制,從而提高了系統(tǒng)的性能并降低了系統(tǒng)的故障率。 (2)PageRank算法采用的是平均分配權值的思路,且沒考慮頁面間主題相關性;本文通過引入主題相關度和時效性機制,使算法能同時兼顧鏈接間的主題相關性和頁面的時效性;PageRank算法在計算網頁權值時會產生大量的中間迭代數據,從而導致算法性能降低,本文采用了基于塊結構算法劃分網絡的方法,有效地減少了中間迭代計算所產生的數據量,提高了算法的性能。 (3)通過采用基于詞頻差異的特征選取方法和改進后的TF-IDF公式,改進了Context Graph爬蟲搜索策略,綜合考慮了網頁不同部分的文本信息對特征選取,及各特征詞類間權重和類中權重的影響,提高了主題爬蟲的爬行效率。
[Abstract]:With the explosive growth of resources, search engines have become an important tool for Internet users to obtain information. Traditional search engines often use centralized architecture and deploy search systems on a single server. Besides, it uses keyword matching mode, which makes users can not get information from mass data quickly and accurately, and can not get information coverage in information coverage. Results relevance and accuracy can not meet the higher demand of users. In recent years, distributed computing theory has been widely studied, and the search engine based on distributed computing has emerged as the times require. It overcomes the shortcomings of centralized search engine and realizes the processing of large amount of data by extending the server of the system. At the same time, the introduction of user personalized search model, combined with semantic analysis and other research hot spots, has become a research hotspot in the field of data mining and intelligent information processing. Based on the research of the working principle, structure and distributed computing technology of search engine, this paper studies the model framework, data processing flow, sorting algorithm optimization and subject crawler in MapReduce distributed search engine. The main research work includes the following aspects: (1) by studying the distributed File system (HDFS), the working principle of MapReduce programming model is analyzed, and the problems of load imbalance and performance bottleneck in the single NameNode control structure in the original architecture are pointed out. This paper proposes a structure based on multi-node control of NameNode. In the process of data processing in MapReduce model, the key value in the intermediate result is too scattered or centralized, which results in the problem of data imbalance, which results in the long execution time or failure of the reduce job. After processing in Map stage, this paper introduces the data balance mechanism to improve the performance of the system and reduce the failure rate of the system. (2) the PageRank algorithm adopts the idea of average distribution weight, and does not consider the topic correlation between pages; In this paper, by introducing the mechanism of topic correlation and timeliness, the PageRank algorithm can produce a lot of intermediate iterative data when calculating the weights of web pages by taking into account both the topic correlation between links and the timeliness of pages. As a result, the performance of the algorithm is reduced. In this paper, the algorithm based on block structure is used to divide the network, which effectively reduces the amount of data generated by the intermediate iterative computation. The performance of the algorithm is improved. (3) by adopting the feature selection method based on word frequency difference and the improved TF-IDF formula, the context Graph crawler search strategy is improved, and the text information selection of different parts of the web page is considered synthetically. The crawling efficiency of the subject reptiles is improved by the influence of the weight of each feature class and the weight in the class.
【學位授予單位】:蘭州理工大學
【學位級別】:碩士
【學位授予年份】:2013
【分類號】:TP391.3

【參考文獻】

相關期刊論文 前10條

1 李建江;崔健;王聃;嚴林;黃義雙;;MapReduce并行編程模型研究綜述[J];電子學報;2011年11期

2 余e,

本文編號:2079945


資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2079945.html


Copyright(c)文論論文網All Rights Reserved | 網站地圖 |

版權申明:資料由用戶b3b2f***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com
日本国产欧美精品视频| 欧美小黄片在线一级观看| 国产一区二区精品丝袜| 成人欧美一区二区三区视频| 国产亚洲精品一二三区| 国产成人综合亚洲欧美日韩| 加勒比人妻精品一区二区| 欧美精品二区中文乱码字幕高清| 亚洲淫片一区二区三区| 亚洲国产欧美精品久久| 欧美三级不卡在线观线看| 色综合视频一区二区观看| 欧美有码黄片免费在线视频| 欧美一区二区三区播放| 国产日产欧美精品视频| 99久久精品国产日本| 91精品视频免费播放| 欧美日韩国产精品自在自线| 国产老女人性生活视频| 深夜少妇一区二区三区| 高清一区二区三区大伊香蕉| 国产精品一区二区日韩新区| 国产日韩欧美在线播放| 白丝美女被插入视频在线观看| 午夜精品麻豆视频91| 91天堂免费在线观看| 亚洲欧洲一区二区综合精品| 国产av一二三区在线观看| 噜噜中文字幕一区二区| 国产无摭挡又爽又色又刺激| 久久中文字幕中文字幕中文| 色婷婷国产精品视频一区二区保健| 精品推荐国产麻豆剧传媒| 欧美日韩精品人妻二区三区| 国产一区一一一区麻豆| 91精品国产综合久久福利| 熟女高潮一区二区三区| 精品一区二区三区不卡少妇av| 色婷婷激情五月天丁香| 欧美黑人巨大一区二区三区| 粉嫩内射av一区二区|