基于GPU的快速摘要生成方法
發(fā)布時間:2018-11-11 11:07
【摘要】:作為搜索引擎展示最終搜索結果的重要組成部分,基于查詢的摘要是現(xiàn)代搜索引擎最常用的方法,它可以向用戶展示結果文檔中與檢索詞關聯(lián)度最大的若干片段,這種基于查詢的摘要可以使得搜索結果對于用戶而言更直觀,更具針對性。根據(jù)查詢詞來計算一篇文檔的摘要是輕量級的任務,但是現(xiàn)今的搜索引擎往往要面對海量的查詢請求,而每個請求所呈現(xiàn)的結果頁面中的每個結果文檔都必須根據(jù)查詢詞來生成相應的摘要,因此基于查詢的摘要計算是現(xiàn)代搜索引擎系統(tǒng)中耗費計算資源相當大的一個部分。為了改進在大負載條件下摘要生成計算的性能和經(jīng)濟性,提出了一種基于CPU-GPU(Graphic Processing Unit,,圖形處理單元)混合系統(tǒng)的高性能并行處理方法。 提出了一種適合GPU處理的摘要生成算法,這個算法采用了滑動窗口的文檔切分方法,目的是為了避免傳統(tǒng)的截斷式文檔切分法所導致的高關聯(lián)度片段被切斷的問題。與此同時,算法還采用了一種新的量化公式來評估一個片段與查詢詞的關聯(lián)度。 在對CPU-GPU混合系統(tǒng)運行特征進行分析的基礎之上,對前述的摘要生成算法進行了改進。將一個摘要生成任務內部并行化的同時,還實現(xiàn)了任務間的并行化,并設計了一種三段式的流水線系統(tǒng)來支持此并行化的處理方法。為了實現(xiàn)此三段式流水線系統(tǒng),設計了一種異步執(zhí)行框架JobFlow,此框架采用基于服務的編程模式,可以支持高度的模塊化和并行化的程序設計。 開展了多項試驗以優(yōu)化系統(tǒng)的性能指標并評估系統(tǒng)的性能和經(jīng)濟效能。實驗結果顯示,與基準摘要生成算法Lucene的Highlighter組件相比較,GPU流水線處理系統(tǒng)獲得了較高的加速比,同時能降低了系統(tǒng)的成本。
[Abstract]:As an important part of search engine to display final search results, query-based summary is the most commonly used method in modern search engine. This query-based summary can make search results more intuitive and targeted to users. It is a lightweight task to calculate the summary of a document according to the query words, but nowadays search engines often have to face a large number of query requests. However, each result document in the result page presented by each request must generate the corresponding summary according to the query term. Therefore, the query-based summary computing is a part of the modern search engine system that consumes a lot of computing resources. In order to improve the performance and economy of summary generation under heavy load, a high performance parallel processing method based on CPU-GPU (Graphic Processing Unit, graphics processing unit (CPU-GPU (Graphic Processing Unit,) hybrid system is proposed. A summary generation algorithm suitable for GPU processing is proposed in this paper. This algorithm uses a sliding window method to segment documents in order to avoid the problem of cutting off high correlation segments caused by the traditional truncated document segmentation method. At the same time, a new quantitative formula is used to evaluate the correlation between a segment and a query word. On the basis of analyzing the operation characteristics of CPU-GPU hybrid system, the algorithm of summary generation is improved. While a summary generation task is parallelized, the parallelization between tasks is realized, and a three-segment pipeline system is designed to support the parallelization. In order to realize this three-segment pipeline system, an asynchronous execution framework (JobFlow,) is designed. The framework adopts a service-based programming model and can support highly modular and parallel programming. Several experiments were carried out to optimize the performance index and evaluate the performance and economic performance of the system. The experimental results show that compared with the Highlighter component of the benchmark digest generation algorithm Lucene, the GPU pipeline processing system has a higher speedup ratio and can reduce the cost of the system at the same time.
【學位授予單位】:華中科技大學
【學位級別】:碩士
【學位授予年份】:2012
【分類號】:TP391.3
本文編號:2324651
[Abstract]:As an important part of search engine to display final search results, query-based summary is the most commonly used method in modern search engine. This query-based summary can make search results more intuitive and targeted to users. It is a lightweight task to calculate the summary of a document according to the query words, but nowadays search engines often have to face a large number of query requests. However, each result document in the result page presented by each request must generate the corresponding summary according to the query term. Therefore, the query-based summary computing is a part of the modern search engine system that consumes a lot of computing resources. In order to improve the performance and economy of summary generation under heavy load, a high performance parallel processing method based on CPU-GPU (Graphic Processing Unit, graphics processing unit (CPU-GPU (Graphic Processing Unit,) hybrid system is proposed. A summary generation algorithm suitable for GPU processing is proposed in this paper. This algorithm uses a sliding window method to segment documents in order to avoid the problem of cutting off high correlation segments caused by the traditional truncated document segmentation method. At the same time, a new quantitative formula is used to evaluate the correlation between a segment and a query word. On the basis of analyzing the operation characteristics of CPU-GPU hybrid system, the algorithm of summary generation is improved. While a summary generation task is parallelized, the parallelization between tasks is realized, and a three-segment pipeline system is designed to support the parallelization. In order to realize this three-segment pipeline system, an asynchronous execution framework (JobFlow,) is designed. The framework adopts a service-based programming model and can support highly modular and parallel programming. Several experiments were carried out to optimize the performance index and evaluate the performance and economic performance of the system. The experimental results show that compared with the Highlighter component of the benchmark digest generation algorithm Lucene, the GPU pipeline processing system has a higher speedup ratio and can reduce the cost of the system at the same time.
【學位授予單位】:華中科技大學
【學位級別】:碩士
【學位授予年份】:2012
【分類號】:TP391.3
【參考文獻】
相關期刊論文 前3條
1 顏維龍,蓋杰,武港山,袁春風;面向網(wǎng)絡的全文檢索中索引文件的組織[J];計算機應用研究;2002年11期
2 張衛(wèi);楊曉江;;基于PC機群的分布式信息檢索系統(tǒng)[J];情報雜志;2006年12期
3 許濤,吳淑燕;Google搜索引擎及其技術簡介[J];現(xiàn)代圖書情報技術;2003年04期
本文編號:2324651
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2324651.html
最近更新
教材專著