基于壓縮全文自索引的分布式索引技術(shù)研究
發(fā)布時(shí)間:2018-05-02 10:29
本文選題:分布式全文索引 + 壓縮全文自索引; 參考:《杭州電子科技大學(xué)》2015年碩士論文
【摘要】:分布式全文檢索技術(shù)是信息處理領(lǐng)域的核心技術(shù)之一,目前被廣泛應(yīng)用于競(jìng)爭(zhēng)情報(bào)、信息檢索、搜索引擎以及信息過(guò)濾等領(lǐng)域。對(duì)高效分布式全文索引技術(shù)的深入探討不僅擁有重要的理論價(jià)值,同時(shí)還具有巨大的商業(yè)價(jià)值。隨著互聯(lián)網(wǎng)的日益普及,各式各樣的數(shù)據(jù)以更快的速度產(chǎn)生,數(shù)據(jù)總量成指數(shù)級(jí)增長(zhǎng),面對(duì)海量的數(shù)據(jù),相關(guān)數(shù)據(jù)索引文件的大小也持續(xù)增加。傳統(tǒng)的單機(jī)索引系統(tǒng)基本不能滿足海量數(shù)據(jù)的索引需求,而分布式索引系統(tǒng)可滿足上述需求,并實(shí)現(xiàn)海量數(shù)據(jù)的分布式索引。分布式索引系統(tǒng)的核心技術(shù)涵蓋了分布式索引創(chuàng)建、索引查詢、分布式索引的數(shù)據(jù)分配以及分布式索引的負(fù)載均衡等內(nèi)容。本文將近幾年來(lái)流行的文本處理技術(shù)——壓縮全文自索引應(yīng)用到分布式索引當(dāng)中,并討論該索引結(jié)構(gòu)下的查詢策略。 本文對(duì)分布式全文索引技術(shù)研究的內(nèi)容包括: (1)當(dāng)前主流的分布式索引系統(tǒng)主要采用倒排索引結(jié)構(gòu),運(yùn)行在高性能集群中的倒排索引對(duì)查詢的響應(yīng)時(shí)間可達(dá)到毫秒級(jí)別。然而,倒排索引除了需存儲(chǔ)自身信息之外的信息,,還需要額外存儲(chǔ)信息用于支持搜索引擎實(shí)現(xiàn)存儲(chǔ)片段抽取、排序和位置信息、查詢緩存等功能,從而導(dǎo)致存儲(chǔ)空間的利用效率偏低。本文創(chuàng)新的將當(dāng)前文本索引研究的熱點(diǎn)壓縮全文自索引應(yīng)用到分布式索引系統(tǒng)當(dāng)中,提出一種基于改進(jìn)哈夫曼編碼的小波樹(shù)壓縮算法,并與后綴數(shù)組將結(jié)合,實(shí)現(xiàn)了能適應(yīng)分布式環(huán)境下的壓縮全文自索引結(jié)構(gòu)及對(duì)應(yīng)的高效創(chuàng)建算法。 (2)索引系統(tǒng)在搜索引擎中主要發(fā)揮以下兩種作用:第一,根據(jù)一定的規(guī)則創(chuàng)建網(wǎng)頁(yè)文檔的索引,便于后續(xù)查詢;第二,按照用戶提出的查詢命令檢索索引文件,同時(shí)按一定規(guī)則對(duì)索引文件進(jìn)行排序并將結(jié)果返回客戶端;谛赂倪M(jìn)的壓縮全文自索引結(jié)構(gòu),提出了一種分布式環(huán)境下的查詢處理策略。 (3)結(jié)合以上研究?jī)?nèi)容和相關(guān)研究成果,提出一種分布式全文索引系統(tǒng)架構(gòu),該系統(tǒng)有利于實(shí)現(xiàn)各種各樣非結(jié)構(gòu)化數(shù)據(jù)的分布式索引,進(jìn)而實(shí)現(xiàn)海量非結(jié)構(gòu)化數(shù)據(jù)的查詢和索引性能。詳細(xì)介紹了系統(tǒng)中索引集群、查詢集群以及分布式文件系統(tǒng)的設(shè)計(jì),最后測(cè)試該分布式索引系統(tǒng)查詢處理的高效性。
[Abstract]:Distributed full-text retrieval is one of the core technologies in the field of information processing. It is widely used in the fields of competitive intelligence, information retrieval, search engine and information filtering. The in-depth study of efficient distributed full-text indexing technology not only has important theoretical value, but also has great commercial value. With the increasing popularity of the Internet, all kinds of data are produced at a faster rate, and the total amount of data increases exponentially. In the face of massive data, the size of related data index files continues to increase. The traditional single computer indexing system can not meet the index requirement of mass data, but the distributed index system can meet the above requirements and realize the distributed index of mass data. The core technologies of distributed index system include distributed index creation, index query, data distribution of distributed index and load balance of distributed index. In this paper, a popular text processing technique, compressed full-text self-index, is applied to distributed index in recent years, and the query strategy under this index structure is discussed. In this paper, the research contents of distributed full-text indexing technology include: At present, the main distributed index system mainly adopts inverted index structure, and the response time of inverted index running in high performance cluster can reach millisecond level. However, the inverted index not only needs to store its own information, but also needs to store additional information to support the search engine to realize the functions of segment extraction, sorting and location information, query cache, etc. As a result, the utilization efficiency of storage space is on the low side. In this paper, a new algorithm of wavelet tree compression based on improved Huffman coding is proposed, which is combined with suffix array. The compression full-text self-index structure and the corresponding efficient creation algorithm are implemented in the distributed environment. The index system plays the following two main roles in the search engine: first, to create the index of the web page document according to certain rules, to facilitate the subsequent query; second, to retrieve the index file according to the query command put forward by the user. At the same time, the index files are sorted according to certain rules and the results are returned to the client. A query processing strategy in distributed environment is proposed based on the newly improved compression full text self-index structure. 3) combining the above research contents and related research results, a distributed full-text index system architecture is proposed, which is conducive to the realization of distributed index of various unstructured data. Then the query and index performance of massive unstructured data is realized. The design of index cluster, query cluster and distributed file system in the system is introduced in detail. Finally, the efficiency of query processing in the distributed index system is tested.
【學(xué)位授予單位】:杭州電子科技大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2015
【分類號(hào)】:TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前4條
1 王建勇,單松巍,雷鳴,謝正茂,李曉明;海量Web搜索引擎系統(tǒng)中用戶行為的分布特征及其啟示[J];中國(guó)科學(xué)E輯:技術(shù)科學(xué);2001年04期
2 李勇;張志剛;;領(lǐng)域本體構(gòu)建方法研究[J];計(jì)算機(jī)工程與科學(xué);2008年05期
3 吳晟;李星;;分布式搜索中節(jié)點(diǎn)索引量大小估計(jì)算法[J];計(jì)算機(jī)應(yīng)用;2008年09期
4 韓婕;向陽(yáng);;本體構(gòu)建研究綜述[J];計(jì)算機(jī)應(yīng)用與軟件;2007年09期
本文編號(hào):1833511
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1833511.html
最近更新
教材專著