異構(gòu)數(shù)據(jù)聯(lián)合檢索系統(tǒng)的設(shè)計與實現(xiàn)
發(fā)布時間:2018-10-26 13:06
【摘要】:隨著計算機和網(wǎng)絡(luò)的普及,越來越多的企業(yè)、機關(guān)、學(xué)校等都利用計算機來處理文檔,而在這些機構(gòu)的管理過程中也必然會產(chǎn)生大量的電子文檔。如何從大量的文檔中快速而準(zhǔn)確地檢索出用戶所需要的信息成為擺在人們面前的一大難題。某企業(yè)對文檔的檢索上也存在這個問題,目前該企業(yè)對文檔采用目錄式管理,沒有一個針對所有文檔的檢索系統(tǒng),員工欲查找某項信息需花費大量的時間,并且尋找到的信息不完全。所以該企業(yè)急需一個針對其所有文檔來進行信息檢索的搜索引擎來滿足不同用戶的需求。本項目以該企業(yè)需求為依托,針對異構(gòu)數(shù)據(jù)聯(lián)合檢索系統(tǒng)中索引建立與搜索機制來進行研究。該系統(tǒng)提供了按文檔類型檢索、按發(fā)布者檢索、按發(fā)布日期檢索等多種檢索方式,以方便用戶的使用。同時,針對企業(yè)數(shù)據(jù)量龐大和檢索結(jié)果需準(zhǔn)確的特點,系統(tǒng)對索引的建立與檢索過程以及庖丁解牛中文分詞器均做了大量的優(yōu)化。本系統(tǒng)采用Java語言開發(fā),主要使用基于Java的全文索引工具包Lucene來實現(xiàn)?紤]到企業(yè)龐大的數(shù)據(jù)量以及未來的系統(tǒng)升級,數(shù)據(jù)庫采用專門針對大容量數(shù)據(jù)處理的GreenPlum數(shù)據(jù)庫。項目采用SSH框架,文檔解析采用了POI和PDFBox工具包,中文分詞器采用了庖丁解牛分詞器。開發(fā)工具使用MyEclipse10。系統(tǒng)運行情況良好,就檢索的效率和效果而言,基本達到了最初的設(shè)計要求。
[Abstract]:With the popularity of computers and networks, more and more enterprises, institutions, schools and so on use computers to process documents, and in the management process of these organizations will inevitably produce a large number of electronic documents. How to quickly and accurately retrieve the information needed by users from a large number of documents has become a big problem in front of people. There is also this problem in the retrieval of documents in a certain enterprise. At present, the enterprise uses directory management for documents, and there is no retrieval system for all documents. It takes a lot of time for employees to find a certain item of information. And the information found is incomplete. Therefore, the enterprise urgently needs a search engine for all its documents to meet the needs of different users. This project is based on the requirements of the enterprise and studies the indexing and searching mechanism in the heterogeneous data joint retrieval system. The system provides a variety of retrieval methods, such as retrieval by document type, by publisher, by publication date, and so on, in order to facilitate the use of users. At the same time, in view of the large amount of enterprise data and the need for accurate retrieval results, the system has made a great deal of optimization on the establishment and retrieval process of the index and the Chinese word particifier of Pao Ding Jie Niu. This system is developed with Java language, mainly using the full-text index toolkit Lucene based on Java. Considering the huge amount of enterprise data and the future system upgrade, the database adopts GreenPlum database which is specially designed for large capacity data processing. SSH framework is used in the project, POI and PDFBox toolkits are used in document parsing, and Pao Ding Jie Niu word Segmentation is used in Chinese word Segmentation. Development tools using MyEclipse10. The system runs well, and the efficiency and effect of retrieval basically meet the initial design requirements.
【學(xué)位授予單位】:東北大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP391.3
,
本文編號:2295815
[Abstract]:With the popularity of computers and networks, more and more enterprises, institutions, schools and so on use computers to process documents, and in the management process of these organizations will inevitably produce a large number of electronic documents. How to quickly and accurately retrieve the information needed by users from a large number of documents has become a big problem in front of people. There is also this problem in the retrieval of documents in a certain enterprise. At present, the enterprise uses directory management for documents, and there is no retrieval system for all documents. It takes a lot of time for employees to find a certain item of information. And the information found is incomplete. Therefore, the enterprise urgently needs a search engine for all its documents to meet the needs of different users. This project is based on the requirements of the enterprise and studies the indexing and searching mechanism in the heterogeneous data joint retrieval system. The system provides a variety of retrieval methods, such as retrieval by document type, by publisher, by publication date, and so on, in order to facilitate the use of users. At the same time, in view of the large amount of enterprise data and the need for accurate retrieval results, the system has made a great deal of optimization on the establishment and retrieval process of the index and the Chinese word particifier of Pao Ding Jie Niu. This system is developed with Java language, mainly using the full-text index toolkit Lucene based on Java. Considering the huge amount of enterprise data and the future system upgrade, the database adopts GreenPlum database which is specially designed for large capacity data processing. SSH framework is used in the project, POI and PDFBox toolkits are used in document parsing, and Pao Ding Jie Niu word Segmentation is used in Chinese word Segmentation. Development tools using MyEclipse10. The system runs well, and the efficiency and effect of retrieval basically meet the initial design requirements.
【學(xué)位授予單位】:東北大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP391.3
,
本文編號:2295815
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2295815.html
最近更新
教材專著