異構(gòu)數(shù)據(jù)聯(lián)合檢索系統(tǒng)的設(shè)計與實現(xiàn)

發(fā)布時間：2018-10-26 13:06

【摘要】：隨著計算機(jī)和網(wǎng)絡(luò)的普及,越來越多的企業(yè)、機(jī)關(guān)、學(xué)校等都利用計算機(jī)來處理文檔,而在這些機(jī)構(gòu)的管理過程中也必然會產(chǎn)生大量的電子文檔。如何從大量的文檔中快速而準(zhǔn)確地檢索出用戶所需要的信息成為擺在人們面前的一大難題。某企業(yè)對文檔的檢索上也存在這個問題,目前該企業(yè)對文檔采用目錄式管理,沒有一個針對所有文檔的檢索系統(tǒng),員工欲查找某項信息需花費大量的時間,并且尋找到的信息不完全。所以該企業(yè)急需一個針對其所有文檔來進(jìn)行信息檢索的搜索引擎來滿足不同用戶的需求。本項目以該企業(yè)需求為依托,針對異構(gòu)數(shù)據(jù)聯(lián)合檢索系統(tǒng)中索引建立與搜索機(jī)制來進(jìn)行研究。該系統(tǒng)提供了按文檔類型檢索、按發(fā)布者檢索、按發(fā)布日期檢索等多種檢索方式,以方便用戶的使用。同時,針對企業(yè)數(shù)據(jù)量龐大和檢索結(jié)果需準(zhǔn)確的特點,系統(tǒng)對索引的建立與檢索過程以及庖丁解牛中文分詞器均做了大量的優(yōu)化。本系統(tǒng)采用Java語言開發(fā),主要使用基于Java的全文索引工具包Lucene來實現(xiàn)�？紤]到企業(yè)龐大的數(shù)據(jù)量以及未來的系統(tǒng)升級,數(shù)據(jù)庫采用專門針對大容量數(shù)據(jù)處理的GreenPlum數(shù)據(jù)庫。項目采用SSH框架,文檔解析采用了POI和PDFBox工具包,中文分詞器采用了庖丁解牛分詞器。開發(fā)工具使用MyEclipse10。系統(tǒng)運(yùn)行情況良好,就檢索的效率和效果而言,基本達(dá)到了最初的設(shè)計要求。
[Abstract]:With the popularity of computers and networks, more and more enterprises, institutions, schools and so on use computers to process documents, and in the management process of these organizations will inevitably produce a large number of electronic documents. How to quickly and accurately retrieve the information needed by users from a large number of documents has become a big problem in front of people. There is also this problem in the retrieval of documents in a certain enterprise. At present, the enterprise uses directory management for documents, and there is no retrieval system for all documents. It takes a lot of time for employees to find a certain item of information. And the information found is incomplete. Therefore, the enterprise urgently needs a search engine for all its documents to meet the needs of different users. This project is based on the requirements of the enterprise and studies the indexing and searching mechanism in the heterogeneous data joint retrieval system. The system provides a variety of retrieval methods, such as retrieval by document type, by publisher, by publication date, and so on, in order to facilitate the use of users. At the same time, in view of the large amount of enterprise data and the need for accurate retrieval results, the system has made a great deal of optimization on the establishment and retrieval process of the index and the Chinese word particifier of Pao Ding Jie Niu. This system is developed with Java language, mainly using the full-text index toolkit Lucene based on Java. Considering the huge amount of enterprise data and the future system upgrade, the database adopts GreenPlum database which is specially designed for large capacity data processing. SSH framework is used in the project, POI and PDFBox toolkits are used in document parsing, and Pao Ding Jie Niu word Segmentation is used in Chinese word Segmentation. Development tools using MyEclipse10. The system runs well, and the efficiency and effect of retrieval basically meet the initial design requirements.
【學(xué)位授予單位】：東北大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2013
【分類號】：TP391.3
，

本文編號：2295815

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2295815.html

上一篇：N層向量空間模型在Web信息檢索中的應(yīng)用
下一篇：淺議搜索引擎在語言學(xué)中的應(yīng)用——以谷歌西班牙為例

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

異構(gòu)數(shù)據(jù)聯(lián)合檢索系統(tǒng)的設(shè)計與實現(xiàn)