天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁 > 科技論文 > 搜索引擎論文 >

基于PSO-BP神經(jīng)網(wǎng)絡(luò)的Lucene搜索引擎的研究

發(fā)布時(shí)間:2019-02-23 09:37
【摘要】:Lucene是一個(gè)全文搜索體系架構(gòu),具有優(yōu)異的索引結(jié)構(gòu)、良好的系統(tǒng)架構(gòu)以及高性能、可伸縮的信息搜索庫等優(yōu)點(diǎn),但是對(duì)于中文分詞以及多種文本格式的支持卻很是不足。目前Lucene采用的中文分詞算法有很多,包括Lucene自身提供的StandardAnalyzer和CJKAnalyzer,以及第三方提供的ChineseAnalyzer和IK_CAnalyzer等等很多種中文分詞系統(tǒng)。其中,StandardAnalyzer是基于單字分詞的,即在對(duì)中文文本進(jìn)行分詞時(shí),以字為單位進(jìn)行切分,其缺點(diǎn)是需要復(fù)雜的單字匹配算法,以及大量的CPU運(yùn)算;CJKAnalyzer和ChineseAnalyzer采用的均是二分法,所謂二分法就是每每?jī)蓚(gè)字當(dāng)作一個(gè)詞來切分;IK_CAnalyzer分詞技術(shù)是基于分詞詞典的,采用了特有的正向迭代最細(xì)粒度切分算法和多子處理器分析模式。目前,Lucene搜索引擎并未實(shí)現(xiàn)基于理解的中文分詞方法,因?yàn)橛?jì)算機(jī)無法識(shí)別每個(gè)詞在不同語境中的含義,所以基于理解的分詞方法還未有實(shí)際的運(yùn)用效果。 針對(duì)Lucene對(duì)中文分詞的不足,尤其是缺少基于理解領(lǐng)域的中文分詞技術(shù)等缺陷,本文探討了BP(Back Propagation)神經(jīng)網(wǎng)絡(luò)算法在中文分詞中的應(yīng)用研究,并針對(duì)BP神經(jīng)網(wǎng)絡(luò)應(yīng)用中文分詞具有收斂速度慢,容易陷入局部極小值以及速度和效率低等缺陷,提出了一種改進(jìn)的微粒群優(yōu)化算法(PSO, Particle SwarmOptimization)優(yōu)化BP神經(jīng)網(wǎng)絡(luò)——PSO-BP神經(jīng)網(wǎng)絡(luò),并將其運(yùn)用于中文分詞中,與傳統(tǒng)的BP神經(jīng)網(wǎng)絡(luò)相比較,可以得出PSO-BP神經(jīng)網(wǎng)絡(luò)不僅解決了傳統(tǒng)BP神經(jīng)網(wǎng)絡(luò)收斂速度慢的缺陷,同時(shí)也提高了分詞的精度。 然后,本文對(duì)Lucene提供的第三方中文分詞組件的API進(jìn)行了系統(tǒng)地研究與分析,將經(jīng)PSO-BP神經(jīng)網(wǎng)絡(luò)優(yōu)化后的中文分詞技術(shù)成功應(yīng)用于Lucene中,并與Lucene自帶的中文分詞技術(shù)進(jìn)行比較,得出該技術(shù)明顯優(yōu)于自帶的中文分詞技術(shù)。 最后,,本文采用包含PSO-BP神經(jīng)網(wǎng)絡(luò)中文分詞組件的Lucene進(jìn)行搜索引擎的設(shè)計(jì)和實(shí)現(xiàn),從而實(shí)現(xiàn)搜索引擎的中文分詞的智能化探索,為后續(xù)的工作和研究提供了一個(gè)良好的平臺(tái)。
[Abstract]:Lucene is a full-text search architecture with excellent index structure, good system architecture and high performance, scalable information search library. However, the support for Chinese word segmentation and various text formats is very inadequate. At present, there are many Chinese word segmentation algorithms used in Lucene, including StandardAnalyzer and CJKAnalyzer, provided by Lucene itself and ChineseAnalyzer and IK_CAnalyzer provided by third parties. Among them, StandardAnalyzer is based on word segmentation, that is to say, word segmentation is based on word segmentation. Its disadvantage is that it needs complex word matching algorithm and a large number of CPU operations. CJKAnalyzer and ChineseAnalyzer use dichotomy, so called dichotomy is each word as a word to divide; The word segmentation technology of IK_CAnalyzer is based on the word segmentation dictionary, and adopts the special forward iterative finest granularity segmentation algorithm and the analysis mode of multiple sub-processors. At present, the Lucene search engine has not realized the Chinese word segmentation method based on understanding, because the computer can not recognize the meaning of each word in different context, so the word segmentation method based on understanding has no practical application effect. In view of the deficiency of Lucene in Chinese word segmentation, especially the lack of Chinese word segmentation technology based on understanding, this paper discusses the application of BP (Back Propagation) neural network algorithm in Chinese word segmentation. Aiming at the shortcomings of BP neural network in the application of Chinese word segmentation, such as slow convergence, easy to fall into local minima, and low speed and efficiency, an improved particle swarm optimization algorithm (PSO,) is proposed. Particle SwarmOptimization) optimizes BP neural network, PSO-BP neural network, and applies it to Chinese word segmentation. Compared with traditional BP neural network, PSO-BP neural network not only solves the problem of slow convergence speed of traditional BP neural network. At the same time, the accuracy of word segmentation is improved. Then, the API of the third-party Chinese word segmentation component provided by Lucene is systematically studied and analyzed in this paper. The Chinese word segmentation technology optimized by PSO-BP neural network is successfully applied to Lucene, and compared with the Chinese word segmentation technology provided by Lucene. The result shows that this technique is superior to the Chinese word segmentation technology. Finally, this paper uses Lucene which includes PSO-BP neural network Chinese word segmentation component to design and implement the search engine, so as to realize the intelligent exploration of Chinese word segmentation of search engine, which provides a good platform for the follow-up work and research.
【學(xué)位授予單位】:中國石油大學(xué)(華東)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP391.3;TP183

【參考文獻(xiàn)】

相關(guān)期刊論文 前10條

1 龔漢明,周長(zhǎng)勝;漢語分詞技術(shù)綜述[J];北京機(jī)械工業(yè)學(xué)院學(xué)報(bào);2004年03期

2 余華;曹亮;李啟元;;BP神經(jīng)網(wǎng)絡(luò)算法的改進(jìn)及其在手寫體漢字識(shí)別中的應(yīng)用[J];江西師范大學(xué)學(xué)報(bào)(自然科學(xué)版);2009年05期

3 周平;;Lucene全文檢索引擎技術(shù)及應(yīng)用[J];重慶工學(xué)院學(xué)報(bào)(自然科學(xué)版);2007年04期

4 于洪波;;中文分詞技術(shù)研究[J];東莞理工學(xué)院學(xué)報(bào);2010年05期

5 張利;張立勇;張曉淼;耿鐵鎖;岳宗閣;;基于改進(jìn)BP網(wǎng)絡(luò)的中文歧義字段分詞方法研究[J];大連理工大學(xué)學(xué)報(bào);2007年01期

6 劉玲;嚴(yán)登俊;龔燈才;張紅梅;李大鵬;;基于粒子群模糊神經(jīng)網(wǎng)絡(luò)的短期電力負(fù)荷預(yù)測(cè)[J];電力系統(tǒng)及其自動(dòng)化學(xué)報(bào);2006年03期

7 姚李孝,宋玲芳,李慶宇,萬詩新;基于模糊聚類分析與BP網(wǎng)絡(luò)的電力系統(tǒng)短期負(fù)荷預(yù)測(cè)[J];電網(wǎng)技術(shù);2005年01期

8 丁麗;相玉紅;黃安民;張卓勇;;BP神經(jīng)網(wǎng)絡(luò)與近紅外光譜定量預(yù)測(cè)杉木中的綜纖維素、木質(zhì)素、微纖絲角[J];光譜學(xué)與光譜分析;2009年07期

9 王欣;葉華俊;黎慶濤;謝錦春;盧家炯;夏阿林;王健;;近紅外光譜結(jié)合人工神經(jīng)網(wǎng)絡(luò)分析蔗汁的錘度和旋光度[J];光譜學(xué)與光譜分析;2010年07期

10 嚴(yán)文娟;張晶;胡廣芹;趙靜;林凌;陸小左;李剛;;BP神經(jīng)網(wǎng)絡(luò)用于肝炎患者舌診近紅外光譜的研究[J];光譜學(xué)與光譜分析;2010年10期



本文編號(hào):2428689

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2428689.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶bba29***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com