基于最大匹配的論文特征提取系統(tǒng)的設(shè)計與實現(xiàn)

發(fā)布時間：2018-12-14 03:40

【摘要】：在中文搜索引擎中，中文分詞的作用顯而易見，其結(jié)果直接影響到搜索引擎的性能。目前，中文分詞技術(shù)主要有下面三種：通過字符串匹配進(jìn)行進(jìn)行分詞，通過人工智能的方法在理解分詞語義的基礎(chǔ)上來進(jìn)行分詞，通過統(tǒng)計計算的方法來進(jìn)行分詞。所謂的中文分詞系統(tǒng)，是現(xiàn)代漢語句子中的分詞方法。因為現(xiàn)代漢語的語法習(xí)慣，漢語句子和詞之間的標(biāo)記表明。而英語單詞與單詞之間用空格，所以沒有分詞問題。但在中國，每一個句子，詞與詞問題是沒有空間的，所以我們必須使用一些智能技術(shù)分離。漢語自動分詞算法從十九年代至今，已成為計算機(jī)專業(yè)研究的熱點(diǎn)，因為語言的復(fù)雜，計算機(jī)技術(shù)的瓶頸使之一直處于發(fā)展階段。本文首先將已有的分詞算法進(jìn)行了分析、總結(jié)和歸納，討論了中文識別一直難以很好解決的兩大問題：歧義識別和未登錄詞。中文分詞發(fā)展過程中遇到最大的問題是歧義識別和新詞識別。中文分詞的未來發(fā)展方向既要解決這類問題，使得達(dá)到較高的分詞正確率，又要進(jìn)行行業(yè)分詞不斷拓展中文分詞的應(yīng)用范圍，通過對詞頻進(jìn)行每個詞項的出現(xiàn)次數(shù)后，得到該詞項的特征集，設(shè)計出詞頻空間特征提取方法。首先利用最大匹配算法對文件進(jìn)行詞語切分，然后導(dǎo)入詞頻矩陣，統(tǒng)計詞頻矩陣中各項出現(xiàn)的頻率，最后提取出文本特征。本文主要研究圖書館論文特征提取系統(tǒng)的開發(fā)和設(shè)計。把中文分詞技術(shù)和特征提取技術(shù)應(yīng)用到一起設(shè)計了可以應(yīng)用到圖書館的論文特征提取系統(tǒng)，，并對系統(tǒng)的設(shè)計過程和實驗結(jié)果進(jìn)行了詳細(xì)的介紹。應(yīng)用了本系統(tǒng)之后，學(xué)校圖書館的論文管理變的效率更高，查找論文的速度也更快。
[Abstract]:In Chinese search engine, the function of Chinese word segmentation is obvious, and its result directly affects the performance of search engine. At present, there are three kinds of Chinese word segmentation techniques: word segmentation by string matching, word segmentation by artificial intelligence on the basis of understanding the semantics of word segmentation, and word segmentation by statistical calculation. The so-called Chinese word segmentation system is a method of word segmentation in modern Chinese sentences. Because of the grammatical habits of modern Chinese, the markers between Chinese sentences and words indicate. English words and words between the space, so there is no word segmentation problem. But in China, every sentence, word and word problem has no space, so we must use some intelligent technology to separate. Chinese automatic word segmentation algorithm has become a hot topic in computer science since the nineteen's, because of the complexity of language and the bottleneck of computer technology, it has been in the development stage. In this paper, the existing word segmentation algorithms are analyzed, summarized and summarized, and two problems which are difficult to solve in Chinese recognition are discussed: ambiguity recognition and unrecorded words. Ambiguity recognition and new word recognition are the biggest problems encountered in the development of Chinese word segmentation. The future development of Chinese word segmentation should not only solve this kind of problems, so as to achieve a higher correct rate of word segmentation, but also continue to expand the scope of application of Chinese word segmentation. The feature set of the word term is obtained, and the feature extraction method of word frequency space is designed. Firstly, the maximum matching algorithm is used to segment the file, then the word frequency matrix is imported, and the frequency of each occurrence in the word frequency matrix is counted. Finally, the text features are extracted. This paper mainly studies the development and design of library paper feature extraction system. This paper applies Chinese word segmentation technology and feature extraction technology to design a paper feature extraction system which can be applied to library. The design process and experimental results of the system are introduced in detail. With the application of this system, the paper management of the school library becomes more efficient and the search speed is faster.
【學(xué)位授予單位】：電子科技大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2012
【分類號】：TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 龔漢明,周長勝;漢語分詞技術(shù)綜述[J];北京機(jī)械工業(yè)學(xué)院學(xué)報;2004年03期

2 劉海峰;王元元;;一種基于統(tǒng)計的漢語切詞方法[J];工程地質(zhì)計算機(jī)應(yīng)用;2006年02期

3 歐振猛,余順爭;中文分詞算法在搜索引擎應(yīng)用中的研究[J];計算機(jī)工程與應(yīng)用;2000年08期

4 應(yīng)志偉,柴佩琪,陳其暉;文語轉(zhuǎn)換系統(tǒng)中基于語料的漢語自動分詞研究[J];計算機(jī)應(yīng)用;2000年02期

5 馬玉春,宋瀚濤;Web中文文本分詞技術(shù)研究[J];計算機(jī)應(yīng)用;2004年04期

6 鄒海山,吳勇,吳月珠,陳陣;中文搜索引擎中的中文信息處理技術(shù)[J];計算機(jī)應(yīng)用研究;2000年12期

7 曹倩,丁艷,王超,潘金貴;漢語自動分詞研究及其在信息檢索中的應(yīng)用[J];計算機(jī)應(yīng)用研究;2004年05期

8 黃昌寧;趙海;;中文分詞十年回顧[J];中文信息學(xué)報;2007年03期

9 曹紅兵;;新一代搜索引擎UJIK0[J];圖書館建設(shè);2007年02期

10 于海燕;陳曉江;馮健;房鼎益;;Web文本內(nèi)容過濾方法的研究[J];微電子學(xué)與計算機(jī);2006年09期

相關(guān)碩士學(xué)位論文前1條

1 于洪杰;垃圾郵件過濾技術(shù)算法研究[D];大連海事大學(xué);2007年

本文編號：2377849

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2377849.html

上一篇：面向僑務(wù)信息主題的搜索引擎系統(tǒng)
下一篇：三維模型的局部匹配和檢索方法研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于最大匹配的論文特征提取系統(tǒng)的設(shè)計與實現(xiàn)