當(dāng)前位置：主頁(yè) > 文藝論文 > 廣告藝術(shù)論文 >

基于網(wǎng)絡(luò)信息的個(gè)性化用戶詞典更新方法

發(fā)布時(shí)間：2018-01-23 04:35

本文關(guān)鍵詞： 網(wǎng)絡(luò)信息提取新詞發(fā)現(xiàn) 新詞分類(lèi) 個(gè)性化加載拼音輸入法　出處：《哈爾濱工業(yè)大學(xué)》2013年碩士論文　論文類(lèi)型：學(xué)位論文

【摘要】：漢字輸入是中文信息處理中非常重要的問(wèn)題之一，也是智能人機(jī)接口的一個(gè)重要組成部分。在漢字輸入領(lǐng)域，拼音輸入比較符合人們的使用習(xí)慣，目前已經(jīng)進(jìn)入第三代云輸入法的發(fā)展階段。目前主流輸入法強(qiáng)調(diào)個(gè)性化，個(gè)性化主要體現(xiàn)為詞頻調(diào)整和詞庫(kù)自動(dòng)擴(kuò)充。詞頻調(diào)整是指根據(jù)用戶輸入的分詞統(tǒng)計(jì)，隨時(shí)對(duì)詞庫(kù)的詞頻做出合理的調(diào)整，給用戶最合理的詞條排序。而詞庫(kù)自動(dòng)擴(kuò)充是指通過(guò)搜索引擎或者互聯(lián)網(wǎng)抓取前所未有的超大訓(xùn)練語(yǔ)料（TB級(jí)別），使得各種各樣的詞語(yǔ)都可以統(tǒng)統(tǒng)納入詞典而不受任何限制。本文正是主要從詞庫(kù)擴(kuò)充來(lái)改進(jìn)輸入法。詞庫(kù)擴(kuò)充最重要的方面是新詞發(fā)現(xiàn)，這也是本文的核心內(nèi)容，針對(duì)這個(gè)問(wèn)題，本文主要進(jìn)行了以下研究工作： (1)網(wǎng)絡(luò)信息的提取和處理：用網(wǎng)絡(luò)爬蟲(chóng)程序爬取新浪網(wǎng)頁(yè)，抽取出其中的網(wǎng)頁(yè)內(nèi)容，由于其中的網(wǎng)頁(yè)內(nèi)容還有大量垃圾信息（比如廣告，版權(quán)等信息），需要對(duì)抽取到的網(wǎng)頁(yè)內(nèi)容進(jìn)行凈化，提取其中有效信息，，標(biāo)記其中重要信息。網(wǎng)頁(yè)凈化是指對(duì)原始網(wǎng)頁(yè)庫(kù)中的每一個(gè)網(wǎng)頁(yè)進(jìn)行解析和過(guò)濾，提取有效信息，標(biāo)記重要信息，去掉意義不大的廣告、版權(quán)等信息的過(guò)程。原始網(wǎng)頁(yè)經(jīng)過(guò)凈化，可以轉(zhuǎn)變?yōu)橐粋€(gè)結(jié)構(gòu)清晰，內(nèi)容緊湊，信息明確的網(wǎng)頁(yè)。 (2)設(shè)計(jì)實(shí)現(xiàn)了新詞的提�。簩�(duì)凈化的網(wǎng)頁(yè)采用基于普通重復(fù)串統(tǒng)計(jì)方法提取新詞，對(duì)中文按照標(biāo)點(diǎn)和停用詞表進(jìn)行切分，然后對(duì)每個(gè)二字詞、三字詞、四字詞進(jìn)行出現(xiàn)次數(shù)的統(tǒng)計(jì)，次數(shù)超過(guò)預(yù)先設(shè)置好的閾值的字串作為候選新詞，再基于重復(fù)串查找算法刪除重復(fù)子串和構(gòu)詞規(guī)則刪除垃圾串，最后將候選新詞和輸入法本身的詞庫(kù)進(jìn)行比對(duì)，形成一個(gè)新詞詞庫(kù)。 (3)新詞分類(lèi)和詞庫(kù)的個(gè)性化加載：在所得到的凈化網(wǎng)頁(yè)信息中，經(jīng)研究原始網(wǎng)頁(yè)發(fā)現(xiàn)，標(biāo)題字段也含有正文的類(lèi)別信息，用匹配的方法提取出類(lèi)別。通過(guò)這種方法，把新詞進(jìn)行分類(lèi)。根據(jù)用戶的使用習(xí)慣，有選擇的加載或刪除新詞詞庫(kù)其中的一類(lèi)或者幾類(lèi)，體現(xiàn)用戶的個(gè)性化特點(diǎn)。最后，為了對(duì)系統(tǒng)取得真實(shí)、客觀的評(píng)價(jià)，本文以準(zhǔn)確率，召回率，F(xiàn)值來(lái)評(píng)測(cè)新詞提取的性能，以字符準(zhǔn)確率，行準(zhǔn)確率為評(píng)價(jià)指標(biāo)，對(duì)輸入法加入新詞詞庫(kù)前后的性能進(jìn)行比較。經(jīng)評(píng)測(cè)發(fā)現(xiàn)，新詞提取的各項(xiàng)標(biāo)準(zhǔn)較好，而加入新詞詞庫(kù)后輸入法的性能得到了進(jìn)一步的提高。
[Abstract]:Chinese character input is one of the most important problems in Chinese information processing, and it is also an important part of intelligent man-machine interface. At present, it has entered the development stage of the third generation cloud input method. At present, the mainstream input method emphasizes personalization, personalization is mainly reflected in word frequency adjustment and word bank automatic expansion. Word frequency adjustment refers to word segmentation statistics according to user input. At any time to make a reasonable readjustment of the vocabulary frequency to give users the most reasonable word ranking. And the automatic expansion of vocabulary refers to the search engine or the Internet to grab unprecedented huge training corpus terabytes). So that all kinds of words can be included in the dictionary without any restrictions. This paper is mainly from the lexicon expansion to improve the input method. The most important aspect of lexicon expansion is the discovery of new words. This is also the core content of this paper, in view of this problem, this paper mainly carried out the following research work: 1) extraction and processing of network information: crawling Sina web page with web crawler program, extracting the web page content, because of the web page content and a large number of spam information (such as advertising, copyright and other information). It is necessary to purify the extracted web page content, extract the effective information and mark the important information. Page purification refers to the analysis and filtering of every page in the original web page library to extract effective information. The process of marking important information and removing information such as advertising and copyright, etc. After purification, the original page can be transformed into a web page with clear structure, compact content and clear information. Design and implementation of the new word extraction: the purification of the web page based on the common repeated string statistics method to extract new words, Chinese according to punctuation and stop word table for segmentation, and then for each two words, three words. The number of occurrences of four words is counted, the number of times exceeding the pre-set threshold as a candidate new word, and then based on repeated string search algorithm to delete repeated substrings and word-formation rules to delete garbage string. Finally, the candidate neologisms are compared with the lexicon of the input method to form a neologism lexicon. 3) Classification of new words and personalized loading of thesaurus: in the purified web page information obtained, it is found that the title field also contains the category information of the text after studying the original web page. Use matching method to extract categories. By this method, new words are classified. According to the usage habits of users, one or more of the categories of neologisms are selectively loaded or deleted. Reflect the personalized characteristics of the user. Finally, in order to obtain a true and objective evaluation of the system, this paper uses accuracy, recall rate and F value to evaluate the performance of neologism extraction, and takes character accuracy and line accuracy as evaluation indicators. This paper compares the performance of the input method before and after adding the new word library. The evaluation shows that the new word extraction standards are better, and the performance of the input method has been further improved after the addition of the new word bank.
【學(xué)位授予單位】：哈爾濱工業(yè)大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2013
【分類(lèi)號(hào)】：TP391.14

【參考文獻(xiàn)】

相關(guān)期刊論文前9條

1 丁建立;慈祥;黃劍雄;;一種基于免疫遺傳算法的網(wǎng)絡(luò)新詞識(shí)別方法[J];計(jì)算機(jī)科學(xué);2011年01期

2 劉峰;王曄晗;湯步洲;王曉龍;王軒;;基于Android的智能中文輸入法[J];計(jì)算機(jī)工程;2011年07期

3 楊曉東;晏立;尤慧麗;;CCRF與規(guī)則相結(jié)合的中文機(jī)構(gòu)名識(shí)別[J];計(jì)算機(jī)工程;2011年08期

4 向曉雯,史曉東,曾華琳;一個(gè)統(tǒng)計(jì)與規(guī)則相結(jié)合的中文命名實(shí)體識(shí)別系統(tǒng)[J];計(jì)算機(jī)應(yīng)用;2005年10期

5 劉非凡;趙軍;呂碧波;徐波;于浩;夏迎炬;;面向商務(wù)信息抽取的產(chǎn)品命名實(shí)體識(shí)別研究[J];中文信息學(xué)報(bào);2006年01期

6 趙軍;;命名實(shí)體識(shí)別、排歧和跨語(yǔ)言關(guān)聯(lián)[J];中文信息學(xué)報(bào);2009年02期

7 劉挺,吳巖,王開(kāi)鑄;串頻統(tǒng)計(jì)和詞形匹配相結(jié)合的漢語(yǔ)自動(dòng)分詞系統(tǒng)[J];中文信息學(xué)報(bào);1998年01期

8 鄭家恒,李文花;基于構(gòu)詞法的網(wǎng)絡(luò)新詞自動(dòng)識(shí)別初探[J];山西大學(xué)學(xué)報(bào)(自然科學(xué)版);2002年02期

9 俞鴻魁;張華平;劉群;呂學(xué)強(qiáng);施水才;;基于層疊隱馬爾可夫模型的中文命名實(shí)體識(shí)別[J];通信學(xué)報(bào);2006年02期

本文編號(hào)：1456724

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/wenyilunwen/guanggaoshejilunwen/1456724.html

上一篇：LG安防產(chǎn)品中國(guó)營(yíng)銷(xiāo)渠道規(guī)劃研究
下一篇：一汽轎車(chē)車(chē)身外購(gòu)件生產(chǎn)準(zhǔn)備管理優(yōu)化研究

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于網(wǎng)絡(luò)信息的個(gè)性化用戶詞典更新方法