天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁 > 文藝論文 > 廣告藝術(shù)論文 >

基于網(wǎng)絡(luò)信息的個(gè)性化用戶詞典更新方法

發(fā)布時(shí)間:2018-01-23 04:35

  本文關(guān)鍵詞: 網(wǎng)絡(luò)信息提取 新詞發(fā)現(xiàn) 新詞分類 個(gè)性化加載 拼音輸入法 出處:《哈爾濱工業(yè)大學(xué)》2013年碩士論文 論文類型:學(xué)位論文


【摘要】:漢字輸入是中文信息處理中非常重要的問題之一,也是智能人機(jī)接口的一個(gè)重要組成部分。在漢字輸入領(lǐng)域,拼音輸入比較符合人們的使用習(xí)慣,目前已經(jīng)進(jìn)入第三代云輸入法的發(fā)展階段。目前主流輸入法強(qiáng)調(diào)個(gè)性化,個(gè)性化主要體現(xiàn)為詞頻調(diào)整和詞庫自動(dòng)擴(kuò)充。詞頻調(diào)整是指根據(jù)用戶輸入的分詞統(tǒng)計(jì),隨時(shí)對(duì)詞庫的詞頻做出合理的調(diào)整,給用戶最合理的詞條排序。而詞庫自動(dòng)擴(kuò)充是指通過搜索引擎或者互聯(lián)網(wǎng)抓取前所未有的超大訓(xùn)練語料(TB級(jí)別),使得各種各樣的詞語都可以統(tǒng)統(tǒng)納入詞典而不受任何限制。本文正是主要從詞庫擴(kuò)充來改進(jìn)輸入法。詞庫擴(kuò)充最重要的方面是新詞發(fā)現(xiàn),這也是本文的核心內(nèi)容,針對(duì)這個(gè)問題,本文主要進(jìn)行了以下研究工作: (1)網(wǎng)絡(luò)信息的提取和處理:用網(wǎng)絡(luò)爬蟲程序爬取新浪網(wǎng)頁,抽取出其中的網(wǎng)頁內(nèi)容,由于其中的網(wǎng)頁內(nèi)容還有大量垃圾信息(比如廣告,版權(quán)等信息),需要對(duì)抽取到的網(wǎng)頁內(nèi)容進(jìn)行凈化,提取其中有效信息,,標(biāo)記其中重要信息。網(wǎng)頁凈化是指對(duì)原始網(wǎng)頁庫中的每一個(gè)網(wǎng)頁進(jìn)行解析和過濾,提取有效信息,標(biāo)記重要信息,去掉意義不大的廣告、版權(quán)等信息的過程。原始網(wǎng)頁經(jīng)過凈化,可以轉(zhuǎn)變?yōu)橐粋(gè)結(jié)構(gòu)清晰,內(nèi)容緊湊,信息明確的網(wǎng)頁。 (2)設(shè)計(jì)實(shí)現(xiàn)了新詞的提。簩(duì)凈化的網(wǎng)頁采用基于普通重復(fù)串統(tǒng)計(jì)方法提取新詞,對(duì)中文按照標(biāo)點(diǎn)和停用詞表進(jìn)行切分,然后對(duì)每個(gè)二字詞、三字詞、四字詞進(jìn)行出現(xiàn)次數(shù)的統(tǒng)計(jì),次數(shù)超過預(yù)先設(shè)置好的閾值的字串作為候選新詞,再基于重復(fù)串查找算法刪除重復(fù)子串和構(gòu)詞規(guī)則刪除垃圾串,最后將候選新詞和輸入法本身的詞庫進(jìn)行比對(duì),形成一個(gè)新詞詞庫。 (3)新詞分類和詞庫的個(gè)性化加載:在所得到的凈化網(wǎng)頁信息中,經(jīng)研究原始網(wǎng)頁發(fā)現(xiàn),標(biāo)題字段也含有正文的類別信息,用匹配的方法提取出類別。通過這種方法,把新詞進(jìn)行分類。根據(jù)用戶的使用習(xí)慣,有選擇的加載或刪除新詞詞庫其中的一類或者幾類,體現(xiàn)用戶的個(gè)性化特點(diǎn)。 最后,為了對(duì)系統(tǒng)取得真實(shí)、客觀的評(píng)價(jià),本文以準(zhǔn)確率,召回率,F(xiàn)值來評(píng)測(cè)新詞提取的性能,以字符準(zhǔn)確率,行準(zhǔn)確率為評(píng)價(jià)指標(biāo),對(duì)輸入法加入新詞詞庫前后的性能進(jìn)行比較。經(jīng)評(píng)測(cè)發(fā)現(xiàn),新詞提取的各項(xiàng)標(biāo)準(zhǔn)較好,而加入新詞詞庫后輸入法的性能得到了進(jìn)一步的提高。
[Abstract]:Chinese character input is one of the most important problems in Chinese information processing, and it is also an important part of intelligent man-machine interface. At present, it has entered the development stage of the third generation cloud input method. At present, the mainstream input method emphasizes personalization, personalization is mainly reflected in word frequency adjustment and word bank automatic expansion. Word frequency adjustment refers to word segmentation statistics according to user input. At any time to make a reasonable readjustment of the vocabulary frequency to give users the most reasonable word ranking. And the automatic expansion of vocabulary refers to the search engine or the Internet to grab unprecedented huge training corpus terabytes). So that all kinds of words can be included in the dictionary without any restrictions. This paper is mainly from the lexicon expansion to improve the input method. The most important aspect of lexicon expansion is the discovery of new words. This is also the core content of this paper, in view of this problem, this paper mainly carried out the following research work: 1) extraction and processing of network information: crawling Sina web page with web crawler program, extracting the web page content, because of the web page content and a large number of spam information (such as advertising, copyright and other information). It is necessary to purify the extracted web page content, extract the effective information and mark the important information. Page purification refers to the analysis and filtering of every page in the original web page library to extract effective information. The process of marking important information and removing information such as advertising and copyright, etc. After purification, the original page can be transformed into a web page with clear structure, compact content and clear information. Design and implementation of the new word extraction: the purification of the web page based on the common repeated string statistics method to extract new words, Chinese according to punctuation and stop word table for segmentation, and then for each two words, three words. The number of occurrences of four words is counted, the number of times exceeding the pre-set threshold as a candidate new word, and then based on repeated string search algorithm to delete repeated substrings and word-formation rules to delete garbage string. Finally, the candidate neologisms are compared with the lexicon of the input method to form a neologism lexicon. 3) Classification of new words and personalized loading of thesaurus: in the purified web page information obtained, it is found that the title field also contains the category information of the text after studying the original web page. Use matching method to extract categories. By this method, new words are classified. According to the usage habits of users, one or more of the categories of neologisms are selectively loaded or deleted. Reflect the personalized characteristics of the user. Finally, in order to obtain a true and objective evaluation of the system, this paper uses accuracy, recall rate and F value to evaluate the performance of neologism extraction, and takes character accuracy and line accuracy as evaluation indicators. This paper compares the performance of the input method before and after adding the new word library. The evaluation shows that the new word extraction standards are better, and the performance of the input method has been further improved after the addition of the new word bank.
【學(xué)位授予單位】:哈爾濱工業(yè)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP391.14

【參考文獻(xiàn)】

相關(guān)期刊論文 前9條

1 丁建立;慈祥;黃劍雄;;一種基于免疫遺傳算法的網(wǎng)絡(luò)新詞識(shí)別方法[J];計(jì)算機(jī)科學(xué);2011年01期

2 劉峰;王曄晗;湯步洲;王曉龍;王軒;;基于Android的智能中文輸入法[J];計(jì)算機(jī)工程;2011年07期

3 楊曉東;晏立;尤慧麗;;CCRF與規(guī)則相結(jié)合的中文機(jī)構(gòu)名識(shí)別[J];計(jì)算機(jī)工程;2011年08期

4 向曉雯,史曉東,曾華琳;一個(gè)統(tǒng)計(jì)與規(guī)則相結(jié)合的中文命名實(shí)體識(shí)別系統(tǒng)[J];計(jì)算機(jī)應(yīng)用;2005年10期

5 劉非凡;趙軍;呂碧波;徐波;于浩;夏迎炬;;面向商務(wù)信息抽取的產(chǎn)品命名實(shí)體識(shí)別研究[J];中文信息學(xué)報(bào);2006年01期

6 趙軍;;命名實(shí)體識(shí)別、排歧和跨語言關(guān)聯(lián)[J];中文信息學(xué)報(bào);2009年02期

7 劉挺,吳巖,王開鑄;串頻統(tǒng)計(jì)和詞形匹配相結(jié)合的漢語自動(dòng)分詞系統(tǒng)[J];中文信息學(xué)報(bào);1998年01期

8 鄭家恒,李文花;基于構(gòu)詞法的網(wǎng)絡(luò)新詞自動(dòng)識(shí)別初探[J];山西大學(xué)學(xué)報(bào)(自然科學(xué)版);2002年02期

9 俞鴻魁;張華平;劉群;呂學(xué)強(qiáng);施水才;;基于層疊隱馬爾可夫模型的中文命名實(shí)體識(shí)別[J];通信學(xué)報(bào);2006年02期



本文編號(hào):1456724

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/wenyilunwen/guanggaoshejilunwen/1456724.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶086b4***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com
日本欧美视频在线观看免费| 精品人妻一区二区三区免费| 国产午夜精品亚洲精品国产| 婷婷伊人综合中文字幕| 日本欧美一区二区三区高清| 亚洲中文字幕有码在线观看| 亚洲专区中文字幕视频| 中文字幕亚洲精品人妻| 日韩一级一片内射视频4k| 久久亚洲精品成人国产| 国产精品福利一级久久| 青青操在线视频精品视频| 大香蕉网国产在线观看av| 亚洲欧美日产综合在线网| 中文字幕一区二区熟女| 午夜福利视频偷拍91| 不卡中文字幕在线免费看| 91欧美日韩国产在线观看 | 千仞雪下面好爽好紧好湿全文| 国产精品午夜福利在线观看| 国产一级一片内射视频在线| 亚洲清纯一区二区三区| 精品视频一区二区三区不卡| 99久久国产精品免费| 日本加勒比系列在线播放| 99香蕉精品视频国产版| 国产精品成人免费精品自在线观看 | 又大又紧又硬又湿又爽又猛| 国产精品亚洲综合天堂夜夜| 久久免费精品拍拍一区二区| 亚洲精品偷拍视频免费观看| 国产精品久久香蕉国产线| 在线视频三区日本精品| 欧美一区二区三区五月婷婷| 狠狠干狠狠操在线播放| 亚洲精品国男人在线视频| 色无极东京热男人的天堂| 少妇人妻中出中文字幕| 偷拍美女洗澡免费视频| 欧美成人高清在线播放| 中文字幕日韩精品人一妻|