基于網(wǎng)絡(luò)信息的個(gè)性化用戶詞典更新方法
本文關(guān)鍵詞: 網(wǎng)絡(luò)信息提取 新詞發(fā)現(xiàn) 新詞分類 個(gè)性化加載 拼音輸入法 出處:《哈爾濱工業(yè)大學(xué)》2013年碩士論文 論文類型:學(xué)位論文
【摘要】:漢字輸入是中文信息處理中非常重要的問題之一,也是智能人機(jī)接口的一個(gè)重要組成部分。在漢字輸入領(lǐng)域,拼音輸入比較符合人們的使用習(xí)慣,目前已經(jīng)進(jìn)入第三代云輸入法的發(fā)展階段。目前主流輸入法強(qiáng)調(diào)個(gè)性化,個(gè)性化主要體現(xiàn)為詞頻調(diào)整和詞庫自動(dòng)擴(kuò)充。詞頻調(diào)整是指根據(jù)用戶輸入的分詞統(tǒng)計(jì),隨時(shí)對(duì)詞庫的詞頻做出合理的調(diào)整,給用戶最合理的詞條排序。而詞庫自動(dòng)擴(kuò)充是指通過搜索引擎或者互聯(lián)網(wǎng)抓取前所未有的超大訓(xùn)練語料(TB級(jí)別),使得各種各樣的詞語都可以統(tǒng)統(tǒng)納入詞典而不受任何限制。本文正是主要從詞庫擴(kuò)充來改進(jìn)輸入法。詞庫擴(kuò)充最重要的方面是新詞發(fā)現(xiàn),這也是本文的核心內(nèi)容,針對(duì)這個(gè)問題,本文主要進(jìn)行了以下研究工作: (1)網(wǎng)絡(luò)信息的提取和處理:用網(wǎng)絡(luò)爬蟲程序爬取新浪網(wǎng)頁,抽取出其中的網(wǎng)頁內(nèi)容,由于其中的網(wǎng)頁內(nèi)容還有大量垃圾信息(比如廣告,版權(quán)等信息),需要對(duì)抽取到的網(wǎng)頁內(nèi)容進(jìn)行凈化,提取其中有效信息,,標(biāo)記其中重要信息。網(wǎng)頁凈化是指對(duì)原始網(wǎng)頁庫中的每一個(gè)網(wǎng)頁進(jìn)行解析和過濾,提取有效信息,標(biāo)記重要信息,去掉意義不大的廣告、版權(quán)等信息的過程。原始網(wǎng)頁經(jīng)過凈化,可以轉(zhuǎn)變?yōu)橐粋(gè)結(jié)構(gòu)清晰,內(nèi)容緊湊,信息明確的網(wǎng)頁。 (2)設(shè)計(jì)實(shí)現(xiàn)了新詞的提。簩(duì)凈化的網(wǎng)頁采用基于普通重復(fù)串統(tǒng)計(jì)方法提取新詞,對(duì)中文按照標(biāo)點(diǎn)和停用詞表進(jìn)行切分,然后對(duì)每個(gè)二字詞、三字詞、四字詞進(jìn)行出現(xiàn)次數(shù)的統(tǒng)計(jì),次數(shù)超過預(yù)先設(shè)置好的閾值的字串作為候選新詞,再基于重復(fù)串查找算法刪除重復(fù)子串和構(gòu)詞規(guī)則刪除垃圾串,最后將候選新詞和輸入法本身的詞庫進(jìn)行比對(duì),形成一個(gè)新詞詞庫。 (3)新詞分類和詞庫的個(gè)性化加載:在所得到的凈化網(wǎng)頁信息中,經(jīng)研究原始網(wǎng)頁發(fā)現(xiàn),標(biāo)題字段也含有正文的類別信息,用匹配的方法提取出類別。通過這種方法,把新詞進(jìn)行分類。根據(jù)用戶的使用習(xí)慣,有選擇的加載或刪除新詞詞庫其中的一類或者幾類,體現(xiàn)用戶的個(gè)性化特點(diǎn)。 最后,為了對(duì)系統(tǒng)取得真實(shí)、客觀的評(píng)價(jià),本文以準(zhǔn)確率,召回率,F(xiàn)值來評(píng)測(cè)新詞提取的性能,以字符準(zhǔn)確率,行準(zhǔn)確率為評(píng)價(jià)指標(biāo),對(duì)輸入法加入新詞詞庫前后的性能進(jìn)行比較。經(jīng)評(píng)測(cè)發(fā)現(xiàn),新詞提取的各項(xiàng)標(biāo)準(zhǔn)較好,而加入新詞詞庫后輸入法的性能得到了進(jìn)一步的提高。
[Abstract]:Chinese character input is one of the most important problems in Chinese information processing, and it is also an important part of intelligent man-machine interface. At present, it has entered the development stage of the third generation cloud input method. At present, the mainstream input method emphasizes personalization, personalization is mainly reflected in word frequency adjustment and word bank automatic expansion. Word frequency adjustment refers to word segmentation statistics according to user input. At any time to make a reasonable readjustment of the vocabulary frequency to give users the most reasonable word ranking. And the automatic expansion of vocabulary refers to the search engine or the Internet to grab unprecedented huge training corpus terabytes). So that all kinds of words can be included in the dictionary without any restrictions. This paper is mainly from the lexicon expansion to improve the input method. The most important aspect of lexicon expansion is the discovery of new words. This is also the core content of this paper, in view of this problem, this paper mainly carried out the following research work: 1) extraction and processing of network information: crawling Sina web page with web crawler program, extracting the web page content, because of the web page content and a large number of spam information (such as advertising, copyright and other information). It is necessary to purify the extracted web page content, extract the effective information and mark the important information. Page purification refers to the analysis and filtering of every page in the original web page library to extract effective information. The process of marking important information and removing information such as advertising and copyright, etc. After purification, the original page can be transformed into a web page with clear structure, compact content and clear information. Design and implementation of the new word extraction: the purification of the web page based on the common repeated string statistics method to extract new words, Chinese according to punctuation and stop word table for segmentation, and then for each two words, three words. The number of occurrences of four words is counted, the number of times exceeding the pre-set threshold as a candidate new word, and then based on repeated string search algorithm to delete repeated substrings and word-formation rules to delete garbage string. Finally, the candidate neologisms are compared with the lexicon of the input method to form a neologism lexicon. 3) Classification of new words and personalized loading of thesaurus: in the purified web page information obtained, it is found that the title field also contains the category information of the text after studying the original web page. Use matching method to extract categories. By this method, new words are classified. According to the usage habits of users, one or more of the categories of neologisms are selectively loaded or deleted. Reflect the personalized characteristics of the user. Finally, in order to obtain a true and objective evaluation of the system, this paper uses accuracy, recall rate and F value to evaluate the performance of neologism extraction, and takes character accuracy and line accuracy as evaluation indicators. This paper compares the performance of the input method before and after adding the new word library. The evaluation shows that the new word extraction standards are better, and the performance of the input method has been further improved after the addition of the new word bank.
【學(xué)位授予單位】:哈爾濱工業(yè)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP391.14
【參考文獻(xiàn)】
相關(guān)期刊論文 前9條
1 丁建立;慈祥;黃劍雄;;一種基于免疫遺傳算法的網(wǎng)絡(luò)新詞識(shí)別方法[J];計(jì)算機(jī)科學(xué);2011年01期
2 劉峰;王曄晗;湯步洲;王曉龍;王軒;;基于Android的智能中文輸入法[J];計(jì)算機(jī)工程;2011年07期
3 楊曉東;晏立;尤慧麗;;CCRF與規(guī)則相結(jié)合的中文機(jī)構(gòu)名識(shí)別[J];計(jì)算機(jī)工程;2011年08期
4 向曉雯,史曉東,曾華琳;一個(gè)統(tǒng)計(jì)與規(guī)則相結(jié)合的中文命名實(shí)體識(shí)別系統(tǒng)[J];計(jì)算機(jī)應(yīng)用;2005年10期
5 劉非凡;趙軍;呂碧波;徐波;于浩;夏迎炬;;面向商務(wù)信息抽取的產(chǎn)品命名實(shí)體識(shí)別研究[J];中文信息學(xué)報(bào);2006年01期
6 趙軍;;命名實(shí)體識(shí)別、排歧和跨語言關(guān)聯(lián)[J];中文信息學(xué)報(bào);2009年02期
7 劉挺,吳巖,王開鑄;串頻統(tǒng)計(jì)和詞形匹配相結(jié)合的漢語自動(dòng)分詞系統(tǒng)[J];中文信息學(xué)報(bào);1998年01期
8 鄭家恒,李文花;基于構(gòu)詞法的網(wǎng)絡(luò)新詞自動(dòng)識(shí)別初探[J];山西大學(xué)學(xué)報(bào)(自然科學(xué)版);2002年02期
9 俞鴻魁;張華平;劉群;呂學(xué)強(qiáng);施水才;;基于層疊隱馬爾可夫模型的中文命名實(shí)體識(shí)別[J];通信學(xué)報(bào);2006年02期
本文編號(hào):1456724
本文鏈接:http://sikaile.net/wenyilunwen/guanggaoshejilunwen/1456724.html