互聯(lián)網(wǎng)環(huán)境下的中文熱詞與方言詞匯的定量研究
發(fā)布時(shí)間:2018-07-09 09:59
本文選題:查詢?nèi)罩?/strong> + 拼音輸入法 ; 參考:《清華大學(xué)》2014年博士論文
【摘要】:隨著科學(xué)技術(shù)的進(jìn)步,特別是信息技術(shù)的不斷發(fā)展以及互聯(lián)網(wǎng)的普及,中文語言發(fā)生了巨大的變化。其中詞匯作為語言中最活躍的部分,變化最為顯著;ヂ(lián)網(wǎng)環(huán)境下的詞匯變化,突出表現(xiàn)在兩個(gè)方面:一方面是熱點(diǎn)詞新詞不斷涌現(xiàn);另一方面是方言詞在網(wǎng)絡(luò)中的大量使用。對(duì)詞匯的變化進(jìn)行研究,有助于我們改進(jìn)中文信息處理的性能;對(duì)熱點(diǎn)詞匯以及方言詞匯進(jìn)行識(shí)別,有助于補(bǔ)充語言詞典、輔助語言的量化研究。本文中我們從詞匯變化的主要來源-搜索引擎查詢?nèi)罩疽约爸形钠匆糨斎敕ǖ臄?shù)據(jù)入手,對(duì)詞匯的變化加以研究。論文的工作包括:(1)提出了基于搜索引擎查詢?cè)~的熱詞新詞識(shí)別方法。通過對(duì)熱點(diǎn)查詢?cè)~的時(shí)間動(dòng)態(tài)模式進(jìn)行分析,我們發(fā)現(xiàn)熱點(diǎn)詞具有特定的時(shí)間模式。針對(duì)熱點(diǎn)詞的主要突發(fā)期進(jìn)行檢測(cè),設(shè)計(jì)了基于突發(fā)期內(nèi)頻度比的算法以自動(dòng)發(fā)現(xiàn)熱點(diǎn)詞。(2)綜合考慮語義相似度與時(shí)間序列相似度,對(duì)熱點(diǎn)詞匯進(jìn)行了擴(kuò)充,挖掘了熱點(diǎn)詞匯相關(guān)的低頻查詢?cè)~,解決了低頻熱詞新詞難以識(shí)別的困難。通過對(duì)查詢?cè)~頻度序列的時(shí)間模式進(jìn)行分析,我們對(duì)熱詞中的可預(yù)測(cè)部分重點(diǎn)加以識(shí)別。(3)提出了利用中文拼音輸入法用戶記錄自動(dòng)識(shí)別方言詞匯的方法。通過對(duì)輸入法用戶的地理信息提取輸入法詞條的地域化特征,同時(shí)分析了輸入法用戶調(diào)用輸入法的程序類別,對(duì)輸入詞條提取了口語化相關(guān)特征。通過綜合分析地域化特征與口語化特征,提出了基于特征組合排序的方法對(duì)方言詞匯進(jìn)行識(shí)別。實(shí)驗(yàn)結(jié)果表明口語化特征與地域化特征相結(jié)合的方法大大提高了方言詞匯的識(shí)別性能。(4)通過對(duì)中文拼音輸入法數(shù)據(jù)中的詞匯及頻度信息,設(shè)計(jì)不同的詞表,考察不同詞表在各地域的頻度排序序列之間的相關(guān)關(guān)系以比較各地方言之間的關(guān)系,利用層次聚類的方法對(duì)方言分區(qū)進(jìn)行了量化研究。同時(shí)對(duì)詞條在方言區(qū)域及其相鄰區(qū)域之間的區(qū)分度覆蓋度等特征進(jìn)行分析,整理給出了各地域的方言特征詞,最后我們實(shí)現(xiàn)了方言詞匯地理分布的可視化,以輔助方言間詞匯關(guān)系的研究。
[Abstract]:With the progress of science and technology, especially the continuous development of information technology and the popularity of the Internet, the Chinese language has undergone tremendous changes. Vocabulary as the most active part of the language, the most significant change. The lexical changes in the Internet environment are highlighted in two aspects: one is the continuous emergence of hot words and the other is the extensive use of dialect words in the network. The research on the change of vocabulary is helpful to improve the performance of Chinese information processing, to recognize hot words and dialect words, to supplement the language dictionary and to assist the quantitative study of language. In this paper, we study the change of vocabulary from the main source of lexical change, search engine query log and the data of Chinese phonetic input method. The main work of this paper is as follows: (1) A new word recognition method based on search engine query is proposed. By analyzing the temporal dynamic pattern of hot query words, we find that hot words have a specific time pattern. In order to detect the main burst period of hot words, an algorithm based on frequency ratio in burst period is designed to find hot words automatically. (2) considering the semantic similarity and time series similarity, the hot words are expanded. The low frequency query words related to hot words are excavated, and the difficulty of identifying new low frequency hot words is solved. By analyzing the time pattern of the frequency sequence of query words, we recognize the predictable parts of hot words. (3) A method of automatic recognition of dialect words by Chinese phonetic input method is proposed. By extracting the geographical information of the input method user's geographical feature of the input method, the author analyzes the program category of the input method user's calling the input method, and extracts the relevant colloquial feature of the input term. Based on the comprehensive analysis of regional and colloquial features, a method based on feature combination and ranking is proposed to identify dialect vocabulary. The experimental results show that the combination of colloquial and regional features greatly improves the recognition performance of dialect words. (4) different lexical lists are designed through the information of vocabulary and frequency in Chinese phonetic input data. This paper investigates the correlation between frequency sequence of different lexical lists in different regions to compare the relationships between different dialects, and makes a quantitative study of dialect division by hierarchical clustering method. At the same time, the paper analyzes the features of the terms in the dialect area and its adjacent areas, and puts forward the dialect feature words in each region. Finally, we realize the visualization of the geographical distribution of the dialect vocabulary. To assist the study of lexical relationships among dialects.
【學(xué)位授予單位】:清華大學(xué)
【學(xué)位級(jí)別】:博士
【學(xué)位授予年份】:2014
【分類號(hào)】:TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前1條
1 賈澎濤;何華燦;劉麗;孫濤;;時(shí)間序列數(shù)據(jù)挖掘綜述[J];計(jì)算機(jī)應(yīng)用研究;2007年11期
,本文編號(hào):2108992
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2108992.html
最近更新
教材專著