互聯(lián)網(wǎng)環(huán)境下的中文熱詞與方言詞匯的定量研究

發(fā)布時間：2018-07-09 09:59

本文選題：查詢日志 + 拼音輸入法�。� 參考：《清華大學》2014年博士論文

【摘要】：隨著科學技術的進步,特別是信息技術的不斷發(fā)展以及互聯(lián)網(wǎng)的普及,中文語言發(fā)生了巨大的變化。其中詞匯作為語言中最活躍的部分,變化最為顯著�；ヂ�(lián)網(wǎng)環(huán)境下的詞匯變化,突出表現(xiàn)在兩個方面:一方面是熱點詞新詞不斷涌現(xiàn);另一方面是方言詞在網(wǎng)絡中的大量使用。對詞匯的變化進行研究,有助于我們改進中文信息處理的性能;對熱點詞匯以及方言詞匯進行識別,有助于補充語言詞典、輔助語言的量化研究。本文中我們從詞匯變化的主要來源-搜索引擎查詢日志以及中文拼音輸入法的數(shù)據(jù)入手,對詞匯的變化加以研究。論文的工作包括:(1)提出了基于搜索引擎查詢詞的熱詞新詞識別方法。通過對熱點查詢詞的時間動態(tài)模式進行分析,我們發(fā)現(xiàn)熱點詞具有特定的時間模式。針對熱點詞的主要突發(fā)期進行檢測,設計了基于突發(fā)期內頻度比的算法以自動發(fā)現(xiàn)熱點詞。(2)綜合考慮語義相似度與時間序列相似度,對熱點詞匯進行了擴充,挖掘了熱點詞匯相關的低頻查詢詞,解決了低頻熱詞新詞難以識別的困難。通過對查詢詞頻度序列的時間模式進行分析,我們對熱詞中的可預測部分重點加以識別。(3)提出了利用中文拼音輸入法用戶記錄自動識別方言詞匯的方法。通過對輸入法用戶的地理信息提取輸入法詞條的地域化特征,同時分析了輸入法用戶調用輸入法的程序類別,對輸入詞條提取了口語化相關特征。通過綜合分析地域化特征與口語化特征,提出了基于特征組合排序的方法對方言詞匯進行識別。實驗結果表明口語化特征與地域化特征相結合的方法大大提高了方言詞匯的識別性能。(4)通過對中文拼音輸入法數(shù)據(jù)中的詞匯及頻度信息,設計不同的詞表,考察不同詞表在各地域的頻度排序序列之間的相關關系以比較各地方言之間的關系,利用層次聚類的方法對方言分區(qū)進行了量化研究。同時對詞條在方言區(qū)域及其相鄰區(qū)域之間的區(qū)分度覆蓋度等特征進行分析,整理給出了各地域的方言特征詞,最后我們實現(xiàn)了方言詞匯地理分布的可視化,以輔助方言間詞匯關系的研究。
[Abstract]:With the progress of science and technology, especially the continuous development of information technology and the popularity of the Internet, the Chinese language has undergone tremendous changes. Vocabulary as the most active part of the language, the most significant change. The lexical changes in the Internet environment are highlighted in two aspects: one is the continuous emergence of hot words and the other is the extensive use of dialect words in the network. The research on the change of vocabulary is helpful to improve the performance of Chinese information processing, to recognize hot words and dialect words, to supplement the language dictionary and to assist the quantitative study of language. In this paper, we study the change of vocabulary from the main source of lexical change, search engine query log and the data of Chinese phonetic input method. The main work of this paper is as follows: (1) A new word recognition method based on search engine query is proposed. By analyzing the temporal dynamic pattern of hot query words, we find that hot words have a specific time pattern. In order to detect the main burst period of hot words, an algorithm based on frequency ratio in burst period is designed to find hot words automatically. (2) considering the semantic similarity and time series similarity, the hot words are expanded. The low frequency query words related to hot words are excavated, and the difficulty of identifying new low frequency hot words is solved. By analyzing the time pattern of the frequency sequence of query words, we recognize the predictable parts of hot words. (3) A method of automatic recognition of dialect words by Chinese phonetic input method is proposed. By extracting the geographical information of the input method user's geographical feature of the input method, the author analyzes the program category of the input method user's calling the input method, and extracts the relevant colloquial feature of the input term. Based on the comprehensive analysis of regional and colloquial features, a method based on feature combination and ranking is proposed to identify dialect vocabulary. The experimental results show that the combination of colloquial and regional features greatly improves the recognition performance of dialect words. (4) different lexical lists are designed through the information of vocabulary and frequency in Chinese phonetic input data. This paper investigates the correlation between frequency sequence of different lexical lists in different regions to compare the relationships between different dialects, and makes a quantitative study of dialect division by hierarchical clustering method. At the same time, the paper analyzes the features of the terms in the dialect area and its adjacent areas, and puts forward the dialect feature words in each region. Finally, we realize the visualization of the geographical distribution of the dialect vocabulary. To assist the study of lexical relationships among dialects.
【學位授予單位】：清華大學
【學位級別】：博士
【學位授予年份】：2014
【分類號】：TP391.1

【參考文獻】

相關期刊論文前1條

1 賈澎濤;何華燦;劉麗;孫濤;;時間序列數(shù)據(jù)挖掘綜述[J];計算機應用研究;2007年11期

，

本文編號：2108992

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2108992.html

上一篇：面向垂直搜索引擎的Web站點劃分方案
下一篇：國內20所高校圖書館網(wǎng)站SEO現(xiàn)狀調查研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

互聯(lián)網(wǎng)環(huán)境下的中文熱詞與方言詞匯的定量研究