天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

多標(biāo)記中文問(wèn)句分類(lèi)研究多標(biāo)記中文問(wèn)句

發(fā)布時(shí)間:2018-09-13 14:56
【摘要】:當(dāng)前,逐漸被大眾接收并廣泛使用的一種新穎的網(wǎng)絡(luò)應(yīng)用被稱為社區(qū)問(wèn)答,英文名為(Community—basedQuestion—Answering, CQA)。為大眾所熟知的問(wèn)答系統(tǒng)有,新浪愛(ài)問(wèn),百度知道,類(lèi)似的還有雅虎知識(shí)堂,知乎等等。問(wèn)答系統(tǒng)的共同特點(diǎn)就是,使用者可以幫助解答他人所提的問(wèn)題,與此同時(shí)使用者也還可以將自己的問(wèn)題提交由他人來(lái)回答,并且可以依據(jù)他人的回答給出相應(yīng)的評(píng)價(jià)。問(wèn)答系統(tǒng)就是一個(gè)龐大的知識(shí)海洋,它的內(nèi)容,也就是問(wèn)題,都是經(jīng)過(guò)長(zhǎng)年累月所積累的用戶生活中各個(gè)方面問(wèn)題,問(wèn)題不僅多而且范圍還比較廣。社區(qū)問(wèn)答對(duì)使用者的提問(wèn),在線的搜集相關(guān)類(lèi)似的問(wèn)題并作出回答,將一些相關(guān)聯(lián)的問(wèn)題推送給使用者,其最終目標(biāo)是將與使用者提問(wèn)問(wèn)題有直接關(guān)聯(lián)的問(wèn)答反饋給使用者?梢钥偨Y(jié)認(rèn)為,社區(qū)問(wèn)答這種可以相互交互的問(wèn)答模式不是僅僅將某一個(gè)問(wèn)題的回答反饋給使用者,而是將與所提問(wèn)題相關(guān)的一連串的信息反饋給使用者。問(wèn)句的解析,答案的提取以及信息的搜索為社區(qū)問(wèn)答系統(tǒng)的主要的三個(gè)組成部分,因此問(wèn)答系統(tǒng)中的關(guān)鍵性問(wèn)題就是,第一點(diǎn),在問(wèn)句解析的過(guò)程中,怎么樣去深刻的了解使用者所提交的問(wèn)句的真正的含義,第二點(diǎn),在信息搜索的過(guò)程中,怎么樣去把與所提問(wèn)題相關(guān)的信息找出來(lái),第三點(diǎn),在答案的提取過(guò)程中,怎么樣去精準(zhǔn)的把回答從相關(guān)的信息中提出出來(lái)。中文問(wèn)句也是具有其自身的特點(diǎn),比如,中文問(wèn)句比較短通常都超不過(guò)160個(gè)字符,因此也就使得中文問(wèn)句的特征信息相對(duì)較稀疏,這也會(huì)使得中文問(wèn)句對(duì)信息的概述信號(hào)較弱、噪音多等等諸多的問(wèn)題。再者就是社區(qū)問(wèn)答的中文問(wèn)句中很多情況下都會(huì)有一些不是很規(guī)則的詞語(yǔ)或是句子出現(xiàn),如,日常生活的俗語(yǔ)、習(xí)慣性使用的縮寫(xiě)詞、網(wǎng)絡(luò)中使用的變形詞,所以傳統(tǒng)的文本預(yù)處理的效果以及文本表示方法的性能也一定程度收到了影響。中文問(wèn)句還具有多義性的特點(diǎn),問(wèn)句的多義性指的是一條問(wèn)句同時(shí)屬于多個(gè)類(lèi)別,如問(wèn)句“從買(mǎi)房到裝修需都要注意哪些事”,它既屬于“購(gòu)房置業(yè)”類(lèi)也屬于“家居裝修”類(lèi)。因此,本文針對(duì)中文問(wèn)句的特征稀疏及多義性展開(kāi)研究,經(jīng)過(guò)不斷的深入研究與反復(fù)的進(jìn)行試驗(yàn),對(duì)多標(biāo)記中文問(wèn)句分類(lèi)研究取得以下成果:(1)本文中先是使用維基百科知識(shí)庫(kù)來(lái)構(gòu)建中文問(wèn)句中詞語(yǔ)的關(guān)聯(lián)的概念集合,因?yàn)榫S基百科知識(shí)庫(kù)中具有非常豐富的概念和鏈接等一系列的關(guān)聯(lián)信息。然后再使用個(gè)個(gè)頁(yè)面間的鏈接的相關(guān)關(guān)系量化概念間的語(yǔ)義關(guān)系。接著將通過(guò)維基百科知識(shí)庫(kù)獲取的相應(yīng)詞語(yǔ)的關(guān)聯(lián)的概念集合并將其用作相應(yīng)詞語(yǔ)的擴(kuò)展特征詞集合。下一步就是擴(kuò)展中文問(wèn)句的特征通過(guò)詞語(yǔ)間語(yǔ)義的關(guān)系,再經(jīng)過(guò)消除歧義詞進(jìn)一步選取相應(yīng)的概念,從而完成中文問(wèn)句特征擴(kuò)展,通過(guò)這種方式改善中文問(wèn)句對(duì)其概念描述的精確性,同時(shí)也能達(dá)到對(duì)語(yǔ)義表達(dá)更進(jìn)一步的豐富,一定程度上也減少了中文問(wèn)句特征稀疏對(duì)分類(lèi)效果的影響。(2)由于在多標(biāo)記中文問(wèn)句分類(lèi)的過(guò)程中,傳統(tǒng)的ML-kNN算法并沒(méi)有很好的考慮到標(biāo)記之間的關(guān)聯(lián)性問(wèn)題,因此本文基于ML-kNN基礎(chǔ)上,改進(jìn)出了ML-CQC多標(biāo)記中文問(wèn)句分類(lèi)算法,充分將問(wèn)句的類(lèi)別標(biāo)記相關(guān)性考慮到問(wèn)句分類(lèi)的過(guò)程中。本文改進(jìn)出了的ML-CQC算法在使用最大后驗(yàn)概率來(lái)推斷沒(méi)有標(biāo)記類(lèi)別的中文問(wèn)句所屬類(lèi)別時(shí)會(huì)將它附近的其他的類(lèi)別的統(tǒng)計(jì)信息考慮進(jìn)來(lái)。在此基礎(chǔ)之上,再在利用已經(jīng)分類(lèi)得到的類(lèi)別標(biāo)記結(jié)果之間的相關(guān)性,迭代ML-CQC。與ML-kNN不同的是,本文改講出的ML-CQC算法能夠有效地利用標(biāo)記相關(guān)性來(lái)改善和提升分類(lèi)性能,實(shí)驗(yàn)表明經(jīng)過(guò)特征擴(kuò)展過(guò)的中文問(wèn)句在ML-CQC算法上具有可行性與有效性。。(3)本文在ML-CQC算法的基礎(chǔ)上再次改進(jìn)出SML-CQC算法,其核心思想是通過(guò)計(jì)算出類(lèi)別標(biāo)記的正例與負(fù)例的比例s,通過(guò)對(duì)相應(yīng)樣例先驗(yàn)概率進(jìn)行s方,以此來(lái)改善因類(lèi)別標(biāo)記的正例的樣本數(shù)量過(guò)于少而導(dǎo)致的錯(cuò)誤的分類(lèi)的情況。
[Abstract]:Nowadays, a new network application which is gradually accepted and widely used by the public is called Community-based Question-Answering (CQA). The well-known question-answering system includes: Sina Love Question, Baidu Know, Yahoo Knowledge Hall, Know and so on. Question answering system is a vast ocean of knowledge, and its contents, that is, questions, are the life of users accumulated over the years. Community Question Answering (CBA) provides users with questions, collects and answers similar questions online, and pushes related questions to users. The ultimate goal is to feed back the questions directly related to the user's questions. In order to solve this problem, the community question answering (QA) model is not only to feed back the answer of a certain question to the user, but also to feed back a series of information related to the question to the user. The key problem in the system is, first, in the process of question parsing, how to deeply understand the real meaning of the questions submitted by users, second, in the process of information search, how to find out the information related to the questions raised, third, in the process of extracting answers, how to be accurate. Chinese question sentences also have their own characteristics. For example, Chinese question sentences are usually shorter than 160 characters, which makes the characteristic information of Chinese question sentences relatively sparse. This also makes the overview signal of Chinese question sentences to information weak, more noise and many other questions. In many cases, there will be some irregular words or sentences, such as common sayings in daily life, abbreviations used habitually, deformation words used in the network, so the effect of traditional text preprocessing and the performance of text representation methods have also been affected to a certain extent. The Chinese question also has the characteristic of polysemy. The polysemy of the question refers to a question which belongs to many categories at the same time, such as "what should we pay attention to from buying a house to decorating". It belongs to the category of "buying a house and buying a house" and "decorating a house". After continuous in-depth study and repeated experiments, the research on multi-marker Chinese question classification achieves the following results: (1) In this paper, we first use Wikipedia knowledge base to construct a set of related concepts of Chinese question words, because Wikipedia knowledge base has a very rich set of related letters such as concepts and links. Then we quantify the semantic relationship between concepts by using the correlation of links between pages. Then we use the concept set of Related words obtained by Wikipedia Knowledge Base as the extended feature set of corresponding words. The next step is to extend the features of Chinese question sentences through the semantic relationship between words, and then. After disambiguation, the corresponding concepts are further selected to complete the expansion of Chinese question features. By this way, the accuracy of Chinese question conceptual description can be improved, and the semantic expression can be further enriched. To some extent, the influence of sparse Chinese question features on the classification effect can be reduced. In the process of multi-marker Chinese question classification, the traditional ML-kNN algorithm does not take into account the relevance between tags very well. Therefore, based on ML-kNN, this paper improves the ML-CQC multi-marker Chinese question classification algorithm, fully considering the relevance of the class tags in the process of question classification. The ML-CQC algorithm takes into account the statistical information of other classes in the vicinity of unmarked categories when it uses the maximum posteriori probability to infer which category the Chinese question sentence belongs to. ML-CQC algorithm can effectively use marker correlation to improve and improve the classification performance. Experiments show that the feature-extended Chinese question is feasible and effective in ML-CQC algorithm. (3) This paper improves the SML-CQC algorithm again on the basis of ML-CQC algorithm, its core idea is to calculate the positive and negative of the class marker. In order to improve the classification errors caused by too few samples of the labeled samples, the ratio s of the corresponding samples is used to carry out the s-square of the prior probability of the corresponding samples.
【學(xué)位授予單位】:昆明理工大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2016
【分類(lèi)號(hào)】:TP391.1
,

本文編號(hào):2241502

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/wenyilunwen/shinazhuanghuangshejilunwen/2241502.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶43078***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com