語料庫短語序列提取系統(tǒng)的設(shè)計(jì)與開發(fā)
發(fā)布時(shí)間:2018-05-08 12:32
本文選題:語料庫驅(qū)動(dòng) + 短語序列; 參考:《外語電化教學(xué)》2017年04期
【摘要】:語料庫短語序列提取一直是短語學(xué)研究的關(guān)鍵技術(shù)環(huán)節(jié)。囿于計(jì)算和操作的復(fù)雜性,前人研究多使用相對(duì)單一的統(tǒng)計(jì)方法測量和提取短語序列,導(dǎo)致提取的數(shù)據(jù)包含大量噪音。文章使用前沿的大數(shù)據(jù)處理手段和計(jì)算技術(shù),實(shí)現(xiàn)了基于頻數(shù)、互信息、邊界熵等多種統(tǒng)計(jì)手段的短語序列提取方法,并研制開發(fā)了相應(yīng)的系統(tǒng)。實(shí)驗(yàn)結(jié)果表明,該系統(tǒng)能夠在普通計(jì)算機(jī)上支持千萬詞級(jí)規(guī)模的大型語料庫運(yùn)算,并能顯著提高短語序列的提取質(zhì)量。
[Abstract]:Phrase sequence extraction from corpus is always the key technology of phrasology. Due to the complexity of computation and operation, previous studies often use a relatively single statistical method to measure and extract phrase sequences, resulting in a large amount of noise in extracted packets. In this paper, a new method of phrase sequence extraction based on frequency, mutual information, boundary entropy and other statistical means is realized by using the advanced processing means and computing techniques of big data, and the corresponding system is developed. The experimental results show that the system can support a large corpus with a scale of ten million words on a common computer, and can improve the quality of phrase sequence extraction significantly.
【作者單位】: 北京航空航天大學(xué);中國人民解放軍后勤科學(xué)研究所;東華大學(xué);
【基金】:國家社會(huì)科學(xué)基金項(xiàng)目(項(xiàng)目編號(hào):13BYY074;14CYY049) 北京市社會(huì)科學(xué)基金項(xiàng)目(項(xiàng)目編號(hào):16JDYYA001)的部分研究成果
【分類號(hào)】:H314.3;TP311.52
,
本文編號(hào):1861420
本文鏈接:http://sikaile.net/waiyulunwen/yingyulunwen/1861420.html
最近更新
教材專著