基于多知識庫科技報告術(shù)語實體鏈接研究

發(fā)布時間：2018-11-27 18:37

【摘要】：科技報告作為一種重要的文獻資源,對其的深入挖掘與分析具有重要的價值和意義。然而,目前針對科技報告的研究仍停留在對其基本概念、屬性的界定,以及科技報告體系建設(shè)研究,而對科技報告內(nèi)容深入挖掘與分析研究非常少�？萍紙蟾嬷泻写罅康膶I(yè)術(shù)語實體,這些專業(yè)術(shù)語實體多為科技報告的研究主體,代表著我國科學(xué)技術(shù)的發(fā)展現(xiàn)狀與未來趨勢。因此,對科技報告內(nèi)容的挖掘分析,識別其中的專業(yè)術(shù)語實體對推動科技創(chuàng)新具有重要意義。實體識別技術(shù)作為自然語言處理的關(guān)鍵技術(shù),可用于自動識別文本中的人名、地名、機構(gòu)名等實體,將其擴展應(yīng)用使得自動識別專業(yè)術(shù)語實體成為可能。本文以科技報告為研究對象,首先利用新詞發(fā)現(xiàn)技術(shù)發(fā)現(xiàn)科技報告中未登錄的潛在術(shù)語新詞,然后構(gòu)建專業(yè)術(shù)語知識庫作為術(shù)語實體識別與鏈接的語料支撐,最后利用Stanford NER實體識別框架實現(xiàn)科技報告中術(shù)語實體的自動識別,并與多個知識庫進行鏈接消歧。主要的研究工作如下:(1)針對目前中文分詞存在的問題以及科技報告術(shù)語的特點,提出了基于詞性組合的新詞發(fā)現(xiàn)方法,通過制定專業(yè)術(shù)語的詞性組合規(guī)則抽取符合規(guī)則的詞串,并根據(jù)詞串的支持度以及詞長、互信息等內(nèi)外部特征確定新詞,有效發(fā)現(xiàn)專業(yè)術(shù)語新詞,在一定程度上提高了中文分詞的準確度,為術(shù)語實體的識別奠定了基礎(chǔ)。(2)構(gòu)建專業(yè)術(shù)語知識庫。實體識別需要大量的語料作為支持,通過訓(xùn)練語料提取實體特征,實現(xiàn)實體的自動識別。由于目前缺乏公開的科技報告術(shù)語語料,本文以中國規(guī)范術(shù)語網(wǎng)提供的專業(yè)術(shù)語知識作為數(shù)據(jù)源,利用網(wǎng)絡(luò)爬蟲,數(shù)據(jù)庫等信息技術(shù)設(shè)計并構(gòu)建術(shù)語知識庫。(3)詳細介紹了目前實體識別的主流方法,并選擇成熟的基于條件隨機場模型的Stanford NER開源實體識別框架,通過訓(xùn)練術(shù)語實體模型,實現(xiàn)科技報告術(shù)語實體的自動識別,并結(jié)合多知識庫與語義相似度計算實現(xiàn)術(shù)語實體的鏈接消歧。(4)選取國家科技報告服務(wù)系統(tǒng)發(fā)布的科技報告作為實驗數(shù)據(jù),設(shè)計并開發(fā)基于多知識庫的科技報告術(shù)語實體鏈接原型系統(tǒng)。該系統(tǒng)主要集成了科技報告數(shù)據(jù)預(yù)處理、新詞發(fā)現(xiàn)、實體識別與實體鏈接功能,實現(xiàn)了對科技報告術(shù)語實體的自動識別與消歧,并驗證了本文方法的正確性和有效性。
[Abstract]:As an important document resource, it is of great value and significance to excavate and analyze the scientific and technological report. However, at present, the research on science and technology report is still focused on its basic concept, definition of attributes and construction of science and technology report system. There are a large number of technical terminology entities in science and technology reports, which are the main research subjects of science and technology reports, which represent the development status and future trend of science and technology in China. Therefore, it is of great significance to excavate and analyze the contents of science and technology reports and identify the technical terminology entities. As the key technology of natural language processing, entity recognition technology can be used to automatically recognize the names of persons, place names, agency names and other entities in the text. In this paper, the scientific and technological report is taken as the research object. Firstly, the new term discovery technology is used to discover the potential new term in the scientific and technological report, and then the specialized terminology knowledge base is constructed as the corpus support for the identification and link of the term entity. Finally, the Stanford NER entity recognition framework is used to realize the automatic recognition of the terminology entities in the scientific and technological reports, and links disambiguation with multiple knowledge bases. The main research works are as follows: (1) aiming at the problems existing in Chinese word segmentation and the characteristics of the terms in scientific and technological reports, a new word discovery method based on part of speech combination is proposed. By drawing up the rules of part of speech combination of professional terms to extract the words in accordance with the rules, and according to the support degree of the strings and the internal and external characteristics of the words, such as length and mutual information, the new words are determined, and the new words of the professional terms are found effectively. To some extent, it improves the accuracy of Chinese word segmentation, and lays a foundation for the identification of terminology entities. (2) constructing the specialized terminology knowledge base. Entity recognition needs a large number of corpus as the support, through training corpus to extract entity features to achieve automatic entity recognition. Due to the lack of public scientific and technological reporting terminology data, this paper uses the technical terminology knowledge provided by the China Standard terminology Network as the data source and uses the web crawler as the data source. Database and other information technologies design and construct the term knowledge base. (3) the main methods of entity recognition are introduced in detail, and the mature Stanford NER open source entity recognition framework based on conditional random field model is selected to train the term entity model. Realizing the automatic recognition of the technical report term entity, and combining the multi-knowledge base and semantic similarity calculation to realize the link disambiguation of the term entity. (4) selecting the science and technology report issued by the national science and technology report service system as the experimental data. This paper designs and develops a prototype system of entity link of scientific and technological reporting terms based on multi-knowledge base. The system mainly integrates preprocessing of scientific and technological report data, neologism discovery, entity identification and entity link function, realizes automatic recognition and disambiguation of scientific and technological report term entity, and verifies the correctness and validity of this method.
【學(xué)位授予單位】：華中師范大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2017
【分類號】：G353.1

【相似文獻】

相關(guān)期刊論文前10條

1 陳衛(wèi)紅;;論科技報告編輯的全方位能力[J];編輯學(xué)報;2006年02期

2 陳馨武;;科技報告在高校教學(xué)和科研中的作用[J];高校圖書館工作;1982年04期

3 張龍根;;科技報告的查檢[J];圖書情報工作;1982年01期

4 秦洪生;;科技報告管理辦法應(yīng)改進[J];兵工情報工作;1986年02期

5 慶芳;《航天部科技報告》編輯出版[J];中國空間科學(xué)技術(shù);1987年Z1期

6 王琳,陳京麗;關(guān)于加速船舶科技報告發(fā)展的探討[J];情報理論與實踐;1997年06期

7 王維亮;美國政府科技報告的調(diào)查分析——關(guān)于近幾年來發(fā)行數(shù)量減少問題[J];情報理論與實踐;2000年02期

8 劉立雪;;我們是怎樣用主題鍵詞處理科技報告的[J];圖書情報工作;1981年04期

9 劉士星;美國政府科技報告檢索工具的特點[J];中國科學(xué)技術(shù)大學(xué)學(xué)報;1982年S2期

10 方平;;怎樣查閱科技報告中的醫(yī)學(xué)文獻[J];醫(yī)學(xué)情報工作;1984年04期

相關(guān)會議論文前3條

1 鄒鍵;;關(guān)于科技報告管理體系建設(shè)的思考[A];第二屆中國航空學(xué)會青年科技論壇文集[C];2006年

2 鄒鍵;;關(guān)于科技報告管理體系建設(shè)的思考[A];節(jié)能環(huán)保和諧發(fā)展——2007中國科協(xié)年會論文集（一）[C];2007年

3 夏文;;關(guān)于綜述寫作的一些問題[A];遼寧省高校學(xué)報研究會首屆學(xué)術(shù)年會論文集[C];1983年

相關(guān)重要報紙文章前10條

1 本報記者劉垠;建立國家科技報告體系[N];大眾科技報;2011年

2 本報記者徐玢;“科技報告制度是國家創(chuàng)新體系的基本保障條件”[N];科技日報;2012年

3 見習(xí)記者王恒;建立國家科技報告制度需注意四大問題[N];中國經(jīng)濟時報;2014年

4 本報記者陳磊;國家科技報告制度，，從頂層設(shè)計走向逐級實施[N];科技日報;2014年

5 記者喻思孌;國家科技報告制度全面推行[N];人民日報;2014年

6 記者胡宇芬邋通訊員戴雄輝任彬彬;三百省直廳干聽科技報告[N];湖南日報;2008年

7 本報記者司建楠;馮長根：加快建立國家科技報告體系[N];中國工業(yè)報;2011年

8 本報記者劉垠陳磊;科技報告：展現(xiàn)科技實力推進開放共享[N];科技日報;2013年

9 宗禾;制度護航國家科技成果向社會開放共享[N];中國財經(jīng)報;2014年

10 尹江勇;省科協(xié)科技報告周啟動[N];河南日報;2007年

相關(guān)碩士學(xué)位論文前5條

1 陳桂強;基于多知識庫科技報告術(shù)語實體鏈接研究[D];華中師范大學(xué);2017年

2 范苗苗;科技報告的風(fēng)格翻譯[D];北京外國語大學(xué);2017年

3 李亞峰;科技報告知識共享績效評價體系構(gòu)建研究[D];吉林大學(xué);2015年

4 張金云;科技報告語篇中人際情感與態(tài)度意義[D];山東大學(xué);2005年

5 李成龍;科技報告中粒度關(guān)聯(lián)數(shù)據(jù)的創(chuàng)建與發(fā)布研究[D];華中師范大學(xué);2014年

本文編號：2361695

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/tushudanganlunwen/2361695.html

上一篇：科研合作關(guān)系網(wǎng)絡(luò)數(shù)據(jù)源分布研究
下一篇：讓檔案插上騰飛的翅膀——常州市檔案館創(chuàng)建“全國示范數(shù)字檔案館”工作紀實

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于多知識庫科技報告術(shù)語實體鏈接研究