面向科技項(xiàng)目的相似度計(jì)算和聚類算法研究

發(fā)布時(shí)間：2018-05-29 02:49

本文選題：VSM + 語義理解　；參考：《杭州電子科技大學(xué)》2015年碩士論文

【摘要】：隨著我國對(duì)科技經(jīng)費(fèi)投入的逐漸增多,科研單位科技項(xiàng)目的申請(qǐng)也變得越來越多,怎么樣有效的解決項(xiàng)目重復(fù)立項(xiàng)問題是現(xiàn)代科技項(xiàng)目管理中非常重要的一部分。傳統(tǒng)的人工查重顯然是不行的,而已有的一些查重系統(tǒng)在精度和速度上都不能滿足要求,因此對(duì)項(xiàng)目查重系統(tǒng)關(guān)鍵技術(shù)的研究就變得非常有必要。本文重點(diǎn)對(duì)科技項(xiàng)目的表示模型、相似度計(jì)算和聚類等技術(shù)進(jìn)行研究。主要工作包括以下幾個(gè)方面：1.根據(jù)科技項(xiàng)目內(nèi)容復(fù)雜、信息大的特點(diǎn),提出一種結(jié)合物元知識(shí)表示模型和向量空間模型的科技項(xiàng)目知識(shí)表示模型和科技項(xiàng)目關(guān)系模型,方便后續(xù)對(duì)科技項(xiàng)目的表示和處理。2.針對(duì)科技項(xiàng)目的查重需求,分析總結(jié)了基于向量空間模型的相似度計(jì)算方法和基于語義理解的相似度計(jì)算方法,在此基礎(chǔ)上提出了一種基于語義理解的VSM相似度計(jì)算方法。針對(duì)科技項(xiàng)目名稱中含有大量有用信息,字?jǐn)?shù)較少且含有較多專業(yè)名詞的特點(diǎn),提出了一種改進(jìn)的基于編輯距離的句子相似度計(jì)算方法。最后把以上兩種方法分別應(yīng)用于科技項(xiàng)目的主要內(nèi)容和項(xiàng)目名稱的相似度計(jì)算中,并進(jìn)行權(quán)重調(diào)整,綜合計(jì)算整個(gè)科技項(xiàng)目的相似度。3.針對(duì)科技項(xiàng)目查重時(shí)需把待查項(xiàng)目和已有所有項(xiàng)目進(jìn)行比對(duì),效率較低的問題,本文先進(jìn)行項(xiàng)目聚類然后再進(jìn)行查重。而已有的聚類算法有需要預(yù)先輸入?yún)?shù)和算法時(shí)間復(fù)雜度較高無法應(yīng)用于大型項(xiàng)目庫等問題,本文提出一種基于雙閾值的最近鄰項(xiàng)目聚類算法并應(yīng)用于項(xiàng)目查重系統(tǒng),在不影響查重精度的情況下,提高了查重速度。在以上相似度計(jì)算方法和聚類算法研究成果的基礎(chǔ)上,實(shí)際應(yīng)用于浙江省科技項(xiàng)目相似度檢測(cè)系統(tǒng)中,有效地實(shí)現(xiàn)了項(xiàng)目查重功能,并且有良好查重準(zhǔn)確度和運(yùn)行速度,成功驗(yàn)證了本論文研究成果的可行性。
[Abstract]:With the increasing investment of science and technology funds in our country, the application of scientific and technological projects in scientific research units has become more and more. How to effectively solve the problem of project duplicate establishment is a very important part of modern science and technology project management. It is obvious that the traditional manual checking is not feasible, and some of the existing checking systems can not meet the requirements in accuracy and speed. Therefore, it is necessary to study the key technologies of the item checking and rechecking system. This paper focuses on the representation model of scientific and technological projects, similarity calculation and clustering techniques. The main work includes the following aspects: 1. According to the characteristics of complex contents and large information of scientific and technological projects, a model of knowledge representation of scientific and technological projects and a relational model of scientific and technological projects are proposed in combination with matter-element knowledge representation model and vector space model, which can facilitate the subsequent representation and processing of scientific and technological projects. According to the need of scientific and technological projects, this paper analyzes and summarizes the similarity calculation methods based on vector space model and semantic understanding. Based on this, a VSM similarity calculation method based on semantic understanding is proposed. In view of the fact that the names of scientific and technological projects contain a lot of useful information, fewer words and more professional nouns, an improved sentence similarity calculation method based on editing distance is proposed. Finally, the above two methods are applied to the similarity calculation of the main contents of the science and technology project and the name of the project, and the weight is adjusted to calculate the similarity of the whole science and technology project. 3. In order to solve the problem that it is necessary to compare the items to be checked with all the existing items and the efficiency is low, this paper first clusters the items and then checks them again. However, the existing clustering algorithms need to input parameters in advance and the time complexity of the algorithms can not be applied to large project library. In this paper, a clustering algorithm for nearest neighbor items based on double thresholds is proposed and applied to the item checking system. Under the condition of not affecting the checking accuracy, the checking speed is improved. On the basis of the above research results of similarity calculation method and clustering algorithm, it has been applied to the similarity detection system of science and technology projects in Zhejiang Province. It has effectively realized the function of checking duplicate of items, and has good accuracy and running speed. The feasibility of the research results is verified successfully.
【學(xué)位授予單位】：杭州電子科技大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2015
【分類號(hào)】：TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 趙作鵬;尹志民;王潛平;許新征;江海峰;;一種改進(jìn)的編輯距離算法及其在數(shù)據(jù)處理中的應(yīng)用[J];計(jì)算機(jī)應(yīng)用;2009年02期

2 呂佳;;基于動(dòng)態(tài)隧道系統(tǒng)的K-means聚類算法研究[J];重慶師范大學(xué)學(xué)報(bào)(自然科學(xué)版);2009年01期

3 高瀅;劉大有;齊紅;劉赫;;一種半監(jiān)督K均值多關(guān)系數(shù)據(jù)聚類算法[J];軟件學(xué)報(bào);2008年11期

4 雷小鋒;謝昆青;林帆;夏征義;;一種基于K-Means局部最優(yōu)性的高效聚類算法[J];軟件學(xué)報(bào);2008年07期

5 孫吉貴;劉杰;趙連宇;;聚類算法研究[J];軟件學(xué)報(bào);2008年01期

6 王毅;唐歆瑜;謝治華;;基于向量空間模型的畢業(yè)論文相似性辨識(shí)研究[J];科學(xué)技術(shù)與工程;2007年09期

7 楊善林;李永森;胡笑旋;潘若愚;;K-MEANS算法中的K值優(yōu)化問題研究[J];系統(tǒng)工程理論與實(shí)踐;2006年02期

8 余剛;裴仰軍;朱征宇;陳華月;;基于詞匯語義計(jì)算的文本相似度研究[J];計(jì)算機(jī)工程與設(shè)計(jì);2006年02期

9 金博,史彥軍,滕弘飛;基于語義理解的文本相似度算法[J];大連理工大學(xué)學(xué)報(bào);2005年02期

10 史彥軍,滕弘飛,金博;抄襲論文識(shí)別研究與進(jìn)展[J];大連理工大學(xué)學(xué)報(bào);2005年01期

，

本文編號(hào)：1949205

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/guanlilunwen/xiangmuguanli/1949205.html

上一篇：社會(huì)公益性科技項(xiàng)目管理的科學(xué)發(fā)展觀范式
下一篇：XX公司燃機(jī)工程項(xiàng)目供應(yīng)商的管理問題研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級(jí)|國家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

面向科技項(xiàng)目的相似度計(jì)算和聚類算法研究