基于Hadoop的科技項(xiàng)目相似度計(jì)算研究
發(fā)布時(shí)間:2018-02-28 18:31
本文關(guān)鍵詞: 科技項(xiàng)目 相似度計(jì)算 圖模型 最大團(tuán) Hadoop 出處:《河北工業(yè)大學(xué)》2015年碩士論文 論文類(lèi)型:學(xué)位論文
【摘要】:《國(guó)家中長(zhǎng)期科學(xué)和技術(shù)發(fā)展規(guī)劃綱要(2006-2020年)》實(shí)施以來(lái),我國(guó)財(cái)政科技投入快速增長(zhǎng),科技項(xiàng)目和資金管理不斷改進(jìn),為科技事業(yè)發(fā)展提供了有力支撐。同時(shí)也給科技項(xiàng)目管理工作帶來(lái)了新的挑戰(zhàn):第一,隨著科技項(xiàng)目申報(bào)數(shù)量的增加存在項(xiàng)目重復(fù)申報(bào)、重復(fù)立項(xiàng)等突出問(wèn)題。第二,隨著各學(xué)科不斷細(xì)化以及學(xué)科交叉、融合日益加劇,科技項(xiàng)目研究的廣泛交流與合作是科技發(fā)展的重要推動(dòng)力,根據(jù)項(xiàng)目的相似度進(jìn)行合理的整合是未來(lái)發(fā)展的趨勢(shì)。加強(qiáng)項(xiàng)目相似度分析是解決這些問(wèn)題的關(guān)鍵,項(xiàng)目的相似度分析一般是通過(guò)申請(qǐng)書(shū)的相似度計(jì)算找到相似項(xiàng)目,從而為項(xiàng)目立項(xiàng)提供一定依據(jù),論文主要研究?jī)?nèi)容包括以下幾個(gè)方面。首先,分析科技項(xiàng)目相似度計(jì)算的關(guān)鍵技術(shù),針對(duì)科技項(xiàng)目申請(qǐng)書(shū)中存在的大量專(zhuān)業(yè)術(shù)語(yǔ),提出一種改進(jìn)的基于詞序列頻率有向網(wǎng)的未登錄詞識(shí)別方法。該方法依據(jù)詞性對(duì)項(xiàng)目申請(qǐng)書(shū)的分詞進(jìn)行過(guò)濾,并結(jié)合停用詞表對(duì)提取出的未登錄詞進(jìn)行過(guò)濾。將提取出的未登錄詞作為特征詞的一部分,結(jié)合剩余特征詞構(gòu)建基于向量空間和圖模型的申請(qǐng)書(shū)表示模型,然后基于該模型計(jì)算申請(qǐng)書(shū)的相似度。其次,提出最大團(tuán)方法求解圖模型的相似度。圖模型的相似度可以通過(guò)最大公共子圖求解,同時(shí)圖的最大公共子圖問(wèn)題又可以轉(zhuǎn)化成求解最大團(tuán)問(wèn)題。最后,隨著科技項(xiàng)目數(shù)量的增加,科技項(xiàng)目相似度計(jì)算涉及到的申請(qǐng)書(shū)預(yù)處理、特征詞提取以及相似度計(jì)算等技術(shù)計(jì)算量大、計(jì)算時(shí)間長(zhǎng),為解決這一問(wèn)題本文結(jié)合Hadoop分布式計(jì)算平臺(tái),利用MapReduce并行計(jì)算框架將申請(qǐng)書(shū)相似度計(jì)算每一個(gè)過(guò)程分解為Map和Reduce任務(wù)。
[Abstract]:Since its implementation, China's financial investment in science and technology has increased rapidly, and the management of scientific and technological projects and funds has been continuously improved. It has provided strong support for the development of scientific and technological undertakings. At the same time, it has also brought new challenges to the management of scientific and technological projects. First, with the increase in the number of scientific and technological projects declared, there are outstanding problems such as repeated reporting and duplicate projects. Second, With the continuous refinement and intersection of various disciplines and the increasing integration, the extensive exchange and cooperation of scientific and technological research is an important driving force for the development of science and technology. It is the trend of the future development to integrate the items according to the similarity degree of the project, the key to solve these problems is to strengthen the similarity analysis of the project, and the similarity analysis of the project is usually to find the similar items through the similarity calculation of the application form. In order to provide a certain basis for the project establishment, the main research content includes the following aspects. Firstly, the key technology of the similarity calculation of scientific and technological projects is analyzed, and a large number of technical terms in the application form of scientific and technological projects are analyzed. An improved unrecorded word recognition method based on word sequence frequency directed net is proposed. Combined with the stop word table, the extracted unrecorded words are filtered. The extracted unrecorded words are taken as a part of the feature words, and the application representation model based on vector space and graph model is constructed by combining the remaining feature words. Then the similarity of the application form is calculated based on the model. Secondly, the maximum cluster method is proposed to solve the similarity of the graph model. The similarity of the graph model can be solved by the maximum common subgraph. At the same time, the maximum common subgraph problem of graph can be transformed into solving the maximum cluster problem. Finally, with the increase of the number of scientific and technological projects, the application preprocessing involved in the similarity calculation of scientific and technological projects is obtained. In order to solve this problem, this paper combines the Hadoop distributed computing platform with the large amount of computation and the long computing time of feature word extraction and similarity calculation. Each process of application similarity calculation is decomposed into Map and Reduce tasks by using MapReduce parallel computing framework.
【學(xué)位授予單位】:河北工業(yè)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2015
【分類(lèi)號(hào)】:TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前1條
1 翟荔婷;;淺談中文文本分詞方法[J];經(jīng)營(yíng)管理者;2012年18期
,本文編號(hào):1548449
本文鏈接:http://sikaile.net/guanlilunwen/xiangmuguanli/1548449.html
最近更新
教材專(zhuān)著