科技成果的自動提取與融合
發(fā)布時間:2018-04-15 23:05
本文選題:信息融合 + Web信息抽取。 參考:《中南大學(xué)》2014年碩士論文
【摘要】:從Web頁面中抽取出學(xué)術(shù)成果信息并加以融合,能夠幫助實現(xiàn)學(xué)術(shù)成果的科學(xué)管理,同時能夠為專家學(xué)術(shù)軌跡的深入挖掘提供重要的基礎(chǔ)資源,F(xiàn)有的信息抽取系統(tǒng)對Web頁面結(jié)構(gòu)的頻繁變化的適應(yīng)性較低,同時由于資源規(guī)模巨大,信息存在高冗余度、低可信度、描述方式不一致等問題,導(dǎo)致結(jié)果的準(zhǔn)確性難以得到保證。因此本論文面向?qū)<铱萍汲晒畔?重點聚焦Web信息融合中的抽取和去重兩項關(guān)鍵技術(shù)進行研究。 雖然目前存在多種Web信息抽取方式,但它們要么強烈依賴于抽取模板,要么對網(wǎng)頁結(jié)構(gòu)的變化有嚴(yán)格要求,針對此問題,本論文提出一種基于空間連接和DOM相結(jié)合的Web信息抽取算法(Spatial Relation Based DOM,簡稱SRB-DOM),實現(xiàn)從Web頁面中抽取出成果信息。該方法將DOM樹中的各個元素節(jié)點映射成二維空間中的對象,利用矩形代數(shù)中的相關(guān)理論得到各個對象之間空間關(guān)系的描述,利用元素節(jié)點之間的空間關(guān)系,抽取出成果信息的元數(shù)據(jù),然后根據(jù)最大無連接邊界元組構(gòu)建完整的成果記錄,最終實現(xiàn)成果信息的抽取。分析與模擬實驗結(jié)果表明,該方法在對頁面結(jié)構(gòu)變化的適應(yīng)性方面遠優(yōu)于現(xiàn)有的基于路徑的信息抽取算法。 信息源的多樣性和描述方式的不同導(dǎo)致存在大量相似或重復(fù)的抽取結(jié)果,因此在對成果信息作進一步的融合與挖掘之前,必須對其進行一定的清洗工作。本文利用熵增度量成果記錄中各個數(shù)據(jù)項的重要性程度,依此對各數(shù)據(jù)項分配權(quán)值,完成成果記錄間相似度的計算,實現(xiàn)對成果的分類。在此之后,論文提出了一種基于數(shù)據(jù)標(biāo)準(zhǔn)化的成果記錄完整化算法(Data Standardization Based Record Combine,簡稱DSBRC),該算法首先對成果記錄進行基于特征的描述標(biāo)準(zhǔn)化,然后據(jù)此對每條成果記錄的數(shù)據(jù)狀態(tài)進行標(biāo)注,得到數(shù)據(jù)狀態(tài)矩陣,根據(jù)該矩陣得到成果記錄的完整描述信息。分析與實驗結(jié)果表明,該算法在結(jié)果的準(zhǔn)確度和完整度方面由于其他同類算法。 Web信息抽取適應(yīng)頁面結(jié)構(gòu)變化的能力對系統(tǒng)的實用性有很重要的影響,所以應(yīng)當(dāng)盡可能提高信息抽取系統(tǒng)對頁面結(jié)構(gòu)變化的適應(yīng)性。使用本論文提出的SRB-DOM算法實現(xiàn)信息抽取,完全消除了對路徑的依賴,與傳統(tǒng)的基于路徑的抽取方法相比,適應(yīng)性得到了很大的提高。論文提出的基于熵增分類能夠提高成果記錄的分類準(zhǔn)確度,而DSBRC算法能夠有效提高成果記錄合并的完整度與準(zhǔn)確度,這對接下來數(shù)據(jù)的深入挖掘與知識發(fā)現(xiàn)有重要的研究價值。
[Abstract]:To extract information from the academic achievements and be integrated in the Web page, can help to realize the scientific management of academic achievements, at the same time can provide the important basic resources for further mining expert academic trajectory. The frequent change of Web structure of the page information extraction system to adapt to the existing low, at the same time because the resource is huge, high information redundancy, low reliability, description of inconsistencies and other issues, it is difficult to ensure the accuracy of the results. Therefore the expert oriented science and technology achievements in information extraction, focusing Web in information fusion and to two key technologies are studied.
Although there are many kinds of Web information extraction, but they are either strongly depends on the selected template, or to change the structure of a web page has strict requirements, in order to solve this problem, this paper proposes a Web information extraction algorithm of spatial connection and based on the combination of DOM (Spatial Relation Based DOM, referred to as SRB-DOM), the extraction results of information from the Web page. This method will each element node in the DOM tree mapping object in two-dimensional space, get the spatial relationship between the objects described by using the theory of rectangle algebra, the element space relations between nodes, metadata extraction results of information, then according to the maximum non connecting boundary tuples to build a complete the results of record, and ultimately results in information extraction. Analysis and simulation results show that the method on the page structure adaptability is far superior to the existing Based on the information extraction algorithm of path.
There are a large number of similar or duplicate extraction results of diversity of information sources and describes the different ways of lead, so before making further achievements of fusion and mining information, must carry on the cleaning work. This paper uses the entropy measure the degree of importance of each data record results, according to the distribution of weight of each data item. The calculation results of the similarity of complete record, classify the results. After this, the paper puts forward a data based on the results of standardization record complete algorithm (Data Standardization Based Record Combine, referred to as DSBRC), the algorithm first describing the characteristics of standardization based on the achievements of the record, and then based on the results of each record the state of the data dimension, data matrix description information according to the obtained matrix results recorded. Analysis and experimental results show that the The algorithm in terms of accuracy and integrity of the results with other algorithms.
Web information extraction has important influence to practical ability to change the page structure of the system, so it should be possible to improve information extraction system of page structure adaptability. Using the SRB-DOM algorithm proposed in this paper to achieve information extraction, completely eliminates the dependence on the path, and the traditional extraction method based on path compared. Adaptability has been greatly improved. Based on the entropy classification can improve the accuracy of the classification results recorded, while the DSBRC algorithm can effectively improve the achievement record combined integrity and accuracy, in-depth mining and knowledge of this next data found to have important research value.
【學(xué)位授予單位】:中南大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2014
【分類號】:TP393.092
【參考文獻】
相關(guān)期刊論文 前10條
1 李星毅;包從劍;施化吉;;數(shù)據(jù)倉庫中的相似重復(fù)記錄檢測方法[J];電子科技大學(xué)學(xué)報;2007年06期
2 陳少飛,郝亞南,李天柱,徐林昊,楊文柱;Web信息抽取技術(shù)研究進展[J];河北大學(xué)學(xué)報(自然科學(xué)版);2003年01期
3 龐雄文;姚占林;李擁軍;;大數(shù)據(jù)量的高效重復(fù)記錄檢測方法[J];華中科技大學(xué)學(xué)報(自然科學(xué)版);2010年02期
4 淦文燕,李家福,李德毅;高維聚類中的一種特征篩選方法[J];解放軍理工大學(xué)學(xué)報(自然科學(xué)版);2003年06期
5 韓京宇;徐立臻;董逸生;;一種大數(shù)據(jù)量的相似記錄檢測方法[J];計算機研究與發(fā)展;2005年12期
6 許向陽,佘春紅;近似重復(fù)記錄的增量式識別算法[J];計算機工程與應(yīng)用;2003年12期
7 周麗娟;肖滿生;;基于數(shù)據(jù)分組匹配的相似重復(fù)記錄檢測[J];計算機工程;2010年12期
8 邱越峰,田增平,季文,
本文編號:1756196
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/1756196.html
最近更新
教材專著