HTML頁面中的文獻(xiàn)記錄分析算法
發(fā)布時間:2019-04-26 00:39
【摘要】:為了使出版機(jī)構(gòu)能夠及時從大量網(wǎng)頁中發(fā)現(xiàn)所需文獻(xiàn),需要設(shè)計能夠從超文本標(biāo)記語言頁面中自動提取文獻(xiàn)信息的算法.為此,設(shè)計了基于條件隨機(jī)場的文獻(xiàn)記錄分析算法:首先,設(shè)計了文檔對象樹的分割算法,通過分割標(biāo)記將頁面數(shù)據(jù)分成獨(dú)立的部分,這些數(shù)據(jù)塊由標(biāo)簽和文本序列構(gòu)成;隨后,將該序列作為條件隨機(jī)場模型的特征向量,建立文獻(xiàn)信息標(biāo)記模型;最后,設(shè)計啟發(fā)式算法,從標(biāo)記模型中提取文獻(xiàn)信息數(shù)據(jù),并通過實驗驗證了其有效性.
[Abstract]:In order for publishers to find the required documents from a large number of web pages in time, it is necessary to design an algorithm that can automatically extract literature information from hypertext markup language pages. For this reason, a document record analysis algorithm based on conditional random field is designed. Firstly, the segmentation algorithm of document object tree is designed. The page data is divided into independent parts by segmenting tags, and these data blocks are composed of tags and text sequences. Then, using this sequence as the feature vector of conditional random field model, the document information marking model is established. Finally, the heuristic algorithm is designed to extract the literature information data from the marking model, and the validity of the model is verified by experiments.
【作者單位】: 北京印刷學(xué)院信息工程學(xué)院;清華大學(xué)計算機(jī)科學(xué)與技術(shù)博士后流動站;國家新聞出版廣電總局廣播電視衛(wèi)星直播管理中心;
【基金】:北京市教委科技創(chuàng)新服務(wù)能力建設(shè)項目(PXM2016_014223_000025) 北京印刷學(xué)院校級重點項目(ea201507);北京印刷學(xué)院教師隊伍建設(shè)—博士啟動金項目(27170116005/062);北京印刷學(xué)院科研項目—出版物數(shù)據(jù)資產(chǎn)評估實驗室建設(shè)項目(20190116005/006)
【分類號】:TP393.092
,
本文編號:2465603
[Abstract]:In order for publishers to find the required documents from a large number of web pages in time, it is necessary to design an algorithm that can automatically extract literature information from hypertext markup language pages. For this reason, a document record analysis algorithm based on conditional random field is designed. Firstly, the segmentation algorithm of document object tree is designed. The page data is divided into independent parts by segmenting tags, and these data blocks are composed of tags and text sequences. Then, using this sequence as the feature vector of conditional random field model, the document information marking model is established. Finally, the heuristic algorithm is designed to extract the literature information data from the marking model, and the validity of the model is verified by experiments.
【作者單位】: 北京印刷學(xué)院信息工程學(xué)院;清華大學(xué)計算機(jī)科學(xué)與技術(shù)博士后流動站;國家新聞出版廣電總局廣播電視衛(wèi)星直播管理中心;
【基金】:北京市教委科技創(chuàng)新服務(wù)能力建設(shè)項目(PXM2016_014223_000025) 北京印刷學(xué)院校級重點項目(ea201507);北京印刷學(xué)院教師隊伍建設(shè)—博士啟動金項目(27170116005/062);北京印刷學(xué)院科研項目—出版物數(shù)據(jù)資產(chǎn)評估實驗室建設(shè)項目(20190116005/006)
【分類號】:TP393.092
,
本文編號:2465603
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/2465603.html
最近更新
教材專著