基于標簽路徑特征的Web新聞內(nèi)容抽取研究
發(fā)布時間:2018-12-15 16:46
【摘要】:Web新聞內(nèi)容抽取是Web智能信息處理過程中的一個非常重要的步驟,是情報獲取與安全、網(wǎng)絡輿情監(jiān)測、移動終端個性化推薦服務、異構(gòu)Web數(shù)據(jù)集成、信息檢索、搜索引擎等研究與應用的基礎。因此,面向Web新聞內(nèi)容抽取領(lǐng)域中的相關(guān)問題開展研究,具有重要的研究和應用價值。 實例分析和進一步研究發(fā)現(xiàn),許多新聞網(wǎng)站具有類似的布局結(jié)構(gòu)和風格,網(wǎng)頁內(nèi)容布局與其解析樹的標簽路徑之間存在隱含的關(guān)聯(lián)性。傳統(tǒng)的路徑表達式過于剛性,在Web信息抽取過程中難以適應HTML文檔結(jié)構(gòu)的細微變化,影響信息抽取的準確率;此外,Web新聞網(wǎng)頁具有海量異構(gòu)的特點,對手工構(gòu)造包裝器技術(shù)以及基于規(guī)則學習的包裝器技術(shù)的通用性提出了挑戰(zhàn)。為此,本文開展基于標簽路徑特征的Web新聞內(nèi)容抽取研究,研究內(nèi)容涉及兩方面:面向特定網(wǎng)站,研究基于路徑模式知識的高精度Web新聞內(nèi)容抽取模型和方法;面向開放環(huán)境,研究基于標簽路徑特征的通用Web新聞內(nèi)容抽取模型和方法。 主要研究內(nèi)容如下: (1)在研究網(wǎng)頁內(nèi)容布局與其解析樹的路徑模式之間存在隱含關(guān)聯(lián)性的基礎上,提出了一種新穎的Web信息抽取系統(tǒng)模型—基于區(qū)分路徑模式的Web新聞內(nèi)容抽取模型PP-WNE。在此基礎上,定義了一種特殊的適用于Web新聞內(nèi)容抽取的路徑模式—區(qū)分路徑模式,并提出一種區(qū)分路徑模式挖掘方法,解決了抽取模式知識庫的構(gòu)建問題。以中文、英文網(wǎng)站上隨機選取的網(wǎng)頁為實驗數(shù)據(jù)集,實驗結(jié)果表明,通過采用合理設置的容噪閾值,基于路徑模式挖掘的新聞網(wǎng)頁內(nèi)容抽取方法的F值可達到98%以上,同時也驗證了路徑模式應用于Web新聞內(nèi)容信息抽取領(lǐng)域的可行性和有效性。 (2)為解決基于路徑模式的Web信息抽取模型PP-WNE中知識庫規(guī)模的優(yōu)化問題,提出區(qū)分路徑模式覆蓋問題,并證明了區(qū)分路徑模式覆蓋問題是一個NP-complete問題。為求解區(qū)分路徑模式覆蓋問題的近似最優(yōu)解,定義了一種特殊的區(qū)分路徑模式—極小區(qū)分路徑模式,在此基礎上,設計了一個求解區(qū)分路徑模式覆蓋問題的多項式時間(in|n|+1)近似算法MPM,其中,n為訓練樣本中正例的規(guī)模。在測試數(shù)據(jù)集上的實驗結(jié)果表明,MPM算法可有效優(yōu)化區(qū)分路徑模式集,并且在節(jié)點級評估標準和文本級評估標準下均可達到98%以上的抽取精度、召回率和F值。 (3)面向開放環(huán)境Web新聞內(nèi)容抽取的需求,設計了一種文本標簽路徑比特征,描述了基于網(wǎng)頁解析樹節(jié)點遍歷的文本標簽路徑比計算過程,提出基于文本標簽路徑直方圖區(qū)分內(nèi)容和非內(nèi)容的閾值方法CEPR,有效地解決了在線Web新聞內(nèi)容抽取的問題;提出了基于路徑編輯距離的加權(quán)高斯平滑方法,有效地提高了CEPR算法在抽取短文本方面的能力,并解決了新聞內(nèi)容中非新聞內(nèi)容過濾的問題。CEPR是一種快速的、通用的、無需訓練的網(wǎng)頁內(nèi)容抽取算法,可抽取多種來源、多種風格、多種語言的Web信息網(wǎng)頁。在CleanEval測試數(shù)據(jù)集上的實驗結(jié)果表明,大多數(shù)情況下,CEPR方法優(yōu)于CETR等抽取方法。 (4)設計并實現(xiàn)了一個HTML新聞網(wǎng)頁過濾與總結(jié)系統(tǒng)NFaS。其中,提出并實現(xiàn)了一種基于URL特征、網(wǎng)頁結(jié)構(gòu)特征、內(nèi)容屬性特征相結(jié)合的Web新聞網(wǎng)頁自動識別方法,有效地解決了Web新聞網(wǎng)頁自動識別問題;采用Web新聞內(nèi)容抽取技術(shù),有效地解決了Web新聞網(wǎng)頁過濾問題;采用一種基于詞語語義聯(lián)系的關(guān)鍵詞抽取方法,通過詞匯鏈構(gòu)造詞語語義聯(lián)系圖,抽取出高質(zhì)量的關(guān)鍵詞,完成Web新聞的總結(jié)任務。在測試數(shù)據(jù)集上的評估結(jié)果驗證了NFaS系統(tǒng)的有效性。
[Abstract]:Web news content extraction is a very important step in the process of Web intelligent information processing, which is the basis of information acquisition and security, network public opinion monitoring, mobile terminal personalized recommendation service, heterogeneous Web data integration, information retrieval, search engine and other research and application. Therefore, the research on relevant problems in the field of Web-based news content extraction has important research and application value. An example analysis and further study found that many news websites have similar layout structure and style, and there is an implicit association between the content layout and the label path of the parse tree. The traditional path expression is too rigid, which is difficult to adapt to the fine change of the structure of the HTML document in the process of extracting the Web information, and the accuracy of the information extraction is affected; in addition, the web news web page has a mass of heterogeneous, The universality of the technology of the hand-constructed wrapper and the technology of the wrapper based on the rule learning is presented. In this paper, the research of Web news content extraction based on label-path feature is carried out in this paper. The content of the research is concerned with two aspects: the research of high-precision Web news content extraction model and method based on path-mode knowledge for a specific website; A General Web News Content Extraction Model and a Party Based on the Label-Path Feature A. Principal research The following is the following: (1) Based on the study of the implicit relationship between the content layout and the path pattern of the analysis tree, a novel Web information extraction system model based on the distinguishing path model is proposed. P-WNE, on the basis of which, defines a special path pattern for Web news content extraction, and proposes a method for distinguishing path pattern, which solves the knowledge base of extraction mode. The result of the experiment shows that the F value of the method for extracting the news web content based on the path pattern can be achieved by using the noise threshold which is reasonably set. At the same time, the application of the path model to the information extraction of Web news content is also verified. (2) To solve the problem of optimization of knowledge base scale in PP-WNE of Web information extraction model based on path model, the problem of path mode coverage is proposed, and it is proved that the problem of distinguishing path mode is an NP-com In order to solve the approximate optimal solution of the problem of different path mode coverage, a special path pattern for distinguishing path patterns is defined. On the basis of this, a polynomial time (in | n | + 1) is designed to solve the problem of covering the path pattern. Method MPM, where n is a training sample The experimental results on the test data set show that the MPM algorithm can effectively optimize the path pattern set, and can reach more than 98% of the extraction accuracy at the node level evaluation standard and the current level evaluation standard. and (3) a text label path specific feature is designed for the requirement of the open environment Web news content extraction, and the text based on the webpage analysis tree node traversal is described. The label path ratio calculation process is based on the text label path histogram distinguishing content and the non-content threshold method CEPR, which effectively solves the problem that the online Web news content is extracted; the weighted Gaussian smoothing method based on the path editing distance is proposed, and the CEPR algorithm is effectively improved The ability to take a short text, and solve the problem in the news content. The problem of filtering the news content. The CEPR is a fast, general-purpose, no-training webpage content extraction algorithm, which can be used to extract a variety of sources, a variety of styles, a variety of languages, Web-based information web pages. The experimental results on the CleanEval test data set show that, in most cases, the CEPR method is superior to CETR and other extraction methods. (4) Design and implement an HTML news web page In this paper, a new method for automatic identification of web news web page based on URL character, web structure features and content attribute features is proposed and implemented, and the automatic identification of Web news web pages is effectively solved. The web news content extraction technology effectively solves the problem of web news web page filtering, adopts a keyword extraction method based on the semantic contact of words, A summary task for web news. Validation of the evaluation results on the test data set
【學位授予單位】:合肥工業(yè)大學
【學位級別】:博士
【學位授予年份】:2012
【分類號】:TP391.1;TP393.092
本文編號:2380981
[Abstract]:Web news content extraction is a very important step in the process of Web intelligent information processing, which is the basis of information acquisition and security, network public opinion monitoring, mobile terminal personalized recommendation service, heterogeneous Web data integration, information retrieval, search engine and other research and application. Therefore, the research on relevant problems in the field of Web-based news content extraction has important research and application value. An example analysis and further study found that many news websites have similar layout structure and style, and there is an implicit association between the content layout and the label path of the parse tree. The traditional path expression is too rigid, which is difficult to adapt to the fine change of the structure of the HTML document in the process of extracting the Web information, and the accuracy of the information extraction is affected; in addition, the web news web page has a mass of heterogeneous, The universality of the technology of the hand-constructed wrapper and the technology of the wrapper based on the rule learning is presented. In this paper, the research of Web news content extraction based on label-path feature is carried out in this paper. The content of the research is concerned with two aspects: the research of high-precision Web news content extraction model and method based on path-mode knowledge for a specific website; A General Web News Content Extraction Model and a Party Based on the Label-Path Feature A. Principal research The following is the following: (1) Based on the study of the implicit relationship between the content layout and the path pattern of the analysis tree, a novel Web information extraction system model based on the distinguishing path model is proposed. P-WNE, on the basis of which, defines a special path pattern for Web news content extraction, and proposes a method for distinguishing path pattern, which solves the knowledge base of extraction mode. The result of the experiment shows that the F value of the method for extracting the news web content based on the path pattern can be achieved by using the noise threshold which is reasonably set. At the same time, the application of the path model to the information extraction of Web news content is also verified. (2) To solve the problem of optimization of knowledge base scale in PP-WNE of Web information extraction model based on path model, the problem of path mode coverage is proposed, and it is proved that the problem of distinguishing path mode is an NP-com In order to solve the approximate optimal solution of the problem of different path mode coverage, a special path pattern for distinguishing path patterns is defined. On the basis of this, a polynomial time (in | n | + 1) is designed to solve the problem of covering the path pattern. Method MPM, where n is a training sample The experimental results on the test data set show that the MPM algorithm can effectively optimize the path pattern set, and can reach more than 98% of the extraction accuracy at the node level evaluation standard and the current level evaluation standard. and (3) a text label path specific feature is designed for the requirement of the open environment Web news content extraction, and the text based on the webpage analysis tree node traversal is described. The label path ratio calculation process is based on the text label path histogram distinguishing content and the non-content threshold method CEPR, which effectively solves the problem that the online Web news content is extracted; the weighted Gaussian smoothing method based on the path editing distance is proposed, and the CEPR algorithm is effectively improved The ability to take a short text, and solve the problem in the news content. The problem of filtering the news content. The CEPR is a fast, general-purpose, no-training webpage content extraction algorithm, which can be used to extract a variety of sources, a variety of styles, a variety of languages, Web-based information web pages. The experimental results on the CleanEval test data set show that, in most cases, the CEPR method is superior to CETR and other extraction methods. (4) Design and implement an HTML news web page In this paper, a new method for automatic identification of web news web page based on URL character, web structure features and content attribute features is proposed and implemented, and the automatic identification of Web news web pages is effectively solved. The web news content extraction technology effectively solves the problem of web news web page filtering, adopts a keyword extraction method based on the semantic contact of words, A summary task for web news. Validation of the evaluation results on the test data set
【學位授予單位】:合肥工業(yè)大學
【學位級別】:博士
【學位授予年份】:2012
【分類號】:TP391.1;TP393.092
【參考文獻】
相關(guān)期刊論文 前10條
1 丁春;關(guān)鍵詞標引的若干問題探討[J];編輯學報;2004年02期
2 劉遠超;王曉龍;徐志明;劉秉權(quán);;基于粗集理論的中文關(guān)鍵詞短語構(gòu)成規(guī)則挖掘[J];電子學報;2007年02期
3 胡東東,孟小峰;一種基于樹結(jié)構(gòu)的Web數(shù)據(jù)自動抽取方法[J];計算機研究與發(fā)展;2004年10期
4 馬安香;張斌;高克寧;齊鵬;張引;;基于結(jié)果模式的Deep Web數(shù)據(jù)抽取[J];計算機研究與發(fā)展;2009年02期
5 李保利,陳玉忠,俞士汶;信息抽取研究綜述[J];計算機工程與應用;2003年10期
6 李晶;陳恩紅;;Web信息抽取[J];計算機科學;2003年06期
7 李素建,王厚峰,俞士汶,辛乘勝;關(guān)鍵詞自動標引的最大熵模型應用研究[J];計算機學報;2004年09期
8 孫承杰,關(guān)毅;基于統(tǒng)計的網(wǎng)頁正文信息抽取方法的研究[J];中文信息學報;2004年05期
9 胡國平;張巍;王仁華;;基于雙層決策的新聞網(wǎng)頁正文精確抽取[J];中文信息學報;2006年06期
10 范焱,鄭誠,王清毅,蔡慶生,劉潔;用Naive Bayes方法協(xié)調(diào)分類Web網(wǎng)頁[J];軟件學報;2001年09期
,本文編號:2380981
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2380981.html
最近更新
教材專著