基于標(biāo)簽路徑特征的Web新聞內(nèi)容抽取研究
[Abstract]:Web news content extraction is a very important step in the process of Web intelligent information processing, which is the basis of information acquisition and security, network public opinion monitoring, mobile terminal personalized recommendation service, heterogeneous Web data integration, information retrieval, search engine and other research and application. Therefore, the research on relevant problems in the field of Web-based news content extraction has important research and application value. An example analysis and further study found that many news websites have similar layout structure and style, and there is an implicit association between the content layout and the label path of the parse tree. The traditional path expression is too rigid, which is difficult to adapt to the fine change of the structure of the HTML document in the process of extracting the Web information, and the accuracy of the information extraction is affected; in addition, the web news web page has a mass of heterogeneous, The universality of the technology of the hand-constructed wrapper and the technology of the wrapper based on the rule learning is presented. In this paper, the research of Web news content extraction based on label-path feature is carried out in this paper. The content of the research is concerned with two aspects: the research of high-precision Web news content extraction model and method based on path-mode knowledge for a specific website; A General Web News Content Extraction Model and a Party Based on the Label-Path Feature A. Principal research The following is the following: (1) Based on the study of the implicit relationship between the content layout and the path pattern of the analysis tree, a novel Web information extraction system model based on the distinguishing path model is proposed. P-WNE, on the basis of which, defines a special path pattern for Web news content extraction, and proposes a method for distinguishing path pattern, which solves the knowledge base of extraction mode. The result of the experiment shows that the F value of the method for extracting the news web content based on the path pattern can be achieved by using the noise threshold which is reasonably set. At the same time, the application of the path model to the information extraction of Web news content is also verified. (2) To solve the problem of optimization of knowledge base scale in PP-WNE of Web information extraction model based on path model, the problem of path mode coverage is proposed, and it is proved that the problem of distinguishing path mode is an NP-com In order to solve the approximate optimal solution of the problem of different path mode coverage, a special path pattern for distinguishing path patterns is defined. On the basis of this, a polynomial time (in | n | + 1) is designed to solve the problem of covering the path pattern. Method MPM, where n is a training sample The experimental results on the test data set show that the MPM algorithm can effectively optimize the path pattern set, and can reach more than 98% of the extraction accuracy at the node level evaluation standard and the current level evaluation standard. and (3) a text label path specific feature is designed for the requirement of the open environment Web news content extraction, and the text based on the webpage analysis tree node traversal is described. The label path ratio calculation process is based on the text label path histogram distinguishing content and the non-content threshold method CEPR, which effectively solves the problem that the online Web news content is extracted; the weighted Gaussian smoothing method based on the path editing distance is proposed, and the CEPR algorithm is effectively improved The ability to take a short text, and solve the problem in the news content. The problem of filtering the news content. The CEPR is a fast, general-purpose, no-training webpage content extraction algorithm, which can be used to extract a variety of sources, a variety of styles, a variety of languages, Web-based information web pages. The experimental results on the CleanEval test data set show that, in most cases, the CEPR method is superior to CETR and other extraction methods. (4) Design and implement an HTML news web page In this paper, a new method for automatic identification of web news web page based on URL character, web structure features and content attribute features is proposed and implemented, and the automatic identification of Web news web pages is effectively solved. The web news content extraction technology effectively solves the problem of web news web page filtering, adopts a keyword extraction method based on the semantic contact of words, A summary task for web news. Validation of the evaluation results on the test data set
【學(xué)位授予單位】:合肥工業(yè)大學(xué)
【學(xué)位級別】:博士
【學(xué)位授予年份】:2012
【分類號】:TP391.1;TP393.092
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 丁春;關(guān)鍵詞標(biāo)引的若干問題探討[J];編輯學(xué)報;2004年02期
2 劉遠(yuǎn)超;王曉龍;徐志明;劉秉權(quán);;基于粗集理論的中文關(guān)鍵詞短語構(gòu)成規(guī)則挖掘[J];電子學(xué)報;2007年02期
3 胡東東,孟小峰;一種基于樹結(jié)構(gòu)的Web數(shù)據(jù)自動抽取方法[J];計算機(jī)研究與發(fā)展;2004年10期
4 馬安香;張斌;高克寧;齊鵬;張引;;基于結(jié)果模式的Deep Web數(shù)據(jù)抽取[J];計算機(jī)研究與發(fā)展;2009年02期
5 李保利,陳玉忠,俞士汶;信息抽取研究綜述[J];計算機(jī)工程與應(yīng)用;2003年10期
6 李晶;陳恩紅;;Web信息抽取[J];計算機(jī)科學(xué);2003年06期
7 李素建,王厚峰,俞士汶,辛乘勝;關(guān)鍵詞自動標(biāo)引的最大熵模型應(yīng)用研究[J];計算機(jī)學(xué)報;2004年09期
8 孫承杰,關(guān)毅;基于統(tǒng)計的網(wǎng)頁正文信息抽取方法的研究[J];中文信息學(xué)報;2004年05期
9 胡國平;張巍;王仁華;;基于雙層決策的新聞網(wǎng)頁正文精確抽取[J];中文信息學(xué)報;2006年06期
10 范焱,鄭誠,王清毅,蔡慶生,劉潔;用Naive Bayes方法協(xié)調(diào)分類Web網(wǎng)頁[J];軟件學(xué)報;2001年09期
,本文編號:2380981
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2380981.html