天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁 > 科技論文 > 搜索引擎論文 >

基于標(biāo)簽路徑特征的Web新聞內(nèi)容抽取研究

發(fā)布時間:2018-12-15 16:46
【摘要】:Web新聞內(nèi)容抽取是Web智能信息處理過程中的一個非常重要的步驟,是情報獲取與安全、網(wǎng)絡(luò)輿情監(jiān)測、移動終端個性化推薦服務(wù)、異構(gòu)Web數(shù)據(jù)集成、信息檢索、搜索引擎等研究與應(yīng)用的基礎(chǔ)。因此,面向Web新聞內(nèi)容抽取領(lǐng)域中的相關(guān)問題開展研究,具有重要的研究和應(yīng)用價值。 實例分析和進(jìn)一步研究發(fā)現(xiàn),許多新聞網(wǎng)站具有類似的布局結(jié)構(gòu)和風(fēng)格,網(wǎng)頁內(nèi)容布局與其解析樹的標(biāo)簽路徑之間存在隱含的關(guān)聯(lián)性。傳統(tǒng)的路徑表達(dá)式過于剛性,在Web信息抽取過程中難以適應(yīng)HTML文檔結(jié)構(gòu)的細(xì)微變化,影響信息抽取的準(zhǔn)確率;此外,Web新聞網(wǎng)頁具有海量異構(gòu)的特點,對手工構(gòu)造包裝器技術(shù)以及基于規(guī)則學(xué)習(xí)的包裝器技術(shù)的通用性提出了挑戰(zhàn)。為此,本文開展基于標(biāo)簽路徑特征的Web新聞內(nèi)容抽取研究,研究內(nèi)容涉及兩方面:面向特定網(wǎng)站,研究基于路徑模式知識的高精度Web新聞內(nèi)容抽取模型和方法;面向開放環(huán)境,研究基于標(biāo)簽路徑特征的通用Web新聞內(nèi)容抽取模型和方法。 主要研究內(nèi)容如下: (1)在研究網(wǎng)頁內(nèi)容布局與其解析樹的路徑模式之間存在隱含關(guān)聯(lián)性的基礎(chǔ)上,提出了一種新穎的Web信息抽取系統(tǒng)模型—基于區(qū)分路徑模式的Web新聞內(nèi)容抽取模型PP-WNE。在此基礎(chǔ)上,定義了一種特殊的適用于Web新聞內(nèi)容抽取的路徑模式—區(qū)分路徑模式,并提出一種區(qū)分路徑模式挖掘方法,解決了抽取模式知識庫的構(gòu)建問題。以中文、英文網(wǎng)站上隨機(jī)選取的網(wǎng)頁為實驗數(shù)據(jù)集,實驗結(jié)果表明,通過采用合理設(shè)置的容噪閾值,基于路徑模式挖掘的新聞網(wǎng)頁內(nèi)容抽取方法的F值可達(dá)到98%以上,同時也驗證了路徑模式應(yīng)用于Web新聞內(nèi)容信息抽取領(lǐng)域的可行性和有效性。 (2)為解決基于路徑模式的Web信息抽取模型PP-WNE中知識庫規(guī)模的優(yōu)化問題,提出區(qū)分路徑模式覆蓋問題,并證明了區(qū)分路徑模式覆蓋問題是一個NP-complete問題。為求解區(qū)分路徑模式覆蓋問題的近似最優(yōu)解,定義了一種特殊的區(qū)分路徑模式—極小區(qū)分路徑模式,在此基礎(chǔ)上,設(shè)計了一個求解區(qū)分路徑模式覆蓋問題的多項式時間(in|n|+1)近似算法MPM,其中,n為訓(xùn)練樣本中正例的規(guī)模。在測試數(shù)據(jù)集上的實驗結(jié)果表明,MPM算法可有效優(yōu)化區(qū)分路徑模式集,并且在節(jié)點級評估標(biāo)準(zhǔn)和文本級評估標(biāo)準(zhǔn)下均可達(dá)到98%以上的抽取精度、召回率和F值。 (3)面向開放環(huán)境Web新聞內(nèi)容抽取的需求,設(shè)計了一種文本標(biāo)簽路徑比特征,描述了基于網(wǎng)頁解析樹節(jié)點遍歷的文本標(biāo)簽路徑比計算過程,提出基于文本標(biāo)簽路徑直方圖區(qū)分內(nèi)容和非內(nèi)容的閾值方法CEPR,有效地解決了在線Web新聞內(nèi)容抽取的問題;提出了基于路徑編輯距離的加權(quán)高斯平滑方法,有效地提高了CEPR算法在抽取短文本方面的能力,并解決了新聞內(nèi)容中非新聞內(nèi)容過濾的問題。CEPR是一種快速的、通用的、無需訓(xùn)練的網(wǎng)頁內(nèi)容抽取算法,可抽取多種來源、多種風(fēng)格、多種語言的Web信息網(wǎng)頁。在CleanEval測試數(shù)據(jù)集上的實驗結(jié)果表明,大多數(shù)情況下,CEPR方法優(yōu)于CETR等抽取方法。 (4)設(shè)計并實現(xiàn)了一個HTML新聞網(wǎng)頁過濾與總結(jié)系統(tǒng)NFaS。其中,提出并實現(xiàn)了一種基于URL特征、網(wǎng)頁結(jié)構(gòu)特征、內(nèi)容屬性特征相結(jié)合的Web新聞網(wǎng)頁自動識別方法,有效地解決了Web新聞網(wǎng)頁自動識別問題;采用Web新聞內(nèi)容抽取技術(shù),有效地解決了Web新聞網(wǎng)頁過濾問題;采用一種基于詞語語義聯(lián)系的關(guān)鍵詞抽取方法,通過詞匯鏈構(gòu)造詞語語義聯(lián)系圖,抽取出高質(zhì)量的關(guān)鍵詞,完成Web新聞的總結(jié)任務(wù)。在測試數(shù)據(jù)集上的評估結(jié)果驗證了NFaS系統(tǒng)的有效性。
[Abstract]:Web news content extraction is a very important step in the process of Web intelligent information processing, which is the basis of information acquisition and security, network public opinion monitoring, mobile terminal personalized recommendation service, heterogeneous Web data integration, information retrieval, search engine and other research and application. Therefore, the research on relevant problems in the field of Web-based news content extraction has important research and application value. An example analysis and further study found that many news websites have similar layout structure and style, and there is an implicit association between the content layout and the label path of the parse tree. The traditional path expression is too rigid, which is difficult to adapt to the fine change of the structure of the HTML document in the process of extracting the Web information, and the accuracy of the information extraction is affected; in addition, the web news web page has a mass of heterogeneous, The universality of the technology of the hand-constructed wrapper and the technology of the wrapper based on the rule learning is presented. In this paper, the research of Web news content extraction based on label-path feature is carried out in this paper. The content of the research is concerned with two aspects: the research of high-precision Web news content extraction model and method based on path-mode knowledge for a specific website; A General Web News Content Extraction Model and a Party Based on the Label-Path Feature A. Principal research The following is the following: (1) Based on the study of the implicit relationship between the content layout and the path pattern of the analysis tree, a novel Web information extraction system model based on the distinguishing path model is proposed. P-WNE, on the basis of which, defines a special path pattern for Web news content extraction, and proposes a method for distinguishing path pattern, which solves the knowledge base of extraction mode. The result of the experiment shows that the F value of the method for extracting the news web content based on the path pattern can be achieved by using the noise threshold which is reasonably set. At the same time, the application of the path model to the information extraction of Web news content is also verified. (2) To solve the problem of optimization of knowledge base scale in PP-WNE of Web information extraction model based on path model, the problem of path mode coverage is proposed, and it is proved that the problem of distinguishing path mode is an NP-com In order to solve the approximate optimal solution of the problem of different path mode coverage, a special path pattern for distinguishing path patterns is defined. On the basis of this, a polynomial time (in | n | + 1) is designed to solve the problem of covering the path pattern. Method MPM, where n is a training sample The experimental results on the test data set show that the MPM algorithm can effectively optimize the path pattern set, and can reach more than 98% of the extraction accuracy at the node level evaluation standard and the current level evaluation standard. and (3) a text label path specific feature is designed for the requirement of the open environment Web news content extraction, and the text based on the webpage analysis tree node traversal is described. The label path ratio calculation process is based on the text label path histogram distinguishing content and the non-content threshold method CEPR, which effectively solves the problem that the online Web news content is extracted; the weighted Gaussian smoothing method based on the path editing distance is proposed, and the CEPR algorithm is effectively improved The ability to take a short text, and solve the problem in the news content. The problem of filtering the news content. The CEPR is a fast, general-purpose, no-training webpage content extraction algorithm, which can be used to extract a variety of sources, a variety of styles, a variety of languages, Web-based information web pages. The experimental results on the CleanEval test data set show that, in most cases, the CEPR method is superior to CETR and other extraction methods. (4) Design and implement an HTML news web page In this paper, a new method for automatic identification of web news web page based on URL character, web structure features and content attribute features is proposed and implemented, and the automatic identification of Web news web pages is effectively solved. The web news content extraction technology effectively solves the problem of web news web page filtering, adopts a keyword extraction method based on the semantic contact of words, A summary task for web news. Validation of the evaluation results on the test data set
【學(xué)位授予單位】:合肥工業(yè)大學(xué)
【學(xué)位級別】:博士
【學(xué)位授予年份】:2012
【分類號】:TP391.1;TP393.092

【參考文獻(xiàn)】

相關(guān)期刊論文 前10條

1 丁春;關(guān)鍵詞標(biāo)引的若干問題探討[J];編輯學(xué)報;2004年02期

2 劉遠(yuǎn)超;王曉龍;徐志明;劉秉權(quán);;基于粗集理論的中文關(guān)鍵詞短語構(gòu)成規(guī)則挖掘[J];電子學(xué)報;2007年02期

3 胡東東,孟小峰;一種基于樹結(jié)構(gòu)的Web數(shù)據(jù)自動抽取方法[J];計算機(jī)研究與發(fā)展;2004年10期

4 馬安香;張斌;高克寧;齊鵬;張引;;基于結(jié)果模式的Deep Web數(shù)據(jù)抽取[J];計算機(jī)研究與發(fā)展;2009年02期

5 李保利,陳玉忠,俞士汶;信息抽取研究綜述[J];計算機(jī)工程與應(yīng)用;2003年10期

6 李晶;陳恩紅;;Web信息抽取[J];計算機(jī)科學(xué);2003年06期

7 李素建,王厚峰,俞士汶,辛乘勝;關(guān)鍵詞自動標(biāo)引的最大熵模型應(yīng)用研究[J];計算機(jī)學(xué)報;2004年09期

8 孫承杰,關(guān)毅;基于統(tǒng)計的網(wǎng)頁正文信息抽取方法的研究[J];中文信息學(xué)報;2004年05期

9 胡國平;張巍;王仁華;;基于雙層決策的新聞網(wǎng)頁正文精確抽取[J];中文信息學(xué)報;2006年06期

10 范焱,鄭誠,王清毅,蔡慶生,劉潔;用Naive Bayes方法協(xié)調(diào)分類Web網(wǎng)頁[J];軟件學(xué)報;2001年09期

,

本文編號:2380981

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2380981.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶18d1a***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com
欧美国产日产综合精品| 日韩一区二区三区高清在| 九九九热视频最新在线| 97人摸人人澡人人人超碰| 中国黄色色片色哟哟哟哟哟哟| 日韩av欧美中文字幕| 熟女白浆精品一区二区| 国产精品香蕉一级免费| 亚洲熟女少妇精品一区二区三区| 日本一区二区三区黄色| 国产又粗又猛又爽又黄的文字| 青青草草免费在线视频| 日韩特级黄片免费在线观看| 久久香蕉综合网精品视频| 丰满少妇被猛烈撞击在线视频| 国产肥女老熟女激情视频一区| 亚洲专区一区中文字幕| 国产欧美一区二区三区精品视| 一二区中文字幕在线观看| 麻豆一区二区三区在线免费| 日韩人妻少妇一区二区| 国产一区欧美午夜福利| 国产亚洲精品久久99| 国产精品乱子伦一区二区三区| 东京热男人的天堂久久综合| 亚洲人妻av中文字幕| 自拍偷拍福利视频在线观看| 国产精品色热综合在线| 国产成人精品国内自产拍| 精品香蕉国产一区二区三区| 婷婷九月在线中文字幕| 久久国产亚洲精品成人| 亚洲国产日韩欧美三级| 国产精品大秀视频日韩精品| 国产在线不卡中文字幕| 亚洲少妇一区二区三区懂色| 日本人妻熟女一区二区三区| 一区二区三区在线不卡免费| 午夜福利视频日本一区| 亚洲精品福利视频你懂的| 九九热这里只有精品哦|