基于Web的動(dòng)態(tài)評(píng)論抽取技術(shù)研究
本文關(guān)鍵詞: 信息抽取 動(dòng)態(tài)頁(yè)面 Chrome LFSU DOM 出處:《沈陽(yáng)航空航天大學(xué)》2014年碩士論文 論文類(lèi)型:學(xué)位論文
【摘要】:Web2.0時(shí)代的到來(lái)推動(dòng)互聯(lián)網(wǎng)由過(guò)去的信息發(fā)布平臺(tái)轉(zhuǎn)變?yōu)槿缃竦男畔⒔换テ脚_(tái),在這一平臺(tái)上人們可以就自己感興趣的話(huà)題發(fā)表意見(jiàn)、參與討論并形成輿論效應(yīng),其中不乏惡意利用網(wǎng)絡(luò)輿論者,因此輿情分析工作也越來(lái)越受到重視,而Web信息抽取則是輿情分析的基礎(chǔ)工作。 Web信息抽取是從無(wú)結(jié)構(gòu)或半結(jié)構(gòu)的網(wǎng)頁(yè)中抽取特定信息的結(jié)構(gòu)化描述。本文介紹了web信息抽取技術(shù)現(xiàn)狀,針對(duì)現(xiàn)有技術(shù)對(duì)網(wǎng)頁(yè)結(jié)構(gòu)敏感、動(dòng)態(tài)多級(jí)評(píng)論抽取研究較少等問(wèn)題設(shè)計(jì)了一種半自動(dòng)的信息抽取系統(tǒng),該系統(tǒng)主要分為信息源獲取與評(píng)論抽取兩大模塊。信息源獲取模塊是基于Chrome插件技術(shù)、利用瀏覽器API與消息傳遞機(jī)制開(kāi)發(fā)的頁(yè)面抓取工具,實(shí)現(xiàn)了動(dòng)態(tài)頁(yè)面完整內(nèi)容的自動(dòng)獲取。評(píng)論抽取模塊基于動(dòng)態(tài)頁(yè)面的視覺(jué)、結(jié)構(gòu)、語(yǔ)義特征提出了LFSU概念,,利用其定位性質(zhì)進(jìn)行不同評(píng)論組織模型下的評(píng)論區(qū)域識(shí)別,并給出了單級(jí)評(píng)論與多級(jí)評(píng)論的抽取方法。該信息抽取方法利用少數(shù)DOM樹(shù)信息,并且不涉及復(fù)雜結(jié)構(gòu)比對(duì)與聚類(lèi)分析,算法效率高。 通過(guò)實(shí)際環(huán)境下覆蓋性實(shí)驗(yàn)結(jié)果分析發(fā)現(xiàn),該信息抽取方法滿(mǎn)足了博客輿情數(shù)據(jù)實(shí)際分析需求,對(duì)于評(píng)論數(shù)量大于1的頁(yè)面有很好的抽取效果。其查全率、查準(zhǔn)率和F值均達(dá)到92%以上。
[Abstract]:The advent of the Web2.0 era has transformed the Internet from a former information publishing platform to a modern information exchange platform, where people can express their views on topics of interest to themselves, participate in discussions and form a public opinion effect. Many of them use public opinion maliciously, so more and more attention is paid to the analysis of public opinion, and Web information extraction is the basic work of public opinion analysis. Web information extraction is a structured description of extracting specific information from unstructured or semi-structured web pages. This paper introduces the present situation of web information extraction technology. A semi-automatic information extraction system is designed for dynamic multi-level comment extraction, which is mainly divided into two modules: information source extraction and comment extraction. The information source extraction module is based on Chrome plug-in technology. A page capture tool developed by browser API and message passing mechanism is used to realize the automatic acquisition of the complete content of dynamic pages. The comment extraction module is based on the visual structure and semantic features of dynamic pages and puts forward the concept of LFSU. Using its localization property to identify comment regions under different comment organization models, a method of extracting single and multilevel comments is presented. This information extraction method uses a few DOM tree information. The algorithm is efficient and does not involve complex structure alignment and clustering analysis. By analyzing the results of overlay experiments in actual environment, it is found that this information extraction method meets the demand of actual analysis of blog public opinion data, and has a good extraction effect for pages with a number of comments greater than 1. The precision rate and F value were above 92%.
【學(xué)位授予單位】:沈陽(yáng)航空航天大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2014
【分類(lèi)號(hào)】:TP393.09;TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 孫立偉;何國(guó)輝;吳禮發(fā);;網(wǎng)絡(luò)爬蟲(chóng)技術(shù)的研究[J];電腦知識(shí)與技術(shù);2010年15期
2 于志良;;基于A(yíng)jax技術(shù)的Chrome擴(kuò)展開(kāi)發(fā)[J];電腦知識(shí)與技術(shù);2011年27期
3 劉豐;韓輝;周蕾;齊峻瑤;徐寶梁;;網(wǎng)絡(luò)信息技術(shù)在傳染病輿情監(jiān)測(cè)中的應(yīng)用[J];中國(guó)國(guó)境衛(wèi)生檢疫雜志;2012年04期
4 譚力;楊宗源;謝瑾奎;;Ajax技術(shù)的數(shù)據(jù)響應(yīng)優(yōu)化[J];計(jì)算機(jī)工程;2010年07期
5 徐文杰;陳慶奎;;增量更新并行W eb爬蟲(chóng)系統(tǒng)[J];計(jì)算機(jī)應(yīng)用;2009年04期
6 范純龍;夏佳;肖昕;呂紅偉;徐蕾;;基于功能語(yǔ)義單元的博客評(píng)論抽取技術(shù)[J];計(jì)算機(jī)應(yīng)用;2011年09期
7 郭浩;陸余良;劉金紅;;一種基于狀態(tài)轉(zhuǎn)換圖的Ajax爬行算法[J];計(jì)算機(jī)應(yīng)用研究;2009年11期
8 李烯;徐朝軍;;基于分塊和統(tǒng)計(jì)相結(jié)合的新聞?wù)某槿J];情報(bào)理論與實(shí)踐;2010年01期
9 曹冬林;廖祥文;許洪波;白碩;;基于網(wǎng)頁(yè)格式信息量的博客文章和評(píng)論抽取模型[J];軟件學(xué)報(bào);2009年05期
10 熊文;熊淑華;孫旭;張朝陽(yáng);;Ajax技術(shù)在Web2.0網(wǎng)站設(shè)計(jì)中的應(yīng)用研究[J];計(jì)算機(jī)技術(shù)與發(fā)展;2012年03期
本文編號(hào):1520034
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/1520034.html