針對模板生成網(wǎng)頁的數(shù)據(jù)自動(dòng)抽取方法的探討與應(yīng)用

發(fā)布時(shí)間：2018-03-28 14:02

本文選題：Web信息抽取技術(shù)　切入點(diǎn)：網(wǎng)頁模板　出處：《華東師范大學(xué)》2009年碩士論文

【摘要】： 隨著Internet的迅速發(fā)展,互聯(lián)網(wǎng)已成為一個(gè)巨大的信息庫,為了有效地利用互聯(lián)網(wǎng)上的信息,各種Web信息抽取技術(shù)應(yīng)運(yùn)而生。目前,Web上的很多網(wǎng)頁是網(wǎng)站根據(jù)用戶請求從后臺數(shù)據(jù)庫中選取數(shù)據(jù)并嵌入到通用的模板中,并結(jié)合網(wǎng)站的特定需求而動(dòng)態(tài)生成的,例如電子商務(wù)網(wǎng)站的商品描述網(wǎng)頁等。針對如何從這類由模板生成的網(wǎng)頁中自動(dòng)地抽取出有效數(shù)據(jù)的問題,目前常用的經(jīng)典方法有RoadRunner,EXALG等,其中RoadRunner的實(shí)現(xiàn)算法的時(shí)間復(fù)雜度呈指數(shù)級增長,其實(shí)用性不強(qiáng);雖然EXALG方法對RoadRunner方法進(jìn)行了有效的改進(jìn),但是仍然缺乏對網(wǎng)頁中可視化布局信息、標(biāo)記屬性和字符串的相似度等重要特征的考慮。因此,本文針對上述這些問題研討了相關(guān)網(wǎng)頁模板檢測問題的形式化描述,結(jié)合該類網(wǎng)頁的結(jié)構(gòu)特征,探討了一種新的模板檢測方法;并且利用檢測出的模板完成對相關(guān)實(shí)例網(wǎng)頁的數(shù)據(jù)自動(dòng)抽取過程;最終將該基于有效模板檢測的網(wǎng)頁數(shù)據(jù)自動(dòng)抽取算法應(yīng)用于某電子商務(wù)網(wǎng)站的相關(guān)網(wǎng)頁的數(shù)據(jù)抽取過程中,即對某網(wǎng)站中的商品列表信息和商品詳細(xì)信息等重要數(shù)據(jù)實(shí)現(xiàn)了自動(dòng)抽取的工作。與其他方法相比,該方法能夠適應(yīng)于“列表頁面”和“詳細(xì)頁面”兩種類型的網(wǎng)頁,在該類網(wǎng)頁數(shù)據(jù)抽取的查全率和準(zhǔn)確率方面有了較大的改進(jìn)。本文的主要內(nèi)容和結(jié)構(gòu)安排如下: 首先,介紹針對模板生成網(wǎng)頁的數(shù)據(jù)抽取方法的發(fā)展現(xiàn)狀以及相關(guān)技術(shù),并闡述了論文的研究目標(biāo)和工作內(nèi)容。其次,介紹了Web數(shù)據(jù)抽取過程中主流的網(wǎng)頁數(shù)據(jù)抽取技術(shù),系統(tǒng)地剖析了目前廣泛采用的經(jīng)典的網(wǎng)頁數(shù)據(jù)抽取技術(shù)中存在的優(yōu)勢與不足,以此為基礎(chǔ),文中研討了一種有效的針對模板生成網(wǎng)頁的數(shù)據(jù)抽取方法及其實(shí)現(xiàn)算法,即針對該類網(wǎng)頁,完成了相應(yīng)網(wǎng)頁有效數(shù)據(jù)的自動(dòng)抽取工作。接著,重點(diǎn)闡述了文中所研討的針對模板生成網(wǎng)頁的數(shù)據(jù)自動(dòng)抽取算法的設(shè)計(jì)與實(shí)現(xiàn)過程。該算法首先將已經(jīng)凈化的HTML頁面解析成標(biāo)簽樹和標(biāo)簽隊(duì)列兩種數(shù)據(jù)結(jié)構(gòu);其次針對大部分網(wǎng)頁中存在導(dǎo)航條、廣告及版本信息等一些與抽取內(nèi)容無關(guān)的數(shù)據(jù)信息,采用文中所提出的具體有效的標(biāo)簽樹匹配算法過濾上述無關(guān)/冗余的數(shù)據(jù)信息;然后通過該數(shù)據(jù)自動(dòng)抽取算法中計(jì)算Ctokens的核心子算法將這類HTML頁面進(jìn)行有效的標(biāo)簽歸類,以期基于所生成的Ctokens來自動(dòng)抽取出該類網(wǎng)頁的模板結(jié)構(gòu)信息數(shù)據(jù)和字段層次上的有效網(wǎng)頁生成數(shù)據(jù)。最后,根據(jù)文中所研討的方法及實(shí)現(xiàn)算法,嘗試性地構(gòu)造了一個(gè)針對模板生成網(wǎng)頁的數(shù)據(jù)自動(dòng)抽取原型系統(tǒng),該系統(tǒng)能夠完成對相關(guān)電子商務(wù)網(wǎng)站中該類網(wǎng)頁(如:商品的“列表頁面”和“詳細(xì)頁面”的具體網(wǎng)頁)的有效數(shù)據(jù)的自動(dòng)抽裙ぷ?該抽取過程的查全率和準(zhǔn)確率都有較大的改進(jìn),所完成的工作是具有廣泛實(shí)際需求和深入推廣應(yīng)用價(jià)值的。
[Abstract]:With the rapid development of Internet, the Internet has become a huge information base, in order to effectively use the information on the Internet, Web information extraction technology came into being. At present, a lot of Web pages is the site according to the request of the user selected data and embedded into the general template from databases, and websites with specific needs dynamically generated, such as electronic commerce website ". According to the description of the goods from the template generated web pages automatically extract the valid data, the classical methods of RoadRunner, EXALG and RoadRunner, which realized the time complexity of the algorithm grows exponentially, in fact is not strong; although the EXALG method the RoadRunner method is improved effectively, but there is still a lack of information visualization in web page layout, tag attributes and string similarity Other important features are considered. Therefore, aiming at these issues related web page template detection problem is formalized, combined with the structure characteristics of the web page, and discusses a new template detection method; and use the detected templates to complete automatic data extraction process of relevant examples of Web data extraction process; the web application based on web data template detection algorithm effective automatic extraction in an e-commerce site in the list of goods and merchandise information with information and other important data on a web site in the automatic extraction work. Compared with other methods, this method can be applied to the list of "pages" and "detail page" two types ", have been greatly improved in the aspect of the web data extraction recall and accuracy.
The main contents and structure of this paper are as follows:
First of all, this paper introduces the development status of data extraction method for template generation of web pages and related technologies, and expounds the research objectives and work content of the paper.
Secondly, introduces the web data extraction technology of Web data extraction process, systematically analyzes the existing web data extraction technology is widely used in the classical advantages and disadvantages, on this basis, this paper presents an effective template generated web pages data extraction method and algorithm for the class ", completed the work to automatically extract the corresponding page valid data.
Then, focuses on the design and implementation of template generated web pages automatic data extraction algorithm research in this paper. Firstly, HTML parsor had purified into two kinds of label label tree and queue data structure; secondly, there is a majority of web page navigation, independent advertising and version information and some the contents of the selected data, using the specific effective label tree is proposed in this paper, the algorithm of filtering irrelevant / redundant information; and then through the data extraction algorithm in computing core algorithm Ctokens the HTML page for effective label classification, which based on the generated Ctokens to automatically extract the data generated effective web template structure information of the data of the web page and the field level.
Finally, according to the studied method and algorithm of this paper attempts to construct a template generated web pages automatic data extraction prototype system, the system can complete the related e-commerce website in the web page (such as: "specific products" list "and" detail page ") automatic extraction the work of the skirt of the effective data extraction process? The recall and precision are greatly improved, the completion of the work has wide actual demand and thorough promotion application value.

【學(xué)位授予單位】：華東師范大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2009
【分類號】：TP393.092

【參考文獻(xiàn)】

相關(guān)期刊論文前2條

1 陳少飛,郝亞南,李天柱,徐林昊,楊文柱;Web信息抽取技術(shù)研究進(jìn)展[J];河北大學(xué)學(xué)報(bào)(自然科學(xué)版);2003年01期

2 李保利,陳玉忠,俞士汶;信息抽取研究綜述[J];計(jì)算機(jī)工程與應(yīng)用;2003年10期

，

本文編號：1676627

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/wenyilunwen/guanggaoshejilunwen/1676627.html

上一篇：近代上海平面設(shè)計(jì)發(fā)展研究（1843-1949）
下一篇：在線交易消費(fèi)者權(quán)益保護(hù)制度研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

針對模板生成網(wǎng)頁的數(shù)據(jù)自動(dòng)抽取方法的探討與應(yīng)用