Web論壇數(shù)據(jù)抽取

發(fā)布時間：2018-05-26 16:41

本文選題：論壇數(shù)據(jù)抽取 + 用戶生成內(nèi)容��；參考：《華東師范大學(xué)》2012年博士論文

【摘要】：Web2.0為用戶提供了豐富的應(yīng)用,大量用戶的深度參與使Web正演變成一個生態(tài)系統(tǒng)。在向用戶展示信息的同時,Web2.0也吸引著用戶貢獻(xiàn)大量內(nèi)容,這些用戶生成的內(nèi)容蘊含巨大的價值。作為一種典型的Web2.0應(yīng)用,論壇為用戶提供了一個信息獲取與交流的平臺。用戶在論壇上發(fā)布信息和評論,例如介紹產(chǎn)品使用心得、交流生活感悟、討論學(xué)校教育、發(fā)布社會新聞等,這些內(nèi)容真實地反映了用戶的需求、觀點以及社會現(xiàn)象等。如何將論壇數(shù)據(jù)從Web頁面中抽取出來,以支持商品推薦、專家發(fā)現(xiàn)、輿情監(jiān)控等應(yīng)用具有很強的研究與現(xiàn)實意義。論壇數(shù)據(jù)較為復(fù)雜,它不僅包含用戶生成內(nèi)容,還包括推薦、廣告等噪音數(shù)據(jù)；此外,各論壇站點風(fēng)格也存在較大差異,這為論壇數(shù)據(jù)抽取帶來了挑戰(zhàn)。傳統(tǒng)的Web數(shù)據(jù)抽取技術(shù)通常面向相對規(guī)整的結(jié)構(gòu)化數(shù)據(jù),并不適用于論壇數(shù)據(jù)抽取,因而需要研究面向論壇數(shù)據(jù)的高效的抽取技術(shù)。本文的主要貢獻(xiàn)包括以下幾個方面： ·提出了一種整合歸納邏輯程序設(shè)計和XPath模式學(xué)習(xí)的論壇數(shù)據(jù)抽取方法,該方法具有較高的準(zhǔn)確率和召回率。該方法充分考慮了論壇頁面的結(jié)構(gòu)特征,引入新謂詞,以整合邏輯程序表達(dá)式和XPath模式,采用分而治之的方法來學(xué)習(xí)XPath模式,以描述目標(biāo)數(shù)據(jù)的結(jié)構(gòu)特征。最后,將學(xué)習(xí)的XPath模式規(guī)則轉(zhuǎn)換成XSLT文件,從而把抽取的論壇數(shù)據(jù)按照預(yù)定義的模型存儲起來,以實現(xiàn)論壇數(shù)據(jù)的自動抽取。 ·提出了一種非監(jiān)督的論壇數(shù)據(jù)抽取方法,該方法充分考慮了Web頁面的結(jié)構(gòu)特征和頁而間聯(lián)系,顯著提升了抽取的自動化程度。基于同一論壇站點頁面的結(jié)構(gòu)具有相似性的特點,采用多頁面聯(lián)合比較的方法,將Web頁面劃分成穩(wěn)定區(qū)域和非穩(wěn)定區(qū)域,并通過頁面級過濾和模板級過濾移除Web頁面的大多數(shù)噪音數(shù)據(jù)。然后利用穩(wěn)定區(qū)域中路徑和非穩(wěn)定區(qū)域中路徑的相互關(guān)系,引入路徑伴隨距離和相似度計算路徑之間的依賴關(guān)系,從而判定一個路徑是否屬于抽取目標(biāo)的路徑,實現(xiàn)論壇帖子內(nèi)容的自動抽取。 ·提出了一種非監(jiān)督的論壇數(shù)據(jù)抽取規(guī)則生成方法,該方法充分考慮了Web頁面的結(jié)構(gòu)和頁面內(nèi)容特征,提升了對不同論壇的適應(yīng)能力,保證了帖子抽取的完整性。本方法是一個兩階段的抽取規(guī)則生成方法,同時開采了Web頁面結(jié)構(gòu)、用戶發(fā)布帖子和論壇常規(guī)性的冗余信息三者的特征。在用戶信息處理階段,通過Web頁而常規(guī)性的冗余信息獲取用戶區(qū)域,并發(fā)現(xiàn)用戶區(qū)域中的最大子結(jié)構(gòu),從而獲得用戶信息：在帖子內(nèi)容處理階段,將用戶區(qū)域轉(zhuǎn)換成關(guān)系表中的記錄,根據(jù)屬性間的函數(shù)依賴關(guān)系來區(qū)分帖子內(nèi)容和噪音數(shù)據(jù)。最后,將兩個階段獲取內(nèi)容對應(yīng)的路徑歸納成以正則樹結(jié)構(gòu)表示的抽取規(guī)則。綜上所述,本文從不同的需求出發(fā)提出了三種論壇數(shù)據(jù)抽取方法。第一種方法采用有監(jiān)督的抽取規(guī)則學(xué)習(xí)模式,能夠獲得較好的準(zhǔn)確率和召回率,比較適用于小規(guī)模的論壇數(shù)據(jù)集合；第二種方法是非監(jiān)督的抽取方法,直接從Web頁面抽取數(shù)據(jù),不顯式地輸出抽取規(guī)則,適用于較大規(guī)模的論壇數(shù)據(jù)集合；第三種方法也是非監(jiān)督的方法,它首先學(xué)習(xí)抽取規(guī)則,然后基于規(guī)則抽取數(shù)據(jù),兼顧了規(guī)則生成的自動化和抽取性能,能適應(yīng)更大規(guī)模的數(shù)據(jù)集合�；谡鎸嵳搲瘮�(shù)據(jù)的實驗表明,上述方法能有效地從各種論壇中抽取數(shù)據(jù)。
[Abstract]:Web2.0 provides a wealth of applications for users, and a large number of users' deep participation makes Web an ecosystem. While displaying information to users, Web2.0 also attracts users to contribute a lot of content, and the content generated by these users is of great value.
As a typical Web2.0 application, the forum provides users with a platform for information acquisition and communication. Users publish information and comments on the forum, such as introducing product use, communicating life sentiment, discussing school education, and publishing social news, which really reflect users' needs, views, and social phenomena. How to extract forum data from Web pages to support commodity recommendation, expert discovery, public opinion monitoring and other applications has strong research and practical significance.
The forum data is more complex. It not only contains user generated content, but also includes noise data such as recommendation and advertising. In addition, there are great differences in the style of forum sites. This brings challenges to the forum data extraction. The traditional Web data extraction technology is usually oriented to relatively structured data, which is not suitable for forum data extraction, because it is not suitable for forum data extraction. The efficient extraction technology for forum data needs to be studied. The main contributions of this paper include the following aspects:
A method of forum data extraction which integrates inductive logic programming and XPath pattern learning is proposed. This method has high accuracy rate and recall rate. This method takes full account of the structural features of the forum pages, introduces new predicates, integrates logical program expressions and XPath patterns, and uses a divide and conquer method to learn XPath modules. In order to describe the structural features of the target data. Finally, the learning XPath pattern rules are converted into XSLT files, and the extracted forum data are stored in a predefined model to automatically extract the forum data.
A method of unsupervised forum data extraction is proposed. This method takes full account of the structural features of Web pages and the links between pages. This method significantly improves the automation of extraction. Based on the similarity of the structure of the same forum site page, the Web page is divided into stable areas by multi page joint comparison method. In the unstable region, most noise data of Web pages are removed by page level filtering and template level filtering. Then, using the relationship between the path in the path of the stable region and the path in the unstable region, the dependency relationship between the path and the path is introduced to determine whether a path belongs to the extraction target. Diameter, the automatic extraction of the content of the forum posts.
An unsupervised forum data extraction rule generation method is proposed. This method fully considers the structure of Web pages and the features of page content, improves the adaptability to different forums and ensures the integrity of the post extraction. This method is a two stage extraction rule generation method, and the Web page structure is exploited, users send it. In the user information processing stage, the user area is obtained through the Web page, and the maximum substructure in the user area is found, and the user information is obtained, which converts the user area into a record in the relational table in the post content processing stage, according to the genera. The function dependence between sex is used to distinguish between the content of the post and the noise data. Finally, the path of the content corresponding to the two stages is summed up into the extraction rule expressed in the regular tree structure.
To sum up, three kinds of forum data extraction methods are proposed from different requirements. The first method uses supervised extraction rule learning model, which can obtain better accuracy and recall rate, and is more suitable for small scale forum data sets; the second method is unsupervised extraction method, directly extracted from Web pages. Data, unexplicitly output extraction rules, suitable for large scale forum data sets; the third method is also an unsupervised method. First, it learns to extract rules, then extracts data based on rules, takes into account the automation and extraction performance of rule generation, and can adapt to a larger data set. Real forum data are based on real forum data. Experiments show that the above methods can effectively extract data from various forums.
【學(xué)位授予單位】：華東師范大學(xué)
【學(xué)位級別】：博士
【學(xué)位授予年份】：2012
【分類號】：TP393.09

【共引文獻(xiàn)】

相關(guān)期刊論文前10條

1 夏姍姍,劉椿年;約束歸納邏輯程序設(shè)計的研究[J];北京工業(yè)大學(xué)學(xué)報;2000年03期

2 傅騫;溫曉輝;;開放式Web信息抽取系統(tǒng)研究與實現(xiàn)[J];北京師范大學(xué)學(xué)報(自然科學(xué)版);2005年06期

3 趙麗麗,孫吉貴;歸納邏輯程序設(shè)計綜述[J];吉林大學(xué)學(xué)報(信息科學(xué)版);2005年S2期

4 高琳;覃桂敏;周曉峰;;圖數(shù)據(jù)中頻繁模式挖掘算法研究綜述[J];電子學(xué)報;2008年08期

5 陳少飛,郝亞南,李天柱,徐林昊,楊文柱;Web信息抽取技術(shù)研究進(jìn)展[J];河北大學(xué)學(xué)報(自然科學(xué)版);2003年01期

6 周順先;林亞平;王雷;;Web信息抽取中基于頁面特性的包裝器平衡算法[J];計算機工程與應(yīng)用;2006年36期

7 李永麗;張玉良;;一種基于后綴樹的包裝器自動生成方法研究[J];計算機工程與應(yīng)用;2007年34期

8 陳立寧;羅可;;基于Apriori算法的確定指定精度矩陣聚類方法[J];計算機工程與應(yīng)用;2012年07期

9 韓京宇;徐立臻;董逸生;;Web數(shù)據(jù)倉庫研究綜述[J];計算機科學(xué);2004年11期

10 鄭麗珍;郭景峰;李晶;邊偉峰;;一種有效率的基于圖的關(guān)系學(xué)習(xí)算法[J];計算機科學(xué);2008年03期

相關(guān)會議論文前2條

1 趙麗麗;孫吉貴;;歸納邏輯程序設(shè)計綜述[A];2005全國計算機程序設(shè)計類課程教學(xué)研討會論文集[C];2005年

2 汪建偉;高軍;王騰蛟;楊冬青;;一種基于顯示屬性的網(wǎng)頁信息提取方法[A];全國網(wǎng)絡(luò)與信息安全技術(shù)研討會論文集（上冊）[C];2007年

相關(guān)博士學(xué)位論文前10條

1 張慧斌;Deep Web查詢接口及查詢結(jié)果抽取研究[D];南開大學(xué);2010年

2 鄧斌;B2C在線評論中的客戶知識管理研究[D];電子科技大學(xué);2010年

3 陳珂銳;基于本體演化的Deep Web數(shù)據(jù)抽取與注釋[D];吉林大學(xué);2011年

4 石振國;資源網(wǎng)絡(luò)的精化學(xué)習(xí)及應(yīng)用研究[D];上海大學(xué);2011年

5 黃九鳴;面向輿情分析和屬性發(fā)現(xiàn)的網(wǎng)絡(luò)文本挖掘技術(shù)研究[D];國防科學(xué)技術(shù)大學(xué);2011年

6 聶鐵錚;Deep Web中Web數(shù)據(jù)庫集成關(guān)鍵技術(shù)的研究[D];東北大學(xué);2009年

7 楊新武;遺傳歸納邏輯程序設(shè)計技術(shù)研究[D];北京工業(yè)大學(xué);2003年

8 王萍;基于數(shù)據(jù)挖掘技術(shù)的消費者行為研究[D];吉林大學(xué);2004年

9 鄧緒斌;面向復(fù)雜數(shù)據(jù)源的數(shù)據(jù)抽取模型和算法研究[D];復(fù)旦大學(xué);2005年

10 許中衛(wèi);基于雙向搜索的ILP算法構(gòu)建漢語語義自動切分系統(tǒng)[D];安徽大學(xué);2006年

相關(guān)碩士學(xué)位論文前10條

1 樊敬川;Deep Web數(shù)據(jù)庫的選擇研究[D];河北大學(xué);2009年

2 孫嶺;一種基于前綴表達(dá)式的Web信息抽取方法的關(guān)鍵問題的實現(xiàn)[D];山東科技大學(xué);2010年

3 王葛;Deep Web接口集成與數(shù)據(jù)標(biāo)注方法研究[D];長春工業(yè)大學(xué);2010年

4 楊奕錦;Web頁面用戶評論信息抽取技術(shù)研究[D];浙江大學(xué);2011年

5 蘇偉兵;個性化Web商務(wù)信息融合關(guān)鍵技術(shù)研究[D];浙江大學(xué);2010年

6 胡開勝;基于WEB元數(shù)據(jù)抽取的ETL資源整合模型研究與實現(xiàn)[D];湖南師范大學(xué);2010年

7 潘高源;Deep Web查詢結(jié)果抽取技術(shù)的研究[D];吉林大學(xué);2011年

8 石京;基于語義本體的垂直搜索引擎模型研究[D];大連海事大學(xué);2011年

9 沈迅;基于Web頁面嵌套模式的包裝器生成系統(tǒng)的設(shè)計與實現(xiàn)[D];北京郵電大學(xué);2010年

10 王偉;基于網(wǎng)絡(luò)信息的熱點事件發(fā)現(xiàn)與分析研究[D];華東師范大學(xué);2011年

，

本文編號：1938043

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/wenyilunwen/guanggaoshejilunwen/1938043.html

上一篇：廣告學(xué)專業(yè)實務(wù)類課程教學(xué)創(chuàng)新——以《廣告策劃》本科教學(xué)為例
下一篇：基于多核多線程的大型LED燈光系統(tǒng)軟控制器設(shè)計

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

Web論壇數(shù)據(jù)抽取