天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

Web數(shù)據(jù)集成中有價值事件識別研究

發(fā)布時間:2018-05-06 06:14

  本文選題:重復(fù)事件表象 + 事件表象統(tǒng)一; 參考:《山東大學》2014年博士論文


【摘要】:隨著互聯(lián)網(wǎng)技術(shù)的飛速發(fā)展,Web成為巨大的信息源,擁有海量數(shù)據(jù),同時Web具有開放性、交互性、便捷性的特點,已成為人們獲取信息的重要平臺。如何準確、有效地從Web中獲取所需信息,對信息進一步分析和挖掘,對諸如市場情報分析、商業(yè)智能等分析型應(yīng)用尤為重要。 相對于傳統(tǒng)數(shù)據(jù)集成中結(jié)構(gòu)化數(shù)據(jù),Web網(wǎng)頁包含大量無結(jié)構(gòu)數(shù)據(jù),其中在特定時間、地點發(fā)生,由特定參與者參加的活動語句稱為事件。識別網(wǎng)頁中有價值事件,即識別出分散在大量網(wǎng)頁中的事件信息并關(guān)聯(lián)事件的價值數(shù)據(jù),為市場情報分析等應(yīng)用提供數(shù)據(jù)支持。 Web網(wǎng)頁的新聞報道中蘊含大量事件,為用戶提供及時、廣泛的信息,但報道這些事件的描述語句陳述角度各異,表達方式隨意,難以識別是否指向同一事件。網(wǎng)頁報道中對同一事件的不同描述語句稱為事件表象。在Web大量網(wǎng)頁中,通過聚合事件表象發(fā)現(xiàn)其共同所指的事件,利用共指同一事件的表象間互相印證和補充的信息對事件有一個較全面、準確的認識。另外,分析事件主題,集成事件主題熱度信息,從不同層面識別有價值事件。識別出的有價值事件,數(shù)據(jù)較豐富和準確,而且集成了事件主題等不同層面的價值信息,可以為市場情報分析等應(yīng)用提供支持,也是進一步進行數(shù)據(jù)分析和挖掘的基礎(chǔ)。 Web有價值事件識別已經(jīng)成為當前的熱點研究問題之一,由于Web事件具有海量、無結(jié)構(gòu)、描述隨意和聯(lián)系豐富等特點,有價值事件識別不僅進行Web事件發(fā)現(xiàn),還要集成事件價值信息,研究中仍然存在以下問題有待解決。(1)同一事件網(wǎng)絡(luò)中有不同的新聞報道,報道該事件的事件表象語句因描述角度不同,存在較大差異。這些事件表象分布于大量網(wǎng)頁中,如何從網(wǎng)頁中快速、準確的發(fā)現(xiàn)重復(fù)事件表象,聚合指向同一事件的表象,是需要研究的問題;(2)事件表象從不同角度描述事件,如何充分利用表象間相互印證和補充信息,將形式各異的共指事件表象統(tǒng)一成一條表象,保證合并后的事件表象具有較準確和豐富的數(shù)據(jù),是需要解決的問題;(3)Web不同事件可以擁有共同主題,如何準確發(fā)現(xiàn)不同事件的主題,分析主題詞熱度,從主題層面識別有價值事件,是需要解決的問題。 本文以Web數(shù)據(jù)集成為目標,針對Web有價值事件識別中存在的以上問題展開研究,本文的貢獻主要包括以下三個方面: (1)提出一種基于維度匹配和共現(xiàn)約束的重復(fù)事件表象發(fā)現(xiàn)方法。使用事件的8維度表示形式,提出使用網(wǎng)頁事件表象共現(xiàn)約束減少事件表象的匹配次數(shù),能夠準確、高效的發(fā)現(xiàn)網(wǎng)頁中重復(fù)事件表象。 本文提出一種基于維度匹配和共現(xiàn)約束的重復(fù)事件表象發(fā)現(xiàn)方法,事件使用{agent, activity, object, time, location, cause, purpose, manner}8個維度表示,賦予事件一定的結(jié)構(gòu)特性。針對不同維度內(nèi)容使用不同匹配器分別匹配,使用擴展證據(jù)理論模型綜合維度匹配結(jié)果。針對大規(guī)模網(wǎng)頁重復(fù)事件表象的發(fā)現(xiàn),提出網(wǎng)頁事件表象共現(xiàn)約束,減少網(wǎng)頁間事件表象匹配次數(shù)。實驗結(jié)果表明,該方法能夠準確聚合大量共指同一事件的重復(fù)事件表象,并且減少事件表象間匹配次數(shù),有效降低了網(wǎng)頁重復(fù)事件表象發(fā)現(xiàn)的時間,提高了重復(fù)事件表象發(fā)現(xiàn)的效率。 (2)針對指向同一事件的Web事件表象形式多樣,提出一種通過維度內(nèi)容重組的事件表象統(tǒng)一方法,選取大量重復(fù)事件表象中較準確和詳細的維度內(nèi)容并組合到一條事件表象中,反映現(xiàn)實事件。 本文提出一種通過維度內(nèi)容重組的事件表象統(tǒng)一方法,提出使用Markov邏輯網(wǎng)結(jié)合多種一階邏輯規(guī)則綜合判斷,選擇事件表象中較完整、準確的維度內(nèi)容。組合分散在多個事件表象中較準確詳細的維度內(nèi)容到一條事件表象中。實驗結(jié)果表明,該方法能夠有效選擇較完整、準確的維度內(nèi)容,事件表象統(tǒng)一有較高的準確率。 (3)針對不同事件可以擁有共同主題,提出一種基于主題特征聚類和擴展LDA模型的事件主題分析方法。分析事件的主題詞和主題詞熱度,從主題層面識別有價值事件。 本文提出一種擴展LDA模型DLDA,在LDA模型中集成事件的維度信息,避免在主題無關(guān)的事件維度上分配主題概率(如時間、地點等維度內(nèi)容),選取主題特征維度。根據(jù)選取的主題特征維度內(nèi)容聚類,準確識別事件主題。提出一種主題詞合成規(guī)則,合成事件的主題詞并分析主題詞熱度。實驗結(jié)果表明,本文所提方法可以準確地提取事件主題詞并分析主題詞熱度,從主題層面有效識別有價值事件。
[Abstract]:With the rapid development of Internet technology , the Web has become a huge source of information , and has the characteristics of openness , interactivity and convenience . It has become an important platform for people to get information . How to get the required information accurately and effectively from the Web , further analysis and mining of information is of particular importance to the analysis of information such as market intelligence analysis and business intelligence .

In contrast to structured data in traditional data integration , Web pages contain a large amount of unstructured data , in which event statements that occur at a particular time , place , and attended by a particular participant are called events . It is recognized that there are value events in the web page , that is , identify event information dispersed in a large number of web pages and associate the event ' s value data , providing data support for applications such as market intelligence analysis .

There are a lot of events in the news reports of Web pages . It provides users with timely and extensive information . However , it is difficult to identify whether or not to point to the same event .

Web - based event recognition has become one of the current hot - spot research problems . Because Web events have the characteristics of mass , no structure , description of arbitrary nature and rich contact , there are some problems to be solved in the study .
( 2 ) How to describe the events from different angles , how to make full use of the mutual authentication and supplementary information among the representations , unify the representations of the forms of common finger events into a form , and ensure that the combined event images have more accurate and abundant data , which is a problem that needs to be solved ;
( 3 ) Web different events can own common theme , how to find the topics of different events accurately , analyze the heat of the theme words , identify valuable events from the theme level , are the problems that need to be solved .

Based on the Web data set , this paper studies the above problems in Web valuable event recognition , and the contribution of this paper mainly includes the following three aspects :

( 1 ) A method for finding duplicate events based on dimension matching and co - occurrence constraint is proposed . The 8 - dimension representation of the event is used to reduce the number of occurrences of event representation by using the event representation of the web page , which can accurately and efficiently find the duplicate event representation in the web page .

This paper presents a method for finding a duplicate event based on dimension matching and co - occurrence constraint . The event uses { agent , activity , object , time , location , cause , purpose , manner } 8 dimensions to express and assign certain structural characteristics to the event . The results show that the method can accurately aggregate a large amount of repeated event representations of the same event and reduce the number of events between the web pages .

( 2 ) Aiming at the form and diversity of Web events pointing to the same event , a unified method of event representation through dimension content reorganization is proposed , and the more accurate and detailed dimension contents in large number of duplicate event images are selected and combined into an event table to reflect the real events .

This paper presents a unified method for event representation by dimension content recombination . It is proposed to use Markov logic network to combine multiple first - order logic rules to judge and select the more complete and accurate dimension content in the event representation . The experimental results show that the method can effectively select the more complete and accurate dimension content , and the event representation has a higher accuracy .

( 3 ) In view of the common theme of different events , an event theme analysis method based on thematic clustering and extended LDA model is proposed . The theme words and the theme word heat degree of the event are analyzed , and the value events are recognized from the subject level .

This paper proposes an extended LDA model DLDA , which integrates the dimension information of event in LDA model , avoids the distribution of theme probability ( such as time , place , etc . ) on the topic - independent event dimension , and selects the theme feature dimension . According to the selected topic feature dimension content clustering , the theme word of the event is identified and the hot degree of the subject word is analyzed . The experimental results show that the proposed method can accurately extract the event subject word and analyze the heat degree of the subject word , and the value event can be effectively identified from the theme level .

【學位授予單位】:山東大學
【學位級別】:博士
【學位授予年份】:2014
【分類號】:TP393.092;TP391.1

【參考文獻】

相關(guān)期刊論文 前10條

1 范青武;王普;張會清;高學金;;遺傳算法交叉算子的實質(zhì)分析[J];北京工業(yè)大學學報;2010年10期

2 張振亞;程紅梅;王進;王煦法;;面向凝聚式層次聚類算法實現(xiàn)的矩陣存儲數(shù)據(jù)結(jié)構(gòu)研究[J];計算機科學;2006年01期

3 夏天;;漢語詞語語義相似度計算研究[J];計算機工程;2007年06期

4 任慶生,曾進,戚飛虎;交叉算子的極限一致性[J];計算機學報;2002年12期

5 吳健,吳朝暉,李瑩,鄧水光;基于本體論和詞匯語義相似度的Web服務(wù)發(fā)現(xiàn)[J];計算機學報;2005年04期

6 徐永東;徐志明;王曉龍;;基于信息融合的多文檔自動文摘技術(shù)[J];計算機學報;2007年11期

7 張永新;李慶忠;彭朝暉;;基于Markov邏輯網(wǎng)的兩階段數(shù)據(jù)沖突解決方法[J];計算機學報;2012年01期

8 白旭;靳志軍;;K-中心點聚類算法優(yōu)化模型的仿真研究[J];計算機仿真;2011年01期

9 許榮華;吳剛;李培峰;朱巧明;;基于事件框架的主題事件融合研究[J];計算機應(yīng)用研究;2009年12期

10 孫學剛,陳群秀,馬亮;基于主題的Web文檔聚類研究[J];中文信息學報;2003年03期

相關(guān)博士學位論文 前2條

1 譚紅葉;中文事件抽取關(guān)鍵技術(shù)研究[D];哈爾濱工業(yè)大學;2008年

2 付劍鋒;面向事件的知識處理研究[D];上海大學;2010年

,

本文編號:1851132

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/guanlilunwen/ydhl/1851132.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶4beab***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com