天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

Web數(shù)據(jù)集成中有價(jià)值事件識(shí)別研究

發(fā)布時(shí)間:2018-05-06 06:14

  本文選題:重復(fù)事件表象 + 事件表象統(tǒng)一; 參考:《山東大學(xué)》2014年博士論文


【摘要】:隨著互聯(lián)網(wǎng)技術(shù)的飛速發(fā)展,Web成為巨大的信息源,擁有海量數(shù)據(jù),同時(shí)Web具有開放性、交互性、便捷性的特點(diǎn),已成為人們獲取信息的重要平臺(tái)。如何準(zhǔn)確、有效地從Web中獲取所需信息,對(duì)信息進(jìn)一步分析和挖掘,對(duì)諸如市場(chǎng)情報(bào)分析、商業(yè)智能等分析型應(yīng)用尤為重要。 相對(duì)于傳統(tǒng)數(shù)據(jù)集成中結(jié)構(gòu)化數(shù)據(jù),Web網(wǎng)頁(yè)包含大量無(wú)結(jié)構(gòu)數(shù)據(jù),其中在特定時(shí)間、地點(diǎn)發(fā)生,由特定參與者參加的活動(dòng)語(yǔ)句稱為事件。識(shí)別網(wǎng)頁(yè)中有價(jià)值事件,即識(shí)別出分散在大量網(wǎng)頁(yè)中的事件信息并關(guān)聯(lián)事件的價(jià)值數(shù)據(jù),為市場(chǎng)情報(bào)分析等應(yīng)用提供數(shù)據(jù)支持。 Web網(wǎng)頁(yè)的新聞報(bào)道中蘊(yùn)含大量事件,為用戶提供及時(shí)、廣泛的信息,但報(bào)道這些事件的描述語(yǔ)句陳述角度各異,表達(dá)方式隨意,難以識(shí)別是否指向同一事件。網(wǎng)頁(yè)報(bào)道中對(duì)同一事件的不同描述語(yǔ)句稱為事件表象。在Web大量網(wǎng)頁(yè)中,通過聚合事件表象發(fā)現(xiàn)其共同所指的事件,利用共指同一事件的表象間互相印證和補(bǔ)充的信息對(duì)事件有一個(gè)較全面、準(zhǔn)確的認(rèn)識(shí)。另外,分析事件主題,集成事件主題熱度信息,從不同層面識(shí)別有價(jià)值事件。識(shí)別出的有價(jià)值事件,數(shù)據(jù)較豐富和準(zhǔn)確,而且集成了事件主題等不同層面的價(jià)值信息,可以為市場(chǎng)情報(bào)分析等應(yīng)用提供支持,也是進(jìn)一步進(jìn)行數(shù)據(jù)分析和挖掘的基礎(chǔ)。 Web有價(jià)值事件識(shí)別已經(jīng)成為當(dāng)前的熱點(diǎn)研究問題之一,由于Web事件具有海量、無(wú)結(jié)構(gòu)、描述隨意和聯(lián)系豐富等特點(diǎn),有價(jià)值事件識(shí)別不僅進(jìn)行Web事件發(fā)現(xiàn),還要集成事件價(jià)值信息,研究中仍然存在以下問題有待解決。(1)同一事件網(wǎng)絡(luò)中有不同的新聞報(bào)道,報(bào)道該事件的事件表象語(yǔ)句因描述角度不同,存在較大差異。這些事件表象分布于大量網(wǎng)頁(yè)中,如何從網(wǎng)頁(yè)中快速、準(zhǔn)確的發(fā)現(xiàn)重復(fù)事件表象,聚合指向同一事件的表象,是需要研究的問題;(2)事件表象從不同角度描述事件,如何充分利用表象間相互印證和補(bǔ)充信息,將形式各異的共指事件表象統(tǒng)一成一條表象,保證合并后的事件表象具有較準(zhǔn)確和豐富的數(shù)據(jù),是需要解決的問題;(3)Web不同事件可以擁有共同主題,如何準(zhǔn)確發(fā)現(xiàn)不同事件的主題,分析主題詞熱度,從主題層面識(shí)別有價(jià)值事件,是需要解決的問題。 本文以Web數(shù)據(jù)集成為目標(biāo),針對(duì)Web有價(jià)值事件識(shí)別中存在的以上問題展開研究,本文的貢獻(xiàn)主要包括以下三個(gè)方面: (1)提出一種基于維度匹配和共現(xiàn)約束的重復(fù)事件表象發(fā)現(xiàn)方法。使用事件的8維度表示形式,提出使用網(wǎng)頁(yè)事件表象共現(xiàn)約束減少事件表象的匹配次數(shù),能夠準(zhǔn)確、高效的發(fā)現(xiàn)網(wǎng)頁(yè)中重復(fù)事件表象。 本文提出一種基于維度匹配和共現(xiàn)約束的重復(fù)事件表象發(fā)現(xiàn)方法,事件使用{agent, activity, object, time, location, cause, purpose, manner}8個(gè)維度表示,賦予事件一定的結(jié)構(gòu)特性。針對(duì)不同維度內(nèi)容使用不同匹配器分別匹配,使用擴(kuò)展證據(jù)理論模型綜合維度匹配結(jié)果。針對(duì)大規(guī)模網(wǎng)頁(yè)重復(fù)事件表象的發(fā)現(xiàn),提出網(wǎng)頁(yè)事件表象共現(xiàn)約束,減少網(wǎng)頁(yè)間事件表象匹配次數(shù)。實(shí)驗(yàn)結(jié)果表明,該方法能夠準(zhǔn)確聚合大量共指同一事件的重復(fù)事件表象,并且減少事件表象間匹配次數(shù),有效降低了網(wǎng)頁(yè)重復(fù)事件表象發(fā)現(xiàn)的時(shí)間,提高了重復(fù)事件表象發(fā)現(xiàn)的效率。 (2)針對(duì)指向同一事件的Web事件表象形式多樣,提出一種通過維度內(nèi)容重組的事件表象統(tǒng)一方法,選取大量重復(fù)事件表象中較準(zhǔn)確和詳細(xì)的維度內(nèi)容并組合到一條事件表象中,反映現(xiàn)實(shí)事件。 本文提出一種通過維度內(nèi)容重組的事件表象統(tǒng)一方法,提出使用Markov邏輯網(wǎng)結(jié)合多種一階邏輯規(guī)則綜合判斷,選擇事件表象中較完整、準(zhǔn)確的維度內(nèi)容。組合分散在多個(gè)事件表象中較準(zhǔn)確詳細(xì)的維度內(nèi)容到一條事件表象中。實(shí)驗(yàn)結(jié)果表明,該方法能夠有效選擇較完整、準(zhǔn)確的維度內(nèi)容,事件表象統(tǒng)一有較高的準(zhǔn)確率。 (3)針對(duì)不同事件可以擁有共同主題,提出一種基于主題特征聚類和擴(kuò)展LDA模型的事件主題分析方法。分析事件的主題詞和主題詞熱度,從主題層面識(shí)別有價(jià)值事件。 本文提出一種擴(kuò)展LDA模型DLDA,在LDA模型中集成事件的維度信息,避免在主題無(wú)關(guān)的事件維度上分配主題概率(如時(shí)間、地點(diǎn)等維度內(nèi)容),選取主題特征維度。根據(jù)選取的主題特征維度內(nèi)容聚類,準(zhǔn)確識(shí)別事件主題。提出一種主題詞合成規(guī)則,合成事件的主題詞并分析主題詞熱度。實(shí)驗(yàn)結(jié)果表明,本文所提方法可以準(zhǔn)確地提取事件主題詞并分析主題詞熱度,從主題層面有效識(shí)別有價(jià)值事件。
[Abstract]:With the rapid development of Internet technology , the Web has become a huge source of information , and has the characteristics of openness , interactivity and convenience . It has become an important platform for people to get information . How to get the required information accurately and effectively from the Web , further analysis and mining of information is of particular importance to the analysis of information such as market intelligence analysis and business intelligence .

In contrast to structured data in traditional data integration , Web pages contain a large amount of unstructured data , in which event statements that occur at a particular time , place , and attended by a particular participant are called events . It is recognized that there are value events in the web page , that is , identify event information dispersed in a large number of web pages and associate the event ' s value data , providing data support for applications such as market intelligence analysis .

There are a lot of events in the news reports of Web pages . It provides users with timely and extensive information . However , it is difficult to identify whether or not to point to the same event .

Web - based event recognition has become one of the current hot - spot research problems . Because Web events have the characteristics of mass , no structure , description of arbitrary nature and rich contact , there are some problems to be solved in the study .
( 2 ) How to describe the events from different angles , how to make full use of the mutual authentication and supplementary information among the representations , unify the representations of the forms of common finger events into a form , and ensure that the combined event images have more accurate and abundant data , which is a problem that needs to be solved ;
( 3 ) Web different events can own common theme , how to find the topics of different events accurately , analyze the heat of the theme words , identify valuable events from the theme level , are the problems that need to be solved .

Based on the Web data set , this paper studies the above problems in Web valuable event recognition , and the contribution of this paper mainly includes the following three aspects :

( 1 ) A method for finding duplicate events based on dimension matching and co - occurrence constraint is proposed . The 8 - dimension representation of the event is used to reduce the number of occurrences of event representation by using the event representation of the web page , which can accurately and efficiently find the duplicate event representation in the web page .

This paper presents a method for finding a duplicate event based on dimension matching and co - occurrence constraint . The event uses { agent , activity , object , time , location , cause , purpose , manner } 8 dimensions to express and assign certain structural characteristics to the event . The results show that the method can accurately aggregate a large amount of repeated event representations of the same event and reduce the number of events between the web pages .

( 2 ) Aiming at the form and diversity of Web events pointing to the same event , a unified method of event representation through dimension content reorganization is proposed , and the more accurate and detailed dimension contents in large number of duplicate event images are selected and combined into an event table to reflect the real events .

This paper presents a unified method for event representation by dimension content recombination . It is proposed to use Markov logic network to combine multiple first - order logic rules to judge and select the more complete and accurate dimension content in the event representation . The experimental results show that the method can effectively select the more complete and accurate dimension content , and the event representation has a higher accuracy .

( 3 ) In view of the common theme of different events , an event theme analysis method based on thematic clustering and extended LDA model is proposed . The theme words and the theme word heat degree of the event are analyzed , and the value events are recognized from the subject level .

This paper proposes an extended LDA model DLDA , which integrates the dimension information of event in LDA model , avoids the distribution of theme probability ( such as time , place , etc . ) on the topic - independent event dimension , and selects the theme feature dimension . According to the selected topic feature dimension content clustering , the theme word of the event is identified and the hot degree of the subject word is analyzed . The experimental results show that the proposed method can accurately extract the event subject word and analyze the heat degree of the subject word , and the value event can be effectively identified from the theme level .

【學(xué)位授予單位】:山東大學(xué)
【學(xué)位級(jí)別】:博士
【學(xué)位授予年份】:2014
【分類號(hào)】:TP393.092;TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文 前10條

1 范青武;王普;張會(huì)清;高學(xué)金;;遺傳算法交叉算子的實(shí)質(zhì)分析[J];北京工業(yè)大學(xué)學(xué)報(bào);2010年10期

2 張振亞;程紅梅;王進(jìn);王煦法;;面向凝聚式層次聚類算法實(shí)現(xiàn)的矩陣存儲(chǔ)數(shù)據(jù)結(jié)構(gòu)研究[J];計(jì)算機(jī)科學(xué);2006年01期

3 夏天;;漢語(yǔ)詞語(yǔ)語(yǔ)義相似度計(jì)算研究[J];計(jì)算機(jī)工程;2007年06期

4 任慶生,曾進(jìn),戚飛虎;交叉算子的極限一致性[J];計(jì)算機(jī)學(xué)報(bào);2002年12期

5 吳健,吳朝暉,李瑩,鄧水光;基于本體論和詞匯語(yǔ)義相似度的Web服務(wù)發(fā)現(xiàn)[J];計(jì)算機(jī)學(xué)報(bào);2005年04期

6 徐永東;徐志明;王曉龍;;基于信息融合的多文檔自動(dòng)文摘技術(shù)[J];計(jì)算機(jī)學(xué)報(bào);2007年11期

7 張永新;李慶忠;彭朝暉;;基于Markov邏輯網(wǎng)的兩階段數(shù)據(jù)沖突解決方法[J];計(jì)算機(jī)學(xué)報(bào);2012年01期

8 白旭;靳志軍;;K-中心點(diǎn)聚類算法優(yōu)化模型的仿真研究[J];計(jì)算機(jī)仿真;2011年01期

9 許榮華;吳剛;李培峰;朱巧明;;基于事件框架的主題事件融合研究[J];計(jì)算機(jī)應(yīng)用研究;2009年12期

10 孫學(xué)剛,陳群秀,馬亮;基于主題的Web文檔聚類研究[J];中文信息學(xué)報(bào);2003年03期

相關(guān)博士學(xué)位論文 前2條

1 譚紅葉;中文事件抽取關(guān)鍵技術(shù)研究[D];哈爾濱工業(yè)大學(xué);2008年

2 付劍鋒;面向事件的知識(shí)處理研究[D];上海大學(xué);2010年

,

本文編號(hào):1851132

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/guanlilunwen/ydhl/1851132.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶4beab***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com