當(dāng)前位置：主頁(yè) > 管理論文 > 移動(dòng)網(wǎng)絡(luò)論文 >

基于約束的頻繁模式挖掘方法以及應(yīng)用研究

發(fā)布時(shí)間：2018-07-28 11:58

【摘要】：基于約束的頻繁模式挖掘是數(shù)據(jù)挖掘研究中最基本問題之一,具有廣泛的實(shí)際應(yīng)用。然而,在這個(gè)研究領(lǐng)域中,仍然存在三個(gè)方面的挑戰(zhàn)：(1)如何拓展新的應(yīng)用?具體而言,除了模式的“支持度”,怎樣設(shè)計(jì)一些新模式指標(biāo)更好地去度量模式的興趣度,以滿足新應(yīng)用的需求；(2)和模式支持度的反單調(diào)性不同,所提新模式指標(biāo)的性質(zhì)通常都比較復(fù)雜,比如它不滿足單調(diào)性、反單調(diào)性、可轉(zhuǎn)換性、簡(jiǎn)明性等。那么對(duì)一個(gè)模式,如何快速計(jì)算其所有父模式關(guān)于該指標(biāo)的上/下界,并利用這個(gè)新模式指標(biāo)的特性設(shè)計(jì)出高效算法；(3)通常,不同的應(yīng)用,有不同新模式指標(biāo)的提出,然后分別提出不同的模式上/下界的計(jì)算方法。那么有沒有一種通用方法可以計(jì)算任一模式指標(biāo)的上/下界?針對(duì)以上問題和挑戰(zhàn),本文開展了基于約束的頻繁模式挖掘的方法及其應(yīng)用研究,主要成果及貢獻(xiàn)如下：首先,提出了一個(gè)基于模式挖掘的網(wǎng)頁(yè)內(nèi)容推薦方法。網(wǎng)頁(yè)內(nèi)容推薦就是從網(wǎng)頁(yè)中找到重要的內(nèi)容塊組合推薦給用戶,有著很多的應(yīng)用(比如網(wǎng)頁(yè)智能打印、移動(dòng)設(shè)備上的電子閱讀等)。目前有許多的方法試圖去解決這個(gè)問題,但在這些方法中,要么就是針對(duì)于特定網(wǎng)頁(yè)(比如新聞、博客類的網(wǎng)頁(yè)),要么就是半自動(dòng)化的(用戶需要額外的操作去選擇網(wǎng)頁(yè)的內(nèi)容塊)。針對(duì)于任一類型的網(wǎng)頁(yè),如何全自動(dòng)地提取網(wǎng)頁(yè)中的有效內(nèi)容,目前還沒有得到很好地解決。為此,本文利用之前用戶對(duì)相似網(wǎng)頁(yè)的選擇方式,將該問題形式化成一個(gè)模式挖掘推薦問題,提出了一個(gè)基于模式挖掘的網(wǎng)頁(yè)內(nèi)容推薦方法,可以為任一類型的網(wǎng)頁(yè)提供更加準(zhǔn)確的網(wǎng)頁(yè)內(nèi)容推薦。具體而言,推薦給用戶的內(nèi)容塊組合(模式)不僅要頻繁被其它用戶選擇,而且要越完整越好。鑒于此,本文提出了一個(gè)新的模式興趣指標(biāo),即占有度,來衡量模式在其支持?jǐn)?shù)據(jù)庫(kù)上的完整度。結(jié)合模式的支持度和占有度,可以提供給用戶更加準(zhǔn)確、滿意的網(wǎng)頁(yè)內(nèi)容推薦。最后,同基準(zhǔn)方法比較,在真實(shí)的數(shù)據(jù)集上的實(shí)驗(yàn)結(jié)果表明所提方法能取得更加滿意的推薦結(jié)果和運(yùn)行效率。其次,提出了一個(gè)基于占有度的頻繁模式挖掘通用高效算法。本章分別對(duì)占有度的定義、界估算方法以及應(yīng)用三個(gè)層面進(jìn)行深度擴(kuò)展。具體而言,基于不同的加權(quán)平均(算術(shù)平均和調(diào)和平均),提出了兩種不同的占有度定義,即算術(shù)占有度和調(diào)和占有度。與模式支持度的反單調(diào)性不同,占有度的性質(zhì)即不滿足單調(diào)性、反單調(diào)性,又不滿足可轉(zhuǎn)換性、簡(jiǎn)明性,那么對(duì)一個(gè)模式,如何快速計(jì)算其所有父模式關(guān)于占有度的一個(gè)上界?為此,對(duì)于每一種占有度定義,本文分別提出了三種上界：高效、最‘緊’和折中上界。高效上界對(duì)于單個(gè)結(jié)點(diǎn)計(jì)算比較高效,但是比較松散,需要搜索結(jié)點(diǎn)數(shù)比較多；最‘緊’上界得到的界比較緊湊,因而搜索很少的結(jié)點(diǎn),但是計(jì)算單個(gè)結(jié)點(diǎn)比較耗時(shí)；為此,本文提出了一個(gè)折中上界,在松緊度和計(jì)算復(fù)雜度之間達(dá)到一個(gè)均衡,使算法整體性能達(dá)到最優(yōu)。占有度的概念不僅對(duì)于事務(wù)數(shù)據(jù)庫(kù)上的應(yīng)用很重要(比如網(wǎng)頁(yè)內(nèi)容打印推薦),而且對(duì)于序列數(shù)據(jù)庫(kù)中上的應(yīng)用也非常重要(比如旅游餐景點(diǎn)推薦),為此,本文提出了一個(gè)通用算法DOFRA可以同時(shí)處理不同類型數(shù)據(jù)庫(kù)上的應(yīng)用。最后,在兩個(gè)實(shí)際應(yīng)用中驗(yàn)證了DOFRA的有效性,同時(shí)也在大量的合成數(shù)據(jù)中驗(yàn)證了DOFRA算法運(yùn)行效率。最后,提出了一個(gè)通用模型可以高效估算任一模式指標(biāo)的上／下界�；诩s束模式挖掘不僅有助于捕捉更多的模式的語(yǔ)義信息,而且還可以利用約束的性質(zhì)進(jìn)一步地提高挖掘效率。在一些實(shí)際的應(yīng)用驅(qū)動(dòng)下,通常會(huì)提出一些新的模式指標(biāo)去度量模式的興趣度,然后分別估算所提模式指標(biāo)的上／下界,缺少一個(gè)適合于任一模式指標(biāo)的統(tǒng)一框架。為此,本文形式化了只考慮項(xiàng)標(biāo)記的界估計(jì)問題,提出了一個(gè)通用模型可以高效解決這個(gè)問題。為了更加直觀地展示所提通用框架的有效性,本文給出了兩個(gè)非常典型的模式指標(biāo)作為學(xué)習(xí)案例,即模式效用和模式占有度。除此之外,為滿足不同的應(yīng)用需求,本文把傳統(tǒng)的基于SQL的模式指標(biāo),比如min, max, avg, var等,給擴(kuò)展成了相對(duì)模式指標(biāo)形式。最后,在真實(shí)和合成數(shù)據(jù)上的實(shí)驗(yàn)分析驗(yàn)證了該技術(shù)方案的通用性和有效性。
[Abstract]:Frequent pattern mining based on constraints is one of the most basic problems in the research of data mining and has a wide range of practical applications. However, there are still three challenges in this field: (1) how to expand the new application? Specifically, in addition to the "support" of the model, how to design some new model indicators to better measure it Mode of interest to meet the needs of the new application; (2) the anti mononality of the model support is different, and the properties of the proposed new model are usually more complex, such as it does not satisfy monotonicity, anti mono tonal, conversion, simplicity, etc. then, for a pattern, for example, how to quickly calculate all the upper / lower bounds of all its parent patterns on the index, And using the characteristics of this new model to design an efficient algorithm; (3) usually, different applications, with different new model indicators, and then put forward different model / lower bound calculation method. Then, is there a general method to calculate the upper / lower bounds of any pattern index? For the above problems and challenges, this paper develops The method and application of constraint based frequent pattern mining are summarized. The main achievements and contributions are as follows:
First, a web content recommendation method based on pattern mining is proposed. The recommendation of web content is to find important content block combinations from web pages to recommend users, and there are many applications (such as web page intelligent printing, electronic reading on mobile devices, etc.). There are many ways to solve this problem at present, but in these parties, there are many ways to solve this problem. In the law, either is for a specific web page (such as a web page for news, bloggers) or semi automated (users need additional operations to select the content blocks of a web page). For any type of web page, how to automatically extract effective content from a web page has not been well solved. The method of selecting the similar web page by the former user, makes the problem form a pattern mining recommendation problem, and proposes a web content recommendation method based on pattern mining, which can provide more accurate web content recommendation for any type of web page. Specifically, the content block combination (pattern) recommended to the user is not only frequent. Other users choose, and the more complete, the better. In view of this, this paper presents a new pattern of interest index, that is, the degree of possession, to measure the integrity of the pattern on its support database. Experimental results on real data sets show that the proposed method can achieve more satisfactory recommendation results and operational efficiency.
Secondly, a general efficient algorithm for mining frequent pattern mining based on occupancy is proposed. This chapter extends the definition of occupancy, the method of boundary estimation and the application of three levels. Specifically, two different definitions of occupancy are proposed based on the different weighted mean (arithmetic mean and harmonic mean), that is, the arithmetic occupancy. And harmonic possession. Unlike the anti mononality of the pattern support, the nature of possession is not satisfied with monotonicity, anti mononality, and is not satisfied with the convertability and simplicity. Then, how to quickly calculate the upper bound of all the parent patterns about the degree of possession for a pattern? For this, for each definition, three The upper bound is efficient, the most 'tight' and the upper bound. The high efficient upper bound is more efficient for single node computing, but it is looser, it needs to search a lot of nodes; the most tight upper bound is compact, so it searches for a few nodes, but the calculation of a single node is more time-consuming; for this reason, this paper puts forward a middle upper bound, A balance between the tightness and computational complexity makes the overall performance of the algorithm optimal. The concept of occupancy is not only important for the application on the transaction database (such as web page content printing recommendation), but also is very important for the application of the sequence database (such as a tourist attraction recommendation). For this reason, this paper proposes A universal algorithm DOFRA can process applications on different types of databases at the same time. Finally, the validity of DOFRA is verified in two practical applications, and the efficiency of the DOFRA algorithm is verified in a large number of synthetic data.
Finally, a general model is proposed to efficiently estimate the upper / lower bounds of any pattern index. Constraint based mining is not only helpful to capture more semantic information of the pattern, but also can further improve the mining efficiency by using the nature of constraints. The interest degree of the metric pattern is labeled, then the upper / lower bounds of the model indexes are estimated, and a unified framework suitable for any pattern index is lacking. Therefore, this paper formally considers the boundary estimation problem of only item markers, and proposes a general model to efficiently solve the problem. For the effectiveness of the framework, this paper gives two typical model indexes as learning cases, namely, pattern utility and pattern occupancy. In addition, in order to meet different application requirements, this paper extends the traditional SQL based pattern indicators, such as min, Max, AVG, VaR, and so on. The experimental analysis on the data shows the versatility and effectiveness of the proposed scheme.
【學(xué)位授予單位】：中國(guó)科學(xué)技術(shù)大學(xué)
【學(xué)位級(jí)別】：博士
【學(xué)位授予年份】：2014
【分類號(hào)】：TP311.13;TP393.092

【共引文獻(xiàn)】

相關(guān)期刊論文前10條

1 朱君;曲超;湯庸;;利用單詞超團(tuán)的二分圖文本聚類算法[J];電子科技大學(xué)學(xué)報(bào);2008年03期

2 張樂君;國(guó)林;張健沛;楊靜;夏磊;;測(cè)度屬性關(guān)系分析的分布式系統(tǒng)異常檢測(cè)[J];北京郵電大學(xué)學(xué)報(bào);2013年06期

3 馬麗生;姚光順;楊傳健;;基于FP-tree的極大超團(tuán)模式挖掘算法[J];計(jì)算機(jī)工程與應(yīng)用;2011年12期

4 卓鵬;肖波;藺志青;;基于事務(wù)拆分的超團(tuán)挖掘算法[J];計(jì)算機(jī)工程;2009年20期

5 曲超;潘曉衡;朱君;蔡少仲;胡天明;;基于單詞超團(tuán)的文本聚類方法[J];計(jì)算機(jī)工程;2011年11期

6 黃崇爭(zhēng);李海峰;陳紅;;數(shù)據(jù)流上近似非可導(dǎo)項(xiàng)集的挖掘算法[J];計(jì)算機(jī)學(xué)報(bào);2010年08期

7 Daniel Kunkle;張冬暉;Gene Cooperman;;Mining Frequent Generalized Itemsets and Generalized Association Rules Without Redundancy[J];Journal of Computer Science & Technology;2008年01期

8 ;Mining item-item and between-set correlated association rules[J];Journal of Zhejiang University-Science C(Computers & Electronics);2011年02期

9 高恩陽(yáng);劉偉軍;王天然;;一種基于線性規(guī)劃的孤立點(diǎn)檢測(cè)方法[J];控制工程;2013年06期

10 高峗;周薇;韓冀中;孟丹;;一種基于文法壓縮的日志異常檢測(cè)算法[J];計(jì)算機(jī)學(xué)報(bào);2014年01期

相關(guān)會(huì)議論文前1條

1 黃崇爭(zhēng);李海峰;陳紅;;數(shù)據(jù)流上近似非可導(dǎo)項(xiàng)集的挖掘算法[A];NDBC2010第27屆中國(guó)數(shù)據(jù)庫(kù)學(xué)術(shù)會(huì)議論文集A輯一[C];2010年

相關(guān)博士學(xué)位論文前10條

1 李強(qiáng);數(shù)據(jù)挖掘中關(guān)聯(lián)分析算法研究[D];哈爾濱工程大學(xué);2010年

2 沈斌;關(guān)聯(lián)規(guī)則相關(guān)技術(shù)研究[D];浙江大學(xué);2007年

3 沙朝鋒;基于信息論的數(shù)據(jù)挖掘算法[D];復(fù)旦大學(xué);2008年

4 耿汝年;加權(quán)頻繁模式挖掘算法研究[D];江南大學(xué);2008年

5 肖波;可信關(guān)聯(lián)規(guī)則挖掘算法研究[D];北京郵電大學(xué);2009年

6 賀惠新;燃機(jī)異常檢測(cè)系統(tǒng)的關(guān)鍵技術(shù)研究[D];哈爾濱工業(yè)大學(xué);2013年

7 任維武;用于分布式入侵檢測(cè)系統(tǒng)的合作式本體模型[D];吉林大學(xué);2013年

8 陳斌;異常檢測(cè)方法及其關(guān)鍵技術(shù)研究[D];南京航空航天大學(xué);2013年

9 黃垂碧;應(yīng)用層網(wǎng)關(guān)攻擊檢測(cè)和性能優(yōu)化策略研究[D];中國(guó)科學(xué)技術(shù)大學(xué);2014年

10 何曉旭;時(shí)間序列數(shù)據(jù)挖掘若干關(guān)鍵問題研究[D];中國(guó)科學(xué)技術(shù)大學(xué);2014年

相關(guān)碩士學(xué)位論文前10條

1 余強(qiáng);基于語(yǔ)義的設(shè)計(jì)知識(shí)個(gè)性化檢索技術(shù)研究及應(yīng)用[D];南京航空航天大學(xué);2010年

2 李世松;基于閉模式的關(guān)聯(lián)規(guī)則產(chǎn)生算法研究[D];江蘇大學(xué);2007年

3 卓鵬;關(guān)聯(lián)規(guī)則與超團(tuán)挖掘算法研究[D];北京郵電大學(xué);2009年

4 孟靜;異常數(shù)據(jù)挖掘算法研究與應(yīng)用[D];江南大學(xué);2013年

5 龐景月;滑動(dòng)窗口模型下的數(shù)據(jù)流自適應(yīng)異常檢測(cè)方法研究[D];哈爾濱工業(yè)大學(xué);2013年

6 肖托;一種改進(jìn)的支持向量數(shù)據(jù)描述算法[D];哈爾濱工程大學(xué);2013年

7 仲莉;基于隱馬爾科夫模型的低碳異常檢測(cè)方法研究及應(yīng)用[D];華南理工大學(xué);2013年

8 沈耀東;基于壓縮融合的無(wú)線傳感網(wǎng)事件檢測(cè)算法研究[D];中國(guó)地質(zhì)大學(xué);2013年

9 吳龍常;基于聚類分析的入侵檢測(cè)算法研究[D];東北大學(xué);2011年

10 劉彬彬;Android平臺(tái)的安全技術(shù)研究與實(shí)現(xiàn)[D];江蘇科技大學(xué);2013年

，

本文編號(hào)：2150063

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/guanlilunwen/ydhl/2150063.html

上一篇：平衡k叉樹網(wǎng)絡(luò)的平均路徑長(zhǎng)度和鏈路效率
下一篇：移動(dòng)互聯(lián)網(wǎng)環(huán)境下的防篡改系統(tǒng)性能評(píng)估模型

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于約束的頻繁模式挖掘方法以及應(yīng)用研究