天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁(yè) > 科技論文 > 搜索引擎論文 >

Deep web數(shù)據(jù)源的自動(dòng)識(shí)別與分類研究

發(fā)布時(shí)間:2018-08-25 09:20
【摘要】:Deep Web深度網(wǎng)絡(luò)資源,又稱作不可見(jiàn)網(wǎng)或隱藏網(wǎng)(譯為Invisible Web or Hidden Web),它常常被人稱為谷歌查不到的網(wǎng)絡(luò)信息,這些信息不屬于我們所熟知的那些標(biāo)準(zhǔn)搜索引擎所能夠搜索到的。通常認(rèn)為搜索引擎查不到的信息要占網(wǎng)絡(luò)全部信息的90%。據(jù)Bright Planet公司技術(shù)白皮書(shū)的中描述,Deep Web資源容量約為Surface Web的500倍,而且包含著更多有價(jià)值的資源。超過(guò)一半的Deep Web內(nèi)容都保存在專業(yè)領(lǐng)域的數(shù)據(jù)庫(kù)中。海量的表面信息固然可以通過(guò)普通的搜索引擎查詢到,可是還有相當(dāng)大了的信息由于隱藏在深處無(wú)法被搜索引擎查到,而且Deep Web數(shù)據(jù)源同時(shí)又是不斷變化的,絕大部分隱藏的信息必須通過(guò)動(dòng)態(tài)請(qǐng)求產(chǎn)生網(wǎng)頁(yè)信息,標(biāo)準(zhǔn)的搜索引擎是沒(méi)有辦法對(duì)它進(jìn)行查找的。因?yàn)檫@些動(dòng)態(tài)請(qǐng)求產(chǎn)生的網(wǎng)頁(yè)信息必須要通過(guò)Deep Web查詢接口來(lái)獲取,使得Deep Web信息獲取變的更加困難,為了有效的獲取Deep Web信息,我們必須要對(duì)Deep Web進(jìn)行數(shù)據(jù)自動(dòng)識(shí)別和分類。 本文通過(guò)對(duì)Deep Web數(shù)據(jù)源的自動(dòng)識(shí)別和分類研究這兩大重點(diǎn)問(wèn)題展開(kāi)深入研究。主要的研究?jī)?nèi)容包括: (1)對(duì)普通網(wǎng)頁(yè)表單及Deep Web網(wǎng)頁(yè)的表單特征進(jìn)行分析,經(jīng)過(guò)合并、添加、篩選得到的得到本文采用的表單特征提取方案,包含各控件值,控件數(shù)量,包含語(yǔ)義信息的詞條等一系列特征值作為分類屬性。 (2) Deep Web數(shù)據(jù)集成的關(guān)鍵問(wèn)題研究,查詢接口的識(shí)別及分類判定。針對(duì)樸素貝葉斯方法的限制,使用粗糙集算法進(jìn)行優(yōu)化約簡(jiǎn)。該方法利用兩次隨機(jī)抽樣建立基于樸素貝葉斯算法的分類器組,利用粗糙集算法的屬性約簡(jiǎn)方法進(jìn)行分類器組的約簡(jiǎn)處理,然后利用優(yōu)化后的分類器組進(jìn)行分類,對(duì)得到的分類結(jié)果進(jìn)行加權(quán)平均,得到最終的分類結(jié)果。實(shí)驗(yàn)結(jié)果顯示,在優(yōu)化后的貝葉斯分類分類器組,對(duì)Deep Web查詢接口及其分類的查準(zhǔn)率及查全率上均有明顯提高。 (3) Deep Web數(shù)據(jù)源識(shí)別及分類性能對(duì)比。將數(shù)據(jù)挖掘中的幾種分類方法,如:C4.5決策樹(shù)、ID3等以及本文算法進(jìn)行分析對(duì)比,在查全率和查準(zhǔn)率上效果驗(yàn)證了此方法可行。 本文所采取的方法是分析現(xiàn)有的相關(guān)研究,通過(guò)對(duì)Deep Web數(shù)據(jù)源的學(xué)習(xí)和分析,并在目前已有的研究成果的之上,通過(guò)改進(jìn)的算法,加以實(shí)驗(yàn)數(shù)據(jù)來(lái)驗(yàn)證我們的算法的有效性。從實(shí)驗(yàn)的結(jié)果來(lái)看本文的方法還是比較滿意的。實(shí)驗(yàn)中難免存在不足之處,在今后的研究中我們將進(jìn)一步的對(duì)相關(guān)問(wèn)題和算法進(jìn)行修正。Deep Web的研究如今還有一段很長(zhǎng)的路要走,存在的難題需要廣大的研究者們逐個(gè)的去解決。
[Abstract]:Deep Web deep web resources, also known as invisible or hidden networks, are often referred to as Google's unsearchable web information, which is not something that standard search engines known to us can search. It is generally believed that the search engine can not find the information to account for 90% of all the information on the network. The resource capacity of Deep Web is about 500 times that of Surface Web and contains more valuable resources, according to Bright Planet's technical white paper. More than half of Deep Web content is stored in professional databases. The massive amount of surface information can be queried by ordinary search engines, but there is still quite a lot of information that can not be found by search engines because it is hidden in the depths, and the Deep Web data sources are constantly changing at the same time. Most hidden information must be generated by dynamic request, and the standard search engine can not find it. Because the web page information generated by these dynamic requests must be obtained through the Deep Web query interface, it is more difficult to obtain the Deep Web information. In order to obtain Deep Web information effectively, we must recognize and classify the Deep Web data automatically. In this paper, the automatic identification and classification of Deep Web data sources are studied. The main research contents are as follows: (1) analyzing the form features of common web pages and Deep Web web pages, after merging, adding, screening the form feature extraction scheme adopted in this paper, including the values of each control, the main contents of this paper are as follows: 1. A series of feature values such as the number of controls and the entries containing semantic information are used as classification attributes. (2) Research on key issues of) Deep Web data integration, identification and classification of query interfaces. Aiming at the limitation of naive Bayes method, rough set algorithm is used to optimize reduction. In this method, the classifier group based on naive Bayes algorithm is established by twice random sampling, the attribute reduction method of rough set algorithm is used to deal with the reduction of classifier group, and then the optimized classifier group is used to classify. The results are weighted average and the final classification results are obtained. The experimental results show that the precision and recall rate of the Deep Web query interface and its classification are improved obviously in the optimized Bayesian classifier group. (3) Deep Web data source recognition and classification performance comparison. Several classification methods in data mining, such as: C4.5 decision tree ID3 and this algorithm, are analyzed and compared. The results show that this method is feasible on recall and precision. The method adopted in this paper is to analyze the existing relevant research, through the study and analysis of the Deep Web data sources, and on the basis of the existing research results, through the improved algorithm, to verify the effectiveness of our algorithm through experimental data. The experimental results show that the method is satisfactory. In the future, we will further study the related problems and algorithms. Deep Web is still a long way to go, the existing problems need to be solved one by researchers.
【學(xué)位授予單位】:西南大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP391.3

【參考文獻(xiàn)】

相關(guān)期刊論文 前10條

1 馬安香;張斌;高克寧;齊鵬;張引;;基于結(jié)果模式的Deep Web數(shù)據(jù)抽取[J];計(jì)算機(jī)研究與發(fā)展;2009年02期

2 華慧;伏玉琛;周小科;;基于查詢接口文本的Deep Web數(shù)據(jù)源分類[J];計(jì)算機(jī)工程;2010年12期

3 王海龍;胡景芝;趙朋朋;崔志明;;基于搜索引擎的Deep Web數(shù)據(jù)源發(fā)現(xiàn)[J];計(jì)算機(jī)工程;2011年05期

4 劉徽;黃寬娜;余建橋;;一種Deep Web爬蟲(chóng)爬行策略[J];計(jì)算機(jī)工程;2012年11期

5 劉偉;孟小峰;孟衛(wèi)一;;Deep Web數(shù)據(jù)集成研究綜述[J];計(jì)算機(jī)學(xué)報(bào);2007年09期

6 王國(guó)胤;姚一豫;于洪;;粗糙集理論與應(yīng)用研究綜述[J];計(jì)算機(jī)學(xué)報(bào);2009年07期

7 劉玉奎;周立柱;范舉;;中文深度萬(wàn)維網(wǎng)數(shù)據(jù)庫(kù)的現(xiàn)狀研究[J];計(jì)算機(jī)學(xué)報(bào);2011年02期

8 王鴻;余建橋;;基于N-Gram的Deep Web接口屬性抽取[J];計(jì)算機(jī)與現(xiàn)代化;2010年12期

9 姜芳艽;孟小峰;;Deep Web數(shù)據(jù)集成中查詢處理的研究與進(jìn)展[J];計(jì)算機(jī)科學(xué)與探索;2009年02期

10 林玲;周立柱;;基于簡(jiǎn)單查詢接口的Web數(shù)據(jù)庫(kù)模式識(shí)別[J];清華大學(xué)學(xué)報(bào)(自然科學(xué)版);2010年04期

相關(guān)碩士學(xué)位論文 前2條

1 李志濤;使用多分類器進(jìn)行Deep Web數(shù)據(jù)源的分類和判定[D];蘇州大學(xué);2009年

2 張龍飛;基于互信息的樸素貝葉斯改進(jìn)模型研究[D];吉林大學(xué);2010年

,

本文編號(hào):2202436

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2202436.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶d426d***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com