Deep web數(shù)據(jù)源的自動識別與分類研究

發(fā)布時(shí)間：2018-08-25 09:20

【摘要】：Deep Web深度網(wǎng)絡(luò)資源,又稱作不可見網(wǎng)或隱藏網(wǎng)(譯為Invisible Web or Hidden Web),它常常被人稱為谷歌查不到的網(wǎng)絡(luò)信息,這些信息不屬于我們所熟知的那些標(biāo)準(zhǔn)搜索引擎所能夠搜索到的。通常認(rèn)為搜索引擎查不到的信息要占網(wǎng)絡(luò)全部信息的90%。據(jù)Bright Planet公司技術(shù)白皮書的中描述,Deep Web資源容量約為Surface Web的500倍,而且包含著更多有價(jià)值的資源。超過一半的Deep Web內(nèi)容都保存在專業(yè)領(lǐng)域的數(shù)據(jù)庫中。海量的表面信息固然可以通過普通的搜索引擎查詢到,可是還有相當(dāng)大了的信息由于隱藏在深處無法被搜索引擎查到,而且Deep Web數(shù)據(jù)源同時(shí)又是不斷變化的,絕大部分隱藏的信息必須通過動態(tài)請求產(chǎn)生網(wǎng)頁信息,標(biāo)準(zhǔn)的搜索引擎是沒有辦法對它進(jìn)行查找的。因?yàn)檫@些動態(tài)請求產(chǎn)生的網(wǎng)頁信息必須要通過Deep Web查詢接口來獲取,使得Deep Web信息獲取變的更加困難,為了有效的獲取Deep Web信息,我們必須要對Deep Web進(jìn)行數(shù)據(jù)自動識別和分類。本文通過對Deep Web數(shù)據(jù)源的自動識別和分類研究這兩大重點(diǎn)問題展開深入研究。主要的研究內(nèi)容包括： (1)對普通網(wǎng)頁表單及Deep Web網(wǎng)頁的表單特征進(jìn)行分析,經(jīng)過合并、添加、篩選得到的得到本文采用的表單特征提取方案,包含各控件值,控件數(shù)量,包含語義信息的詞條等一系列特征值作為分類屬性。 (2) Deep Web數(shù)據(jù)集成的關(guān)鍵問題研究,查詢接口的識別及分類判定。針對樸素貝葉斯方法的限制,使用粗糙集算法進(jìn)行優(yōu)化約簡。該方法利用兩次隨機(jī)抽樣建立基于樸素貝葉斯算法的分類器組,利用粗糙集算法的屬性約簡方法進(jìn)行分類器組的約簡處理,然后利用優(yōu)化后的分類器組進(jìn)行分類,對得到的分類結(jié)果進(jìn)行加權(quán)平均,得到最終的分類結(jié)果。實(shí)驗(yàn)結(jié)果顯示,在優(yōu)化后的貝葉斯分類分類器組,對Deep Web查詢接口及其分類的查準(zhǔn)率及查全率上均有明顯提高。 (3) Deep Web數(shù)據(jù)源識別及分類性能對比。將數(shù)據(jù)挖掘中的幾種分類方法,如：C4.5決策樹、ID3等以及本文算法進(jìn)行分析對比,在查全率和查準(zhǔn)率上效果驗(yàn)證了此方法可行。本文所采取的方法是分析現(xiàn)有的相關(guān)研究,通過對Deep Web數(shù)據(jù)源的學(xué)習(xí)和分析,并在目前已有的研究成果的之上,通過改進(jìn)的算法,加以實(shí)驗(yàn)數(shù)據(jù)來驗(yàn)證我們的算法的有效性。從實(shí)驗(yàn)的結(jié)果來看本文的方法還是比較滿意的。實(shí)驗(yàn)中難免存在不足之處,在今后的研究中我們將進(jìn)一步的對相關(guān)問題和算法進(jìn)行修正。Deep Web的研究如今還有一段很長的路要走,存在的難題需要廣大的研究者們逐個的去解決。
[Abstract]:Deep Web deep web resources, also known as invisible or hidden networks, are often referred to as Google's unsearchable web information, which is not something that standard search engines known to us can search. It is generally believed that the search engine can not find the information to account for 90% of all the information on the network. The resource capacity of Deep Web is about 500 times that of Surface Web and contains more valuable resources, according to Bright Planet's technical white paper. More than half of Deep Web content is stored in professional databases. The massive amount of surface information can be queried by ordinary search engines, but there is still quite a lot of information that can not be found by search engines because it is hidden in the depths, and the Deep Web data sources are constantly changing at the same time. Most hidden information must be generated by dynamic request, and the standard search engine can not find it. Because the web page information generated by these dynamic requests must be obtained through the Deep Web query interface, it is more difficult to obtain the Deep Web information. In order to obtain Deep Web information effectively, we must recognize and classify the Deep Web data automatically. In this paper, the automatic identification and classification of Deep Web data sources are studied. The main research contents are as follows: (1) analyzing the form features of common web pages and Deep Web web pages, after merging, adding, screening the form feature extraction scheme adopted in this paper, including the values of each control, the main contents of this paper are as follows: 1. A series of feature values such as the number of controls and the entries containing semantic information are used as classification attributes. (2) Research on key issues of) Deep Web data integration, identification and classification of query interfaces. Aiming at the limitation of naive Bayes method, rough set algorithm is used to optimize reduction. In this method, the classifier group based on naive Bayes algorithm is established by twice random sampling, the attribute reduction method of rough set algorithm is used to deal with the reduction of classifier group, and then the optimized classifier group is used to classify. The results are weighted average and the final classification results are obtained. The experimental results show that the precision and recall rate of the Deep Web query interface and its classification are improved obviously in the optimized Bayesian classifier group. (3) Deep Web data source recognition and classification performance comparison. Several classification methods in data mining, such as: C4.5 decision tree ID3 and this algorithm, are analyzed and compared. The results show that this method is feasible on recall and precision. The method adopted in this paper is to analyze the existing relevant research, through the study and analysis of the Deep Web data sources, and on the basis of the existing research results, through the improved algorithm, to verify the effectiveness of our algorithm through experimental data. The experimental results show that the method is satisfactory. In the future, we will further study the related problems and algorithms. Deep Web is still a long way to go, the existing problems need to be solved one by researchers.
【學(xué)位授予單位】：西南大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2013
【分類號】：TP391.3

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 馬安香;張斌;高克寧;齊鵬;張引;;基于結(jié)果模式的Deep Web數(shù)據(jù)抽取[J];計(jì)算機(jī)研究與發(fā)展;2009年02期

2 華慧;伏玉琛;周小科;;基于查詢接口文本的Deep Web數(shù)據(jù)源分類[J];計(jì)算機(jī)工程;2010年12期

3 王海龍;胡景芝;趙朋朋;崔志明;;基于搜索引擎的Deep Web數(shù)據(jù)源發(fā)現(xiàn)[J];計(jì)算機(jī)工程;2011年05期

4 劉徽;黃寬娜;余建橋;;一種Deep Web爬蟲爬行策略[J];計(jì)算機(jī)工程;2012年11期

5 劉偉;孟小峰;孟衛(wèi)一;;Deep Web數(shù)據(jù)集成研究綜述[J];計(jì)算機(jī)學(xué)報(bào);2007年09期

6 王國胤;姚一豫;于洪;;粗糙集理論與應(yīng)用研究綜述[J];計(jì)算機(jī)學(xué)報(bào);2009年07期

7 劉玉奎;周立柱;范舉;;中文深度萬維網(wǎng)數(shù)據(jù)庫的現(xiàn)狀研究[J];計(jì)算機(jī)學(xué)報(bào);2011年02期

8 王鴻;余建橋;;基于N-Gram的Deep Web接口屬性抽取[J];計(jì)算機(jī)與現(xiàn)代化;2010年12期

9 姜芳艽;孟小峰;;Deep Web數(shù)據(jù)集成中查詢處理的研究與進(jìn)展[J];計(jì)算機(jī)科學(xué)與探索;2009年02期

10 林玲;周立柱;;基于簡單查詢接口的Web數(shù)據(jù)庫模式識別[J];清華大學(xué)學(xué)報(bào)(自然科學(xué)版);2010年04期

相關(guān)碩士學(xué)位論文前2條

1 李志濤;使用多分類器進(jìn)行Deep Web數(shù)據(jù)源的分類和判定[D];蘇州大學(xué);2009年

2 張龍飛;基于互信息的樸素貝葉斯改進(jìn)模型研究[D];吉林大學(xué);2010年

，

本文編號：2202436

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2202436.html

上一篇：對稱搜索技術(shù)P2P在網(wǎng)格資源檢索中的應(yīng)用
下一篇：我國法上的避風(fēng)港規(guī)則:利益失衡與立法完善

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

Deep web數(shù)據(jù)源的自動識別與分類研究