Deep web數(shù)據(jù)源的自動識別與分類研究
[Abstract]:Deep Web deep web resources, also known as invisible or hidden networks, are often referred to as Google's unsearchable web information, which is not something that standard search engines known to us can search. It is generally believed that the search engine can not find the information to account for 90% of all the information on the network. The resource capacity of Deep Web is about 500 times that of Surface Web and contains more valuable resources, according to Bright Planet's technical white paper. More than half of Deep Web content is stored in professional databases. The massive amount of surface information can be queried by ordinary search engines, but there is still quite a lot of information that can not be found by search engines because it is hidden in the depths, and the Deep Web data sources are constantly changing at the same time. Most hidden information must be generated by dynamic request, and the standard search engine can not find it. Because the web page information generated by these dynamic requests must be obtained through the Deep Web query interface, it is more difficult to obtain the Deep Web information. In order to obtain Deep Web information effectively, we must recognize and classify the Deep Web data automatically. In this paper, the automatic identification and classification of Deep Web data sources are studied. The main research contents are as follows: (1) analyzing the form features of common web pages and Deep Web web pages, after merging, adding, screening the form feature extraction scheme adopted in this paper, including the values of each control, the main contents of this paper are as follows: 1. A series of feature values such as the number of controls and the entries containing semantic information are used as classification attributes. (2) Research on key issues of) Deep Web data integration, identification and classification of query interfaces. Aiming at the limitation of naive Bayes method, rough set algorithm is used to optimize reduction. In this method, the classifier group based on naive Bayes algorithm is established by twice random sampling, the attribute reduction method of rough set algorithm is used to deal with the reduction of classifier group, and then the optimized classifier group is used to classify. The results are weighted average and the final classification results are obtained. The experimental results show that the precision and recall rate of the Deep Web query interface and its classification are improved obviously in the optimized Bayesian classifier group. (3) Deep Web data source recognition and classification performance comparison. Several classification methods in data mining, such as: C4.5 decision tree ID3 and this algorithm, are analyzed and compared. The results show that this method is feasible on recall and precision. The method adopted in this paper is to analyze the existing relevant research, through the study and analysis of the Deep Web data sources, and on the basis of the existing research results, through the improved algorithm, to verify the effectiveness of our algorithm through experimental data. The experimental results show that the method is satisfactory. In the future, we will further study the related problems and algorithms. Deep Web is still a long way to go, the existing problems need to be solved one by researchers.
【學(xué)位授予單位】:西南大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 馬安香;張斌;高克寧;齊鵬;張引;;基于結(jié)果模式的Deep Web數(shù)據(jù)抽取[J];計(jì)算機(jī)研究與發(fā)展;2009年02期
2 華慧;伏玉琛;周小科;;基于查詢接口文本的Deep Web數(shù)據(jù)源分類[J];計(jì)算機(jī)工程;2010年12期
3 王海龍;胡景芝;趙朋朋;崔志明;;基于搜索引擎的Deep Web數(shù)據(jù)源發(fā)現(xiàn)[J];計(jì)算機(jī)工程;2011年05期
4 劉徽;黃寬娜;余建橋;;一種Deep Web爬蟲爬行策略[J];計(jì)算機(jī)工程;2012年11期
5 劉偉;孟小峰;孟衛(wèi)一;;Deep Web數(shù)據(jù)集成研究綜述[J];計(jì)算機(jī)學(xué)報(bào);2007年09期
6 王國胤;姚一豫;于洪;;粗糙集理論與應(yīng)用研究綜述[J];計(jì)算機(jī)學(xué)報(bào);2009年07期
7 劉玉奎;周立柱;范舉;;中文深度萬維網(wǎng)數(shù)據(jù)庫的現(xiàn)狀研究[J];計(jì)算機(jī)學(xué)報(bào);2011年02期
8 王鴻;余建橋;;基于N-Gram的Deep Web接口屬性抽取[J];計(jì)算機(jī)與現(xiàn)代化;2010年12期
9 姜芳艽;孟小峰;;Deep Web數(shù)據(jù)集成中查詢處理的研究與進(jìn)展[J];計(jì)算機(jī)科學(xué)與探索;2009年02期
10 林玲;周立柱;;基于簡單查詢接口的Web數(shù)據(jù)庫模式識別[J];清華大學(xué)學(xué)報(bào)(自然科學(xué)版);2010年04期
相關(guān)碩士學(xué)位論文 前2條
1 李志濤;使用多分類器進(jìn)行Deep Web數(shù)據(jù)源的分類和判定[D];蘇州大學(xué);2009年
2 張龍飛;基于互信息的樸素貝葉斯改進(jìn)模型研究[D];吉林大學(xué);2010年
,本文編號:2202436
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2202436.html