面向領(lǐng)域的Deep Web查詢(xún)接口發(fā)現(xiàn)研究
本文選題:Deep + Web。 參考:《暨南大學(xué)》2014年碩士論文
【摘要】:深層網(wǎng)絡(luò)指的是位于表層網(wǎng)絡(luò)之下所隱藏的數(shù)據(jù),需要用戶(hù)填寫(xiě)表單發(fā)送查詢(xún)請(qǐng)求才能獲取,其數(shù)據(jù)量遠(yuǎn)遠(yuǎn)超過(guò)表層網(wǎng)絡(luò)且信息價(jià)值巨大。由此原因,如何挖掘出位于深層網(wǎng)絡(luò)中的海量數(shù)據(jù)成為了研究熱點(diǎn),特別是Deep Web的信息集成研究尤為重要。Deep Web數(shù)據(jù)集成中的第一步是Web數(shù)據(jù)庫(kù)的發(fā)現(xiàn),也就是查詢(xún)接口的發(fā)現(xiàn)。但由于深層網(wǎng)絡(luò)的數(shù)據(jù)位于眾多的web數(shù)據(jù)庫(kù)中,并且處于不斷的變化中,相應(yīng)的接口也可能隨之改變,增大了獲取的難度。其中最為突出的技術(shù)難點(diǎn)是:一,Web數(shù)據(jù)庫(kù)分布廣泛且數(shù)量巨大,獲取包含查詢(xún)接口的網(wǎng)頁(yè)信息的效率問(wèn)題有待提高;二,查詢(xún)接口都是以表單的形式存在,但并非所有的表單都是查詢(xún)接口,如何從中正確地篩選出Deep Web查詢(xún)接口、提高分類(lèi)正確性也是亟待解決的問(wèn)題。 圍繞著Deep Web查詢(xún)接口發(fā)現(xiàn)中的兩個(gè)難題,本文主要做了以下工作: 首先,對(duì)Deep Web進(jìn)行研究,其中包括Deep Web的概念、規(guī)模、存在方式、獲得方法以及Deep Web查詢(xún)接口發(fā)現(xiàn)中的一些關(guān)鍵問(wèn)題,提出本文研究的研究方向和內(nèi)容。 其次,對(duì)查詢(xún)接口發(fā)現(xiàn)中用到的相關(guān)技術(shù)進(jìn)行分析,,包括通常用的DOM解析和啟發(fā)式規(guī)則研究,然后分析了查詢(xún)接口發(fā)現(xiàn)的主要算法,并進(jìn)行比較。 再次,針對(duì)面向領(lǐng)域的Deep Web查詢(xún)接口獲取的效率問(wèn)題,本文提出了一種查詢(xún)接口發(fā)現(xiàn)算法,包括基于單線程和多線程算法,并進(jìn)行試驗(yàn)對(duì)比,結(jié)果顯示基于多線程的算法效率提升顯著。 最后,為了從獲取的網(wǎng)頁(yè)表單中正確地篩選出Deep Web查詢(xún)接口,本文在前人研究的基礎(chǔ)上,提出了基于啟發(fā)式規(guī)則的K最近鄰算法,用于從表單中正確識(shí)別出Deep Web查詢(xún)接口,為了進(jìn)行實(shí)驗(yàn)驗(yàn)證,本文從多種途徑多個(gè)領(lǐng)域取得查詢(xún)接口和非查詢(xún)接口,并分別進(jìn)行實(shí)驗(yàn),實(shí)驗(yàn)結(jié)果表明,該算法能明顯提高對(duì)Deep Web查詢(xún)接口的辨別能力,特別市在面向圖書(shū)領(lǐng)域的實(shí)例中,在查重率和查全率方面都有明顯提升。
[Abstract]:The deep network refers to the data hidden under the surface network, which needs the user to fill out the form to send a query request to obtain, and the amount of data is far more than the surface network and the value of information is huge. For this reason, how to mine the massive data located in the deep network has become a research hotspot, especially the information integration research of Deep Web is especially important. The first step in the data integration of Deep Web is the discovery of Web database, that is, the discovery of query interface. However, because the data of the deep network is located in many web databases, and is in constant change, the corresponding interface may also change, which increases the difficulty of obtaining. Among them, the most prominent technical difficulties are: first, the Web database is widely distributed and the number is huge, the efficiency of obtaining the web page information containing the query interface needs to be improved; second, the query interface exists in the form of form. However, not all forms are query interfaces, so how to select the Deep Web query interface correctly and improve the classification accuracy is an urgent problem to be solved. Around the two difficult problems in Deep Web query interface discovery, this paper mainly does the following work: Firstly, this paper studies Deep Web, including the concept, scale, existing mode, acquisition method and some key problems of Deep Web query interface discovery, and puts forward the research direction and content of this paper. Secondly, the related techniques used in query interface discovery are analyzed, including DOM parsing and heuristic rules, and then the main algorithms of query interface discovery are analyzed and compared. Thirdly, aiming at the efficiency of domain oriented Deep Web query interface acquisition, a query interface discovery algorithm is proposed in this paper, including single thread and multithread algorithms, and compared with each other. The results show that the efficiency of multithreading algorithm is improved significantly. Finally, in order to select the Deep Web query interface correctly from the obtained web page form, this paper proposes a K-nearest neighbor algorithm based on heuristic rules, which is used to correctly identify the Deep Web query interface from the form. In order to validate the experiment, the query interface and the non-query interface are obtained from many fields. The experimental results show that the algorithm can obviously improve the ability of discriminating the Deep Web query interface. Special city in the book-oriented field of examples, in the search rate and recall rate has improved significantly.
【學(xué)位授予單位】:暨南大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2014
【分類(lèi)號(hào)】:TP393.09
【相似文獻(xiàn)】
相關(guān)期刊論文 前10條
1 鄭冬冬;崔志明;;Deep Web查詢(xún)接口選擇[J];計(jì)算機(jī)應(yīng)用;2006年09期
2 周愛(ài)武;李玉梅;周閃閃;王寶銅;;基于返回結(jié)果的Deep Web查詢(xún)接口識(shí)別[J];計(jì)算機(jī)技術(shù)與發(fā)展;2009年07期
3 王彩霞;高明;;Deep Web查詢(xún)接口及其識(shí)別算法[J];電腦知識(shí)與技術(shù);2011年22期
4 李齊會(huì);;Deep Web查詢(xún)接口的判定技術(shù)研究[J];計(jì)算機(jī)與數(shù)字工程;2009年03期
5 楊麗華;;基于規(guī)則的Deep Web查詢(xún)接口的抽取[J];電腦知識(shí)與技術(shù);2010年01期
6 錢(qián)程;陽(yáng)小蘭;;Deep Web查詢(xún)接口研究[J];計(jì)算機(jī)與現(xiàn)代化;2012年06期
7 李雪玲;施化吉;蘭均;李星毅;;基于決策樹(shù)和鏈接相似的Deep Web查詢(xún)接口判定[J];計(jì)算機(jī)應(yīng)用研究;2011年11期
8 徐和祥;王述云;胡運(yùn)發(fā);;基于本體的Deep Web查詢(xún)接口分類(lèi)[J];小型微型計(jì)算機(jī)系統(tǒng);2008年10期
9 董永權(quán);李慶忠;丁艷輝;張永新;;一種基于證據(jù)理論和任務(wù)分配的Deep Web查詢(xún)接口匹配方法[J];模式識(shí)別與人工智能;2011年02期
10 崔曉軍;彭智勇;曾承;;基于多標(biāo)注源的Deep Web查詢(xún)結(jié)果自動(dòng)標(biāo)注[J];計(jì)算機(jī)應(yīng)用;2009年01期
相關(guān)會(huì)議論文 前1條
1 王英;左萬(wàn)利;彭濤;赫楓齡;彭釗;;特定領(lǐng)域Deep Web查詢(xún)接口的集成[A];第二十五屆中國(guó)數(shù)據(jù)庫(kù)學(xué)術(shù)會(huì)議論文集(二)[C];2008年
相關(guān)博士學(xué)位論文 前1條
1 張慧斌;Deep Web查詢(xún)接口及查詢(xún)結(jié)果抽取研究[D];南開(kāi)大學(xué);2010年
相關(guān)碩士學(xué)位論文 前5條
1 唐博;基于概念格的Deep Web查詢(xún)接口建模系統(tǒng)研究與設(shè)計(jì)[D];西安電子科技大學(xué);2013年
2 李振興;面向領(lǐng)域的Deep Web查詢(xún)接口發(fā)現(xiàn)研究[D];暨南大學(xué);2014年
3 陳雅冰;基于領(lǐng)域的Deep Web查詢(xún)接口抽取[D];華南理工大學(xué);2011年
4 張?jiān)贫?特定領(lǐng)域的Deep Web查詢(xún)集成及結(jié)果抽取[D];復(fù)旦大學(xué);2008年
5 曹慶皇;Deep Web查詢(xún)接口匹配技術(shù)研究[D];江蘇大學(xué);2009年
本文編號(hào):1897994
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/1897994.html