天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁(yè) > 科技論文 > 搜索引擎論文 >

基于表單特性的深層網(wǎng)絡(luò)數(shù)據(jù)源分類方法研究

發(fā)布時(shí)間:2018-03-19 04:00

  本文選題:深層網(wǎng)絡(luò) 切入點(diǎn):數(shù)據(jù)源自動(dòng)分類 出處:《哈爾濱工程大學(xué)》2012年碩士論文 論文類型:學(xué)位論文


【摘要】:當(dāng)前,Deep Web中蘊(yùn)含著高質(zhì)量的海量信息并且其數(shù)量還在不斷地增長(zhǎng),由于DeepWeb具有分布、異構(gòu)、自治等特點(diǎn),用戶高效、快捷地獲取自己感興趣的信息面臨巨大挑戰(zhàn)。然而,將Deep Web數(shù)據(jù)源按領(lǐng)域分類是解決這一挑戰(zhàn)的基礎(chǔ)。因此,研究DeepWeb數(shù)據(jù)源的組織問(wèn)題具有重要意義。 本文通過(guò)Web字典、課題組開(kāi)發(fā)的深層網(wǎng)絡(luò)數(shù)據(jù)源自動(dòng)抽取工具以及搜索引擎收集了大量深層網(wǎng)絡(luò)數(shù)據(jù)源,它們分別來(lái)自于航空訂票、圖書(shū)銷(xiāo)售、汽車(chē)和房地產(chǎn)等四個(gè)領(lǐng)域。針對(duì)其中的200多個(gè)數(shù)據(jù)源進(jìn)行統(tǒng)計(jì)和分析發(fā)現(xiàn):第一,“主題詞”能夠較好地區(qū)分深層網(wǎng)絡(luò)數(shù)據(jù)源。具體來(lái)說(shuō),查詢接口源代碼中,絕大多數(shù)title標(biāo)記含有內(nèi)容,,而且這部分內(nèi)容中的有些詞往往只出現(xiàn)在某個(gè)領(lǐng)域并且在一定程度上反映了該查詢接口的主題,即所屬的相關(guān)領(lǐng)域;第二,同一領(lǐng)域查詢接口間相似屬性的個(gè)數(shù)往往較多,不同領(lǐng)域接口間相似屬性的個(gè)數(shù)則較少,或者幾乎沒(méi)有;第三,對(duì)于每個(gè)領(lǐng)域,隨著深層網(wǎng)絡(luò)數(shù)據(jù)源的增長(zhǎng),查詢接口中屬性出現(xiàn)的總的詞匯量往往趨向于一個(gè)較小水平,平均大概在60左右;第四,深層網(wǎng)絡(luò)中大部分為結(jié)構(gòu)化的數(shù)據(jù)源。 受此啟發(fā),基于表單特性——主題和表單屬性信息,本文提出了一種新的深層網(wǎng)絡(luò)數(shù)據(jù)源分類方法以及改進(jìn)的查詢接口相似性度量方法,實(shí)現(xiàn)了按照現(xiàn)實(shí)領(lǐng)域自動(dòng)組織大規(guī)模深層網(wǎng)絡(luò)數(shù)據(jù)源的目的。該方法主要由四大模塊組成:預(yù)處理模塊、標(biāo)記策略模塊、半監(jiān)督K-Means聚類模塊和后分類模塊。本文還提出了一種查詢接口標(biāo)記策略,以降低隨機(jī)選擇初始中心點(diǎn)所產(chǎn)生的影響。實(shí)驗(yàn)結(jié)果表明:該方法能夠有效、通用地解決深層網(wǎng)絡(luò)數(shù)據(jù)源的分類問(wèn)題并且具有較高的準(zhǔn)確率和召回率。
[Abstract]:At present, deep Web contains mass information of high quality and its quantity is still growing. Because of the characteristics of DeepWeb, such as distribution, heterogeneity, autonomy, etc., it is a great challenge for users to obtain information of their own interest efficiently and quickly. However, Classification of DeepWeb data sources by domain is the basis to solve this challenge. Therefore, it is of great significance to study the organization of DeepWeb data sources. This paper collects a lot of deep network data sources through Web dictionary, automatic extraction tool of deep network data source developed by our research group and search engine, which come from airline ticket booking, book sales, etc. According to the statistics and analysis of more than 200 data sources, first, the "subject words" can better distinguish the deep network data sources. Specifically, in the source code of the query interface, The vast majority of title tags contain content, and some of the words in this content tend to appear only in one domain and to some extent reflect the subject of the query interface, that is, the related domain to which it belongs; second, The number of similar attributes among interfaces in the same domain is often more than that among interfaces in different domains. Third, for each domain, as the number of deep network data sources increases, the number of similar attributes among interfaces in different domains is less or less. The total vocabulary of attributes in the query interface tends to be smaller, with an average of about 60; 4th, most of the deep network is a structured data source. Inspired by this, this paper proposes a new classification method for deep network data sources and an improved method for measuring similarity of query interfaces based on form feature-topic and form attribute information. This method is mainly composed of four modules: preprocessing module, marking strategy module, and so on. Semi-supervised K-Means clustering module and post-classification module. This paper also proposes a query interface marking strategy to reduce the impact of random selection of initial center points. Experimental results show that the proposed method is effective. It solves the classification problem of deep network data sources and has high accuracy and recall rate.
【學(xué)位授予單位】:哈爾濱工程大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2012
【分類號(hào)】:TP393.09

【參考文獻(xiàn)】

相關(guān)期刊論文 前7條

1 宋暉,張嶺,葉允明,馬范援;基于標(biāo)記樹(shù)對(duì)象抽取技術(shù)的Hidden Web獲取研究[J];計(jì)算機(jī)工程與應(yīng)用;2002年23期

2 劉偉;孟小峰;孟衛(wèi)一;;Deep Web數(shù)據(jù)集成研究綜述[J];計(jì)算機(jī)學(xué)報(bào);2007年09期

3 寇月;申德榮;李冬;聶鐵錚;;一種基于語(yǔ)義及統(tǒng)計(jì)分析的Deep Web實(shí)體識(shí)別機(jī)制[J];軟件學(xué)報(bào);2008年02期

4 王輝;劉艷威;左萬(wàn)利;;使用分類器自動(dòng)發(fā)現(xiàn)特定領(lǐng)域的深度網(wǎng)入口(英文)[J];軟件學(xué)報(bào);2008年02期

5 馬軍;宋玲;韓曉暉;閆潑;;基于網(wǎng)頁(yè)上下文的Deep Web數(shù)據(jù)庫(kù)分類[J];軟件學(xué)報(bào);2008年02期

6 宋杰;王大玲;鮑玉斌;申德榮;;基于頁(yè)面Block的Web檔案采集和存儲(chǔ)[J];軟件學(xué)報(bào);2008年02期

7 高瀅;劉大有;齊紅;劉赫;;一種半監(jiān)督K均值多關(guān)系數(shù)據(jù)聚類算法[J];軟件學(xué)報(bào);2008年11期

相關(guān)碩士學(xué)位論文 前3條

1 劉潔;基于關(guān)聯(lián)挖掘的深層網(wǎng)絡(luò)接口模式匹配方法的研究[D];哈爾濱工程大學(xué);2010年

2 劉富江;網(wǎng)絡(luò)數(shù)據(jù)源模式識(shí)別方法及策略研究[D];哈爾濱工程大學(xué);2010年

3 王銳;基于本體的深層網(wǎng)絡(luò)模式匹配研究[D];哈爾濱工程大學(xué);2011年



本文編號(hào):1632699

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1632699.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶5b2d0***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com