天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁 > 科技論文 > 搜索引擎論文 >

基于特定領(lǐng)域?qū)ο蠹?jí)垂直搜索中的對(duì)象抽取問題的研究

發(fā)布時(shí)間:2018-04-19 13:47

  本文選題:對(duì)象級(jí)搜索引擎 + Web信息抽取; 參考:《電子科技大學(xué)》2015年碩士論文


【摘要】:隨著信息時(shí)代的到來,互聯(lián)網(wǎng)上如雨后春筍一般出現(xiàn)了各種信息站點(diǎn),給人們提供了大量的有用信息。但是出現(xiàn)了一個(gè)新的挑戰(zhàn),就是如何能讓人快速定位到自己所需的信息,搜索引擎正是在這一背景下醞釀而生,用戶可以通過它快速查找信息。搜索引擎由最開始的半機(jī)械半人工的目錄式搜索發(fā)展到現(xiàn)在主流的全文搜索引擎和垂直搜索引擎,但就目前最成熟的全文搜索技術(shù),在單個(gè)領(lǐng)域上的網(wǎng)頁收集能力,還是有一定的欠缺,導(dǎo)致查準(zhǔn)率和查全率達(dá)不到理想的目標(biāo)。雖然垂直搜索技術(shù)在單個(gè)領(lǐng)域上的信息收集能力有所增強(qiáng),但是依然像全文搜索一樣,提供基于網(wǎng)頁級(jí)的搜索服務(wù),需要用戶進(jìn)行再次過濾。因此就出現(xiàn)了對(duì)象級(jí)垂直搜索這一新的搜索模式,它是提供基于特定領(lǐng)域的對(duì)象級(jí)搜索,提交給用戶的查詢結(jié)果是搜索系統(tǒng)經(jīng)過一系列的抽取集成所形成的對(duì)象實(shí)體。但是目前現(xiàn)有的對(duì)象級(jí)搜索引擎在對(duì)象信息抽取模塊,都屬于半自動(dòng)化模式,前期需要大量人力對(duì)部分網(wǎng)頁進(jìn)行標(biāo)注,從而獲取對(duì)象抽取的先驗(yàn)知識(shí)。因此本文針對(duì)這種情況,研究并改進(jìn)了Road Runner全自動(dòng)抽取算法,設(shè)計(jì)實(shí)現(xiàn)了對(duì)象級(jí)垂直搜索引擎中的自動(dòng)信息抽取模塊。本文主要在以下兩個(gè)方面進(jìn)行了改進(jìn):(1)改進(jìn)了簡單樹匹配算法,提高了判斷相似的準(zhǔn)確率。原始的簡單樹匹配算法對(duì)網(wǎng)頁DOM樹結(jié)構(gòu)中所有標(biāo)簽節(jié)點(diǎn)進(jìn)行統(tǒng)一處理,并沒有考慮到迭代標(biāo)簽的特殊性,改進(jìn)后對(duì)迭代標(biāo)簽進(jìn)行了一定的處理后再進(jìn)行匹配比較。(2)改進(jìn)了Road Runner算法的屬性標(biāo)注模塊,利用不同包裝器之間抽取對(duì)象的關(guān)聯(lián)進(jìn)行交叉標(biāo)注,提高了抽取數(shù)據(jù)的屬性標(biāo)注率。Road Runner算法本身采用的屬性標(biāo)注技術(shù)是基于網(wǎng)頁信息中屬性值和屬性名成對(duì)出現(xiàn),而大部分網(wǎng)頁中存在部分屬性名缺失的情況。最后本文利用上述改進(jìn)的算法實(shí)現(xiàn)了對(duì)象信息抽取系統(tǒng),并在圖書領(lǐng)域進(jìn)行了抽取測(cè)試。
[Abstract]:With the arrival of the information age, there are a variety of information sites on the Internet, which provide people with a lot of useful information.However, a new challenge has emerged, that is, how to quickly locate the information one needs. It is in this context that the search engine is conceived, and users can quickly find information through it.The search engine has developed from the first semi-mechanical and semi-artificial directory search engine to the mainstream full-text search engine and vertical search engine. However, with regard to the most mature full-text search technology at present, the ability to collect web pages in a single field,There are still some deficiencies, resulting in precision and recall rate can not reach the ideal goal.Although vertical search technology in a single field of information collection ability has been enhanced, but still like full-text search, to provide Web-based search services, the need for users to filter again.Therefore, a new search pattern named object level vertical search appears, which provides object level search based on specific domain. The query result submitted to user is an object entity formed by a series of extraction integration in the search system.However, the existing object-level search engine in the object information extraction module, all belong to the semi-automatic mode, a lot of manpower is needed to annotate part of the web pages in the early stage, so as to obtain the prior knowledge of object extraction.Therefore, in this paper, we study and improve the Road Runner automatic extraction algorithm, and design and implement the automatic information extraction module in the object level vertical search engine.In this paper, we improve the simple tree matching algorithm in the following two aspects: 1) improve the accuracy of judging similarity.The original simple tree matching algorithm unifies all tag nodes in the web page DOM tree structure without considering the particularity of iterative tags.This paper improves the attribute tagging module of Road Runner algorithm and uses the association of objects extracted between different wrappers for cross-tagging.The attribute tagging rate of extracting data. Road Runner algorithm itself is based on the fact that attribute values and attribute names appear in pairs in web page information, while some attribute names are missing in most web pages.Finally, an object information extraction system is implemented by using the above improved algorithm, and the extraction test is carried out in the field of books.
【學(xué)位授予單位】:電子科技大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2015
【分類號(hào)】:TP391.3


本文編號(hào):1773341

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1773341.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶60ed2***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com
麻豆最新出品国产精品| 日韩一区二区三区免费av| 欧美丝袜诱惑一区二区| 午夜福利视频偷拍91| 九七人妻一区二区三区| 国产精品白丝久久av| 日本成人三级在线播放| 久草视频这里只是精品| 少妇熟女亚洲色图av天堂| 黄色片国产一区二区三区| 日韩亚洲精品国产第二页| 亚洲人妻av中文字幕| 精品伊人久久大香线蕉综合| 欧美午夜视频免费观看| 亚洲精品小视频在线观看| 成年午夜在线免费视频| 国产亚洲精品久久久优势| 中文字幕亚洲视频一区二区| 国产精品一区二区视频成人| 国产永久免费高清在线精品| 视频一区二区三区自拍偷| 日韩午夜老司机免费视频| 亚洲最新中文字幕在线视频| 大香蕉久草网一区二区三区 | 亚洲av熟女一区二区三区蜜桃 | 青青免费操手机在线视频| 激情内射亚洲一区二区三区| 手机在线观看亚洲中文字幕| 亚洲妇女黄色三级视频| 果冻传媒精选麻豆白晶晶 | 日韩人妻中文字幕精品| 麻豆亚州无矿码专区视频| 又黄又色又爽又免费的视频| 亚洲日本中文字幕视频在线观看 | 中文字幕精品一区二区三| 成人精品一级特黄大片| 欧美精品久久99九九| 黄片在线免费看日韩欧美| 日韩人妻免费视频一专区| 国产一级不卡视频在线观看| 欧美精品久久男人的天堂|