天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

Web信息自動(dòng)標(biāo)引研究

發(fā)布時(shí)間:2018-06-27 01:35

  本文選題:Web信息 + 自動(dòng)標(biāo)引; 參考:《浙江大學(xué)》2014年博士論文


【摘要】:互聯(lián)網(wǎng)絡(luò)的發(fā)展及信息化工程的推進(jìn),促使Web信息逐步累積成為一個(gè)能夠提供信息交互、信息共享,并影響人類生活各個(gè)層面的巨大資源空間。為了從具有海量性、無(wú)序性、異構(gòu)性、實(shí)時(shí)更新性、多樣性等特征的Web信息中快速、準(zhǔn)確地獲取所需資源,人們開始逐漸認(rèn)識(shí)到Web信息組織管理的重要性,并開始探索各種Web信息處理方法,自動(dòng)標(biāo)引即為其中之一。本研究以自動(dòng)提取Web信息標(biāo)引詞為切入點(diǎn),以Web坐標(biāo)系、Web頁(yè)面組織結(jié)構(gòu)和Web頁(yè)面瀏覽者的閱讀習(xí)慣等特點(diǎn)為研究對(duì)象,探索Web信息自動(dòng)標(biāo)引過程中的具體影響因素。在總結(jié)前人研究工作的基礎(chǔ)上,提出設(shè)想:根據(jù)網(wǎng)頁(yè)坐標(biāo)系,按照不同站點(diǎn)類型,用不同分割比例把網(wǎng)頁(yè)劃分若干區(qū)域;判析Web信息塊歸屬區(qū)域并針對(duì)網(wǎng)站類型,探索各區(qū)域信息塊在自動(dòng)標(biāo)引過程中的權(quán)重,最后編寫程序驗(yàn)證以上設(shè)想,完成自動(dòng)標(biāo)引各個(gè)環(huán)節(jié)。具體步驟如下:(1)研究實(shí)現(xiàn)Web頁(yè)面采集。根據(jù)研究需要,分別實(shí)現(xiàn)Web頁(yè)面批量采集和手動(dòng)采集,解決Web頁(yè)面采集過程中的頁(yè)面編碼轉(zhuǎn)換、html轉(zhuǎn)換xml等問題。(2)利用Web頁(yè)面坐標(biāo)系,結(jié)合頁(yè)面瀏覽者閱讀習(xí)慣,將Web頁(yè)面劃分成9個(gè)區(qū)域。每個(gè)區(qū)域占據(jù)頁(yè)面一定比例,且區(qū)域中信息塊被視為一個(gè)信息塊集群,在后期運(yùn)算中具有同樣的標(biāo)引權(quán)重并被統(tǒng)一處理。(3)尋找發(fā)現(xiàn)不同類型網(wǎng)站的適宜頁(yè)面分割比例。不同類型網(wǎng)站有著自己獨(dú)特的頁(yè)面信息發(fā)布方式。如新聞?lì)愓军c(diǎn),往往圖片較少,文字報(bào)道占主要部分;大部分新聞?lì)愓军c(diǎn)都向頁(yè)面瀏覽者提供對(duì)某新聞進(jìn)行評(píng)價(jià)的功能,從而造成網(wǎng)頁(yè)高度變動(dòng)幅度較大。本文分別選擇新聞?lì)、體育類、科學(xué)類站點(diǎn)頁(yè)面,用不同頁(yè)面分割比例進(jìn)行測(cè)試,找出各類型站點(diǎn)的適宜頁(yè)面分割比例值。(4)摸索不同區(qū)域信息塊在自動(dòng)標(biāo)引過程中的權(quán)重。瀏覽者在訪問Web頁(yè)面時(shí),總會(huì)有視覺焦點(diǎn)、閱讀習(xí)慣等特性,從而Web頁(yè)面設(shè)計(jì)者在制作網(wǎng)頁(yè)時(shí),也會(huì)有所重點(diǎn)地安排Web頁(yè)面信息。因此能否發(fā)現(xiàn)不同Web頁(yè)面區(qū)域的信息重要程度,對(duì)后期自動(dòng)標(biāo)引結(jié)果的準(zhǔn)確性有著直接影響。本文通過樣本實(shí)驗(yàn),對(duì)新聞?lì)、科學(xué)類站點(diǎn)網(wǎng)頁(yè)的不同區(qū)域信息塊重要性進(jìn)行了摸索,并分別得出不同類型站點(diǎn)的Web頁(yè)面區(qū)域信息塊在自動(dòng)標(biāo)引中的權(quán)重。(5)實(shí)現(xiàn)對(duì)Web頁(yè)面進(jìn)行自動(dòng)標(biāo)引。在考慮Web頁(yè)面信息噪音和區(qū)域特性的基礎(chǔ)上,結(jié)合文本方法特色,給出一種Web信息自動(dòng)標(biāo)引的方法,編寫程序予以實(shí)現(xiàn)和驗(yàn)證。此外,本文還分別對(duì)網(wǎng)頁(yè)寬度、網(wǎng)頁(yè)高度與不同頁(yè)面分割比例下的信息抽取查全率、準(zhǔn)確率等的相關(guān)性等問題進(jìn)行了探討,以期對(duì)以后該領(lǐng)域研究有所幫助。綜上所述,本文對(duì)Web信息自動(dòng)標(biāo)引過程中各環(huán)節(jié)的關(guān)鍵技術(shù)進(jìn)行了探索,探討了不同類型站點(diǎn)網(wǎng)頁(yè)的適宜分割比例,研究了網(wǎng)頁(yè)坐標(biāo)系與Web信息自動(dòng)標(biāo)引過程的相互關(guān)系,對(duì)相關(guān)研究有著借鑒和參考意義。
[Abstract]:With the development of Internet and the promotion of information engineering, Web information is gradually accumulated into a huge resource space which can provide information exchange, information sharing and influence human life. In order to obtain the required resources quickly and accurately from the Web information with the characteristics of magnanimity, disorder, heterogeneity, real-time update and diversity, people begin to realize the importance of the organization and management of Web information. And began to explore a variety of Web information processing methods, automatic indexing is one of them. In this study, we take the automatic extraction of Web information indexing words as the starting point, take the characteristics of the web page organization structure and the reading habits of the web page visitors in the Web coordinate system as the research object, and explore the specific influencing factors in the process of automatic indexing of Web information. On the basis of summarizing the previous research work, this paper puts forward some tentative ideas: according to the web coordinate system, according to the different site types, the web page is divided into several areas with different proportion, and the Web information block belongs to the area and aims at the website type. The weight of each region information block in the process of automatic indexing is explored. Finally, the program is written to verify the above assumption, and each link of automatic indexing is completed. The concrete steps are as follows: (1) Web page collection is realized. According to the needs of the research, we realize the batch and manual collection of web pages, and solve the problems of page coding conversion / html conversion xml in the process of web page collection. (2) using the web page coordinate system, combining with the reading habits of the page viewer, Divide the Web page into nine regions. Each area occupies a certain proportion of the page, and the information block in the region is regarded as a cluster of information blocks, which has the same indexing weight in the later operation and is uniformly processed. (3) to find the appropriate proportion of page segmentation to find different types of websites. Different types of websites have their own unique way of publishing page information. For example, news sites tend to have fewer pictures and text reports account for the main part; most news sites provide page views with the function of evaluating a certain news, resulting in a large range of page height changes. This article selects the news class, sports class, science type website page separately, carries on the test with the different page partition proportion, finds out each type site suitable page segmentation proportion value. (4) gropes the different area information block in the automatic indexing process weight. When visitors visit Web pages, they always have some features such as visual focus, reading habits and so on, so the web page designer will also arrange Web page information with emphasis when making web pages. Therefore, whether we can find the importance of information in different Web page regions has a direct impact on the accuracy of the automatic indexing results in the later period. In this paper, the importance of different regional information blocks of news and science websites is explored through sample experiments. The weight of Web page area information block in automatic indexing of different types of sites is obtained respectively. (5) automatic indexing of Web pages is realized. On the basis of considering the noise and region characteristics of Web page information, a method of automatic indexing of Web information is presented, which is realized and verified by programming. In addition, this paper also discusses the correlation of information extraction recall rate, accuracy rate and so on under the conditions of page width, page height and different page segmentation ratio respectively, in order to be helpful to the future research in this field. To sum up, this paper explores the key technologies in the process of automatic indexing of Web information, probes into the appropriate proportion of web pages of different types of sites, and studies the relationship between web coordinates and the process of automatic indexing of Web information. It has reference and reference significance to relevant research.
【學(xué)位授予單位】:浙江大學(xué)
【學(xué)位級(jí)別】:博士
【學(xué)位授予年份】:2014
【分類號(hào)】:TP393.09

【參考文獻(xiàn)】

相關(guān)期刊論文 前5條

1 丁璇,侯漢清,章成志;中文網(wǎng)頁(yè)標(biāo)引源主題表達(dá)能力的調(diào)查統(tǒng)計(jì)[J];大學(xué)圖書館學(xué)報(bào);2002年06期

2 徐照財(cái);程顯毅;;基于多Agent系統(tǒng)的定題爬蟲算法[J];計(jì)算機(jī)工程;2008年16期

3 索紅光;劉玉樹;曹淑英;;一種基于詞匯鏈的關(guān)鍵詞抽取方法[J];中文信息學(xué)報(bào);2006年06期

4 劉其云,李中言;信息抽取的功能和實(shí)現(xiàn)方法[J];情報(bào)雜志;2005年05期

5 李紅霞;;網(wǎng)絡(luò)信息資源組織研究述評(píng)[J];情報(bào)雜志;2006年09期

,

本文編號(hào):2072167

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/guanlilunwen/ydhl/2072167.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶926d8***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com
国产精品亚洲精品亚洲| 国产高清在线不卡一区| 九九热这里只有精品哦| 黑人巨大精品欧美一区二区区| 国产伦精品一区二区三区精品视频| 成人日韩在线播放视频| 丝袜美女诱惑在线观看| 国产不卡免费高清视频| 亚洲精品成人福利在线| 少妇在线一区二区三区| 中文字幕一区二区免费| 国产精品内射婷婷一级二级| 欧美成人精品一区二区久久| 国产色偷丝袜麻豆亚洲| 久久re6热在线视频| 免费在线成人激情视频| 国产小青蛙全集免费看| 美女被后入视频在线观看| 激情五月激情婷婷丁香| 亚洲国产中文字幕在线观看| 国产精品欧美一区两区| 欧美日韩欧美国产另类| 国产黑人一区二区三区| 最新69国产精品视频| 日韩日韩日韩日韩在线| 熟女少妇一区二区三区蜜桃| 成人精品一区二区三区综合| 久久热中文字幕在线视频| 亚洲精品中文字幕欧美| 99久只有精品免费视频播放| 麻豆欧美精品国产综合久久| 亚洲高清欧美中文字幕| 超薄丝袜足一区二区三区| 国产午夜精品在线免费看| 一区二区三区亚洲天堂| 中国少妇精品偷拍视频| 欧美日韩亚洲精品内裤| 国产欧美日韩在线精品一二区| 成人精品一区二区三区综合| 亚洲欧美日本国产有色| 高清在线精品一区二区|