Web信息自動(dòng)標(biāo)引研究
發(fā)布時(shí)間:2018-06-27 01:35
本文選題:Web信息 + 自動(dòng)標(biāo)引; 參考:《浙江大學(xué)》2014年博士論文
【摘要】:互聯(lián)網(wǎng)絡(luò)的發(fā)展及信息化工程的推進(jìn),促使Web信息逐步累積成為一個(gè)能夠提供信息交互、信息共享,并影響人類生活各個(gè)層面的巨大資源空間。為了從具有海量性、無(wú)序性、異構(gòu)性、實(shí)時(shí)更新性、多樣性等特征的Web信息中快速、準(zhǔn)確地獲取所需資源,人們開始逐漸認(rèn)識(shí)到Web信息組織管理的重要性,并開始探索各種Web信息處理方法,自動(dòng)標(biāo)引即為其中之一。本研究以自動(dòng)提取Web信息標(biāo)引詞為切入點(diǎn),以Web坐標(biāo)系、Web頁(yè)面組織結(jié)構(gòu)和Web頁(yè)面瀏覽者的閱讀習(xí)慣等特點(diǎn)為研究對(duì)象,探索Web信息自動(dòng)標(biāo)引過程中的具體影響因素。在總結(jié)前人研究工作的基礎(chǔ)上,提出設(shè)想:根據(jù)網(wǎng)頁(yè)坐標(biāo)系,按照不同站點(diǎn)類型,用不同分割比例把網(wǎng)頁(yè)劃分若干區(qū)域;判析Web信息塊歸屬區(qū)域并針對(duì)網(wǎng)站類型,探索各區(qū)域信息塊在自動(dòng)標(biāo)引過程中的權(quán)重,最后編寫程序驗(yàn)證以上設(shè)想,完成自動(dòng)標(biāo)引各個(gè)環(huán)節(jié)。具體步驟如下:(1)研究實(shí)現(xiàn)Web頁(yè)面采集。根據(jù)研究需要,分別實(shí)現(xiàn)Web頁(yè)面批量采集和手動(dòng)采集,解決Web頁(yè)面采集過程中的頁(yè)面編碼轉(zhuǎn)換、html轉(zhuǎn)換xml等問題。(2)利用Web頁(yè)面坐標(biāo)系,結(jié)合頁(yè)面瀏覽者閱讀習(xí)慣,將Web頁(yè)面劃分成9個(gè)區(qū)域。每個(gè)區(qū)域占據(jù)頁(yè)面一定比例,且區(qū)域中信息塊被視為一個(gè)信息塊集群,在后期運(yùn)算中具有同樣的標(biāo)引權(quán)重并被統(tǒng)一處理。(3)尋找發(fā)現(xiàn)不同類型網(wǎng)站的適宜頁(yè)面分割比例。不同類型網(wǎng)站有著自己獨(dú)特的頁(yè)面信息發(fā)布方式。如新聞?lì)愓军c(diǎn),往往圖片較少,文字報(bào)道占主要部分;大部分新聞?lì)愓军c(diǎn)都向頁(yè)面瀏覽者提供對(duì)某新聞進(jìn)行評(píng)價(jià)的功能,從而造成網(wǎng)頁(yè)高度變動(dòng)幅度較大。本文分別選擇新聞?lì)、體育類、科學(xué)類站點(diǎn)頁(yè)面,用不同頁(yè)面分割比例進(jìn)行測(cè)試,找出各類型站點(diǎn)的適宜頁(yè)面分割比例值。(4)摸索不同區(qū)域信息塊在自動(dòng)標(biāo)引過程中的權(quán)重。瀏覽者在訪問Web頁(yè)面時(shí),總會(huì)有視覺焦點(diǎn)、閱讀習(xí)慣等特性,從而Web頁(yè)面設(shè)計(jì)者在制作網(wǎng)頁(yè)時(shí),也會(huì)有所重點(diǎn)地安排Web頁(yè)面信息。因此能否發(fā)現(xiàn)不同Web頁(yè)面區(qū)域的信息重要程度,對(duì)后期自動(dòng)標(biāo)引結(jié)果的準(zhǔn)確性有著直接影響。本文通過樣本實(shí)驗(yàn),對(duì)新聞?lì)、科學(xué)類站點(diǎn)網(wǎng)頁(yè)的不同區(qū)域信息塊重要性進(jìn)行了摸索,并分別得出不同類型站點(diǎn)的Web頁(yè)面區(qū)域信息塊在自動(dòng)標(biāo)引中的權(quán)重。(5)實(shí)現(xiàn)對(duì)Web頁(yè)面進(jìn)行自動(dòng)標(biāo)引。在考慮Web頁(yè)面信息噪音和區(qū)域特性的基礎(chǔ)上,結(jié)合文本方法特色,給出一種Web信息自動(dòng)標(biāo)引的方法,編寫程序予以實(shí)現(xiàn)和驗(yàn)證。此外,本文還分別對(duì)網(wǎng)頁(yè)寬度、網(wǎng)頁(yè)高度與不同頁(yè)面分割比例下的信息抽取查全率、準(zhǔn)確率等的相關(guān)性等問題進(jìn)行了探討,以期對(duì)以后該領(lǐng)域研究有所幫助。綜上所述,本文對(duì)Web信息自動(dòng)標(biāo)引過程中各環(huán)節(jié)的關(guān)鍵技術(shù)進(jìn)行了探索,探討了不同類型站點(diǎn)網(wǎng)頁(yè)的適宜分割比例,研究了網(wǎng)頁(yè)坐標(biāo)系與Web信息自動(dòng)標(biāo)引過程的相互關(guān)系,對(duì)相關(guān)研究有著借鑒和參考意義。
[Abstract]:With the development of Internet and the promotion of information engineering, Web information is gradually accumulated into a huge resource space which can provide information exchange, information sharing and influence human life. In order to obtain the required resources quickly and accurately from the Web information with the characteristics of magnanimity, disorder, heterogeneity, real-time update and diversity, people begin to realize the importance of the organization and management of Web information. And began to explore a variety of Web information processing methods, automatic indexing is one of them. In this study, we take the automatic extraction of Web information indexing words as the starting point, take the characteristics of the web page organization structure and the reading habits of the web page visitors in the Web coordinate system as the research object, and explore the specific influencing factors in the process of automatic indexing of Web information. On the basis of summarizing the previous research work, this paper puts forward some tentative ideas: according to the web coordinate system, according to the different site types, the web page is divided into several areas with different proportion, and the Web information block belongs to the area and aims at the website type. The weight of each region information block in the process of automatic indexing is explored. Finally, the program is written to verify the above assumption, and each link of automatic indexing is completed. The concrete steps are as follows: (1) Web page collection is realized. According to the needs of the research, we realize the batch and manual collection of web pages, and solve the problems of page coding conversion / html conversion xml in the process of web page collection. (2) using the web page coordinate system, combining with the reading habits of the page viewer, Divide the Web page into nine regions. Each area occupies a certain proportion of the page, and the information block in the region is regarded as a cluster of information blocks, which has the same indexing weight in the later operation and is uniformly processed. (3) to find the appropriate proportion of page segmentation to find different types of websites. Different types of websites have their own unique way of publishing page information. For example, news sites tend to have fewer pictures and text reports account for the main part; most news sites provide page views with the function of evaluating a certain news, resulting in a large range of page height changes. This article selects the news class, sports class, science type website page separately, carries on the test with the different page partition proportion, finds out each type site suitable page segmentation proportion value. (4) gropes the different area information block in the automatic indexing process weight. When visitors visit Web pages, they always have some features such as visual focus, reading habits and so on, so the web page designer will also arrange Web page information with emphasis when making web pages. Therefore, whether we can find the importance of information in different Web page regions has a direct impact on the accuracy of the automatic indexing results in the later period. In this paper, the importance of different regional information blocks of news and science websites is explored through sample experiments. The weight of Web page area information block in automatic indexing of different types of sites is obtained respectively. (5) automatic indexing of Web pages is realized. On the basis of considering the noise and region characteristics of Web page information, a method of automatic indexing of Web information is presented, which is realized and verified by programming. In addition, this paper also discusses the correlation of information extraction recall rate, accuracy rate and so on under the conditions of page width, page height and different page segmentation ratio respectively, in order to be helpful to the future research in this field. To sum up, this paper explores the key technologies in the process of automatic indexing of Web information, probes into the appropriate proportion of web pages of different types of sites, and studies the relationship between web coordinates and the process of automatic indexing of Web information. It has reference and reference significance to relevant research.
【學(xué)位授予單位】:浙江大學(xué)
【學(xué)位級(jí)別】:博士
【學(xué)位授予年份】:2014
【分類號(hào)】:TP393.09
【參考文獻(xiàn)】
相關(guān)期刊論文 前5條
1 丁璇,侯漢清,章成志;中文網(wǎng)頁(yè)標(biāo)引源主題表達(dá)能力的調(diào)查統(tǒng)計(jì)[J];大學(xué)圖書館學(xué)報(bào);2002年06期
2 徐照財(cái);程顯毅;;基于多Agent系統(tǒng)的定題爬蟲算法[J];計(jì)算機(jī)工程;2008年16期
3 索紅光;劉玉樹;曹淑英;;一種基于詞匯鏈的關(guān)鍵詞抽取方法[J];中文信息學(xué)報(bào);2006年06期
4 劉其云,李中言;信息抽取的功能和實(shí)現(xiàn)方法[J];情報(bào)雜志;2005年05期
5 李紅霞;;網(wǎng)絡(luò)信息資源組織研究述評(píng)[J];情報(bào)雜志;2006年09期
,本文編號(hào):2072167
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/2072167.html
最近更新
教材專著