藏文網(wǎng)頁(yè)除噪技術(shù)研究
發(fā)布時(shí)間:2018-12-07 17:24
【摘要】: 隨著網(wǎng)絡(luò)信息技術(shù)的飛速發(fā)展以及藏族地區(qū)計(jì)算機(jī)應(yīng)用技術(shù)的不斷提高,越來(lái)越多的藏文網(wǎng)頁(yè)出現(xiàn)在互聯(lián)網(wǎng)中,使我們更多地了解到廣大藏族同胞的文化生活和民風(fēng)民俗,增進(jìn)了我們之間的交流,推動(dòng)了藏族地區(qū)的發(fā)展。然而,在藏文網(wǎng)頁(yè)的有用信息周圍往往夾雜著很多噪聲信息,例如彈出的廣告、多余的圖片以及一些無(wú)關(guān)的鏈接等。這些信息嚴(yán)重影響了藏文網(wǎng)頁(yè)中有用信息的獲取效率,如何有效地去除這些無(wú)用的噪聲信息已經(jīng)成為藏文信息處理領(lǐng)域一個(gè)亟待解決的問(wèn)題。本文分析了大量當(dāng)前存在的網(wǎng)頁(yè)除噪技術(shù)以及藏文網(wǎng)頁(yè)的內(nèi)容類型,研究了DOM技術(shù)的特點(diǎn)和一些主要的操作規(guī)范,在此基礎(chǔ)上提出了一種基于DOM和顯示屬性相結(jié)合的藏文網(wǎng)頁(yè)除噪技術(shù)。本技術(shù)通過(guò)分析人們?cè)陂喿x瀏覽網(wǎng)頁(yè)內(nèi)容時(shí)的潛在行為,得出了網(wǎng)頁(yè)元素從顯示屬性上分塊的特征,使用了一種顯示屬性分塊模型,并通過(guò)示例頁(yè)面展示了此模型的具體應(yīng)用,通過(guò)把藏文網(wǎng)頁(yè)解析成DOM樹(shù)結(jié)構(gòu),結(jié)合顯示屬性和分塊模型對(duì)頁(yè)面內(nèi)容進(jìn)行分析,經(jīng)過(guò)一系列的顯示塊劃分、DOM節(jié)點(diǎn)的合并與刪除、DOM樹(shù)簡(jiǎn)化對(duì)藏文頁(yè)面進(jìn)行去噪處理。 本文除噪技術(shù)的核心步驟是提取網(wǎng)頁(yè)DOM樹(shù)節(jié)點(diǎn)的顯示屬性,因此必須實(shí)現(xiàn)藏文網(wǎng)頁(yè)的DOM解析。在深入研究了大量網(wǎng)頁(yè)解析技術(shù)的基礎(chǔ)上,本文使用Java程序設(shè)計(jì)語(yǔ)言在Eclipse平臺(tái)上開(kāi)發(fā)出了一個(gè)藏文網(wǎng)頁(yè)DOM解析器,可以把一個(gè)藏文HTML頁(yè)面解析成一棵DOM節(jié)點(diǎn)樹(shù),每個(gè)節(jié)點(diǎn)都完整地包含了HTML文檔的標(biāo)簽屬性,可以根據(jù)需要隨機(jī)提取網(wǎng)頁(yè)各信息塊的顯示屬性。本解析器還具有簡(jiǎn)單的瀏覽器功能,可以直接通過(guò)輸入網(wǎng)址來(lái)解析一個(gè)藏文網(wǎng)頁(yè),也可以通過(guò)把網(wǎng)頁(yè)源碼下載到本地計(jì)算機(jī)上進(jìn)行解析,具有很強(qiáng)的標(biāo)簽識(shí)別和修復(fù)能力,適用于大多數(shù)藏文網(wǎng)頁(yè)。同時(shí),通過(guò)分析藏文網(wǎng)頁(yè)信息的特征,本文提出了依據(jù)藏文信息音節(jié)點(diǎn)出現(xiàn)頻率和網(wǎng)頁(yè)超鏈率進(jìn)行噪聲信息塊識(shí)別的方法,可以有效地識(shí)別出大部分藏文網(wǎng)頁(yè)中包含的噪聲信息塊。最后,對(duì)保留的有用信息塊進(jìn)行DOM節(jié)點(diǎn)過(guò)濾可以完成對(duì)藏文網(wǎng)頁(yè)的除噪。經(jīng)過(guò)大量測(cè)試,本文的除噪技術(shù)可以有效地去除藏文網(wǎng)頁(yè)中的大多數(shù)噪聲信息,具有很好的實(shí)用價(jià)值和應(yīng)用前景。
[Abstract]:With the rapid development of network information technology and the continuous improvement of computer application technology in Tibetan areas, more and more Tibetan web pages appear on the Internet, which makes us know more about the cultural life and folk customs of the Tibetan compatriots. This has enhanced exchanges between us and promoted the development of Tibetan areas. However, the useful information of Tibetan web pages is often surrounded by a lot of noise information, such as pop-up ads, redundant pictures and irrelevant links. This information seriously affects the efficiency of obtaining useful information in Tibetan web pages. How to effectively remove these useless noise information has become an urgent problem in the field of Tibetan information processing. This paper analyzes a large number of existing web page denoising techniques and the content types of Tibetan web pages, and studies the characteristics of DOM technology and some main operating specifications. On this basis, a Tibetan web page denoising technology based on DOM and display attributes is proposed. By analyzing the potential behavior of people when reading and browsing the web content, the technology obtains the feature that the elements of the web page are divided into blocks from the display attributes, and uses a model to divide the display attributes into blocks, and shows the concrete application of the model through an example page. Through parsing Tibetan web pages into DOM tree structure, combining display attribute and block model to analyze the content of the page, after a series of display blocks partition, DOM node merging and deleting, DOM tree simplifies the denoising processing of Tibetan pages. In this paper, the key step of the denoising technique is to extract the display attributes of the DOM tree node of the web page, so it is necessary to realize the DOM parsing of the Tibetan web page. Based on the deep study of a large number of web page parsing techniques, a Tibetan web page DOM parser is developed on the Eclipse platform by using Java programming language, which can parse a Tibetan HTML page into a DOM node tree. Each node contains the label attributes of HTML documents, and it can randomly extract the display attributes of each information block of the web page according to the need. The parser also has a simple browser function, which can directly parse a Tibetan web page by entering a URL, or can be parsed by downloading the source code of the web page to a local computer. It has a strong ability to identify and repair tags. Suitable for most Tibetan web pages. At the same time, by analyzing the characteristics of Tibetan web page information, this paper proposes a method to identify the noise information blocks based on the frequency of syllable points of Tibetan information and the hyperchain rate of web pages. It can effectively identify the noise information blocks contained in most Tibetan web pages. Finally, the DOM node filtering of reserved useful information blocks can eliminate the noise of Tibetan web pages. After a lot of tests, the denoising technology in this paper can effectively remove most of the noise information from Tibetan web pages, which has good practical value and application prospect.
【學(xué)位授予單位】:西北民族大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2010
【分類號(hào)】:TP393.092
本文編號(hào):2367555
[Abstract]:With the rapid development of network information technology and the continuous improvement of computer application technology in Tibetan areas, more and more Tibetan web pages appear on the Internet, which makes us know more about the cultural life and folk customs of the Tibetan compatriots. This has enhanced exchanges between us and promoted the development of Tibetan areas. However, the useful information of Tibetan web pages is often surrounded by a lot of noise information, such as pop-up ads, redundant pictures and irrelevant links. This information seriously affects the efficiency of obtaining useful information in Tibetan web pages. How to effectively remove these useless noise information has become an urgent problem in the field of Tibetan information processing. This paper analyzes a large number of existing web page denoising techniques and the content types of Tibetan web pages, and studies the characteristics of DOM technology and some main operating specifications. On this basis, a Tibetan web page denoising technology based on DOM and display attributes is proposed. By analyzing the potential behavior of people when reading and browsing the web content, the technology obtains the feature that the elements of the web page are divided into blocks from the display attributes, and uses a model to divide the display attributes into blocks, and shows the concrete application of the model through an example page. Through parsing Tibetan web pages into DOM tree structure, combining display attribute and block model to analyze the content of the page, after a series of display blocks partition, DOM node merging and deleting, DOM tree simplifies the denoising processing of Tibetan pages. In this paper, the key step of the denoising technique is to extract the display attributes of the DOM tree node of the web page, so it is necessary to realize the DOM parsing of the Tibetan web page. Based on the deep study of a large number of web page parsing techniques, a Tibetan web page DOM parser is developed on the Eclipse platform by using Java programming language, which can parse a Tibetan HTML page into a DOM node tree. Each node contains the label attributes of HTML documents, and it can randomly extract the display attributes of each information block of the web page according to the need. The parser also has a simple browser function, which can directly parse a Tibetan web page by entering a URL, or can be parsed by downloading the source code of the web page to a local computer. It has a strong ability to identify and repair tags. Suitable for most Tibetan web pages. At the same time, by analyzing the characteristics of Tibetan web page information, this paper proposes a method to identify the noise information blocks based on the frequency of syllable points of Tibetan information and the hyperchain rate of web pages. It can effectively identify the noise information blocks contained in most Tibetan web pages. Finally, the DOM node filtering of reserved useful information blocks can eliminate the noise of Tibetan web pages. After a lot of tests, the denoising technology in this paper can effectively remove most of the noise information from Tibetan web pages, which has good practical value and application prospect.
【學(xué)位授予單位】:西北民族大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2010
【分類號(hào)】:TP393.092
【參考文獻(xiàn)】
相關(guān)期刊論文 前8條
1 韓家煒,孟小峰,王靜,李盛恩;Web挖掘研究[J];計(jì)算機(jī)研究與發(fā)展;2001年04期
2 王琦,唐世渭,楊冬青,王騰蛟;基于DOM的網(wǎng)頁(yè)主題信息自動(dòng)提取[J];計(jì)算機(jī)研究與發(fā)展;2004年10期
3 常育紅,姜哲,朱小燕;基于標(biāo)記樹(shù)表示方法的頁(yè)面結(jié)構(gòu)分析[J];計(jì)算機(jī)工程與應(yīng)用;2004年16期
4 李朝;彭宏;葉蘇南;張歡;楊親遙;;基于DOM樹(shù)的可適應(yīng)性Web信息抽取[J];計(jì)算機(jī)科學(xué);2009年07期
5 珠杰;歐珠;格桑多吉;;基于DOM修剪的藏文Web信息提取[J];計(jì)算機(jī)工程;2008年24期
6 宋睿華,馬少平,陳剛,李景陽(yáng);一種提高中文搜索引擎檢索質(zhì)量的HTML解析方法[J];中文信息學(xué)報(bào);2003年04期
7 楊曦,高功步;HTML,DHTML,VRML,XML功能分析與比較研究[J];現(xiàn)代電子技術(shù);2003年10期
8 于洪志,喇秉軍,何向真;Web環(huán)境下藏文信息處理技術(shù)[J];西北民族大學(xué)學(xué)報(bào)(自然科學(xué)版);2005年01期
,本文編號(hào):2367555
本文鏈接:http://sikaile.net/wenyilunwen/guanggaoshejilunwen/2367555.html
最近更新
教材專著