智能Web廣告爬蟲系統(tǒng)研究
發(fā)布時間:2018-02-04 00:33
本文關(guān)鍵詞: Web廣告 爬行策略 信息抽取 頁面分塊 聚類 出處:《哈爾濱工業(yè)大學(xué)》2013年碩士論文 論文類型:學(xué)位論文
【摘要】:近年來,隨著互聯(lián)網(wǎng)越來越深入的影響人們的日常生活,互聯(lián)網(wǎng)也演變?yōu)槌娨暋蠹埻庖粋非常重要的廣告?zhèn)鞑ッ浇。Web廣告由于其覆蓋面廣、交互性強等特質(zhì),吸引了眾多的廣告主在互聯(lián)網(wǎng)上進(jìn)行營銷。在互聯(lián)網(wǎng)上投放的廣告數(shù)據(jù)非常之多,收集這些數(shù)據(jù)是一份很有意義的工作,但是目前卻沒有針對這些Web廣告數(shù)據(jù)的采集器。 本文提出并設(shè)計了一個Web廣告爬蟲系統(tǒng),專門用來收集互聯(lián)網(wǎng)中的廣告數(shù)據(jù)。本文主要做了如下三個方面的工作: (1)設(shè)計了針對Web廣告信息抓取的爬行策略,爬行策略通過計算URL種子的權(quán)重來安排URL種子的下載順序。結(jié)合Web廣告爬蟲系統(tǒng)要抓取的廣告對象類型和Web廣告的投放方法,提出了已下載頁面權(quán)重計算方法和種子鏈接權(quán)重計算方法,計算已下載頁面權(quán)重,結(jié)合一些全局統(tǒng)計知識進(jìn)一步計算種子鏈接的權(quán)重; (2)通過觀察和分析大量不同類型網(wǎng)頁中的廣告數(shù)據(jù),設(shè)計了針對Web廣告信息的抽取方法,用于抽取網(wǎng)頁中的廣告數(shù)據(jù)。該方法根據(jù)網(wǎng)頁中的廣告數(shù)據(jù)呈現(xiàn)出來的局部性和聚集性,利用聚類算法將網(wǎng)頁中的所有超鏈接聚合成超鏈接塊,然后用啟發(fā)式規(guī)則判斷鏈接塊的類別性質(zhì),,如果判斷是廣告塊,抽取廣告塊中的廣告數(shù)據(jù); (3)在以上研究成果的基礎(chǔ)上設(shè)計并實現(xiàn)了一個智能Web廣告爬蟲系統(tǒng),該系統(tǒng)從預(yù)設(shè)的URL種子開始,自動的從互聯(lián)網(wǎng)中下載網(wǎng)頁數(shù)據(jù),然后抽取網(wǎng)頁中的廣告數(shù)據(jù)。實驗結(jié)果表明,智能Web廣告爬蟲系統(tǒng)的爬行策略與廣度優(yōu)先策略和深度優(yōu)先策略相比,能夠更高效的抓取互聯(lián)網(wǎng)中的廣告數(shù)據(jù),同時,廣告信息抽取算法也能夠精準(zhǔn)的抽取網(wǎng)頁中的廣告數(shù)據(jù)。
[Abstract]:In recent years, with the Internet more and more in-depth impact on people's daily life, the Internet has also evolved into a very important advertising media besides television, newspaper. Web advertising has a wide coverage. Interactivity and other characteristics have attracted many advertisers to market on the Internet. There are so many advertising data on the Internet. It is a meaningful job to collect these data. But there is no collector for these Web advertising data. This paper proposes and designs a Web advertising crawler system, which is specially used to collect advertising data from the Internet. 1) the crawling strategy for Web advertising information capture is designed. The crawling strategy arranges the download order of URL seed by calculating the weight of URL seed, combined with the type of advertising object to be captured by Web crawler system and the method of Web advertisement delivery. The method of calculating the weight of downloaded page and the weight of seed link is put forward, the weight of downloaded page is calculated, and the weight of seed link is further calculated with some global statistical knowledge. By observing and analyzing a large number of advertising data in different types of web pages, a method of extracting advertising information for Web is designed. This method uses clustering algorithm to aggregate all hyperlinks into hyperlink blocks according to the locality and aggregation of advertisement data in web pages. Then the category nature of link block is judged by heuristic rule. If the judgment is an advertisement block, the advertisement data in the advertisement block is extracted. 3) based on the above research results, an intelligent Web advertising crawler system is designed and implemented. The system starts with the preset URL seed and automatically downloads the web page data from the Internet. The experimental results show that the crawling strategy of intelligent Web advertising crawler system is compared with breadth-first strategy and depth-first strategy. At the same time, advertising information extraction algorithm can extract advertising data from web pages accurately.
【學(xué)位授予單位】:哈爾濱工業(yè)大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP393.09;TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前1條
1 周德懋;李舟軍;;高性能網(wǎng)絡(luò)爬蟲:研究綜述[J];計算機科學(xué);2009年08期
本文編號:1488791
本文鏈接:http://sikaile.net/wenyilunwen/guanggaoshejilunwen/1488791.html
最近更新
教材專著