比價(jià)購物平臺(tái)中網(wǎng)絡(luò)爬蟲的設(shè)計(jì)與實(shí)現(xiàn)
發(fā)布時(shí)間:2018-12-31 08:10
【摘要】:隨著信息技術(shù)的普及與發(fā)展, Internet已深入到人們生活與工作的各個(gè)角落,搜索引擎已成為人們獲取信息最快捷的工具,網(wǎng)上購物已成為一種生活方式,越來越被大多數(shù)人接受。但是網(wǎng)上商品種類繁多、價(jià)格高低不同和商家良莠不齊,消費(fèi)者不得不花費(fèi)大量的時(shí)間在各大購物網(wǎng)站瀏覽商品、比較價(jià)格、權(quán)衡性價(jià)比,因此,用戶很希望擁有這樣一套系統(tǒng)來幫助他們完成對商品的選購,在這套系統(tǒng)中包含了各大主流購物網(wǎng)站中熱賣產(chǎn)品的信息,通過簡單的搜索就能夠知道哪個(gè)網(wǎng)站售賣的商品最便宜、性價(jià)比最高。比價(jià)購物平臺(tái)是一個(gè)很好的解決方案,對于該平臺(tái)來說,如何獲取如此龐大的商品數(shù)據(jù)和價(jià)格信息是一個(gè)至關(guān)重要的問題,正是基于以上背景,本文提出針對其數(shù)據(jù)來源的解決方案——網(wǎng)絡(luò)爬蟲的設(shè)計(jì)與實(shí)現(xiàn)。 本文主要圍繞如何設(shè)計(jì)和實(shí)現(xiàn)網(wǎng)絡(luò)爬蟲功能進(jìn)行研究,在Heritrix網(wǎng)絡(luò)爬蟲的基礎(chǔ)上,對某些功能做擴(kuò)展和定制化開發(fā),本文主要就以下幾個(gè)問題作了深入討論: (1)確定種子鏈接:為網(wǎng)絡(luò)爬蟲提供一個(gè)爬行入口; (2)網(wǎng)頁抓取的方法:將符合要求的網(wǎng)頁保存到本地文件夾; (3)分析和抽取網(wǎng)頁內(nèi)容:提取網(wǎng)頁中與商品屬性有關(guān)的信息; (4)結(jié)構(gòu)化與存儲(chǔ)數(shù)據(jù):將商品屬性逐條提取出來并存儲(chǔ)到數(shù)據(jù)庫中; (5)展現(xiàn)商品數(shù)據(jù),用于比價(jià)。
[Abstract]:With the popularization and development of information technology, Internet has penetrated into every corner of people's life and work. Search engine has become the quickest tool for people to obtain information. Online shopping has become a way of life and more accepted by most people. But there are many kinds of goods on the net, the price is different and the good are not the same, consumers have to spend a lot of time browsing the goods in the major shopping websites, comparing the prices, weighing the performance-to-price ratio, so, Users are keen to have a system to help them complete their shopping choices, which contain information about popular products from major shopping sites. A simple search can tell which sites sell the cheapest and most cost-effective products. Price comparison shopping platform is a good solution, for this platform, how to obtain such huge commodity data and price information is a crucial problem, it is based on the above background, This paper presents a solution for its data source, the design and implementation of web crawler. This paper mainly focuses on how to design and realize the function of web crawler. On the basis of Heritrix crawler, some functions are extended and customized. In this paper, the following problems are discussed: (1) to determine the seed link: to provide a crawling portal for the web crawler; (II) method of web page crawling: save pages that meet the requirements to a local folder; (3) analyzing and extracting web content: extracting information related to commodity attributes in web pages; (4) structuring and storing data: extracting commodity attributes one by one and storing them in database; (5) display commodity data for price comparison.
【學(xué)位授予單位】:華東理工大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP391.3
本文編號:2396304
[Abstract]:With the popularization and development of information technology, Internet has penetrated into every corner of people's life and work. Search engine has become the quickest tool for people to obtain information. Online shopping has become a way of life and more accepted by most people. But there are many kinds of goods on the net, the price is different and the good are not the same, consumers have to spend a lot of time browsing the goods in the major shopping websites, comparing the prices, weighing the performance-to-price ratio, so, Users are keen to have a system to help them complete their shopping choices, which contain information about popular products from major shopping sites. A simple search can tell which sites sell the cheapest and most cost-effective products. Price comparison shopping platform is a good solution, for this platform, how to obtain such huge commodity data and price information is a crucial problem, it is based on the above background, This paper presents a solution for its data source, the design and implementation of web crawler. This paper mainly focuses on how to design and realize the function of web crawler. On the basis of Heritrix crawler, some functions are extended and customized. In this paper, the following problems are discussed: (1) to determine the seed link: to provide a crawling portal for the web crawler; (II) method of web page crawling: save pages that meet the requirements to a local folder; (3) analyzing and extracting web content: extracting information related to commodity attributes in web pages; (4) structuring and storing data: extracting commodity attributes one by one and storing them in database; (5) display commodity data for price comparison.
【學(xué)位授予單位】:華東理工大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP391.3
【引證文獻(xiàn)】
相關(guān)期刊論文 前1條
1 董浩然;謝歡;陳鵬;洪中華;童小華;;基于GIS主題爬蟲的在線房產(chǎn)估價(jià)系統(tǒng)與優(yōu)化[J];地理信息世界;2016年02期
,本文編號:2396304
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2396304.html
最近更新
教材專著