當(dāng)前位置：主頁(yè) > 科技論文 > 自動(dòng)化論文 >

學(xué)生公寓房源數(shù)據(jù)采集平臺(tái)的設(shè)計(jì)與實(shí)現(xiàn)

發(fā)布時(shí)間：2018-09-08 20:23

【摘要】：留彼工坊科技有限公司是一家專(zhuān)門(mén)面向英國(guó)當(dāng)?shù)亓魧W(xué)生群體提供學(xué)生公寓租房信息服務(wù)的020互聯(lián)網(wǎng)創(chuàng)業(yè)公司。在互聯(lián)網(wǎng)模式下,公司需要為用戶(hù)提供體驗(yàn)良好的服務(wù)并且快速而精準(zhǔn)地獲取所需的公寓信息。目前其房源數(shù)據(jù)通過(guò)Unite-Students等機(jī)構(gòu)合作以及友商平臺(tái)獲得,通過(guò)郵件溝通,手工更新公寓情況以及租賃信息。然而郵件方式效率低下,管理成本高,在租房的熱門(mén)季度中,余量以及租期信息變動(dòng)頻繁。在業(yè)務(wù)要求下,需要更為自動(dòng)化的方式來(lái)處理平臺(tái)之間房源信息的同步,以獲取最新精準(zhǔn)的公寓數(shù)據(jù)。網(wǎng)頁(yè)數(shù)據(jù)抓取便是一種有效的手段。在不同的公寓平臺(tái)之間,雖然公寓的信息結(jié)構(gòu)大體一致,但是展示頁(yè)面細(xì)節(jié)各不相同,面對(duì)定制化的網(wǎng)頁(yè)采集需求,為減少爬蟲(chóng)編寫(xiě)的工作量,降低生產(chǎn)成本,如何設(shè)計(jì)整體的系統(tǒng)架構(gòu),控制爬蟲(chóng)編寫(xiě)的模塊復(fù)雜度,解耦模塊功能,進(jìn)行數(shù)據(jù)清洗、結(jié)構(gòu)化以及導(dǎo)入數(shù)據(jù)等都是本項(xiàng)目的關(guān)鍵的問(wèn)題。本人于留彼工坊公司實(shí)習(xí)期間,參與了公寓后臺(tái)數(shù)據(jù)中心的開(kāi)發(fā)工作。參考公司原有的未開(kāi)發(fā)完成的基于Pyspider的爬蟲(chóng)應(yīng)用,重新開(kāi)發(fā)了基于Scrapy的新的系統(tǒng)。區(qū)別于主站后臺(tái)Livety,數(shù)據(jù)中心稱(chēng)為Sharingan。Livety負(fù)責(zé)選擇確切的房源數(shù)據(jù)展示在前臺(tái)頁(yè)面,管理用戶(hù),而Sharingan主要作為房源數(shù)據(jù)庫(kù),存儲(chǔ)和管理從不同平臺(tái)中采集的結(jié)構(gòu)化的房源數(shù)據(jù),并且作為網(wǎng)絡(luò)爬蟲(chóng)的調(diào)度和部署平臺(tái),進(jìn)行一系列的數(shù)據(jù)處理工作。同時(shí),兩個(gè)后臺(tái)中心以消息系統(tǒng)的方式進(jìn)行通信,以實(shí)現(xiàn)系統(tǒng)間的低耦合。本人在項(xiàng)目開(kāi)發(fā)中,具體進(jìn)行的工作內(nèi)容有:(1)參與了房源數(shù)據(jù)庫(kù)關(guān)系模型的建模。深入了解業(yè)務(wù)需求以及各平臺(tái)的學(xué)生公寓出租信息,制定了結(jié)構(gòu)化的數(shù)據(jù)存儲(chǔ)模型。通過(guò)這些工作,為該業(yè)務(wù)的房源數(shù)據(jù)結(jié)構(gòu)化提取和導(dǎo)入、存儲(chǔ)提供基礎(chǔ)和規(guī)范;(2)參與了數(shù)據(jù)中心系統(tǒng)架構(gòu)的設(shè)計(jì),基于整體需求,結(jié)合之前遺留的爬蟲(chóng)系統(tǒng)得到的實(shí)踐經(jīng)驗(yàn),面向網(wǎng)頁(yè)數(shù)據(jù)采集提取建立通用的模式,確定了新系統(tǒng)的架構(gòu),框架、技術(shù)以及功能模塊整合方案等。明確了開(kāi)發(fā)需求和系統(tǒng)架構(gòu)設(shè)計(jì),內(nèi)部模塊的概要設(shè)計(jì)等;(3)負(fù)責(zé)具體模塊的實(shí)現(xiàn),子系統(tǒng)的開(kāi)發(fā)及整合,包括Scrapy爬蟲(chóng)的Fragment模塊、Processor模塊、Validator模塊、Spider調(diào)度、監(jiān)控模塊,數(shù)據(jù)庫(kù)導(dǎo)入模塊,數(shù)據(jù)中心的消息系統(tǒng)等。最后構(gòu)建出了一個(gè)初步可用的完整系統(tǒng)。(4)負(fù)責(zé)編寫(xiě)相關(guān)測(cè)試,確保系統(tǒng)的正確運(yùn)行。通過(guò)測(cè)試,找出并修改了系統(tǒng)和模塊中的程序錯(cuò)誤。系統(tǒng)初步上線(xiàn)后,運(yùn)行情況良好,目前定時(shí)從各平臺(tái)采集數(shù)據(jù),用于為內(nèi)部的展示系統(tǒng)提供公寓數(shù)據(jù)服務(wù),其擴(kuò)展性為以后成為通用性更高、面向更多數(shù)據(jù)的采集平臺(tái)打下了基礎(chǔ)。
[Abstract]:Technology Co., Ltd. is a local students in the United Kingdom to provide student housing information services 020 Internet startups. In Internet mode, companies need to provide users with experienced services and quick and accurate access to the required apartment information. At present, its source data is obtained through Unite-Students and other institutional cooperation and rival platforms, through email communication, manual update of apartment and rental information. However, the efficiency of mail is low, the management cost is high, and the margin and the information of the lease period fluctuate frequently in the hot quarter of renting. Under business requirements, more automated ways are needed to synchronize the source information between platforms to obtain up-to-date and accurate apartment data. Web data capture is an effective method. In different apartment platforms, although the information structure of the apartment is roughly the same, but the details of the display page are different, in the face of customized web page collection demand, to reduce the amount of work compiled by the reptiles, reduce production costs, How to design the whole system architecture, control the complexity of the crawler module, decouple the module function, clean the data, structure and import the data are the key problems of this project. I took part in the development of the back-end data center of the apartment during my internship. A new system based on Scrapy is developed by referring to the original undeveloped crawler application based on Pyspider. Different from the main station background Livety, data center, Sharingan.Livety is responsible for selecting the exact room source data to display on the front page and managing the user, while Sharingan is mainly used as the house source database to store and manage the structured house source data collected from different platforms. And as a network crawler scheduling and deployment platform, a series of data processing work. At the same time, the two backend centers communicate with the message system in order to realize the low coupling between the systems. In the development of the project, the contents are as follows: (1) taking part in the modeling of the relational model of the house source database. A structured data storage model is developed to understand business requirements and rental information of student apartments on various platforms. Through these works, it provides the basis and specification for the structured extraction and import, storage and storage of the house source data of the business. (2) participated in the design of the data center system architecture, based on the overall requirements, combined with the practical experience gained from the previous reptile system, A general pattern for data collection and extraction of web pages is established, and the new system architecture, framework, technology and integration scheme of function modules are determined. (3) responsible for the realization of specific modules, the development and integration of subsystems, including the Fragment module of Scrapy crawler, the Validator module, the module of Spider scheduling, the monitoring module, and the other modules, such as the design of system architecture, the outline design of internal modules, etc. (3) responsible for the implementation of specific modules, the development and integration of subsystems, including the Fragment module of Scrapy crawler, Database import module, data center message system and so on. Finally, a preliminary usable complete system is constructed. (4) responsible for writing relevant tests to ensure the correct operation of the system. Through the test, found and modified the system and module program errors. After the initial launch of the system, the system is running well. At present, it regularly collects data from various platforms, which is used to provide apartment data services for the internal display system. Its expansibility makes it more versatile in the future. More data for the acquisition platform laid the foundation.
【學(xué)位授予單位】：北京交通大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2017
【分類(lèi)號(hào)】：TP311.52;TP274.2

【參考文獻(xiàn)】

相關(guān)期刊論文前6條

1 楊培培;趙海生;李振星;;實(shí)用軟件測(cè)試方法研究[J];計(jì)算機(jī)應(yīng)用;2015年S1期

2 于成龍;于洪波;;網(wǎng)絡(luò)爬蟲(chóng)技術(shù)研究[J];東莞理工學(xué)院學(xué)報(bào);2011年03期

3 郭銀蕊;陳榮;;基于遺傳算法的Web信息抽取[J];模式識(shí)別與人工智能;2011年03期

4 蔣宗禮;田曉燕;趙旭;;一種基于語(yǔ)義分析的主題爬蟲(chóng)算法[J];計(jì)算機(jī)工程與科學(xué);2010年09期

5 蔡建超;蔡明;;搜索引擎PageRank算法研究[J];計(jì)算機(jī)應(yīng)用與軟件;2008年09期

6 趙仲孟;張蓓;沈均毅;;對(duì)搜索引擎未來(lái)發(fā)展的探討[J];計(jì)算機(jī)科學(xué);2001年03期

相關(guān)碩士學(xué)位論文前6條

1 陳R，

本文編號(hào)：2231599

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/zidonghuakongzhilunwen/2231599.html

上一篇：基于改進(jìn)粒子群算法的含分布式電源的配電網(wǎng)優(yōu)化重構(gòu)研究
下一篇：基于力矩傳感器的智能電動(dòng)車(chē)控制系統(tǒng)研究

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

學(xué)生公寓房源數(shù)據(jù)采集平臺(tái)的設(shè)計(jì)與實(shí)現(xiàn)