基于網(wǎng)絡(luò)爬蟲的信息采集分類系統(tǒng)設(shè)計與實現(xiàn)
發(fā)布時間:2018-07-29 10:45
【摘要】:在互聯(lián)網(wǎng)走進世界每一個角落的今天,互聯(lián)網(wǎng)信息在不斷地膨脹,每日互聯(lián)網(wǎng)將產(chǎn)生大量的數(shù)據(jù),其中涵蓋了每天發(fā)生發(fā)展的各種各樣的事件,可謂覆蓋人們生產(chǎn)生活的方方面面,這其中包含了大量富有價值的數(shù)據(jù),同時又有絕大部分我們不關(guān)心的數(shù)據(jù),如何從如此海量的信息中抽取有價值的數(shù)據(jù),是我們急需思考的問題。 系統(tǒng)使用蜘蛛爬蟲技術(shù),結(jié)合實際需求開發(fā)互聯(lián)網(wǎng)采集系統(tǒng),使用定向采集思想,快速定位采集符合業(yè)務(wù)需求的互聯(lián)網(wǎng)數(shù)據(jù),然后將采集結(jié)果數(shù)據(jù)通過文本聚類,歸類出符合特性條件的數(shù)據(jù)集合,以方便后續(xù)其他業(yè)務(wù)的數(shù)據(jù)支持。本系統(tǒng)采用java語言面向?qū)ο蟮乃枷?lucene搜索引擎技術(shù)做底層數(shù)據(jù)檢索支持,開源的中文分詞器IK,應(yīng)用方面實現(xiàn)SSH經(jīng)典Web開發(fā)框架,展現(xiàn)一個簡單的互聯(lián)網(wǎng)信息采集分類系統(tǒng)。 系統(tǒng)能夠為有互聯(lián)網(wǎng)數(shù)據(jù)分析需求的個人、企業(yè)或者政府提供需求數(shù)據(jù)的先期過濾聚類,為各種復(fù)雜業(yè)務(wù)的數(shù)據(jù)分析提供一期標(biāo)準(zhǔn)化數(shù)據(jù),在當(dāng)今這個數(shù)據(jù)時代里,能發(fā)揮很好的作用。
[Abstract]:Today, when the Internet enters every corner of the world, the Internet information is constantly expanding, and the daily Internet will produce a large amount of data, which covers all kinds of events that take place every day. It can be described as covering all aspects of people's production and life, which includes a lot of valuable data, and at the same time, most of the data that we don't care about, how to extract valuable data from such a huge amount of information. It is a problem we urgently need to think about. The system uses spider and reptile technology to develop the Internet acquisition system combined with the actual demand, uses the orientation collection idea, collects the Internet data according to the business demand quickly, then collects the result data through the text clustering. Classifies the characteristic data set to facilitate the data support of other business. In this system, the object oriented search engine technology of java language is used to support the underlying data retrieval, the open source Chinese word segmentation device is IK. the SSH classic Web development framework is implemented in the application aspect, and a simple information collection and classification system is presented. The system can provide pre-filtering clustering for individuals, enterprises or governments who have the demand for Internet data analysis, and provide a standardized data for the data analysis of various complex businesses. In this data age, Can play a good role.
【學(xué)位授予單位】:廈門大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP311.52
本文編號:2152432
[Abstract]:Today, when the Internet enters every corner of the world, the Internet information is constantly expanding, and the daily Internet will produce a large amount of data, which covers all kinds of events that take place every day. It can be described as covering all aspects of people's production and life, which includes a lot of valuable data, and at the same time, most of the data that we don't care about, how to extract valuable data from such a huge amount of information. It is a problem we urgently need to think about. The system uses spider and reptile technology to develop the Internet acquisition system combined with the actual demand, uses the orientation collection idea, collects the Internet data according to the business demand quickly, then collects the result data through the text clustering. Classifies the characteristic data set to facilitate the data support of other business. In this system, the object oriented search engine technology of java language is used to support the underlying data retrieval, the open source Chinese word segmentation device is IK. the SSH classic Web development framework is implemented in the application aspect, and a simple information collection and classification system is presented. The system can provide pre-filtering clustering for individuals, enterprises or governments who have the demand for Internet data analysis, and provide a standardized data for the data analysis of various complex businesses. In this data age, Can play a good role.
【學(xué)位授予單位】:廈門大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP311.52
【參考文獻】
相關(guān)期刊論文 前2條
1 李勇;韓亮;;主題搜索引擎中網(wǎng)絡(luò)爬蟲的搜索策略研究[J];計算機工程與科學(xué);2008年03期
2 汪濤,樊孝忠;主題爬蟲的設(shè)計與實現(xiàn)[J];計算機應(yīng)用;2004年S1期
,本文編號:2152432
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2152432.html
最近更新
教材專著