基于MongoDB云存儲平臺的論壇信息抽取與存儲研究
發(fā)布時間:2018-04-26 01:25
本文選題:云計算 + 非關系數據庫; 參考:《上海交通大學》2012年碩士論文
【摘要】:互聯(lián)網技術的迅猛發(fā)展,以及手機、平板、智能電視等各種輸入終端的普及,讓互聯(lián)網數據呈現出爆炸性的增長。面對海量的數據,如何能以更加穩(wěn)定、快速的方式存儲海量數據,以及從中挖掘出有價值的信息,成為很多企業(yè)面臨的新課堂。云存儲的出現為數據挖掘快速的發(fā)展帶來了新的機遇。亞馬遜、微軟、谷歌、IBM等等巨頭紛紛推出了自己的云存儲平臺,國內百度,華為、騰訊、360等等公司也加緊了在云存儲領域的布局。論文以海量的論壇數據做存儲樣本,搭建了一個支持水平擴展的實驗系統(tǒng)。設計并實現了多種論壇數據抽取的方法。最后驗證了云存儲帶來的性能優(yōu)勢。本文主要開展了以下幾方面的工作: 1)本文詳細介紹了因云存儲發(fā)展而帶動起來的NOSQL,,闡述了各類NOSQL的特點,根據論壇數據的特征,最終篩選了MongoDB來存儲數據,并把它與流行的傳統(tǒng)關系庫MYSQL做了比較,總結了MongoDB的部分優(yōu)勢。隨后介紹了MongoDB的使用方式和存儲論壇數據的方法。 2)簡述了各類論壇信息抽取的方法,隨后分析國內論壇的特點和論壇本身的結構特征,把論壇分成兩類:通用論壇和專用論壇。對于通用論壇,用正則表達式進行精確的信息獲;對于專用論壇,提出并設計了一套啟發(fā)式的抽取方法。應用不同的抽取方法抽取各類論壇數據,提高了抽取準確率。 3)為驗證新設計的存儲方式,以及各類論壇信息抽取算法的可行性。本文結合多種論壇數據挖掘方法,設計了一個基于MongoDB分布式存儲的論壇抽取實驗系統(tǒng),使系統(tǒng)能支持水平擴展和穩(wěn)定的存儲海量論壇數據,并且準確的挖掘出論壇中各類有用的數據。待存儲的數據量達到一定規(guī)模后,測試了論壇大數據的存儲能力,比較了多種查詢下的存儲性能。得出了分布式環(huán)境下的云存儲,在處理大數據上,與單服務架構的MongoDB相比,具有壓倒性的優(yōu)勢。 4)最后對論文工作進行了總結,并討論了存在的問題和對進一步工作的展望。
[Abstract]:With the rapid development of Internet technology and the popularity of mobile phone, flat panel, smart TV and other input terminals, Internet data has shown explosive growth. In the face of the massive data, how to store the massive data in a more stable and fast way, and how to mine valuable information from it has become a new classroom for many enterprises. The emergence of cloud storage brings new opportunities for the rapid development of data mining. Amazon, Microsoft, Google, IBM and other giants have launched their own cloud storage platform, and domestic companies such as Baidu, Huawei, Tencent, and so on have stepped up their layout in the cloud storage field. In this paper, a large amount of forum data is used to store samples, and an experimental system supporting horizontal expansion is built. Design and implementation of a variety of forum data extraction methods. Finally, the performance advantage of cloud storage is verified. The main work of this paper is as follows: 1) this paper introduces NOSQLs driven by the development of cloud storage in detail, expounds the characteristics of various kinds of NOSQL, according to the characteristics of forum data, finally selects MongoDB to store data, and compares it with the popular traditional relational library MYSQL. Some advantages of MongoDB are summarized. Then it introduces the usage of MongoDB and the method of storing forum data. 2) this paper briefly introduces the methods of extracting information from various forums, then analyzes the characteristics of the domestic forums and the structural characteristics of the forums themselves, and classifies the forums into two categories: the general forum and the special forum. For general forums, regular expressions are used to obtain accurate information, and for special forums, a heuristic extraction method is proposed and designed. Different extraction methods are used to extract all kinds of forum data, which improves the accuracy of extraction. 3) to verify the feasibility of the new storage method and the algorithms for extracting information from various forums. In this paper, we design a forum extraction experiment system based on MongoDB distributed storage, which can support horizontal expansion and stable storage of massive forum data. And accurately excavate all kinds of useful data in the forum. After the amount of data to be stored reaches a certain scale, the storage capacity of big data is tested, and the storage performance of various queries is compared. It is concluded that cloud storage in distributed environment has an overwhelming advantage over MongoDB in single service architecture in dealing with big data. Finally, the paper summarizes the work, discusses the existing problems and prospects for further work.
【學位授予單位】:上海交通大學
【學位級別】:碩士
【學位授予年份】:2012
【分類號】:TP333;TP311.13
【參考文獻】
相關期刊論文 前4條
1 張國印,陳先,皮鵬;基于詞頻統(tǒng)計的個性化信息過濾技術[J];哈爾濱工程大學學報;2003年01期
2 潘凡;;從MySQL到MongoDB——視覺中國的NoSQL之路[J];程序員;2010年06期
3 李向陽,苗壯;自由文本信息抽取技術[J];情報科學;2004年07期
4 張啟宇;朱玲;張雅萍;;中文分詞算法研究綜述[J];情報探索;2008年11期
本文編號:1803870
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/1803870.html