基于Spark的超大文本分類方法的設(shè)計(jì)與實(shí)現(xiàn)
本文選題:大數(shù)據(jù) + 文本分類 ; 參考:《北京交通大學(xué)》2017年碩士論文
【摘要】:互聯(lián)網(wǎng)技術(shù)飛速發(fā)展,衍生出了海量的網(wǎng)絡(luò)文本數(shù)據(jù)。但是大部分海量數(shù)據(jù)沒(méi)有經(jīng)過(guò)處理和分類,導(dǎo)致了垃圾郵件、廣告推送等不良網(wǎng)絡(luò)行為的出現(xiàn),使得人們很難從海量數(shù)據(jù)中提取出有用信息,浪費(fèi)了大量時(shí)間精力去處理垃圾信息。因此,如何對(duì)海量文本數(shù)據(jù)進(jìn)行高效的分類,具有重要理論意義和應(yīng)用價(jià)值。論文首先分析了傳統(tǒng)的文本分類算法存在的問(wèn)題:(1)提取特征向量速度慢,效率低。因?yàn)楹A繑?shù)據(jù)的特征空間趨近無(wú)窮開放,但是傳統(tǒng)的文本表示算法使用批處理的方式進(jìn)行離線的特征提取,不僅計(jì)算效率低,而且內(nèi)存占用大,甚至導(dǎo)致內(nèi)存溢出等嚴(yán)重問(wèn)題。(2)傳統(tǒng)的分類器不適合在大數(shù)據(jù)計(jì)算框架中進(jìn)行計(jì)算。海量數(shù)據(jù)通常使用分布式并行計(jì)算的方式進(jìn)行處理,而傳統(tǒng)的分類算法,例如SVM,樸素貝葉斯分類器,并不適合分布式并行計(jì)算。另外,深度學(xué)習(xí)算法,雖然廣泛運(yùn)用在語(yǔ)義識(shí)別中,但是應(yīng)用在文本分類系統(tǒng)時(shí)卻是成效甚微,反而需要耗費(fèi)很長(zhǎng)時(shí)間進(jìn)行模型訓(xùn)練,收益并不明顯。因此,針對(duì)以上問(wèn)題,論文主要在文本表示、分類器設(shè)計(jì)兩個(gè)方面進(jìn)行研究和探索,主要工作如下:(1)在文本表示方面,提出了基于流數(shù)據(jù)的在線分域特征選擇算法(OFFS算法)。該算法對(duì)向量空間模型進(jìn)行改進(jìn),可以對(duì)流數(shù)據(jù)進(jìn)行實(shí)時(shí)的特征提取,快速生成文本向量。解決了傳統(tǒng)特征提取算法效率低、耗費(fèi)內(nèi)存等問(wèn)題。(2)在分類器設(shè)計(jì)方面,設(shè)計(jì)出基于BP神經(jīng)網(wǎng)絡(luò)與OFFS算法相結(jié)合的OFFS-BP神經(jīng)網(wǎng)絡(luò)文本分類器。該分類器適應(yīng)了分布式并行計(jì)算環(huán)境,減少模型訓(xùn)練時(shí)間,能夠兼顧計(jì)算效率和分類準(zhǔn)確率。(3)基于Spark平臺(tái),實(shí)現(xiàn)了 OFFS-BP神經(jīng)網(wǎng)絡(luò)分類器。首先利用Spark Streaming子框架實(shí)現(xiàn)OFFS算法;然后使用Spark MLlib子框架實(shí)現(xiàn)BP神經(jīng)網(wǎng)絡(luò)分類器;最后將SparkStreaming和Spark MLlib框架通過(guò)Spark編程模型RDD進(jìn)行無(wú)縫連接。多種數(shù)據(jù)集實(shí)驗(yàn)表明,論文提出的OFFS-BP神經(jīng)網(wǎng)絡(luò)分類器更適合大數(shù)據(jù),且計(jì)算耗時(shí)更少,分類更高效。
[Abstract]:With the rapid development of Internet technology, huge amounts of network text data have been derived. However, most of the massive data are not processed and classified, which leads to the emergence of bad network behaviors such as spam, advertising push, etc., which makes it difficult for people to extract useful information from the mass data. A lot of time and energy is wasted to deal with junk information. Therefore, how to classify massive text data efficiently has important theoretical significance and application value. Firstly, the paper analyzes the problem of traditional text classification algorithm: (1) extraction of feature vector is slow and inefficient. Because the feature space of massive data tends to be infinitely open, but the traditional text representation algorithm uses batch processing to extract features offline, it not only has low computational efficiency, but also occupies a lot of memory. Even causes serious problems such as memory overflow. 2) traditional classifier is not suitable for big data computing framework. Mass data is usually processed by distributed parallel computing, but traditional classification algorithms, such as SVM and naive Bayes classifier, are not suitable for distributed parallel computing. In addition, although the depth learning algorithm is widely used in semantic recognition, it has little effect in text classification system, and it takes a long time to train the model, and the benefits are not obvious. Therefore, aiming at the above problems, this paper mainly studies and explores the two aspects of text representation and classifier design. The main work is as follows: 1) in text representation, an online feature selection algorithm based on streaming data is proposed, which is called OFFS algorithm. The algorithm improves the vector space model and can extract the feature of convection data in real time and generate the text vector quickly. It solves the problems of low efficiency and memory consumption of traditional feature extraction algorithm. In the design of classifier, a OFFS-BP neural network text classifier based on BP neural network and OFFS algorithm is designed. The classifier adapts to the distributed parallel computing environment, reduces the training time of the model, and takes into account the computing efficiency and classification accuracy. The classifier is implemented based on the Spark platform and the OFFS-BP neural network classifier. First, the OFFS algorithm is implemented by using the Spark Streaming subframework, then the BP neural network classifier is implemented by using the Spark MLlib subframework; finally, the SparkStreaming and Spark MLlib frameworks are seamlessly connected through the Spark programming model RDD. Experiments on various data sets show that the proposed OFFS-BP neural network classifier is more suitable for big data, and the computation time is less and the classification is more efficient.
【學(xué)位授予單位】:北京交通大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP391.1;TP18
【參考文獻(xiàn)】
相關(guān)期刊論文 前7條
1 張蕾;李井泉;曲武;白濤;;基于Spark Streaming的僵尸主機(jī)檢測(cè)算法[J];計(jì)算機(jī)應(yīng)用研究;2016年05期
2 薛瑞;朱曉民;;基于Spark Streaming的實(shí)時(shí)日志處理平臺(tái)設(shè)計(jì)與實(shí)現(xiàn)[J];電信工程技術(shù)與標(biāo)準(zhǔn)化;2015年09期
3 陳穎;;大數(shù)據(jù)發(fā)展歷程綜述[J];當(dāng)代經(jīng)濟(jì);2015年08期
4 劉伍穎;王挺;;適于垃圾文本流過(guò)濾的條件概率集成方法[J];計(jì)算機(jī)科學(xué)與探索;2010年05期
5 蘇綏;林鴻飛;葉正;;基于字符語(yǔ)言模型的垃圾郵件過(guò)濾[J];中文信息學(xué)報(bào);2009年02期
6 王修君;沈鴻;;一種基于增量學(xué)習(xí)型矢量量化的有效文本分類算法[J];計(jì)算機(jī)學(xué)報(bào);2007年08期
7 蘇金樹;張博鋒;徐昕;;基于機(jī)器學(xué)習(xí)的文本分類技術(shù)研究進(jìn)展[J];軟件學(xué)報(bào);2006年09期
相關(guān)碩士學(xué)位論文 前4條
1 商江華;基于大數(shù)據(jù)的TD-LTE基站輔助規(guī)劃選址算法研究[D];南京郵電大學(xué);2015年
2 姜鶴;SVM文本分類中基于法向量的特征選擇算法研究[D];上海交通大學(xué);2010年
3 付玲芳;P2P下基于“科研知識(shí)本體”的信息檢索的研究與實(shí)現(xiàn)[D];內(nèi)蒙古科技大學(xué);2008年
4 潘文鋒;基于內(nèi)容的垃圾郵件過(guò)濾研究[D];中國(guó)科學(xué)院研究生院(計(jì)算技術(shù)研究所);2004年
,本文編號(hào):1853006
本文鏈接:http://sikaile.net/wenyilunwen/guanggaoshejilunwen/1853006.html