網(wǎng)頁去噪音與分類算法研究
發(fā)布時(shí)間:2018-06-10 20:25
本文選題:網(wǎng)頁分類 + 網(wǎng)頁噪音。 參考:《華僑大學(xué)》2008年碩士論文
【摘要】: 隨著互聯(lián)網(wǎng)的快速發(fā)展,網(wǎng)絡(luò)上信息的數(shù)量也在急劇增長(zhǎng);ヂ(lián)網(wǎng)給人們提供了大量信息,但同時(shí)也給人們快速準(zhǔn)確的獲取信息帶來挑戰(zhàn)。為了能有效地利用網(wǎng)頁資源,我們需要對(duì)網(wǎng)頁進(jìn)行分類。 本文研究網(wǎng)頁分類的關(guān)鍵技術(shù),并對(duì)網(wǎng)頁去噪音技術(shù)和分類算法進(jìn)行深入探討。 在網(wǎng)頁預(yù)處理時(shí),最關(guān)鍵的問題是去除掉網(wǎng)頁中的噪音數(shù)據(jù),將與網(wǎng)頁內(nèi)容無關(guān)的廣告、導(dǎo)航條以及版權(quán)等信息盡量去除,以得到所需要的網(wǎng)頁主題信息。我們?cè)诜治霈F(xiàn)有方法和網(wǎng)頁制作特點(diǎn)的基礎(chǔ)上,綜合考率網(wǎng)頁的結(jié)構(gòu)、分塊大小信息,設(shè)計(jì)并實(shí)現(xiàn)了一個(gè)基于塊分析的、自動(dòng)調(diào)整閾值的網(wǎng)頁去噪音算法。 特征聚合算法考慮到詞與詞之間的聯(lián)系,根據(jù)特征詞的分類貢獻(xiàn)將他們聚合為分布模式,并使用分布模式代替?zhèn)鹘y(tǒng)算法中單個(gè)詞對(duì)應(yīng)向量一維的方式,我們對(duì)特征聚合算法在本文分類系統(tǒng)的效果進(jìn)行了測(cè)試,測(cè)試結(jié)果顯示特征聚合算法對(duì)數(shù)據(jù)集偏斜問題有著很好的效果,并對(duì)分類器整體性能有所改進(jìn)。 當(dāng)前文本分類領(lǐng)域已經(jīng)提出了很多分類算法,其中,KNN和SVM被認(rèn)為是具有較好效果的兩種,我們提出SVM-KNN算法,通過將KNN和SVM兩種分類器進(jìn)行結(jié)合,并通過分類預(yù)測(cè)概率的反饋和修正來提高分類器性能。 最后,在我們實(shí)現(xiàn)的中文網(wǎng)頁分類實(shí)驗(yàn)系統(tǒng)中,我們對(duì)基于塊的網(wǎng)頁去噪音算法和SVM-KNN算法的實(shí)際效果進(jìn)行了測(cè)試,實(shí)驗(yàn)結(jié)果證明了算法的有效性。
[Abstract]:With the rapid development of the Internet, the number of information on the network is also growing rapidly. The Internet provides people with a lot of information, but it also challenges people to obtain information quickly and accurately. In order to make effective use of web resources, we need to classify web pages. In this paper, the key technologies of web page classification are studied, and the noise removal technology and classification algorithm are discussed in depth. The most important problem is to remove the noise data from the web page and remove the information such as advertisement, navigation bar, copyright and so on, which is independent of the content of the page, so as to get the required information on the subject of the page. On the basis of analyzing the existing methods and the characteristics of web page making, we design and implement a new method based on block analysis by synthesizing the structure and block size information of the test page. The feature aggregation algorithm, considering the relationship between words and words, aggregates the words into a distribution pattern according to their contribution to classification. Using the distribution pattern instead of the one-dimensional method of single word corresponding vector in the traditional algorithm, we test the effect of feature aggregation algorithm in the classification system in this paper. The test results show that the feature aggregation algorithm has a good effect on the skew problem of data sets and improves the overall performance of the classifier. Many classification algorithms have been proposed in the field of text classification. Among them, KNN and SVM are considered to have better effect. We propose SVM-KNN algorithm, which combines KNN and SVM classifiers, and improves the performance of classifier by feedback and correction of classification prediction probability. In our experimental Chinese web page classification system, we have tested the actual effect of the block based web page de-noise algorithm and SVM-KNN algorithm. The experimental results show that the algorithm is effective.
【學(xué)位授予單位】:華僑大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2008
【分類號(hào)】:TP393.092
【引證文獻(xiàn)】
相關(guān)期刊論文 前1條
1 劉文靜;許志偉;何聰慧;;WEB到WAP的轉(zhuǎn)換過程中頁面去噪問題的研究[J];計(jì)算機(jī)應(yīng)用與軟件;2012年04期
,本文編號(hào):2004485
本文鏈接:http://sikaile.net/wenyilunwen/guanggaoshejilunwen/2004485.html
最近更新
教材專著