基于Web內(nèi)容挖掘的醫(yī)藥類廣告監(jiān)控系統(tǒng)的實現(xiàn)
發(fā)布時間:2018-04-27 00:04
本文選題:Web內(nèi)容挖掘 + 網(wǎng)絡爬蟲 ; 參考:《哈爾濱理工大學》2011年碩士論文
【摘要】:伴隨著互聯(lián)網(wǎng)的迅速發(fā)展,龐大的網(wǎng)民規(guī)模吸引著越來越多的廣告主將注意力轉(zhuǎn)向網(wǎng)絡廣告市場,網(wǎng)絡廣告的數(shù)量急劇增長。但是伴隨而來的是違法廣告層出不窮,尤其是違法醫(yī)藥廣告危害最為嚴重。由于網(wǎng)絡上存在著巨大的信息量,僅僅依靠人工審查的方法難以應對網(wǎng)上海量信息的收集和處理,需要加強相關信息技術的研究,形成一套自動化的網(wǎng)絡醫(yī)藥廣告監(jiān)控系統(tǒng)。 本論文對網(wǎng)絡爬蟲、網(wǎng)頁信息抽取、網(wǎng)頁分類等技術分別進行了深入的研究,并提出了相應的解決方案,以這些技術為基礎本文實現(xiàn)了一個網(wǎng)絡醫(yī)藥廣告監(jiān)控系統(tǒng),較好地解決了互聯(lián)網(wǎng)中醫(yī)藥廣告的監(jiān)控問題。本文完成的主要工作如下: 1.對現(xiàn)有的網(wǎng)絡爬蟲技術進行了深入研究,詳細介紹了爬蟲工作的原理。針對網(wǎng)頁的構成,結合網(wǎng)頁提取的開源工具提出了本文的網(wǎng)頁信息抽取方法。測試結果表明本文提出的方法有著較好的效率和準確性。 2.介紹了網(wǎng)頁分類的現(xiàn)狀和處理流程,詳細講解了網(wǎng)頁分類中涉及的各個模塊的理論。在此基礎上,充分利用相關的開源工具,并針對χ2統(tǒng)計法在文本分類中的缺陷提出了改進的辦法,搭建了判斷網(wǎng)絡爬蟲所爬取的信息是否為醫(yī)藥類信息的分類模塊,實驗結果表明本文提出的分類模塊有著較好的性能。 3.設計并實現(xiàn)了一個醫(yī)藥類網(wǎng)絡廣告監(jiān)控系統(tǒng),可以完成對網(wǎng)絡上醫(yī)藥廣告的自動追蹤處理,提供分布式計算支持,有著較強的操作性和良好的展示界面。
[Abstract]:With the rapid development of the Internet, the huge scale of Internet users attracts more and more advertisers to turn their attention to the online advertising market. But with it, illegal advertisements emerge in endlessly, especially the harm of illegal medical advertisements is the most serious. Because there is a huge amount of information on the network, it is difficult to deal with the collection and processing of the massive information on the network only by the method of manual examination. Therefore, it is necessary to strengthen the research of relevant information technology and form an automatic network medicine advertisement monitoring system. This thesis has carried on the thorough research to the network crawler, the web page information extraction, the webpage classification and so on technology, and has proposed the corresponding solution, based on these technologies, this paper has realized a network medicine advertisement monitoring system. A better solution to the Internet Chinese medicine advertising monitoring problem. The main work of this paper is as follows: 1. The existing web crawler technology is studied in detail, and the principle of crawler work is introduced in detail. According to the composition of web pages, this paper proposes a web page information extraction method combined with the open source tools of web page extraction. The test results show that the proposed method has good efficiency and accuracy. 2. This paper introduces the current situation and processing flow of web page classification, and explains the theory of each module involved in web page classification in detail. On this basis, we make full use of the relevant open source tools, and in view of the defects of 蠂 2 statistics in text classification, put forward an improved method, and build a classification module to judge whether the information crawled by a web crawler is medical information. Experimental results show that the proposed classification module has better performance. 3. A pharmaceutical network advertisement monitoring system is designed and implemented, which can automatically track and process pharmaceutical advertisements on the network, provide distributed computing support, and have a strong operability and a good display interface.
【學位授予單位】:哈爾濱理工大學
【學位級別】:碩士
【學位授予年份】:2011
【分類號】:TP277;TP393.09
【參考文獻】
相關期刊論文 前6條
1 易伯春;醫(yī)藥虛假廣告何時能夠絕跡![J];價格月刊;2002年06期
2 李剛;周立柱;郭奇;林玲;;領域相關的Web網(wǎng)站抓取方法[J];計算機科學;2007年02期
3 周德懋;李舟軍;;高性能網(wǎng)絡爬蟲:研究綜述[J];計算機科學;2009年08期
4 吳軍,,王作英,禹鋒,王俠;漢語語料的自動分類[J];中文信息學報;1995年04期
5 歐健文,董守斌,蔡斌;模板化網(wǎng)頁主題信息的提取方法[J];清華大學學報(自然科學版);2005年S1期
6 曹冬林;廖祥文;許洪波;白碩;;基于網(wǎng)頁格式信息量的博客文章和評論抽取模型[J];軟件學報;2009年05期
相關碩士學位論文 前2條
1 劉小雪;基于XML的Web內(nèi)容挖掘技術研究[D];貴州大學;2008年
2 李曉紅;中文文本分類技術研究[D];蘭州理工大學;2009年
本文編號:1808264
本文鏈接:http://sikaile.net/wenyilunwen/guanggaoshejilunwen/1808264.html
最近更新
教材專著