公眾論壇信息實(shí)時(shí)檢索的研究與實(shí)現(xiàn)
發(fā)布時(shí)間:2018-01-19 00:13
本文關(guān)鍵詞: 公眾論壇 垂直搜索 網(wǎng)絡(luò)爬蟲(chóng) 實(shí)時(shí)檢索 Lucene 出處:《南京理工大學(xué)》2012年碩士論文 論文類(lèi)型:學(xué)位論文
【摘要】:互聯(lián)網(wǎng)作為當(dāng)代社會(huì)迅猛發(fā)展的新生事物之一,已經(jīng)開(kāi)始扮演越來(lái)越重要的角色。公眾論壇是互聯(lián)網(wǎng)發(fā)展的產(chǎn)物之一,它是一個(gè)開(kāi)放的平臺(tái),與普通網(wǎng)站的區(qū)別之處在于網(wǎng)民不僅可以通過(guò)它獲取信息,同時(shí)也能發(fā)布信息,這為相互間的交流帶來(lái)了極大的方便。但隨著時(shí)間的發(fā)展,其產(chǎn)生的消極、危險(xiǎn)的一面也漸漸顯露出來(lái)——一些不法分子利用論壇的便利性散播各種非法信息。由于論壇中信息傳播速度快,刷新頻率高,非法信息很容易在短時(shí)間內(nèi)就引起很?chē)?yán)重的后果,因此需要及時(shí)的發(fā)現(xiàn)這些信息。本文設(shè)計(jì)了一個(gè)面向論壇領(lǐng)域的垂直搜索引擎,能夠?qū)χ付ㄕ搲M(jìn)行深度數(shù)據(jù)挖掘以及24小時(shí)監(jiān)控新出現(xiàn)的信息。 本文設(shè)計(jì)的垂直搜索引擎共分為三大模塊:信息獲取模塊、信息分析模塊、信息索引及檢索模塊。信息獲取模塊通過(guò)開(kāi)發(fā)現(xiàn)有通用搜索引擎接口構(gòu)建元搜索引擎以及編寫(xiě)網(wǎng)絡(luò)爬蟲(chóng)實(shí)現(xiàn);信息分析模塊通過(guò)使用模板及網(wǎng)頁(yè)信息去噪的方法實(shí)現(xiàn)了對(duì)HTML和Word、Excel、PDF等常見(jiàn)格式文件結(jié)構(gòu)化文本信息的提;信息索引和檢索模塊通過(guò)開(kāi)源工具Lucene構(gòu)建,為用戶提供了便利高效的查詢界面。 用戶使用反饋表明本文設(shè)計(jì)的垂直搜索引擎在深度數(shù)據(jù)挖掘以及實(shí)時(shí)監(jiān)控方面都有著很好的性能。
[Abstract]:As one of the new things in the contemporary society, the Internet has begun to play an increasingly important role. Public forum is one of the products of the development of the Internet, it is an open platform. The difference with ordinary websites is that Internet users can not only get information through it, but also can publish information, which brings great convenience for mutual communication. But with the development of time, it produces negative. The dangerous side is also gradually revealed-some lawless elements use the convenience of the forum to spread all kinds of illegal information. Because of the rapid dissemination of information in the forum, refresh high frequency. Illegal information is easy to cause serious consequences in a short period of time, so we need to find these information in time. This paper designed a vertical search engine for forum domain. Ability to perform deep data mining and 24-hour monitoring of emerging information on specified forums. The vertical search engine designed in this paper is divided into three modules: information acquisition module, information analysis module. Information index and retrieval module. The information acquisition module constructs the meta search engine by developing the existing universal search engine interface and implements the web crawler. The information analysis module uses template and web page information denoising method to realize the extraction of structured text information of common format files such as HTML and WordWare Excel PDF. The information index and retrieval module is constructed by open source tool Lucene, which provides a convenient and efficient query interface for users. User feedback shows that the vertical search engine designed in this paper has good performance in depth data mining and real-time monitoring.
【學(xué)位授予單位】:南京理工大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2012
【分類(lèi)號(hào)】:TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前5條
1 歐健文,董守斌,蔡斌;模板化網(wǎng)頁(yè)主題信息的提取方法[J];清華大學(xué)學(xué)報(bào)(自然科學(xué)版);2005年S1期
2 常璐,夏祖奇;搜索引擎的幾種常用排序算法[J];圖書(shū)情報(bào)工作;2003年06期
3 徐金雷;楊曉江;;專(zhuān)業(yè)搜索引擎的排序算法研究[J];現(xiàn)代圖書(shū)情報(bào)技術(shù);2006年07期
4 江華,趙建新,王海嵐;PAT數(shù)組全文檢索技術(shù)的研究與改進(jìn)[J];現(xiàn)代圖書(shū)情報(bào)技術(shù);2005年08期
5 肖明忠;閔博楠;王佳聰;代亞非;;一個(gè)實(shí)用的針對(duì)URL的哈希函數(shù)[J];小型微型計(jì)算機(jī)系統(tǒng);2006年03期
,本文編號(hào):1441783
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1441783.html
最近更新
教材專(zhuān)著