特定網(wǎng)站新聞檢索系統(tǒng)的設(shè)計與實現(xiàn)
發(fā)布時間:2018-06-10 12:34
本文選題:新聞搜索 + RSS; 參考:《華南理工大學(xué)》2013年碩士論文
【摘要】:互聯(lián)網(wǎng)快速的發(fā)展,人們的生活越來越離不開互聯(lián)網(wǎng),網(wǎng)絡(luò)信息量爆發(fā)式地增長給搜索引擎帶來了巨大的挑戰(zhàn)。人們每天都花一定的時間來瀏覽新聞網(wǎng)站,了解當(dāng)前國內(nèi)外正在發(fā)生的一些時事新聞,然而互聯(lián)網(wǎng)上的新聞門戶網(wǎng)站也越來越多,人們獲取自己感興趣的新聞也就越來越難。在很多情形下(例如輿情檢測等),人們只對一些特定網(wǎng)站的新聞感興趣,而通用搜索引擎并不提供這種選擇。這種情況下,我們需要一個面向特定網(wǎng)站的新聞搜索系統(tǒng),能為用戶搜集、整理并提供感興趣的新聞服務(wù)。 本文旨在設(shè)計并實現(xiàn)一個及時準(zhǔn)確的、用戶可配置和定制的、可擴(kuò)展的新聞搜索系統(tǒng),,該系統(tǒng)能實時采集指定網(wǎng)站的新聞,并給用戶提供個性化的新聞搜索服務(wù)。本文調(diào)研了搜索引擎及新聞搜索國內(nèi)外的研究現(xiàn)狀,基于搜索引擎的主要工作原理,提出了面向特定網(wǎng)站的新聞檢索系統(tǒng)的設(shè)計。本文使用MVC分層思想對系統(tǒng)進(jìn)行實現(xiàn),將系統(tǒng)分成數(shù)據(jù)采集層、業(yè)務(wù)邏輯層和展示層。本文通過新聞網(wǎng)站的RSS源來發(fā)現(xiàn)最新的新聞報道,使用Boilerpipe開源庫提取網(wǎng)頁的正文信息,使用IK分詞器對網(wǎng)頁正文進(jìn)行分詞并為網(wǎng)頁建立倒排索引,最后為用戶提供個性化的新聞搜索服務(wù)。同時本文還根據(jù)新聞的特性,提出了基于新聞相關(guān)性、新鮮性、新聞類別、新聞來源站點(diǎn)這四個因素的新聞搜索結(jié)果排序算法對新聞結(jié)果進(jìn)行排序。 本文對系統(tǒng)進(jìn)行測試,統(tǒng)計新聞的采集情況,對新聞網(wǎng)頁正文提取進(jìn)行測試,對新聞搜索系統(tǒng)的Web服務(wù)部分進(jìn)行功能測試。
[Abstract]:With the rapid development of the Internet, people's lives are more and more inseparable from the Internet. The explosive growth of network information has brought great challenges to search engines. People spend a certain amount of time browsing news websites every day to find out what is happening at home and abroad. However, there are more and more news portals on the Internet, so it is more and more difficult for people to get the news they are interested in. In many cases, such as public opinion testing, people are only interested in news from specific sites, whereas generic search engines do not offer this option. In this case, we need a Web-oriented news search system that can collect, organize and provide interesting news services for users. This article aims to design and implement a timely, accurate, user-configurable and customizable news service. An extensible news search system, which can collect news from designated websites in real time, and provide personalized news search service to users. This paper investigates the research status of search engine and news search at home and abroad. Based on the main working principle of search engine, this paper puts forward the design of news retrieval system for specific website. This paper implements the system with MVC layer idea, and divides the system into three layers: data acquisition layer, business logic layer and display layer. In this paper, the latest news reports are found through RSS feeds of news websites, and the text information of web pages is extracted by Boilerpipe open source library, and the text of web pages is partitioned by IK particifier and inverted index is established for the pages. Finally, to provide users with personalized news search service. At the same time, according to the characteristics of news, this paper puts forward a news search result sorting algorithm based on the four factors of news relevance, freshness, news category and news source site. According to the collection of news, the text extraction of news pages is tested, and the function of Web service in news search system is tested.
【學(xué)位授予單位】:華南理工大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前4條
1 印鑒,陳憶群,張鋼;搜索引擎技術(shù)研究與發(fā)展[J];計算機(jī)工程;2005年14期
2 陳釗;張冬梅;;Web信息抽取技術(shù)綜述[J];計算機(jī)應(yīng)用研究;2010年12期
3 薩支斌;;RSS技術(shù)研究[J];情報探索;2006年09期
4 伍玉偉;;RSS:網(wǎng)絡(luò)信息“聚合”利器[J];現(xiàn)代情報;2006年02期
本文編號:2003240
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2003240.html
最近更新
教材專著