基于MySQL新聞搜索引擎的設(shè)計(jì)與實(shí)現(xiàn)
發(fā)布時(shí)間:2018-03-09 14:47
本文選題:信息檢索 切入點(diǎn):網(wǎng)絡(luò)爬蟲(chóng) 出處:《復(fù)旦大學(xué)》2013年碩士論文 論文類(lèi)型:學(xué)位論文
【摘要】:隨著現(xiàn)代信息技術(shù)的飛速發(fā)展,互聯(lián)網(wǎng)絡(luò)上的信息量和類(lèi)型正在發(fā)生爆炸性的增長(zhǎng)。這為人們的日常生活、工作以及學(xué)習(xí)帶來(lái)了極大的便利。但是在信息量爆增的同時(shí)也帶來(lái)了新的問(wèn)題。比如如何對(duì)這些海量的信息進(jìn)行統(tǒng)一的管理,如何將這些分散的資源建立索引,以及如何從海量的信息資源中準(zhǔn)確地獲取需要的信息等等。搜索引擎是解決這些問(wèn)題的關(guān)鍵技術(shù),但是傳統(tǒng)的通用搜索引擎是對(duì)Web上的所有種類(lèi)的信息都進(jìn)行搜集,并面向所有不同層次的用戶(hù),這種想做的面面俱到的努力在海量信息面前變得越來(lái)越?jīng)]有突破性進(jìn)展。普通的用戶(hù)對(duì)信息的關(guān)注程度和寬度是比較集中的。所以面向特定領(lǐng)域和特定需求的專(zhuān)業(yè)搜索引擎的概念應(yīng)用而生。與傳統(tǒng)的通用搜索引擎所不同的地方是專(zhuān)業(yè)搜索引擎只會(huì)收集與某個(gè)主題相關(guān)的Web上的信息,在收集信息時(shí)并不是來(lái)者便收,而是通過(guò)分析判斷信息內(nèi)容是否與特定主題相關(guān),并只對(duì)相關(guān)的信息進(jìn)行進(jìn)一步處理。因此,專(zhuān)業(yè)搜索引擎無(wú)論在資源消耗,還是在查詢(xún)準(zhǔn)確度上都有了顯著的提高。本文的主要研究工作就是面向?qū)I(yè)搜索引擎,且以新聞為搜索主題。在研究過(guò)程中,通過(guò)對(duì)搜索引擎中關(guān)鍵技術(shù)進(jìn)行深入的理論的學(xué)習(xí)和實(shí)踐,進(jìn)一步加深對(duì)搜索引擎領(lǐng)域的了解。在本文中的新聞專(zhuān)業(yè)搜索引擎中,選擇新浪新聞網(wǎng)站作為網(wǎng)絡(luò)爬蟲(chóng)的入口地址,對(duì)其進(jìn)行有針對(duì)性地收集新聞頁(yè)面。收集頁(yè)面的工作由專(zhuān)業(yè)的新聞網(wǎng)絡(luò)爬蟲(chóng)完成,它從新聞首頁(yè)開(kāi)始,提取出其中的新聞鏈接地址,并將這些鏈接地址存入到待爬取的隊(duì)列之中,通過(guò)三層的深度優(yōu)先搜索算法對(duì)Web網(wǎng)站進(jìn)行遍歷。之后,爬蟲(chóng)還將對(duì)收集后的頁(yè)面進(jìn)行凈化處理和提取有效信息,最后由索引器建立搜索引擎中非常核心的數(shù)據(jù):倒排索引。搜索引擎最終是要面向普通用戶(hù)的,所以,設(shè)計(jì)好一個(gè)用戶(hù)體驗(yàn)度好的查詢(xún)接口為用戶(hù)提供新聞查詢(xún)服務(wù)也是非常必須的任務(wù)。本文中詳細(xì)介紹了網(wǎng)絡(luò)爬蟲(chóng)是設(shè)計(jì)和實(shí)現(xiàn),網(wǎng)頁(yè)的凈化和信息抽取以及索引庫(kù)的構(gòu)建。這些技術(shù)都是目前自然語(yǔ)言處理和人工智能方面的研究熱點(diǎn),通過(guò)對(duì)這些技術(shù)和理論的學(xué)習(xí),加深對(duì)專(zhuān)業(yè)的技能。本面向新聞內(nèi)容的搜索引擎從最簡(jiǎn)單的技術(shù)著手,逐步實(shí)現(xiàn)了搜索引擎這一龐雜系統(tǒng)中的關(guān)鍵模塊,實(shí)驗(yàn)結(jié)果表明系統(tǒng)具有一定的準(zhǔn)確率,達(dá)到了良好的效果。
[Abstract]:With the rapid development of modern information technology, the amount and type of information on the Internet is increasing explosively. Work and study bring great convenience. But as the amount of information explodes, it also brings new problems. For example, how to manage these huge amounts of information uniformly, how to index these scattered resources, Search engine is the key technology to solve these problems, but the traditional universal search engine is to collect all kinds of information on Web. And for all the different levels of users, This kind of all-encompassing effort in the face of mass information has become less and less groundless. The average user's attention to the information and width is more concentrated. So specific to specific areas and specific needs. Different from traditional general-purpose search engines, professional search engines only collect information on Web that is relevant to a particular topic. When collecting information, it is not collected by the person who comes, but by analyzing and judging whether the content of the information is relevant to a particular topic, and only the relevant information is further processed. Therefore, the professional search engine, regardless of the resource consumption, The main research work of this paper is to face professional search engine, and take news as the search subject. In the process of research, Through the deep theoretical study and practice of the key technologies in the search engine, we can further deepen our understanding of the search engine field. In this paper, we select the Sina news website as the entry address of the web crawler in the news professional search engine. The collection of pages is done by a professional news web crawler, who starts with the first page of the news and extracts the address of the news link. These link addresses are stored in the queue to be crawled, and the Web site is traversed by a three-layer depth-first search algorithm. After that, the crawler will purify the collected pages and extract effective information. Finally, the indexer builds the very core data in the search engine: inverted index. The search engine is ultimately intended for ordinary users, so, It is also a very necessary task to design a good user experience query interface to provide news query service for users. This paper introduces the design and implementation of web crawler in detail. The purification and information extraction of web pages and the construction of index database. These technologies are the research hotspot in the field of natural language processing and artificial intelligence. Through the study of these technologies and theories, The search engine for news content has gradually realized the key module of the complex system from the simplest technology. The experimental results show that the system has a certain accuracy. Good results have been achieved.
【學(xué)位授予單位】:復(fù)旦大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類(lèi)號(hào)】:TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前4條
1 唐波;;網(wǎng)絡(luò)爬蟲(chóng)的設(shè)計(jì)與實(shí)現(xiàn)[J];電腦知識(shí)與技術(shù);2009年11期
2 張保富;施化吉;馬素琴;;基于TFIDF文本特征加權(quán)方法的改進(jìn)研究[J];計(jì)算機(jī)應(yīng)用與軟件;2011年02期
3 俞平;肖南峰;甘志剛;;第三代搜索引擎研究[J];南京信息工程大學(xué)學(xué)報(bào)(自然科學(xué)版);2009年02期
4 楊思洛;搜索引擎的排序技術(shù)研究[J];現(xiàn)代圖書(shū)情報(bào)技術(shù);2005年01期
相關(guān)碩士學(xué)位論文 前1條
1 劉喜亮;面向主題的網(wǎng)絡(luò)爬蟲(chóng)設(shè)計(jì)與實(shí)現(xiàn)[D];湖南大學(xué);2009年
,本文編號(hào):1588987
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1588987.html
最近更新
教材專(zhuān)著