用于個性推薦系統(tǒng)的文本爬蟲設(shè)計與實現(xiàn)
發(fā)布時間:2018-11-16 14:26
【摘要】:近年來互聯(lián)網(wǎng)技術(shù)發(fā)展迅猛,從互聯(lián)網(wǎng)上獲取信息已經(jīng)成為人們查找有用信息的重要方式。信息種類繁多、傳播迅速、含量龐大是互聯(lián)網(wǎng)的特點。如何針對這些特點及時準(zhǔn)確的抓取有關(guān)信息,為教育云中個性推薦系統(tǒng)建設(shè)學(xué)科資源庫服務(wù),成為個性推薦系統(tǒng)學(xué)科資源庫建立過程中需要解決重要問題。針對這一問題,本文結(jié)合互聯(lián)網(wǎng)的特點,運用信息抽取和網(wǎng)頁處理技術(shù),設(shè)計和實現(xiàn)了個性推薦系統(tǒng)中的網(wǎng)絡(luò)爬蟲部分,以提供分類更細(xì)致精確、數(shù)據(jù)更全面深入、更新更及時的信息抓取服務(wù)。 具體工作如下: 1.本文介紹了網(wǎng)絡(luò)爬蟲的發(fā)展現(xiàn)狀,然后分析了網(wǎng)絡(luò)爬蟲的體系結(jié)構(gòu)以及實現(xiàn)原理,并深入分析了主題頁面在Web上的分布特征。 2.搜索策略。本文利用URL (Uniform Resource Locator)字符串特征、錨文本、父頁面以及兄弟URL等影響因素,計算并預(yù)測‘URL的主題相關(guān)度。對URL依據(jù)預(yù)測的主題相關(guān)度大小依次爬行,盡可能下載與主題相關(guān)度高的網(wǎng)頁。 3.網(wǎng)頁解析過程。包括編碼轉(zhuǎn)換、HTML (Hyper Text Markup Language)解析、URL提取、網(wǎng)頁消噪和正文提取。本文通過讀取HTML文件的頭部信息中meta標(biāo)簽http-equiv屬性中獲得網(wǎng)頁的編碼方式,從互聯(lián)網(wǎng)下載數(shù)據(jù)時指定編碼方式讀取,然后采用鏈接分析和統(tǒng)計相結(jié)合的方法提取網(wǎng)頁正文,進(jìn)一步有效的剔除噪聲,提高網(wǎng)頁正文提取的完整性,對于大部分內(nèi)容型的網(wǎng)頁都能正確的提取出正文部分。 4.最后,本文在以上設(shè)計的基礎(chǔ)上實現(xiàn)了一個網(wǎng)絡(luò)爬蟲系統(tǒng),并分析了爬蟲的運行結(jié)果。 本文給出的網(wǎng)絡(luò)爬蟲可用于教育云的個性化推薦系統(tǒng)中,通過學(xué)科領(lǐng)域文章的獲得、存儲、分析和推薦,為用戶快速推薦感興趣的文獻(xiàn)和相關(guān)資料,從而提高了研究效率。
[Abstract]:In recent years, with the rapid development of Internet technology, obtaining information from the Internet has become an important way for people to find useful information. The characteristic of the Internet is the wide variety of information, the rapid spread and the huge content. How to grasp the relevant information timely and accurately in view of these characteristics and to serve the construction of subject resource bank in the educational cloud has become an important problem to be solved in the course of establishing the subject resource bank of personality recommendation system. Aiming at this problem, this paper combines the characteristics of the Internet, using the technology of information extraction and web page processing, designs and implements the web crawler part of the personality recommendation system to provide more detailed and accurate classification, more comprehensive and thorough data. Update more timely information grab service. The specific work is as follows: 1. This paper introduces the development of web crawlers, then analyzes the architecture and implementation principle of web crawlers, and analyzes the distribution characteristics of theme pages on Web. 2. Search strategy. In this paper, the theme correlation of 'URL' is calculated and predicted by using URL (Uniform Resource Locator) string feature, anchor text, parent page and sibling URL. Crawling the URL according to the predicted correlation degree of the topic, download as many pages as possible with the high correlation degree of the topic. 3. Web page parsing process. Including encoding conversion, HTML (Hyper Text Markup Language) parsing, URL extraction, page denoising and text extraction. In this paper, the encoding method of the web page is obtained by reading the meta tag http-equiv attribute in the header information of the HTML file, and the encoding mode is specified when the data is downloaded from the Internet. Then the text of the web page is extracted by the method of link analysis and statistics. Further effectively eliminate the noise, improve the integrity of the page text extraction, for most of the content pages can correctly extract the text part. 4. Finally, a web crawler system is implemented on the basis of the above design, and the results of the crawler operation are analyzed. The web crawler presented in this paper can be used in the personalized recommendation system of the educational cloud. Through the acquisition, storage, analysis and recommendation of the articles in the subject field, the web crawler can quickly recommend the interested documents and related materials for the users, thus improving the efficiency of the research.
【學(xué)位授予單位】:中國科學(xué)院大學(xué)(工程管理與信息技術(shù)學(xué)院)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2014
【分類號】:TP393.09;TP391.3
本文編號:2335764
[Abstract]:In recent years, with the rapid development of Internet technology, obtaining information from the Internet has become an important way for people to find useful information. The characteristic of the Internet is the wide variety of information, the rapid spread and the huge content. How to grasp the relevant information timely and accurately in view of these characteristics and to serve the construction of subject resource bank in the educational cloud has become an important problem to be solved in the course of establishing the subject resource bank of personality recommendation system. Aiming at this problem, this paper combines the characteristics of the Internet, using the technology of information extraction and web page processing, designs and implements the web crawler part of the personality recommendation system to provide more detailed and accurate classification, more comprehensive and thorough data. Update more timely information grab service. The specific work is as follows: 1. This paper introduces the development of web crawlers, then analyzes the architecture and implementation principle of web crawlers, and analyzes the distribution characteristics of theme pages on Web. 2. Search strategy. In this paper, the theme correlation of 'URL' is calculated and predicted by using URL (Uniform Resource Locator) string feature, anchor text, parent page and sibling URL. Crawling the URL according to the predicted correlation degree of the topic, download as many pages as possible with the high correlation degree of the topic. 3. Web page parsing process. Including encoding conversion, HTML (Hyper Text Markup Language) parsing, URL extraction, page denoising and text extraction. In this paper, the encoding method of the web page is obtained by reading the meta tag http-equiv attribute in the header information of the HTML file, and the encoding mode is specified when the data is downloaded from the Internet. Then the text of the web page is extracted by the method of link analysis and statistics. Further effectively eliminate the noise, improve the integrity of the page text extraction, for most of the content pages can correctly extract the text part. 4. Finally, a web crawler system is implemented on the basis of the above design, and the results of the crawler operation are analyzed. The web crawler presented in this paper can be used in the personalized recommendation system of the educational cloud. Through the acquisition, storage, analysis and recommendation of the articles in the subject field, the web crawler can quickly recommend the interested documents and related materials for the users, thus improving the efficiency of the research.
【學(xué)位授予單位】:中國科學(xué)院大學(xué)(工程管理與信息技術(shù)學(xué)院)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2014
【分類號】:TP393.09;TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前3條
1 王倫剛;;Java中文問題淺析[J];山東紡織經(jīng)濟(jì);2008年02期
2 歐陽柳波,李學(xué)勇,李國徽,王鑫;網(wǎng)絡(luò)蜘蛛搜索策略進(jìn)展研究[J];小型微型計算機系統(tǒng);2005年04期
3 孫皓;董守斌;;基于標(biāo)簽密度的自適應(yīng)正文提取方法[J];鄭州大學(xué)學(xué)報(理學(xué)版);2009年01期
,本文編號:2335764
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/2335764.html
最近更新
教材專著