詞向量語義模型研究及在主題爬蟲系統(tǒng)中的應(yīng)用

發(fā)布時間：2018-08-21 09:59

【摘要】：爬蟲,即使用程序自動獲取網(wǎng)頁上的內(nèi)容,在現(xiàn)在已經(jīng)很流行,是搜索引擎的重要組成部分,也是進(jìn)行有監(jiān)督機(jī)器學(xué)習(xí)模型訓(xùn)練的語料獲取重要方法之一。然而,在某些特定領(lǐng)域的研究,普通爬蟲不再能夠滿足特定語料獲取的需求,因此帶有特定主題的垂直領(lǐng)域爬蟲已經(jīng)日益被需要。主題爬蟲需要在獲取一個新的網(wǎng)頁或網(wǎng)頁鏈接時,通過判斷在語義上是否與主題相關(guān),來判斷是否爬取該頁面。本文使用詞向量進(jìn)行語義表示,并聯(lián)合點(diǎn)對互信息方法,對新的網(wǎng)頁鏈接進(jìn)行判斷,決策繼續(xù)爬取該頁面,還是放棄爬取該頁面。具體內(nèi)容如下。介紹自然語言處理技術(shù)、深度學(xué)習(xí)技術(shù)、語言模型。并詳細(xì)介紹基于矩陣和基于向量的兩種詞向量表示方法。然后基于維基百科中文語料,使用不同的參數(shù)訓(xùn)練模型,得出實(shí)驗(yàn)結(jié)論,并選出某一組參數(shù),進(jìn)行下面章節(jié)的研究。為了解決一詞多義的問題,本文引入點(diǎn)對互信息(PMI,Pointwise Mutual Information)。根據(jù)上下文信息,判斷該詞在此處的意思。并通過上一部分的結(jié)論,選出一個效果最好的詞向量模型,聯(lián)合PMI進(jìn)行實(shí)驗(yàn)。PMI的詞對表容量巨大,普通的電腦內(nèi)存無法裝載,針對該問題,本文將給出一種解決方法。把以上兩部分運(yùn)用于垂直領(lǐng)域爬蟲系統(tǒng)。使用寬度優(yōu)先搜索的方法進(jìn)行抓取,當(dāng)爬蟲系統(tǒng)遇到一個新的鏈接時,使用上一部分得出的模型,判斷該連接詞與主題詞的相關(guān)程度。使用“程序員”、“家具”、“護(hù)膚”三個主題,在百度百科上分別爬取若干頁面,并保留中間扔掉的鏈接,人工判斷每個網(wǎng)頁是否與主題相關(guān),從而得出準(zhǔn)確率,召回率等,并與不使用相關(guān)詞技術(shù)的普通爬蟲對比,從而更加客觀的判斷本文的垂直領(lǐng)域爬蟲的效果。本文提出了使用語義模型表示和點(diǎn)對互信息,聯(lián)合進(jìn)行網(wǎng)頁鏈接是否與主題詞相關(guān)的判定,從而篩選出與主題詞相關(guān)的網(wǎng)頁鏈接,并得出客觀的實(shí)驗(yàn)效果。
[Abstract]:Crawlers, even though they use programs to automatically retrieve the content of web pages, are now very popular. They are an important part of search engines and one of the important methods of corpus acquisition for supervised machine learning model training. However, in some specific areas of research, common reptiles can no longer meet the requirements of specific data acquisition, so vertical domain crawlers with specific topics have been increasingly needed. A topic crawler needs to determine whether to crawl a new page or a web page by judging whether it is semantically related to the topic or not. In this paper, we use word vector for semantic representation, and combine point-pair mutual information method to judge the new web page link, and decide whether to continue crawling the page or to give up crawling the page. The details are as follows. This paper introduces natural language processing technology, deep learning technology and language model. Two word vector representation methods based on matrix and vector are introduced in detail. Then, based on the Chinese corpus of Wikipedia, different parameter training models are used to obtain the experimental conclusions, and a set of parameters is selected for the study of the following chapters. In order to solve the problem of polysemy, this paper introduces the point pair mutual information (PMI) Pointwise Mutual Information). Judge the meaning of the word here based on the context information. Based on the conclusion of the previous part, a word vector model with the best effect is selected, and the word pair of words combined with PMI is very large, and the common computer memory can not be loaded. In view of this problem, this paper will give a method to solve this problem. The above two parts are applied to the vertical reptile system. When the crawler system encounters a new link, the model obtained from the previous part is used to judge the correlation between the link and the subject word. Using the three themes of "programmer", "furniture" and "skin care", crawling several pages on Baidu Encyclopedia and keeping the links thrown away in the middle, we can manually judge whether each web page is related to the theme, so as to get the accuracy, recall rate, etc. And compared with the common crawler without using the related word technology, it is more objective to judge the effect of the vertical domain reptile in this paper. In this paper, the semantic model representation and point-pair mutual information are used to determine whether the web link is related to the subject word, so that the web link related to the theme word can be screened out, and the objective experimental effect is obtained.
【學(xué)位授予單位】：中國地質(zhì)大學(xué)(北京)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2017
【分類號】：TP391.3

【相似文獻(xiàn)】

相關(guān)期刊論文前10條

1 金金;陳儀香教授與計(jì)算語義模型研究[J];上海師范大學(xué)學(xué)報(bào)(自然科學(xué)版);2002年04期

2 張大鵬,周軍鋒,陳寶峰;一種結(jié)合外部環(huán)境狀態(tài)的主體語義模型[J];燕山大學(xué)學(xué)報(bào);2005年04期

3 王煜;周立柱;邢春曉;;視頻語義模型及評價準(zhǔn)則[J];計(jì)算機(jī)學(xué)報(bào);2007年03期

4 趙正利;王國宇;籍芳;;一種基于相關(guān)反饋的圖像內(nèi)在語義模型[J];微計(jì)算機(jī)信息;2007年24期

5 黃睿航;張園園;黃思沛;;基于語義模型的網(wǎng)絡(luò)社群學(xué)習(xí)指導(dǎo)策略初探[J];無線互聯(lián)科技;2013年06期

6 曹化工，，秦友淑;工程信息結(jié)構(gòu)的語義模型[J];計(jì)算機(jī)輔助設(shè)計(jì)與圖形學(xué)學(xué)報(bào);1996年01期

7 呂琳,孟祥旭,徐延寧;復(fù)雜產(chǎn)品的層次語義模型研究[J];中國機(jī)械工程;2004年15期

8 李曉建,陳磊,陳世鴻;教育資源語義模型研究[J];武漢大學(xué)學(xué)報(bào)(理學(xué)版);2005年03期

9 楊俊柯;楊貫中;楊建學(xué);;基于語義模型的信息檢索機(jī)制研究[J];計(jì)算機(jī)工程;2006年12期

10 董小峰;張樹生;趙寒;周競濤;馮峗;田占強(qiáng);;基于語義模型的企業(yè)數(shù)據(jù)檢索[J];制造技術(shù)與機(jī)床;2006年09期

相關(guān)會議論文前5條

1 宋春陽;;從字到字組的語義解釋模型[A];全國第八屆計(jì)算語言學(xué)聯(lián)合學(xué)術(shù)會議（JSCL-2005）論文集[C];2005年

2 張輝;宋曉;張霖;;面向數(shù)字化設(shè)計(jì)的產(chǎn)品共享信息語義模型研究[A];全國先進(jìn)制造技術(shù)高層論壇暨第八屆制造業(yè)自動化與信息化技術(shù)研討會論文集[C];2009年

3 王煜;周立柱;邢春曉;;SemTTe:針對具有結(jié)構(gòu)化時態(tài)與類型化事件的視頻的語義模型[A];第二十二屆中國數(shù)據(jù)庫學(xué)術(shù)會議論文集（研究報(bào)告篇）[C];2005年

4 魏勇;歐陽峰;陳剛;;基于語義的虛擬場景編輯系統(tǒng)設(shè)計(jì)[A];Proceedings of 14th Chinese Conference on System Simulation Technology & Application(CCSSTA’2012)[C];2012年

5 王煜;周立柱;邢春曉;;視頻語義模型SemTTE及其查詢語言VSQL[A];第二十三屆中國數(shù)據(jù)庫學(xué)術(shù)會議論文集（技術(shù)報(bào)告篇）[C];2006年

相關(guān)博士學(xué)位論文前4條

1 李學(xué)寧;現(xiàn)代漢語形容詞概念語義模型研究[D];上海交通大學(xué);2008年

2 石躍祥;計(jì)算機(jī)視覺圖像語義模型的描述方法研究[D];中南大學(xué);2005年

3 馬暉男;信息檢索中淺層語義模型的研究[D];大連理工大學(xué);2007年

4 余衛(wèi)宇;幾種圖像結(jié)構(gòu)語義模型和圖像[D];華南理工大學(xué);2005年

相關(guān)碩士學(xué)位論文前10條

1 胡海彪;魚類目標(biāo)三維空間行為語義模型研究[D];浙江工業(yè)大學(xué);2015年

2 周磊;基于在線快速學(xué)習(xí)隱語義模型的個性化新聞推薦[D];南京郵電大學(xué);2015年

3 劉琴;基于依存關(guān)系的語義表示方法研究[D];哈爾濱工業(yè)大學(xué);2016年

4 范繼強(qiáng);提取直陳述小學(xué)數(shù)學(xué)應(yīng)用題數(shù)量關(guān)系的一個語義模型池[D];華中師范大學(xué);2016年

5 范玉強(qiáng);基于隱語義模型的推薦系統(tǒng)研究[D];貴州大學(xué);2016年

6 陳光穎;基于謂詞邏輯的需求追蹤方法研究[D];南京航空航天大學(xué);2016年

7 張禎;Web服務(wù)多維度語義模型的實(shí)現(xiàn)研究[D];天津大學(xué);2014年

8 孟竹;詞向量語義模型研究及在主題爬蟲系統(tǒng)中的應(yīng)用[D];中國地質(zhì)大學(xué)(北京);2017年

9 杜百玲;服務(wù)組裝的可信語義模型的研究[D];哈爾濱工程大學(xué);2009年

10 孫聰凱;語義模型、近似推理算法及其在網(wǎng)頁分類的應(yīng)用[D];上海交通大學(xué);2009年

本文編號：2195356

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2195356.html

上一篇：基于模糊積分融合方法的智能元搜索引擎系統(tǒng)
下一篇：海量數(shù)據(jù)下的特定語義數(shù)據(jù)檢索優(yōu)化方法研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

詞向量語義模型研究及在主題爬蟲系統(tǒng)中的應(yīng)用