面向藏文WEB熱點(diǎn)事件發(fā)現(xiàn)系統(tǒng)的設(shè)計(jì)

發(fā)布時(shí)間：2018-07-21 19:13

【摘要】：自20世紀(jì)70年代互聯(lián)網(wǎng)這一媒體誕生后,我們就進(jìn)入了一個(gè)信息空前豐富的時(shí)代,與此同時(shí)信息傳播的方式也發(fā)生了很大的變化,越來越多的人愿意通過網(wǎng)絡(luò)媒體來傳遞自己的觀點(diǎn)、思想和態(tài)度。由于這些信息沒有進(jìn)行統(tǒng)一的組織和管理,這就使得發(fā)現(xiàn)并管理我們所需要的信息變得困難重重,因此人們迫切需要一種工具能夠快速的從網(wǎng)絡(luò)上獲取他們所需要的信息。人們通過搜索引擎(search engine)能夠獲取自己需要的信息,但由于其采用關(guān)鍵字匹配算法并且未對結(jié)果進(jìn)行過濾,因此搜索到的網(wǎng)頁很多,羅列了許多毫不相關(guān)的信息,用戶需要花費(fèi)不少時(shí)間從這些結(jié)果中找到自己所需的信息。對于熱點(diǎn)事件,搜索引擎顯得更是無可奈何。不過每年會(huì)有新聞機(jī)構(gòu)評選出某個(gè)領(lǐng)域的熱點(diǎn)事件,但由于時(shí)間周期是以年為單位,并且結(jié)果是人評選出的,結(jié)果的即時(shí)性和客觀性無法保證。本文以人民網(wǎng)藏文網(wǎng)站的語料為研究對象,利用TDT(Topic Detection and Tracking)技術(shù)對新聞事件進(jìn)行識(shí)別與跟蹤,并對新聞事件進(jìn)行聚類,從而設(shè)計(jì)了一個(gè)熱點(diǎn)發(fā)現(xiàn)系統(tǒng),該系統(tǒng)可以讓用戶了解任意一段時(shí)間內(nèi)藏文網(wǎng)絡(luò)上的熱點(diǎn)事件,而且結(jié)果的客觀性比較強(qiáng)。本文首先介紹TDT相關(guān)理論和關(guān)鍵技術(shù),以實(shí)現(xiàn)網(wǎng)絡(luò)新聞流中事件的識(shí)別與跟蹤；接著介紹利用網(wǎng)絡(luò)爬蟲(Crawler)來抓取指定范圍內(nèi)的網(wǎng)頁,提取正文消除噪聲,通過分詞生成權(quán)值向量；進(jìn)而通過對熱點(diǎn)事件發(fā)現(xiàn)算法的研究提出了一種事件熱度計(jì)算的方法,提高了系統(tǒng)對新熱點(diǎn)事件的敏感度,再采用改進(jìn)的兩層聚類策略對文本進(jìn)行聚類,從而得到事件列表。最后通過對2011年新聞?wù)Z料進(jìn)行了實(shí)驗(yàn),對上述算法和思想進(jìn)了行驗(yàn)證,并做了相關(guān)評測,結(jié)果顯示本系統(tǒng)取得了較好的效果。
[Abstract]:Since the birth of the Internet as a media in the 1970s, we have entered an era of unprecedented wealth of information, and at the same time, the way of information dissemination has also undergone great changes. More and more people are willing to communicate their views, ideas and attitudes through the Internet media. Due to the lack of unified organization and management of these information, it is difficult to find and manage the information we need. Therefore, people urgently need a tool to quickly obtain the information they need from the network. People can get the information they need through search engine (search engine), but because they use keyword matching algorithm and don't filter the results, they search many pages and list a lot of irrelevant information. Users spend a lot of time finding the information they need from these results. For hot issues, search engines are more helpless. However, every year, news organizations select hot events in a certain field, but because the time cycle is based on years and the results are chosen by people, the immediacy and objectivity of the results cannot be guaranteed. This paper takes the corpus of people's net Tibetan language website as the research object, uses topic Detection and tracking (TDT) technology to identify and track news events, and cluster news events, so as to design a hot spot discovery system. The system enables users to understand the hot events in Tibetan language network for any period of time, and the results are more objective. This paper first introduces the relevant theories and key technologies of TDT in order to realize the identification and tracking of events in the network news stream, and then introduces the use of Crawler to grab web pages in a specified range and extract the text to remove noise. The weight vector is generated by word segmentation, and a method to calculate the heat of the event is proposed through the research of the algorithm of hot spot event discovery, which improves the sensitivity of the system to the new hot spot event. Then the improved two-layer clustering strategy is used to cluster the text to get the list of events. Finally, through the experiment of news corpus in 2011, the algorithm and idea are verified and evaluated. The results show that the system has achieved good results.
【學(xué)位授予單位】：西北民族大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2012
【分類號(hào)】：TP391.3

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 顧益軍,樊孝忠,王建華,汪濤,黃維金;中文停用詞表的自動(dòng)選取[J];北京理工大學(xué)學(xué)報(bào);2005年04期

2 賈自艷 ,何清 ,張�？� ,李嘉佑 ,史忠植;一種基于動(dòng)態(tài)進(jìn)化模型的事件探測和追蹤算法[J];計(jì)算機(jī)研究與發(fā)展;2004年07期

3 于滿泉;駱衛(wèi)華;許洪波;白碩;;話題識(shí)別與跟蹤中的層次化話題識(shí)別技術(shù)研究[J];計(jì)算機(jī)研究與發(fā)展;2006年03期

4 李保利,俞士汶;話題識(shí)別與跟蹤研究[J];計(jì)算機(jī)工程與應(yīng)用;2003年17期

5 熊文新;宋柔;;信息檢索用戶查詢語句的停用詞過濾[J];計(jì)算機(jī)工程;2007年06期

6 周欽強(qiáng),孫炳達(dá),王義;文本自動(dòng)分類系統(tǒng)文本預(yù)處理方法的研究[J];計(jì)算機(jī)應(yīng)用研究;2005年02期

7 羅杰;陳力;夏德麟;王凱;;基于新的關(guān)鍵詞提取方法的快速文本分類系統(tǒng)[J];計(jì)算機(jī)應(yīng)用研究;2006年04期

8 陳俊彬;;Web信息抽取策略及其實(shí)現(xiàn)方法研究[J];科技情報(bào)開發(fā)與經(jīng)濟(jì);2008年23期

9 孫茂松,左正平,黃昌寧;漢語自動(dòng)分詞詞典機(jī)制的實(shí)驗(yàn)研究[J];中文信息學(xué)報(bào);2000年01期

10 孫學(xué)剛,陳群秀,馬亮;基于主題的Web文檔聚類研究[J];中文信息學(xué)報(bào);2003年03期

相關(guān)博士學(xué)位論文前1條

1 薛德軍;中文文本自動(dòng)分類中的關(guān)鍵問題研究[D];清華大學(xué);2004年

相關(guān)碩士學(xué)位論文前1條

1 李盛韜;基于主題的Web信息采集技術(shù)研究[D];中國科學(xué)院研究生院（計(jì)算技術(shù)研究所）;2002年

，

本文編號(hào)：2136561

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2136561.html

上一篇：聊天機(jī)器人問答系統(tǒng)現(xiàn)狀與發(fā)展
下一篇：搜索引擎服務(wù)提供商的侵害版權(quán)責(zé)任——透視兩起判決的異同

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級(jí)|國家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

面向藏文WEB熱點(diǎn)事件發(fā)現(xiàn)系統(tǒng)的設(shè)計(jì)