面向藏文WEB熱點(diǎn)事件發(fā)現(xiàn)系統(tǒng)的設(shè)計(jì)
[Abstract]:Since the birth of the Internet as a media in the 1970s, we have entered an era of unprecedented wealth of information, and at the same time, the way of information dissemination has also undergone great changes. More and more people are willing to communicate their views, ideas and attitudes through the Internet media. Due to the lack of unified organization and management of these information, it is difficult to find and manage the information we need. Therefore, people urgently need a tool to quickly obtain the information they need from the network. People can get the information they need through search engine (search engine), but because they use keyword matching algorithm and don't filter the results, they search many pages and list a lot of irrelevant information. Users spend a lot of time finding the information they need from these results. For hot issues, search engines are more helpless. However, every year, news organizations select hot events in a certain field, but because the time cycle is based on years and the results are chosen by people, the immediacy and objectivity of the results cannot be guaranteed. This paper takes the corpus of people's net Tibetan language website as the research object, uses topic Detection and tracking (TDT) technology to identify and track news events, and cluster news events, so as to design a hot spot discovery system. The system enables users to understand the hot events in Tibetan language network for any period of time, and the results are more objective. This paper first introduces the relevant theories and key technologies of TDT in order to realize the identification and tracking of events in the network news stream, and then introduces the use of Crawler to grab web pages in a specified range and extract the text to remove noise. The weight vector is generated by word segmentation, and a method to calculate the heat of the event is proposed through the research of the algorithm of hot spot event discovery, which improves the sensitivity of the system to the new hot spot event. Then the improved two-layer clustering strategy is used to cluster the text to get the list of events. Finally, through the experiment of news corpus in 2011, the algorithm and idea are verified and evaluated. The results show that the system has achieved good results.
【學(xué)位授予單位】:西北民族大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2012
【分類號(hào)】:TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 顧益軍,樊孝忠,王建華,汪濤,黃維金;中文停用詞表的自動(dòng)選取[J];北京理工大學(xué)學(xué)報(bào);2005年04期
2 賈自艷 ,何清 ,張? ,李嘉佑 ,史忠植;一種基于動(dòng)態(tài)進(jìn)化模型的事件探測(cè)和追蹤算法[J];計(jì)算機(jī)研究與發(fā)展;2004年07期
3 于滿泉;駱衛(wèi)華;許洪波;白碩;;話題識(shí)別與跟蹤中的層次化話題識(shí)別技術(shù)研究[J];計(jì)算機(jī)研究與發(fā)展;2006年03期
4 李保利,俞士汶;話題識(shí)別與跟蹤研究[J];計(jì)算機(jī)工程與應(yīng)用;2003年17期
5 熊文新;宋柔;;信息檢索用戶查詢語句的停用詞過濾[J];計(jì)算機(jī)工程;2007年06期
6 周欽強(qiáng),孫炳達(dá),王義;文本自動(dòng)分類系統(tǒng)文本預(yù)處理方法的研究[J];計(jì)算機(jī)應(yīng)用研究;2005年02期
7 羅杰;陳力;夏德麟;王凱;;基于新的關(guān)鍵詞提取方法的快速文本分類系統(tǒng)[J];計(jì)算機(jī)應(yīng)用研究;2006年04期
8 陳俊彬;;Web信息抽取策略及其實(shí)現(xiàn)方法研究[J];科技情報(bào)開發(fā)與經(jīng)濟(jì);2008年23期
9 孫茂松,左正平,黃昌寧;漢語自動(dòng)分詞詞典機(jī)制的實(shí)驗(yàn)研究[J];中文信息學(xué)報(bào);2000年01期
10 孫學(xué)剛,陳群秀,馬亮;基于主題的Web文檔聚類研究[J];中文信息學(xué)報(bào);2003年03期
相關(guān)博士學(xué)位論文 前1條
1 薛德軍;中文文本自動(dòng)分類中的關(guān)鍵問題研究[D];清華大學(xué);2004年
相關(guān)碩士學(xué)位論文 前1條
1 李盛韜;基于主題的Web信息采集技術(shù)研究[D];中國科學(xué)院研究生院(計(jì)算技術(shù)研究所);2002年
,本文編號(hào):2136561
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2136561.html