天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁(yè) > 科技論文 > 軟件論文 >

微博檢索結(jié)果優(yōu)化研究

發(fā)布時(shí)間:2018-10-18 14:38
【摘要】:當(dāng)今世界,互聯(lián)網(wǎng)迅猛發(fā)展,信息無(wú)論從產(chǎn)生還是傳播速度上,都大幅提升,在這樣一個(gè)信息爆炸的時(shí)代,如何快速有效的從大量數(shù)據(jù)中獲取感興趣的信息,給搜索引擎的發(fā)展帶來(lái)了巨大的挑戰(zhàn)。微博作為近幾年興起的社交方式,漸漸走入每個(gè)人的生活,微博上的內(nèi)容既包含權(quán)威的新聞事件,熱點(diǎn)話題,也包含數(shù)以億計(jì)的普通用戶發(fā)布的生活?yuàn)蕵?lè)內(nèi)容。對(duì)于微博的檢索一直是一個(gè)值得研究的熱門(mén)話題。本文首先介紹了信息檢索的相關(guān)技術(shù),分析了 Learning to rank模型的優(yōu)勢(shì)以及信息檢索系統(tǒng)的衡量標(biāo)準(zhǔn)。經(jīng)過(guò)調(diào)研,本文從相關(guān)性和多樣性兩個(gè)方面優(yōu)化微博檢索結(jié)果。相關(guān)性方面,本文設(shè)計(jì)并實(shí)現(xiàn)了 GBDT模型訓(xùn)練非語(yǔ)義特征,再融合LTR模型的網(wǎng)絡(luò)結(jié)構(gòu),同時(shí)引入神經(jīng)網(wǎng)絡(luò)訓(xùn)練的詞向量作為特征。在推特?cái)?shù)據(jù)集上,優(yōu)化了 MAP和P@30兩項(xiàng)指標(biāo)。多樣性方面,實(shí)現(xiàn)了將神經(jīng)網(wǎng)絡(luò)訓(xùn)練的句向量作為特征的k-means聚類。驗(yàn)證了句向量訓(xùn)練的有效性。另外,利用Simhash去重算法,去除近似重復(fù)的推特,取得了比聚類更優(yōu)的F1值指標(biāo)。本文的選題是基于2014年TREC微博檢索評(píng)測(cè)任務(wù),提出了新的思路和解決方法。最后,本文闡述了完成該任務(wù)時(shí)的設(shè)計(jì)與實(shí)現(xiàn)流程,并分析了評(píng)測(cè)結(jié)果。
[Abstract]:In today's world, with the rapid development of the Internet and the rapid development of information, both the production and the speed of dissemination of information have been greatly improved. In such an era of information explosion, how to quickly and effectively obtain information of interest from a large number of data, The development of search engines has brought great challenges. Weibo, as a social way rising in recent years, has gradually entered the life of everyone. The content on Weibo includes not only authoritative news events, hot topics, but also the life and entertainment content published by hundreds of millions of ordinary users. The search for Weibo has been a hot topic worth studying. This paper first introduces the related technologies of information retrieval, analyzes the advantages of Learning to rank model and the measurement standard of information retrieval system. After investigation, this paper optimizes Weibo's retrieval results from two aspects: relevance and diversity. In terms of correlation, this paper designs and implements the GBDT model training non-semantic features, and then fuses the network structure of LTR model. At the same time, it introduces the word vector trained by neural network as the feature. On the Twitter dataset, we optimized the MAP and P30s. In terms of diversity, k-means clustering with sentence vectors trained by neural networks as features is realized. The validity of sentence vector training is verified. In addition, the Simhash de-duplication algorithm is used to remove the approximately repetitive Twitter, and the F1 value index is obtained better than the clustering algorithm. This paper is based on the task of TREC Weibo retrieval evaluation in 2014, and puts forward new ideas and solutions. Finally, this paper describes the design and implementation of the task, and analyzes the evaluation results.
【學(xué)位授予單位】:北京郵電大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2016
【分類號(hào)】:TP393.092;TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文 前3條

1 趙琳;;幾種信息檢索模型的比較[J];煤炭技術(shù);2012年08期

2 董博;鄭慶華;宋凱磊;田鋒;馬瑞;;基于多SimHash指紋的近似文本檢測(cè)[J];小型微型計(jì)算機(jī)系統(tǒng);2011年11期

3 馬成前;毛許光;;網(wǎng)頁(yè)查重算法Shingling和Simhash研究[J];計(jì)算機(jī)與數(shù)字工程;2009年01期

,

本文編號(hào):2279431

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2279431.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶3c6bb***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com