基于學(xué)術(shù)網(wǎng)絡(luò)的虹檢索系統(tǒng)設(shè)計(jì)與應(yīng)用研究
本文選題:學(xué)術(shù)網(wǎng)絡(luò) + 文獻(xiàn)檢索 ; 參考:《山東大學(xué)》2017年碩士論文
【摘要】:隨著移動(dòng)互聯(lián)網(wǎng)、云計(jì)算技術(shù)的快速發(fā)展,各行各業(yè)產(chǎn)生、獲取、處理和存儲(chǔ)的數(shù)據(jù)量正以指數(shù)級(jí)別呈爆炸式的增長(zhǎng)。大數(shù)據(jù)作為新時(shí)代發(fā)展的標(biāo)志,以多元、多態(tài)、互聯(lián)的形式影響著社會(huì)生產(chǎn)生活。在學(xué)術(shù)領(lǐng)域,文獻(xiàn)累積數(shù)量已達(dá)億級(jí),海量文獻(xiàn)數(shù)據(jù)對(duì)傳統(tǒng)檢索方法造成了巨大的挑戰(zhàn)。傳統(tǒng)的文獻(xiàn)檢索方法主要通過(guò)單一的文獻(xiàn)信息,例如檢索詞與檢索內(nèi)容之間相關(guān)度或者文獻(xiàn)的引用量進(jìn)行排序,并沒(méi)有考慮學(xué)術(shù)網(wǎng)絡(luò)中節(jié)點(diǎn)之間的關(guān)聯(lián)關(guān)系以及節(jié)點(diǎn)自身的屬性,因此檢索結(jié)果會(huì)存在關(guān)聯(lián)度較差、偏離主題、檢索質(zhì)量不高等缺陷。此外,傳統(tǒng)學(xué)術(shù)檢索系統(tǒng)主要提供文獻(xiàn)檢索服務(wù),而實(shí)際上領(lǐng)域權(quán)威專家推薦可以更好地指導(dǎo)科研工作者的研究以及發(fā)展方向。針對(duì)海量學(xué)術(shù)數(shù)據(jù),如何挖掘更深層的鏈接結(jié)構(gòu)語(yǔ)義信息,建立專家檢索系統(tǒng),也是重要的研究課題。數(shù)據(jù)挖掘技術(shù)和分布式計(jì)算的發(fā)展,為解決以上問(wèn)題提供了有效的手段。本文針對(duì)文獻(xiàn)檢索以及專家檢索兩種場(chǎng)景,通過(guò)構(gòu)建學(xué)術(shù)信息網(wǎng)絡(luò),實(shí)現(xiàn)了對(duì)檢索方法的優(yōu)化以及檢索系統(tǒng)的應(yīng)用設(shè)計(jì)。首先,在文獻(xiàn)檢索系統(tǒng)中,基于鏈接分析PageRank算法對(duì)文獻(xiàn)節(jié)點(diǎn)重要度排序,并針對(duì)PageRank算法的性能缺陷做了以下兩方面的改進(jìn):(1)利用學(xué)術(shù)信息網(wǎng)絡(luò)節(jié)點(diǎn)的不同屬性,計(jì)算學(xué)術(shù)網(wǎng)絡(luò)中文獻(xiàn)節(jié)點(diǎn)的權(quán)威度;谖墨I(xiàn)權(quán)威度對(duì)PageRank算法中的權(quán)重分配策略進(jìn)行改進(jìn),從而提出了 SQT-Rank算法,提高了算法的排序性能;(2)考慮到大數(shù)據(jù)背景下文獻(xiàn)數(shù)據(jù)量巨大,利用MapReduce編程模型對(duì)SQT-Rank算法并行化處理,提高了算法的計(jì)算性能。再者,與同構(gòu)信息網(wǎng)絡(luò)相比,異構(gòu)信息網(wǎng)絡(luò)蘊(yùn)含更豐富的鏈接結(jié)構(gòu)語(yǔ)義信息。在專家檢索系統(tǒng)中,為進(jìn)行更深層的數(shù)據(jù)挖掘和分析,首先構(gòu)建了學(xué)術(shù)異構(gòu)信息網(wǎng)絡(luò),并從中抽取了文獻(xiàn)、專家以及期刊相關(guān)的六個(gè)關(guān)系矩陣。最后基于文獻(xiàn)、專家、期刊相互增強(qiáng)作用的統(tǒng)一架構(gòu),提出專家重要度排序MR-Rank算法,獲得了更加公平合理的專家排序結(jié)果。最后,在上述理論方法研究的基礎(chǔ)上,對(duì)基于學(xué)術(shù)網(wǎng)絡(luò)的虹檢索系統(tǒng)進(jìn)行了架構(gòu)設(shè)計(jì)與功能實(shí)現(xiàn)。整個(gè)系統(tǒng)架構(gòu)包含數(shù)據(jù)獲取、數(shù)據(jù)存儲(chǔ)、數(shù)據(jù)索引、數(shù)據(jù)分析以及結(jié)果可視化展現(xiàn)等部分。通過(guò)數(shù)據(jù)分析處理實(shí)現(xiàn)對(duì)學(xué)術(shù)數(shù)據(jù)提取、清洗、轉(zhuǎn)換,完成文獻(xiàn)、專家節(jié)點(diǎn)重要度分析等功能,最后以指定的方式將排序結(jié)果可視化展示給用戶。綜上,本文主要針對(duì)大數(shù)據(jù)背景下海量文獻(xiàn)精準(zhǔn)檢索和領(lǐng)域?qū)<彝扑]問(wèn)題。通過(guò)構(gòu)建同構(gòu)和異構(gòu)學(xué)術(shù)網(wǎng)絡(luò)模型,基于優(yōu)化后的文獻(xiàn)排序SQT-Rank算法和專家排序MR-Rank算法挖掘網(wǎng)絡(luò)中節(jié)點(diǎn)重要度,并進(jìn)一步應(yīng)用虹檢索系統(tǒng)為用戶推薦高質(zhì)量的文獻(xiàn)、專家,以提高用戶的檢索體驗(yàn)效果。
[Abstract]:With the rapid development of mobile Internet and cloud computing technology, the amount of data generated, acquired, processed and stored in various industries is increasing exponentially. As a symbol of the development of the new era, big data affects social production and life in the form of multivariate, polymorphic and interconnected. In the academic field, the accumulated amount of literature has reached 100 million levels, which poses a great challenge to the traditional retrieval methods. The traditional literature retrieval methods are mainly sorted by single document information, such as the relevance between the search words and the retrieval content or the quantity of references to the literature. The relationship between nodes in academic networks and the attributes of nodes themselves are not considered, so the retrieval results will have some defects, such as poor correlation degree, deviation from the topic, low retrieval quality and so on. In addition, the traditional academic retrieval system mainly provides the document retrieval service, but in fact, the domain authority expert recommendation can better direct the research and the development direction of the research worker. It is also an important research topic to mine deeper semantic information of link structure and establish expert retrieval system for massive academic data. The development of data mining technology and distributed computing provides an effective method to solve the above problems. In this paper, the optimization of retrieval methods and the application design of retrieval system are realized by constructing academic information network, aiming at the two scenarios of literature retrieval and expert retrieval. First of all, in the literature retrieval system, the importance degree of the document node is sorted based on the link analysis PageRank algorithm, and the performance defects of the PageRank algorithm are improved in the following two aspects: 1) the different attributes of the academic information network node are used. The degree of authority of the document node in the academic network is calculated. This paper improves the weight allocation strategy of PageRank algorithm based on document authority degree, and then proposes SQT-Rank algorithm, which improves the sorting performance of the algorithm. MapReduce programming model is used to parallelize the SQT-Rank algorithm, and the performance of the algorithm is improved. Moreover, compared with isomorphic information network, heterogeneous information network contains more semantic information of link structure. In the expert retrieval system, for deeper data mining and analysis, the academic heterogeneous information network is first constructed, and six relational matrices of literature, experts and periodicals are extracted from it. Finally, based on the unified framework of the mutual enhancement of literature, experts and periodicals, the MR-Rank algorithm of expert importance ranking is proposed, and a more fair and reasonable result of expert ranking is obtained. Finally, on the basis of the above theoretical research, the architecture and function of rainbow retrieval system based on academic network are designed and implemented. The whole system architecture includes data acquisition, data storage, data index, data analysis and visualization of results. The functions of extracting, cleaning, transforming, completing documents and analyzing the importance of expert nodes are realized through data analysis. Finally, the sorting results are visualized to the users in a specified way. In summary, this paper mainly focuses on the problem of accurate retrieval and expert recommendation in the context of big data. By constructing the isomorphic and heterogeneous academic network model, based on the optimized SQT-Rank algorithm and the expert sorting MR-Rank algorithm, the importance of the nodes in the network is mined, and the rainbow retrieval system is further applied to recommend high quality documents and experts for users. In order to improve the user's retrieval experience effect.
【學(xué)位授予單位】:山東大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 汪志偉;鄒艷妮;吳舒霞;;PageRank算法應(yīng)用在文獻(xiàn)檢索排序中的研究及改進(jìn)[J];情報(bào)理論與實(shí)踐;2016年11期
2 錢永杰;曹寶香;;基于垂直搜索引擎的網(wǎng)頁(yè)排序算法研究[J];電子技術(shù);2015年07期
3 任曉龍;呂琳媛;;網(wǎng)絡(luò)重要節(jié)點(diǎn)排序方法綜述[J];科學(xué)通報(bào);2014年13期
4 平宇;向陽(yáng);張波;黃寅飛;;基于MapReduce的并行PageRank算法實(shí)現(xiàn)[J];計(jì)算機(jī)工程;2014年02期
5 吳志榮;;文獻(xiàn)發(fā)現(xiàn):大數(shù)據(jù)時(shí)代的重要命題[J];上海師范大學(xué)學(xué)報(bào)(哲學(xué)社會(huì)科學(xué)版);2013年04期
6 艾麗娟;;智能搜索引擎發(fā)展現(xiàn)狀及關(guān)鍵技術(shù)[J];電子技術(shù)與軟件工程;2013年10期
7 李戴維;李寧;;基于Solr的分布式全文檢索系統(tǒng)的研究與實(shí)現(xiàn)[J];計(jì)算機(jī)與現(xiàn)代化;2012年11期
8 王文鈞;李巍;;垂直搜索引擎的現(xiàn)狀與發(fā)展探究[J];情報(bào)科學(xué);2010年03期
9 劉暢;;綜合搜索引擎與垂直搜索引擎的比較研究[J];情報(bào)科學(xué);2007年01期
10 邱均平;余以勝;;基于知識(shí)庫(kù)系統(tǒng)的智能搜索引擎研究[J];情報(bào)科學(xué);2006年03期
相關(guān)碩士學(xué)位論文 前4條
1 姚鈺輝;基于TF算法的英文科技文獻(xiàn)關(guān)鍵詞提取方法研究[D];貴州師范大學(xué);2016年
2 段秋丹;基于MapReduce的文獻(xiàn)發(fā)現(xiàn)系統(tǒng)研究與設(shè)計(jì)[D];山東大學(xué);2016年
3 高宗寶;基于HDFS的海量小文件讀寫策略研究[D];山東大學(xué);2016年
4 丁蔚然;基于Solr的企業(yè)異構(gòu)信息搜索平臺(tái)的設(shè)計(jì)與實(shí)現(xiàn)[D];東南大學(xué);2015年
,本文編號(hào):1892508
本文鏈接:http://sikaile.net/shoufeilunwen/xixikjs/1892508.html