基于Map-Reduce 和Trie樹(shù)的搜索需求識(shí)別研究
發(fā)布時(shí)間:2018-10-13 16:32
【摘要】:在數(shù)據(jù)量爆炸式增長(zhǎng)的互聯(lián)網(wǎng)時(shí)代,人們同時(shí)面臨著機(jī)遇和挑戰(zhàn)。一方面人們?cè)诓粩嗟貜拇髷?shù)據(jù)金礦中挖掘出有用的信息,另一方面又可能面對(duì)大量的Web冗余信息束手無(wú)策。而搜索引擎作為人們最常用的信息檢索工具,在幫助人們從互聯(lián)網(wǎng)中找到所需信息的同時(shí),也承受著數(shù)據(jù)增長(zhǎng)帶來(lái)的極大負(fù)擔(dān)。目前由于搜索引擎的索引數(shù)據(jù)正變得越來(lái)越龐大,其查詢的工作量正變得日益繁重,同時(shí),搜索引擎所查詢到的絕大多數(shù)信息都是與用戶需求無(wú)關(guān)的。如果搜索引擎在發(fā)起搜索之前就能預(yù)測(cè)用戶的搜索需求,就能為用戶提供體驗(yàn)更好的搜索服務(wù)。通過(guò)搜索引擎對(duì)用戶搜索需求進(jìn)行實(shí)時(shí)分析,不僅能為用戶提供更加個(gè)性化的搜索結(jié)果,同時(shí)也可以省略很多不必要的計(jì)算。于是搜索引擎的用戶搜索需求成了國(guó)內(nèi)外學(xué)者們重點(diǎn)研究的領(lǐng)域。要完成對(duì)用戶需求的預(yù)判,必須對(duì)用戶的搜索詞進(jìn)行識(shí)別,這種識(shí)別往往需要借助一些日志挖掘的手段。但是現(xiàn)在的搜索日志數(shù)據(jù)量都在TB級(jí)別,在單機(jī)上難以實(shí)現(xiàn)。本文針對(duì)大規(guī)模數(shù)據(jù)計(jì)算的特點(diǎn),提出了構(gòu)建需求識(shí)別模板的Paratemp策略。該策略借助Map-Reduce技術(shù),通過(guò)對(duì)搜索日志的訓(xùn)練從分布式集群上挖掘出具有代表性的分類模板,從而得到能識(shí)別用戶搜索需求的模式。同時(shí)本文借鑒關(guān)聯(lián)規(guī)則挖掘中的置信度和支持度變量,提出了針對(duì)模板的篩選標(biāo)準(zhǔn)。通過(guò)篩選的模板可以作為分類搜索需求的支持依據(jù)。在成功提取用戶搜索模板后,為了達(dá)到識(shí)別搜索需求的目的,需要一套高效的自然語(yǔ)言算法來(lái)對(duì)這些模板加以利用。本文設(shè)計(jì)了Tempaser識(shí)別算法,利用Trie樹(shù)空間換時(shí)間的思想對(duì)搜索詞進(jìn)行解析,最終實(shí)現(xiàn)了搜索需求的識(shí)別。最后的實(shí)驗(yàn)證明了基于Map-Reduce和Trie樹(shù)的搜索需求識(shí)別具有正確性和高效性。文章的結(jié)尾對(duì)本次研究進(jìn)行了總結(jié)和展望。
[Abstract]:In the era of Internet data explosion, people are faced with opportunities and challenges at the same time. On the one hand, people are constantly mining useful information from big data Gold Mine, on the other hand, they may be faced with a lot of redundant Web information. As the most commonly used information retrieval tool, search engine not only helps people to find the information they need from the Internet, but also bears the great burden of data growth. At present, because the index data of the search engine is becoming more and more huge, the workload of the search engine is becoming more and more heavy. At the same time, most of the information queried by the search engine is independent of the user's demand. If a search engine can predict users' search needs before launching a search, it can provide users with a better experience of search services. The real-time analysis of users' search requirements through search engines can not only provide users with more personalized search results, but also omit a lot of unnecessary calculations. As a result, search engine user search requirements have become the focus of domestic and foreign scholars. It is necessary to recognize the search term of the user in order to complete the pre-judgment of the user's demand. This recognition often needs some means of log mining. But now the amount of search log data is at the TB level, difficult to implement on a single machine. According to the characteristics of large-scale data computing, this paper proposes a Paratemp strategy to construct requirement recognition templates. With the help of Map-Reduce technology, the strategy mine representative classification templates from distributed clusters by training the search logs, and then obtain the pattern that can identify the users' search requirements. At the same time, based on the variables of confidence and support in association rule mining, the selection criteria for templates are proposed. The selected templates can be used as the support basis for classifying search requirements. After the user search templates are extracted successfully, a set of efficient natural language algorithms are needed to make use of these templates in order to identify the search requirements. In this paper, Tempaser recognition algorithm is designed, and the search term is analyzed by using the idea of changing time in Trie tree space. Finally, the recognition of search requirements is realized. Finally, experiments show that the search requirement recognition based on Map-Reduce and Trie tree is correct and efficient. At the end of the article, the research is summarized and prospected.
【學(xué)位授予單位】:江西師范大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2015
【分類號(hào)】:TP391.3
本文編號(hào):2269258
[Abstract]:In the era of Internet data explosion, people are faced with opportunities and challenges at the same time. On the one hand, people are constantly mining useful information from big data Gold Mine, on the other hand, they may be faced with a lot of redundant Web information. As the most commonly used information retrieval tool, search engine not only helps people to find the information they need from the Internet, but also bears the great burden of data growth. At present, because the index data of the search engine is becoming more and more huge, the workload of the search engine is becoming more and more heavy. At the same time, most of the information queried by the search engine is independent of the user's demand. If a search engine can predict users' search needs before launching a search, it can provide users with a better experience of search services. The real-time analysis of users' search requirements through search engines can not only provide users with more personalized search results, but also omit a lot of unnecessary calculations. As a result, search engine user search requirements have become the focus of domestic and foreign scholars. It is necessary to recognize the search term of the user in order to complete the pre-judgment of the user's demand. This recognition often needs some means of log mining. But now the amount of search log data is at the TB level, difficult to implement on a single machine. According to the characteristics of large-scale data computing, this paper proposes a Paratemp strategy to construct requirement recognition templates. With the help of Map-Reduce technology, the strategy mine representative classification templates from distributed clusters by training the search logs, and then obtain the pattern that can identify the users' search requirements. At the same time, based on the variables of confidence and support in association rule mining, the selection criteria for templates are proposed. The selected templates can be used as the support basis for classifying search requirements. After the user search templates are extracted successfully, a set of efficient natural language algorithms are needed to make use of these templates in order to identify the search requirements. In this paper, Tempaser recognition algorithm is designed, and the search term is analyzed by using the idea of changing time in Trie tree space. Finally, the recognition of search requirements is realized. Finally, experiments show that the search requirement recognition based on Map-Reduce and Trie tree is correct and efficient. At the end of the article, the research is summarized and prospected.
【學(xué)位授予單位】:江西師范大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2015
【分類號(hào)】:TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前3條
1 陳紅濤;楊放春;陳磊;;基于大規(guī)模中文搜索引擎的搜索日志挖掘[J];計(jì)算機(jī)應(yīng)用研究;2008年06期
2 毛嚴(yán)奇;彭沛夫;;基于MapReduce的Web日志挖掘預(yù)處理[J];計(jì)算機(jī)與現(xiàn)代化;2013年09期
3 秦成華;;云計(jì)算在我國(guó)農(nóng)業(yè)信息服務(wù)系統(tǒng)中的應(yīng)用與策略[J];吉林農(nóng)業(yè)科學(xué);2014年05期
,本文編號(hào):2269258
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2269258.html
最近更新
教材專著