基于文本語義的個性化圖書推薦
發(fā)布時間:2018-07-28 17:07
【摘要】:互聯(lián)網(wǎng)中積累的海量圖書標簽、摘要為分析閱讀興趣和構(gòu)建個性化圖書推薦系統(tǒng)提供了新的數(shù)據(jù)來源。因此本文主要研究如何整合標簽、摘要等文本數(shù)據(jù),構(gòu)建個性化圖書推薦系統(tǒng),提升系統(tǒng)性能。本文工作可以分為基于語義的興趣偏好模型、推薦算法的設(shè)計和基于Spark平臺的并行化實現(xiàn)三個部分。首先提出基于詞向量和共現(xiàn)頻次計算標簽語義相似度的算法,并針對具體場景設(shè)計優(yōu)化方式。然后分別使用PIC算法和LDA算法建立基于標簽和圖書摘要的語義偏好模型,并采用基于語義偏好的協(xié)同過濾擴展算法生成圖書推薦列表。最后,在Spark分布式計算平臺上并行化實現(xiàn)推薦系統(tǒng)。本文首先介紹了課題的研究背景與意義,在相關(guān)文獻的基礎(chǔ)上,總結(jié)了影響個性化推薦系統(tǒng)性能的關(guān)鍵問題,明確了本文的具體研究內(nèi)容。其次,本文研究了語義分析、聚類、推薦算法等課題關(guān)鍵技術(shù),指出各種技術(shù)的優(yōu)缺點,是后續(xù)研究的理論基礎(chǔ)。再者,建立基于文本語義的興趣偏好模型。其中,引入衰減函數(shù)作為權(quán)重,解決標簽偏好的時間效應(yīng)問題;提出基于詞向量和共現(xiàn)頻次計算標簽相似度的算法,并針對本課題的具體場景設(shè)計優(yōu)化方式,提升相關(guān)性計算的準確度;基于PIC算法實現(xiàn)標簽聚類,建立基于標簽語義的興趣偏好模型,解決了標簽的稀疏問題;利用LDA算法分析圖書摘要潛在主題分布,建立摘要語義偏好模型,解決標簽過少引起的冷啟動問題。本文使用基于語義偏好的協(xié)同過濾擴展算法生成推薦結(jié)果,并設(shè)計實驗測試系統(tǒng)性能。實驗結(jié)果表明:(1)基于文本語義的閱讀興趣偏好特征能夠正確地反映用戶興趣偏好;(2)推薦算法在準確率、多樣性等指標上表現(xiàn)良好。最后設(shè)計實現(xiàn)基于Spark分布式計算平臺的推薦系統(tǒng)。實現(xiàn)的主要模塊有詞向量訓練、LDA主題分析、標簽聚類和協(xié)同過濾擴展算法。前三者基于Spark機器學習庫MLlib提供的接口實現(xiàn)。協(xié)同過濾擴展算法包括基于項目和基于用戶兩種模式,本文針對具體模塊設(shè)計了實現(xiàn)流程。實測證明各種算法加速性能顯著。
[Abstract]:The vast amount of book labels accumulated in the Internet provides a new data source for analyzing reading interest and building personalized book recommendation system. Therefore, this paper mainly studies how to integrate tags, abstracts and other text data, build personalized book recommendation system, and improve the system performance. This paper can be divided into three parts: interest preference model based on semantics, the design of recommendation algorithm and the implementation of parallelization based on Spark platform. Firstly, an algorithm based on word vector and co-occurrence frequency to calculate the semantic similarity of label is proposed, and the optimization method is designed for the specific scene. Then PIC algorithm and LDA algorithm are used to build semantic preference model based on label and book digest, and cooperative filtering extension algorithm based on semantic preference is used to generate book recommendation list. Finally, the recommendation system is implemented by parallelization on Spark distributed computing platform. This paper first introduces the research background and significance of the subject, summarizes the key issues affecting the performance of the personalized recommendation system based on the relevant literature, and clarifies the specific research content of this paper. Secondly, this paper studies the key technologies of semantic analysis, clustering and recommendation algorithms, and points out the advantages and disadvantages of these technologies, which are the theoretical basis for further research. Furthermore, interest preference model based on text semantics is established. Among them, the attenuation function is introduced as the weight to solve the time effect problem of label preference, and an algorithm based on word vector and co-occurrence frequency to calculate label similarity is proposed. Improve the accuracy of correlation calculation; implement tag clustering based on PIC algorithm, establish interest preference model based on label semantics, solve the sparse problem of labels; use LDA algorithm to analyze the distribution of potential topics in book abstracts. The semantic preference model is established to solve the cold start problem caused by too few tags. In this paper, the extended collaborative filtering algorithm based on semantic preference is used to generate recommendation results, and the performance of the system is tested experimentally. The experimental results show that: (1) the feature of reading interest preference based on text semantics can correctly reflect the user's interest preference; (2) the recommendation algorithm performs well in terms of accuracy and diversity. Finally, the recommendation system based on Spark distributed computing platform is designed and implemented. The main modules are word vector training LDA topic analysis, tag clustering and collaborative filtering expansion algorithm. The first three are implemented based on the interface provided by Spark machine learning library MLlib. The extended collaborative filtering algorithm includes two modes: project-based and user-based. This paper designs the implementation flow for specific modules. The experimental results show that the acceleration performance of various algorithms is remarkable.
【學位授予單位】:東南大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP391.3
本文編號:2150977
[Abstract]:The vast amount of book labels accumulated in the Internet provides a new data source for analyzing reading interest and building personalized book recommendation system. Therefore, this paper mainly studies how to integrate tags, abstracts and other text data, build personalized book recommendation system, and improve the system performance. This paper can be divided into three parts: interest preference model based on semantics, the design of recommendation algorithm and the implementation of parallelization based on Spark platform. Firstly, an algorithm based on word vector and co-occurrence frequency to calculate the semantic similarity of label is proposed, and the optimization method is designed for the specific scene. Then PIC algorithm and LDA algorithm are used to build semantic preference model based on label and book digest, and cooperative filtering extension algorithm based on semantic preference is used to generate book recommendation list. Finally, the recommendation system is implemented by parallelization on Spark distributed computing platform. This paper first introduces the research background and significance of the subject, summarizes the key issues affecting the performance of the personalized recommendation system based on the relevant literature, and clarifies the specific research content of this paper. Secondly, this paper studies the key technologies of semantic analysis, clustering and recommendation algorithms, and points out the advantages and disadvantages of these technologies, which are the theoretical basis for further research. Furthermore, interest preference model based on text semantics is established. Among them, the attenuation function is introduced as the weight to solve the time effect problem of label preference, and an algorithm based on word vector and co-occurrence frequency to calculate label similarity is proposed. Improve the accuracy of correlation calculation; implement tag clustering based on PIC algorithm, establish interest preference model based on label semantics, solve the sparse problem of labels; use LDA algorithm to analyze the distribution of potential topics in book abstracts. The semantic preference model is established to solve the cold start problem caused by too few tags. In this paper, the extended collaborative filtering algorithm based on semantic preference is used to generate recommendation results, and the performance of the system is tested experimentally. The experimental results show that: (1) the feature of reading interest preference based on text semantics can correctly reflect the user's interest preference; (2) the recommendation algorithm performs well in terms of accuracy and diversity. Finally, the recommendation system based on Spark distributed computing platform is designed and implemented. The main modules are word vector training LDA topic analysis, tag clustering and collaborative filtering expansion algorithm. The first three are implemented based on the interface provided by Spark machine learning library MLlib. The extended collaborative filtering algorithm includes two modes: project-based and user-based. This paper designs the implementation flow for specific modules. The experimental results show that the acceleration performance of various algorithms is remarkable.
【學位授予單位】:東南大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP391.3
【參考文獻】
相關(guān)期刊論文 前1條
1 陳穎儀;;美國閱讀推廣活動的實踐經(jīng)驗分析及啟示[J];圖書館理論與實踐;2009年05期
,本文編號:2150977
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2150977.html
最近更新
教材專著