基于歷史上下文挖掘的“科技論文在線”用戶行為研究
本文選題:上下文 切入點:web日志 出處:《武漢理工大學(xué)》2013年碩士論文
【摘要】:“中國科技論文在線”是由教育部科技發(fā)展中心主辦,以“闡述學(xué)術(shù)觀點、保護(hù)知識產(chǎn)權(quán)、思想交流創(chuàng)新、論文快捷共享”為宗旨,為科研人員提供一個方便、快捷的交流的學(xué)術(shù)平臺,以此平臺為基礎(chǔ)實現(xiàn)新成果的及時推廣,科研創(chuàng)新思想的及時交流。作為一個信息獲取類的網(wǎng)站,在它快捷、方便地帶來大量信息的同時,也帶來了許多難題:如何能使用戶快速、準(zhǔn)確地獲得所需要的科研信息;如何理解已有的用戶歷史數(shù)據(jù)并用于預(yù)測用戶未來的行為等。對于“科技論文在線”用戶行為的研究可以有效地解決這些問題。 在分析歷史上下文信息與web信息各自的優(yōu)缺點后,將歷史上下文信息與web日志進(jìn)行融合,融合后數(shù)據(jù)來源更為廣泛,能較全面的體現(xiàn)用戶訪問頁面時的環(huán)境狀況,較準(zhǔn)確的反映用戶當(dāng)時的情緒、心理狀態(tài),行為特征。在此兩類數(shù)據(jù)基礎(chǔ)上進(jìn)行挖掘分析,可以較準(zhǔn)確地得出用戶的訪問模式和訪問特點。 本文主要研究了歷史上下文信息挖掘過程中的數(shù)據(jù)獲取、融合及預(yù)處理的各階段的算法并進(jìn)行了部分改進(jìn)和創(chuàng)新,然后利用改進(jìn)的聚類分析算法DICA分析預(yù)處理得到的會話集,并根據(jù)聚類分析結(jié)果得出推薦集來實現(xiàn)網(wǎng)站站點結(jié)構(gòu)改善和向用戶提供推薦服務(wù)。 本論文的工作主要集中在四個方面: (1)數(shù)據(jù)預(yù)處理:首先在較為全面的分析了歷史上下文信息以及web日志的數(shù)據(jù)特點后,將多種歷史上下文信息和服務(wù)器端的web日志進(jìn)行去噪融合。然后通過會話劃分算法將融合后的信息整理為會話集,在此基礎(chǔ)上,利用用戶訪問軌跡重現(xiàn)算法模擬用戶當(dāng)時的訪問軌跡,并以此再次細(xì)化會話集。最后利用歷史上下文信息中的終端環(huán)境上下文信息,修正用戶每個頁面的瀏覽時間。 (2)頁面興趣度計算:對于得到的會話集,采用基于多特征的頁面興趣度計算方法為每個頁面賦權(quán)重值。針對以往權(quán)重計算算法中,不能體現(xiàn)用戶瀏覽頁面順序的問題,本文提出了將會話中頁面的序號作為一個特征加入頁面權(quán)重的計算,有效地區(qū)分了多個用戶采用不同的順序訪問某些特定頁面的情況。 (3)聚類分析用戶行為:在對會話集中的頁面賦值權(quán)重后,本文提出改進(jìn)的k-means算法DICA。算法的自動獲取最優(yōu)聚類個數(shù)和初始聚類中心的特點有效的避免了k-means算法中需要依據(jù)經(jīng)驗設(shè)定初始聚類個數(shù)和隨機(jī)設(shè)定初始聚類中心的缺陷。 (4)生成推薦集:對帶權(quán)重的會話集進(jìn)行DICA算法聚類分析后得到基于群體用戶的推薦集和基于個體用戶的推薦集,并將這兩個推薦集融合,以此來改善網(wǎng)站站點結(jié)構(gòu)和向用戶提供推薦服務(wù)。 本文的研究工作得到教育部項目“基于上下文感知的“中國科技論文在線”用戶行為研究”(項目編號:20121140004)的資助。
[Abstract]:"China Science and Technology Paper online" is sponsored by the Science and Technology Development Center of the Ministry of Education. It aims at "expounding academic viewpoints, protecting intellectual property rights, exchanging ideas and innovating, and sharing papers quickly."Rapid exchange of academic platform, based on this platform to achieve the timely promotion of new results, scientific research and innovation ideas timely exchange.As a website of information acquisition class, it brings a lot of information quickly and conveniently, but also brings a lot of difficulties: how to make users get the needed scientific research information quickly and accurately;How to understand the existing user history data and to predict the future behavior of the user.The research on online user behavior of scientific papers can solve these problems effectively.After analyzing the advantages and disadvantages of the historical context information and the web information, the historical context information and the web log are fused.More accurate reflection of the user's mood, psychological state, behavioral characteristics.On the basis of mining and analysis of these two kinds of data, the user's access pattern and access characteristics can be obtained more accurately.In this paper, we mainly study the algorithms of data acquisition, fusion and preprocessing in the process of historical context information mining, and make some improvements and innovations. Then we use the improved clustering analysis algorithm DICA to analyze the session set obtained by preprocessing.According to the result of clustering analysis, the recommendation set is obtained to improve the site structure and provide recommendation services to users.The work of this thesis is mainly focused on four aspects:(1) data preprocessing: firstly, after analyzing the historical context information and the data characteristics of the web log, we combine the historical context information with the web log on the server side.Then the fused information is arranged into a session set by the session partition algorithm. On this basis, the user access trajectory reconstruction algorithm is used to simulate the access trajectory of the user at that time, and then refine the session set again.Finally, the user browsing time of each page is corrected by using the terminal environment context information in the historical context information.Page interest calculation: for the resulting session set, the multi-feature based page interest calculation method is used to assign a weight value to each page.In order to solve the problem that the order of page browsing can not be reflected in the previous algorithms of weight calculation, this paper proposes to add the ordinal number of the page to the calculation of the weight of the page as a feature.Effectively distinguishes multiple users from accessing certain pages in different order.Cluster analysis of user behavior: after assigning weights to pages in session sets, an improved k-means algorithm, DICA, is proposed in this paper.The characteristics of the algorithm to obtain the optimal number of clusters and the initial clustering centers effectively avoid the defects of k-means algorithm which needs to set the number of initial clusters and the random setting of initial clustering centers according to experience.(4) generating recommendation set: the weighted session set is analyzed by DICA algorithm, then the recommendation set based on group users and the recommendation set based on individual user are obtained, and the two recommendation sets are fused.In order to improve the site structure and provide users with referral services.The research of this paper is supported by the Ministry of Education project "Context-aware" online "user behavior Research" (Project No.: 20121140004).
【學(xué)位授予單位】:武漢理工大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP311.13
【參考文獻(xiàn)】
相關(guān)期刊論文 前6條
1 王建冬;王繼民;;基于日志挖掘的高校用戶期刊數(shù)據(jù)庫檢索行為研究[J];北京大學(xué)學(xué)報(自然科學(xué)版);2012年01期
2 劉加伶;范軍;;基于用戶訪問樹的Web日志挖掘數(shù)據(jù)預(yù)處理[J];計算機(jī)科學(xué);2009年09期
3 楊勁松;凌培亮;;搜索引擎PageRank算法的改進(jìn)[J];計算機(jī)工程;2009年22期
4 李珊;袁方;;基于Web日志挖掘的頁面興趣度方法的改進(jìn)[J];計算機(jī)時代;2007年03期
5 黃日茂;葉琳莉;;基于日志分析的用戶搜索行為研究[J];莆田學(xué)院學(xué)報;2010年02期
6 劉聲田;盧守東;劉忠強(qiáng);;基于用戶關(guān)聯(lián)行為的個性化搜索系統(tǒng)設(shè)計[J];計算機(jī)系統(tǒng)應(yīng)用;2010年03期
相關(guān)博士學(xué)位論文 前1條
1 朱鯤鵬;基于Web日志挖掘的智能信息檢索研究[D];哈爾濱工業(yè)大學(xué);2009年
相關(guān)碩士學(xué)位論文 前4條
1 楊鵬;Web日志挖掘數(shù)據(jù)預(yù)處理算法研究與實現(xiàn)[D];北京郵電大學(xué);2011年
2 張海鵬;基于Web日志挖掘的個性化推薦研究[D];重慶大學(xué);2007年
3 李艷美;基于貝葉斯網(wǎng)絡(luò)的數(shù)據(jù)挖掘應(yīng)用研究[D];西安電子科技大學(xué);2008年
4 高巖;基于社會網(wǎng)絡(luò)分析方法的網(wǎng)絡(luò)數(shù)據(jù)挖掘[D];吉林大學(xué);2012年
,本文編號:1687848
本文鏈接:http://sikaile.net/falvlunwen/zhishichanquanfa/1687848.html