Web挖掘技術及其在互聯(lián)網中的應用研究

發(fā)布時間：2018-10-26 19:57

【摘要】：隨著信息技術的不斷發(fā)展,計算機與通信技術不僅推動著現代社會的信息化發(fā)展,而且同時影響并在改變著人們的現代生活。然而信息技術同時帶來了數據的爆炸式增長,人們迫切需要一種對海量數據進行有效利用和處理的解決方案。在這樣的大數據背景下,數據挖掘技術應運而生。Web挖掘技術作為該領域的一個分支,針對的是萬維網海量數據的有效梳理和運用。由于互聯(lián)網技術日新月異,而Web挖掘技術相對發(fā)展較晚,因此本文以Web挖掘作為研究核心,并深入分析其在互聯(lián)網領域的應用。本文首先介紹了Web技術的研究背景、現狀、技術難點和未來發(fā)展方向等方面,以及對數據挖掘、機器學習等相關概念做了深入說明。然后,繼續(xù)關注Web挖掘技術的實現過程和應用場景,介紹了文本預處理的核心實現過程和話題檢測與追蹤、用戶行為分析兩個應用的技術背景。作為Web內容挖掘技術的一個重要應用之一,話題檢測與動態(tài)追蹤旨在檢測未知話題并且追蹤已有話題的后續(xù)發(fā)展。針對網絡媒介上新聞事件報道類文本對象的話題檢測與動態(tài)追蹤問題,本文實現了一種混合聚類解決方案。本方案基于“貢獻度”對話題模型做了層次化調整,更加適合于構建互聯(lián)網新聞話題,而且效率性能有了大幅提升。實際互聯(lián)網新聞數據表明,與K-Means算法相比,本方案準確率和召回率有了顯著提升,并且構建的話題樹模型層次化效果明顯。針對中文微博類文本對象的話題檢測與動態(tài)追蹤問題,本文提出了一種基于主題詞的增量式模糊聚類解決方案。本方案首先根據微博自身的文本特點,提出了一套信息反垃圾的過濾方案。然后利用時效性和詞頻兩個因素,為主題詞建立適應微博特點的權重。最后利用增量式模糊聚類方法完成突發(fā)話題的檢測過程。實際微博數據表明,本方案可以有效地檢測出突發(fā)事件、熱點話題等,而且時間效率較為理想。作為Web使用挖掘技術的一個重要應用之一用戶行為分析旨在了解用戶習慣、興趣點等,分析評測用戶的產品滿意度,以便改善產品提升用戶體驗。針對搜索引擎的用戶滿意度評測,本文闡述了一種基于用戶使用行為的自動化解決方案。本方案首先介紹原始網絡日志預先處理過程,即從日志數據中得到具體用戶操作行為數據并進行特征抽取。然后,提出了一種基于CURE算法的推薦技術,人工對選取的樣本進行標注。最后,利用動態(tài)建模技術完成對用戶滿意度的模型構建。實際搜索引擎數據表明,基于機器學習的自動化評測方案已經接近人工評測水平,達到了實際應用要求,并且動態(tài)模型通過多模型構建、自動更新、反饋糾正等機制可以有效延長生命周期,提高了學習的延續(xù)性。
[Abstract]:With the continuous development of information technology, computer and communication technology not only promote the development of information technology in modern society, but also affect and change people's modern life at the same time. However, information technology has brought the explosive growth of data at the same time, people urgently need a solution to effectively use and process the massive data. Under the background of big data, data mining technology emerges as the times require. As a branch of this field, Web mining technology is aimed at the effective combing and application of the massive data of the World wide Web. Because of the rapid development of Internet technology and the relatively late development of Web mining technology, this paper takes Web mining as the core of research, and deeply analyzes its application in the field of Internet. This paper first introduces the research background, current situation, technical difficulties and future development direction of Web technology, as well as the related concepts such as data mining, machine learning and so on. Then, we continue to pay attention to the implementation process and application scenarios of Web mining technology, and introduce the core implementation process of text preprocessing, topic detection and tracking, and user behavior analysis technology background. As one of the important applications of Web content mining technology, topic detection and dynamic tracking aims to detect unknown topics and track the future development of existing topics. To solve the problem of topic detection and dynamic tracking of news event-like text objects on network media, a hybrid clustering solution is implemented in this paper. Based on the "contribution degree", the topic model is adjusted hierarchically, which is more suitable for the construction of Internet news topics, and the efficiency performance has been greatly improved. The actual Internet news data show that compared with the K-Means algorithm, the accuracy and recall rate of this scheme are significantly improved, and the hierarchical effect of the topic tree model is obvious. Aiming at the topic detection and dynamic tracking of Chinese Weibo text objects, an incremental fuzzy clustering solution based on theme words is proposed in this paper. Firstly, according to Weibo's own text characteristics, a set of information anti-spam filtering scheme is put forward. Then, by using the two factors of timeliness and word frequency, the weight of the theme words is established to suit Weibo's characteristics. Finally, incremental fuzzy clustering method is used to complete the detection process of burst topic. The actual Weibo data show that this scheme can effectively detect unexpected events, hot topics and so on, and the time efficiency is ideal. As an important application of Web usage mining technology, user behavior analysis aims at understanding user habits, points of interest, and analyzing and evaluating users' product satisfaction, in order to improve the product and enhance the user experience. According to the evaluation of user satisfaction of search engine, this paper presents an automatic solution based on user's use behavior. This scheme first introduces the pre-processing process of the original network log, that is, the user's operation behavior data is obtained from the log data and the feature extraction is carried out. Then, a recommendation technique based on CURE algorithm is proposed to label the selected samples manually. Finally, the dynamic modeling technology is used to build the model of user satisfaction. The actual search engine data show that the automated evaluation scheme based on machine learning is close to the level of manual evaluation and meets the requirements of practical application, and the dynamic model is automatically updated through multi-model construction. Feedback correction and other mechanisms can effectively prolong the life cycle and improve the continuity of learning.
【學位授予單位】：山東大學
【學位級別】：碩士
【學位授予年份】：2013
【分類號】：TP311.13;TP391.1

【參考文獻】

相關期刊論文前10條

1 陳學昌;韓佳珍;魏桂英;;話題識別與跟蹤技術發(fā)展研究[J];中國管理信息化;2011年09期

2 孫玲芳;夏聰;;Web使用挖掘在用戶行為分析中的應用[J];江蘇科技大學學報(自然科學版);2011年03期

3 王淵;;面向用戶的搜索引擎檢索結果評價[J];河南圖書館學刊;2007年04期

4 于滿泉;駱衛(wèi)華;許洪波;白碩;;話題識別與跟蹤中的層次化話題識別技術研究[J];計算機研究與發(fā)展;2006年03期

5 張晨逸;孫建伶;丁軼群;;基于MB-LDA模型的微博主題挖掘[J];計算機研究與發(fā)展;2011年10期

6 程葳;龍志yN;;面向互聯(lián)網新聞的在線話題檢測算法[J];計算機工程;2009年18期

7 劉樹超;李永臣;武洪萍;;Web數據挖掘研究與探討[J];制造業(yè)自動化;2010年09期

8 張小豐;;面向Web的數據挖掘技術在網站優(yōu)化中的個性化推薦方法的研究與應用[J];制造業(yè)自動化;2012年01期

9 江婕;李建民;曾R挽，

本文編號：2296793

資料下載