基于評論性網站用戶發(fā)言的數據挖掘研究
發(fā)布時間:2018-10-13 16:07
【摘要】:隨著網絡的蓬勃發(fā)展,互聯網上出現很多與用戶形成良好互動的評論性網站,這些網站最突出的特點是實時性和信息的快速交替性。正是由于這些特點,這些評論性網站上隱藏了很多有價值的知識,挖掘這些潛在的知識對社會發(fā)展有很重要的指導意義。 本文選取這類網站中最典型的代表BBS網站作為研究對象,通過使用搜索引擎對其評論性內容進行數據挖掘,提取出潛在的有價值信息。本文采用新的網頁排序算法(P-OPIC算法),提高了網頁內容的挖掘力度,讓用戶更加快速地定位到目標網頁。 本文研究了搜索引擎的組成和框架,對開源搜索引擎Nutch的運行機制進行研究分析,主要工作內容分為以下幾個方面: (1)詳細對Nutch的爬蟲框架和索引框架進行研究,對Nutch的運行流程進行深入分析。研究了PageRank算法、HITS算法和OPIC算法,提出基于OPIC算法的優(yōu)化算法。優(yōu)化算法加入網頁PageRank值和BBS網站調整因子,其中調整因子提高了BBS網頁排名的穩(wěn)定性 (2)研究了Nutch的數據結構,在Nutch中添加新的數據結構并實現中文分詞功能。通過修改Nutch源代碼的數據,減少算法對搜索引擎系統(tǒng)性能的影響。 (3)提出實驗方法對算法的性能進行研究,分別對OPIC算法和基于OPIC的改進算法進行數據對比。算法在BBS數據環(huán)境下測試,本文提出的改進算法能夠很好的理解用戶輸入的關鍵詞,網頁排序效果也比OPIC算法好很多,網頁排序的準確度有很明顯的提高。分析對比算法的實驗結果,總結算法的優(yōu)勢和劣勢。
[Abstract]:With the rapid development of the network, there are many critical websites with good interaction with users on the Internet. The most outstanding characteristics of these websites are real-time and rapid alternation of information. Because of these characteristics, these critical websites hide a lot of valuable knowledge, mining these potential knowledge has a very important guiding significance for social development. In this paper, the most typical representative BBS sites of this kind of websites are selected as the research object, and the potential valuable information is extracted by using search engine to mine the data of its critical content. In this paper, a new sorting algorithm (P-OPIC algorithm) is used to improve the mining of web content, which enables users to locate the target pages more quickly. In this paper, the composition and framework of search engine are studied, and the operating mechanism of open source search engine (Nutch) is analyzed. The main work is as follows: (1) the crawler framework and index framework of Nutch are studied in detail. The running process of Nutch is analyzed in depth. PageRank algorithm, HITS algorithm and OPIC algorithm are studied, and an optimization algorithm based on OPIC algorithm is proposed. The optimization algorithm adds the PageRank value of the web page and the adjustment factor of the BBS website. Among them, the adjustment factor improves the stability of the BBS page ranking. (2) the data structure of the Nutch is studied, a new data structure is added to the Nutch and the Chinese word segmentation function is realized. By modifying the data of Nutch source code, the influence of the algorithm on the performance of search engine system is reduced. (3) the experimental method is proposed to study the performance of the algorithm, and the data comparison between the OPIC algorithm and the improved algorithm based on OPIC is carried out. The algorithm is tested in the BBS data environment. The improved algorithm proposed in this paper can understand the keywords input by the user very well, and the sorting effect of the web page is much better than that of the OPIC algorithm, and the accuracy of the web page sorting is obviously improved. The experimental results of the algorithm are analyzed and compared, and the advantages and disadvantages of the algorithm are summarized.
【學位授予單位】:北京郵電大學
【學位級別】:碩士
【學位授予年份】:2013
【分類號】:TP311.13;TP391.3
本文編號:2269206
[Abstract]:With the rapid development of the network, there are many critical websites with good interaction with users on the Internet. The most outstanding characteristics of these websites are real-time and rapid alternation of information. Because of these characteristics, these critical websites hide a lot of valuable knowledge, mining these potential knowledge has a very important guiding significance for social development. In this paper, the most typical representative BBS sites of this kind of websites are selected as the research object, and the potential valuable information is extracted by using search engine to mine the data of its critical content. In this paper, a new sorting algorithm (P-OPIC algorithm) is used to improve the mining of web content, which enables users to locate the target pages more quickly. In this paper, the composition and framework of search engine are studied, and the operating mechanism of open source search engine (Nutch) is analyzed. The main work is as follows: (1) the crawler framework and index framework of Nutch are studied in detail. The running process of Nutch is analyzed in depth. PageRank algorithm, HITS algorithm and OPIC algorithm are studied, and an optimization algorithm based on OPIC algorithm is proposed. The optimization algorithm adds the PageRank value of the web page and the adjustment factor of the BBS website. Among them, the adjustment factor improves the stability of the BBS page ranking. (2) the data structure of the Nutch is studied, a new data structure is added to the Nutch and the Chinese word segmentation function is realized. By modifying the data of Nutch source code, the influence of the algorithm on the performance of search engine system is reduced. (3) the experimental method is proposed to study the performance of the algorithm, and the data comparison between the OPIC algorithm and the improved algorithm based on OPIC is carried out. The algorithm is tested in the BBS data environment. The improved algorithm proposed in this paper can understand the keywords input by the user very well, and the sorting effect of the web page is much better than that of the OPIC algorithm, and the accuracy of the web page sorting is obviously improved. The experimental results of the algorithm are analyzed and compared, and the advantages and disadvantages of the algorithm are summarized.
【學位授予單位】:北京郵電大學
【學位級別】:碩士
【學位授予年份】:2013
【分類號】:TP311.13;TP391.3
【參考文獻】
相關期刊論文 前10條
1 王仕仲;寧龍兵;;基于Nutch的中文搜索引擎的研究與實現[J];電腦開發(fā)與應用;2009年07期
2 羅武;方逵;朱興輝;;網絡搜索引擎排序算法研究進展[J];湖南農業(yè)科學;2010年07期
3 鄒濤;王繼成;楊文清;張福炎;;文本信息檢索技術[J];計算機科學;1999年09期
4 姚文琳;劉文;;一種基于本體的PageRank算法的改進策略[J];計算機工程;2009年06期
5 劉昌鈺,唐常杰,于中華,杜永萍,郭穎;基于潛在語義分析的BBS文檔Bayes鑒別器[J];計算機學報;2004年04期
6 沈華偉;程學旗;陳海強;劉悅;;基于信息瓶頸的社區(qū)發(fā)現[J];計算機學報;2008年04期
7 張珩;;淺析基于BBS數據挖掘的研究[J];科技信息;2009年15期
8 何莘;王琬蕪;;自然語言檢索中的中文分詞技術研究進展及應用[J];情報科學;2008年05期
9 曹軍;Google的PageRank技術剖析[J];情報雜志;2002年10期
10 梁正友;潘濤;;Nutch中PageRank的并行實現[J];計算機工程與設計;2010年20期
,本文編號:2269206
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2269206.html
教材專著