社區(qū)網(wǎng)絡實時搜索引擎的研究
發(fā)布時間:2018-02-21 04:43
本文關鍵詞: 搜索引擎 社區(qū)網(wǎng)絡 網(wǎng)絡爬蟲 全文搜索 出處:《哈爾濱工業(yè)大學》2012年碩士論文 論文類型:學位論文
【摘要】:隨著互聯(lián)網(wǎng)技術的不斷發(fā)展,出現(xiàn)了各式各樣具有很多豐富功能的網(wǎng)站,人們對網(wǎng)絡的需求也不只滿足與以往的看新聞,查資料,越來越多的人喜歡在網(wǎng)絡中記錄自己日常的生活,用簡短的狀態(tài)來表達自己的心情,或者對某種事情的看法。網(wǎng)絡不僅是一個展現(xiàn)數(shù)據(jù)的平臺,而且變成了展現(xiàn)用戶的一個窗口。 這部分由用戶所創(chuàng)造的數(shù)據(jù)與之前的經(jīng)過專業(yè)編輯創(chuàng)建的數(shù)據(jù)不同,其具有數(shù)據(jù)更自由,方式更靈活,內容更豐富,角度更全面,響應更迅速的特點,因此對這類數(shù)據(jù)的研究有著很大的意義。然而,當前的搜索引擎因為一些技術上的一些限制很難有效地獲取這類數(shù)據(jù)。 文章將搜索引擎劃分為數(shù)據(jù)抓取,索引建立,查詢處理,數(shù)據(jù)展示四個模塊,分析了每個模塊在抓取這類數(shù)據(jù)時遇到的難題,并針對這些困難,提出了新的理論和解決方案。 在數(shù)據(jù)抓取部分,以往學術界認為網(wǎng)頁的變化遵循泊松過程,而本論文分析了不同時間斷對網(wǎng)頁變化規(guī)律的影響,并利用用戶之間的相互親密度修正該變化規(guī)律,提出了新的網(wǎng)頁變化模型。在索引建立方面,,提出了使用多種索引的方式,不但提高了結果的時效性,并且可以支持時間段內的統(tǒng)計數(shù)據(jù)查詢。在數(shù)據(jù)排序中,改進了原有的以網(wǎng)頁為基礎的PageRank,考慮到了社區(qū)數(shù)據(jù)的新的屬性,評論和回復,并且加入了用戶的重要程度作為排序的指標。在數(shù)據(jù)的展示方面,提出了利用情緒將數(shù)據(jù)結果分類,以便于展示給用戶更直觀的數(shù)據(jù)。 其次本論文以這些解決方案為基礎,設計并實現(xiàn)了一個新型的面向社區(qū)網(wǎng)絡的搜索引擎。文章的最后給出了實驗結果,驗證了系統(tǒng)具有很好的性能。
[Abstract]:With the continuous development of Internet technology, a variety of websites with a lot of rich functions have emerged. People's demand for the network is not only to meet the needs of the past, but also to read the news and check the materials. More and more people like to record their daily life on the Internet, to express their feelings in a brief state, or to view something. The Internet is not only a platform for displaying data. And become a window to show the user. This part of the data created by the user is different from the previous data created by professional editors. It has the characteristics of freer data, more flexible way, richer content, more comprehensive angle, and faster response. Therefore, the research on this kind of data has great significance. However, the current search engine is very difficult to obtain this kind of data effectively because of some technical limitations. In this paper, the search engine is divided into four modules: data capture, index building, query processing and data display. The difficulties encountered by each module in capturing such data are analyzed, and a new theory and solution are put forward in view of these difficulties. In the part of data capture, the academic circles used to think that the changes of web pages follow the Poisson process. However, this paper analyzes the influence of different time breaks on the changing rules of web pages, and uses the mutual affinity between users to correct the rule of change. A new web page change model is put forward. In the aspect of index building, the method of using multiple indexes is put forward, which not only improves the timeliness of the results, but also supports the query of statistical data in the time period. Improved the existing Page Rank-based page, taking into account the new attributes, comments and responses of community data, and added the importance of the user as a ranking indicator. In order to display more intuitionistic data to the user, the data result is classified by emotion. Secondly, based on these solutions, a new type of search engine for community network is designed and implemented in this paper. Finally, the experimental results are given to verify the good performance of the system.
【學位授予單位】:哈爾濱工業(yè)大學
【學位級別】:碩士
【學位授予年份】:2012
【分類號】:TP391.3
【參考文獻】
相關期刊論文 前10條
1 曲佳彬;;網(wǎng)絡信息檢索中常用檢索模型分析[J];產(chǎn)業(yè)與科技論壇;2010年03期
2 郭利剛;姚寒冰;;基于倒排索引的密文數(shù)據(jù)庫檢索方法研究[J];計算機安全;2010年09期
3 張小慢;;百度李彥宏[J];記者觀察(上半月);2009年05期
4 李衛(wèi)疆;趙鐵軍;;面向Blog的爬行算法[J];計算機工程與應用;2008年31期
5 楊為民;李龍澍;;基于場論的高精度信息檢索研究[J];計算機工程;2011年15期
6 高峰;楊連賀;;Flex技術與Django開發(fā)框架的整合研究[J];計算機與數(shù)字工程;2010年01期
7 劉金紅;陸余良;;主題網(wǎng)絡爬蟲研究綜述[J];計算機應用研究;2007年10期
8 王進孝;搜索引擎與網(wǎng)絡信息資源檢索研究[J];情報理論與實踐;2002年04期
9 顧玲華;;基于搜索引擎發(fā)現(xiàn)技術的網(wǎng)頁存儲[J];蘇州大學學報(工科版);2011年02期
10 王玲;簡論搜索引擎及其應用技巧[J];圖書館論壇;2005年02期
本文編號:1521029
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1521029.html
最近更新
教材專著