基于文本分析的互聯(lián)網(wǎng)視頻搜索引擎技術(shù)研究

發(fā)布時間：2018-05-11 01:39

本文選題：視頻搜索引擎 + 中文分詞　；參考：《杭州電子科技大學(xué)》2013年碩士論文

【摘要】：隨著網(wǎng)絡(luò)技術(shù)的飛速發(fā)展，互聯(lián)網(wǎng)上的信息不僅在數(shù)量上以幾何級的速度增長，在形式上也變得多種多樣。多媒體信息正在逐步的取代傳統(tǒng)的文本信息，成為人們上網(wǎng)了解信息的第一選擇。傳統(tǒng)的搜索引擎專注于文字的搜索，對視頻、圖片等多媒體信息搜索的支持遠(yuǎn)遠(yuǎn)滿足不了人們的需求。針對這種情況，本文設(shè)計了一種專門針對于互聯(lián)網(wǎng)視頻的搜索引擎，該搜索引擎通過分析挖掘視頻的標(biāo)題，評論等相關(guān)文本信息能較精確的搜索到視頻信息，并通過分析用戶日志提供個性化搜索。本文首先介紹了網(wǎng)絡(luò)爬蟲的實現(xiàn)原理及運行過程。該網(wǎng)絡(luò)爬蟲針對視頻網(wǎng)站爬取視頻相關(guān)文本信息，，并將相關(guān)信息保存到本地。由于具有采集速度快，范圍廣的特點，使其能較好的滿足用戶對搜索引擎查找速度快，搜索范圍廣的要求。接下來，本文介紹通過對已有視頻文本信息的分析挖掘而非對視頻本身的分析來間接獲取視頻內(nèi)容信息。介紹了中文分詞的主流算法，并比較了這些方法的優(yōu)缺點，并詳細(xì)實現(xiàn)了正向最大匹配算法，為之后句子間相似度匹配算法提供了較好的分詞效果。接著介紹對爬蟲爬取到的視頻評論進(jìn)行過濾的方法，將情感評論，垃圾評論等對分析視頻內(nèi)容無關(guān)的評論過濾掉。采用計算相對詞頻來分析文本信息進(jìn)而判斷視頻的內(nèi)容。然后，詳細(xì)介紹了根據(jù)用戶日志判斷用戶查詢意圖的方法。首先介紹了用戶日志挖掘的過程，詳細(xì)描述了怎樣對用戶日志進(jìn)行處理，并以Sogou用戶日志為例進(jìn)行處理，獲得了滿足后續(xù)分析的數(shù)據(jù)。提出一種基于句子相似度計算判斷用戶查詢意圖的方法，該方法根據(jù)用戶日志判斷查詢詞與哪類視頻的相關(guān)度最大來確定用戶的查詢意圖。最后，分別用實驗驗證了網(wǎng)絡(luò)爬蟲的爬取效果，垃圾評論過濾，句子相似度匹配算法的正確性和可行性，并將這些功能有機的結(jié)合在一起實現(xiàn)了一個面向互聯(lián)網(wǎng)視頻的個性化搜索引擎系統(tǒng)。
[Abstract]:With the rapid development of network technology, the information on the Internet not only grows at the rate of geometry, but also becomes diversified in form. Multimedia information is gradually replacing the traditional text information, becoming the first choice for people to understand information online. Traditional search engines focus on text search and support multimedia information search, such as video, pictures, etc. In order to solve this problem, this paper designs a search engine for Internet video. The search engine can search the video information accurately by analyzing and mining the titles, comments and other related text information of the video. And through the analysis of user logs to provide personalized search. This paper first introduces the implementation principle and running process of network crawler. The web crawler crawls the video related text information to the video website and saves the relevant information to the local. Because of its fast acquisition speed and wide range, it can better meet the requirements of search engine search speed and search range. Then, this paper introduces how to obtain the video content information indirectly by mining the existing video text information rather than analyzing the video itself. This paper introduces the mainstream algorithms of Chinese word segmentation, compares the advantages and disadvantages of these methods, and implements the forward maximum matching algorithm in detail, which provides a good segmentation effect for the subsequent sentence similarity matching algorithm. Then it introduces the method of filtering the video comments crawled by the crawler, filtering out the comments that have nothing to do with the analysis of video content, such as emotional comments and spam comments. The relative word frequency is used to analyze the text information and to judge the content of the video. Then, the method of judging user's query intention according to user log is introduced in detail. Firstly, the process of user log mining is introduced, and how to process user log is described in detail. Taking Sogou user log as an example, the data satisfying the subsequent analysis are obtained. A method of judging user's query intention based on sentence similarity calculation is proposed. This method determines the user's query intention based on the user log's judgement of the maximum correlation between the query words and which kind of video. Finally, experiments are carried out to verify the correctness and feasibility of crawler crawling, spam filtering and sentence similarity matching algorithm. And the organic combination of these functions together to achieve a personalized search engine system for Internet video.
【學(xué)位授予單位】：杭州電子科技大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2013
【分類號】：TP391.3

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 王晟;趙壁芳;;基于內(nèi)容的圖片搜索引擎研究[J];長沙大學(xué)學(xué)報;2012年02期

2 王子玲;許愛強;牛雙誠;陳育良;;一種建立復(fù)雜系統(tǒng)相關(guān)性矩陣的新方法[J];測試技術(shù)學(xué)報;2010年02期

3 楊思春;;一種改進(jìn)的句子相似度計算模型[J];電子科技大學(xué)學(xué)報;2006年06期

4 謝峰;劉洪星;;基于Lucene的Web站內(nèi)搜索引擎的研究[J];電腦知識與技術(shù);2008年04期

5 侯亞麗,袁方;Web日志挖掘中的數(shù)據(jù)預(yù)處理技術(shù)[J];河北大學(xué)學(xué)報(自然科學(xué)版);2005年02期

6 陳桂林,王永成,韓客松,王剛;一種改進(jìn)的快速分詞算法[J];計算機研究與發(fā)展;2000年04期

7 陳耿,朱玉全,楊鶴標(biāo),陸介平,宋余慶,孫志揮;關(guān)聯(lián)規(guī)則挖掘中若干關(guān)鍵技術(shù)的研究[J];計算機研究與發(fā)展;2005年10期

8 羅智勇;宋柔;;現(xiàn)代漢語通用分詞系統(tǒng)中歧義切分的實用技術(shù)[J];計算機研究與發(fā)展;2006年06期

9 歐振猛,余順爭;中文分詞算法在搜索引擎應(yīng)用中的研究[J];計算機工程與應(yīng)用;2000年08期

10 劉遷;賈惠波;;中文信息處理中自動分詞技術(shù)的研究與展望[J];計算機工程與應(yīng)用;2006年03期

本文編號：1871860

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1871860.html

上一篇：Google率先打造新一代搜索引擎
下一篇：XML數(shù)據(jù)文檔及其處理技術(shù)探討

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于文本分析的互聯(lián)網(wǎng)視頻搜索引擎技術(shù)研究