互聯(lián)網(wǎng)搜索詞分類關(guān)鍵技術(shù)研究
本文選題:搜索關(guān)鍵詞 + 偽相關(guān)反饋; 參考:《浙江大學(xué)》2011年碩士論文
【摘要】:隨著互聯(lián)網(wǎng)的飛速發(fā)展,互聯(lián)網(wǎng)上的數(shù)字信息量也開始呈指數(shù)型增長,人們要從信息海洋中獲取自己需要的特定信息變得越來越困難。能幫助人們從海量信息中找到真正所需的搜索引擎,作為網(wǎng)絡(luò)用戶的信息獲取平臺,已成為互聯(lián)網(wǎng)上不可或缺的網(wǎng)絡(luò)應(yīng)用。網(wǎng)絡(luò)用戶對搜索引擎的依賴越來越嚴重,用戶的搜索行為已經(jīng)成為其上網(wǎng)行為中很重要的一部分,而搜索行為中最為重要的就是用戶提供的搜索詞,這些搜索詞直接或間接的反映了用戶潛在的興趣和需求,能夠很好的為用戶個性化應(yīng)用以及網(wǎng)絡(luò)定向廣告投放等網(wǎng)絡(luò)服務(wù)提供基礎(chǔ)。 因此,本文提出了對搜索詞進行分類分析。針對互聯(lián)網(wǎng)搜索詞分類的問題,本文詳細分析了互聯(lián)網(wǎng)搜索詞產(chǎn)生的相關(guān)背景,總結(jié)概括了搜索詞的定義,詳細描述了搜索詞的特點,并針對現(xiàn)存的一些技術(shù)方法分析了搜索詞分類的難點,最終提出了一個二階段的搜索詞分類解決方案:基于偽相關(guān)反饋的搜索詞預(yù)處理與基于文本分類技術(shù)的搜索詞分類。將未知的搜索詞分類問題通過偽相關(guān)反饋理念轉(zhuǎn)化為可以利用已有文本分類技術(shù)解決的問題。 在搜索詞分類問題的解決過程中,本文針對文本分類技術(shù)中的一些技術(shù)進行了研究比較,提出了一種在初步特征選擇后進一步精減特征的基于重構(gòu)思想的特征精選方法,該方法結(jié)合列選擇方法定義了一個對初選特征選取特征子集的目標函數(shù),利用貪心和直推式實驗設(shè)計的思想來求解目標函數(shù),最終獲得局部最優(yōu)精簡特征子集,并通過實驗證實了此方法的可用性。本文還通過詳細全面的實驗,對比分析了多種特征選擇方法與分類方法組合的分類結(jié)果,最終選擇出了適用于本文分類問題的特征選擇方法與分類方法。在最后,本文還提出了搜索詞分類問題可以進一步改進與應(yīng)用的方向。
[Abstract]:With the rapid development of the Internet, the amount of digital information on the Internet is increasing exponentially. It is becoming more and more difficult for people to get the specific information they need from the information ocean. It can help people find the really needed search engines from the mass information. As the information acquisition platform of network users, it has become the Internet. The Internet users' dependence on the search engine is becoming more and more serious. The user's search behavior has become a very important part of its Internet behavior. The most important thing in the search behavior is the search term provided by the user. These search words directly or indirectly reflect the potential interests and needs of the users, and can be very important. Good for users personalized applications and network targeted advertising and other network services to provide the basis.
Therefore, this paper puts forward the classification and analysis of search words. In view of the classification of Internet search words, this paper analyzes the related background of Internet search words in detail, summarizes the definition of search words, describes the characteristics of the search words in detail, and analyzes the difficulties of the classification of search words according to some existing technical methods. Finally, the difficulties of the search words are analyzed. Finally, the difficulties of the search word classification are analyzed. A two phase search term classification solution is proposed: search word preprocessing based on pseudo correlation feedback and search word classification based on text classification technology. The unknown search word classification problem is transformed into a problem that can be solved by using the existing text classification technology through the pseudo correlation feedback concept.
In the process of solving the problem of classification of search words, this paper studies and compares some of the techniques in text classification, and proposes a feature selection method based on the reconfiguration idea, which is a step down feature in the initial feature selection. This method combines the column selection method to determine the feature subset of the selected feature. Objective function, using the idea of greedy and direct push experimental design to solve the objective function, and finally obtain the local optimal set of feature subsets, and verify the availability of this method through experiments. This paper also compares and analyzes the classification results of the combination of multiple feature selection methods and classification methods through a detailed and comprehensive experiment. Finally, the results are selected. In the end, this paper puts forward the direction of further improvement and application of the classification of search words.
【學(xué)位授予單位】:浙江大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2011
【分類號】:TP391.1
【參考文獻】
相關(guān)期刊論文 前10條
1 周宏宇;張政;;中文分詞技術(shù)綜述[J];安陽師范學(xué)院學(xué)報;2010年02期
2 張德鑫;“水至清則無魚”——我的新生詞語規(guī)范觀[J];北京大學(xué)學(xué)報(哲學(xué)社會科學(xué)版);2000年05期
3 高軍,陳錫先;無監(jiān)督的動態(tài)分詞方法[J];北京郵電大學(xué)學(xué)報;1997年04期
4 鐘曉;;自動分類在搜索引擎中的應(yīng)用[J];福建電腦;2009年10期
5 伍建軍;康耀紅;;文本分類中特征降維方式的研究[J];海南大學(xué)學(xué)報(自然科學(xué)版);2007年01期
6 賀敏;龔才春;張華平;程學(xué)旗;;一種基于大規(guī)模語料的新詞識別方法[J];計算機工程與應(yīng)用;2007年21期
7 徐威;董淵;白若鷂;張素琴;;針對中文文本自動分類算法的評估體系[J];計算機科學(xué);2007年08期
8 都云琪,肖詩斌;基于支持向量機的中文文本自動分類研究[J];計算機工程;2002年11期
9 張玉芳;艾東梅;黃濤;熊忠陽;;結(jié)合編輯距離和Google距離的語義標注方法[J];計算機應(yīng)用研究;2010年02期
10 張仰森;曹元大;俞士汶;;基于規(guī)則與統(tǒng)計相結(jié)合的中文文本自動查錯模型與算法[J];中文信息學(xué)報;2006年04期
,本文編號:1913321
本文鏈接:http://sikaile.net/wenyilunwen/guanggaoshejilunwen/1913321.html