天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁 > 科技論文 > 搜索引擎論文 >

面向信息檢索的Web文本挖掘方法研究

發(fā)布時間:2018-03-31 01:28

  本文選題:Web文本挖掘 切入點(diǎn):半監(jiān)督學(xué)習(xí) 出處:《華南理工大學(xué)》2012年博士論文


【摘要】:今天,互聯(lián)網(wǎng)已經(jīng)成為一個大眾化和交互式的信息發(fā)布媒介。Web作為一個巨大的、開放的、異構(gòu)的和動態(tài)的信息容器,產(chǎn)生和容納了巨大規(guī)模的文本、數(shù)據(jù)、多媒體、臨時性數(shù)據(jù)等各類信息。由于資源分散且沒有統(tǒng)一的管理和結(jié)構(gòu),這就導(dǎo)致相關(guān)信息的獲取并非易事,人們真正感興趣的內(nèi)容常常被淹沒在眾多無關(guān)信息當(dāng)中。 通過Web數(shù)據(jù)挖掘的研究,把新的Web文本挖掘方法和技術(shù)應(yīng)用到信息檢索中去,利用Web文本挖掘的研究成果來提高信息檢索中頁面內(nèi)容分類、聚類的精度和效率,改善檢索結(jié)果的組織,提高Web信息查找和利用的效率,能夠直接或間接地解決搜索引擎精度不高、召回率低、信息過載、返回結(jié)果組織方式有限以及服務(wù)形式單一等缺陷,為信息檢索系統(tǒng)發(fā)展到一個新的水平提供技術(shù)支持。因此,面向信息檢索的Web文本挖掘方法研究有著十分重要的理論意義和商業(yè)應(yīng)用價值。 目前,從面向信息檢索的角度來看Web文本內(nèi)容挖掘是一個非常活躍的研究方向,眾多學(xué)者在這個領(lǐng)域進(jìn)行了廣泛而深入的研究,雖然取得了一些可喜的成果與應(yīng)用,但遠(yuǎn)遠(yuǎn)未達(dá)到一個成熟的階段,仍面臨許多亟待解決的重要問題:至今還沒有發(fā)現(xiàn)“最佳”的特征選擇的維度削減方法;文本數(shù)據(jù)高維稀疏,傳統(tǒng)的分類、聚類算法的精度和效率難以提高;基于小樣本訓(xùn)練的半監(jiān)督學(xué)習(xí)問題;海量數(shù)據(jù)難于查找,如何有效改善檢索結(jié)果的組織、發(fā)布以方便查詢?yōu)g覽等等。 本文在現(xiàn)有Web文本內(nèi)容挖掘方法和研究成果的基礎(chǔ)上,進(jìn)一步圍繞Web文本挖掘的關(guān)鍵性問題與方法展開研究。對類別不均衡的非平衡數(shù)據(jù)、在線評價這類帶情感傾向的數(shù)據(jù)的特征降維問題分別給出解決方案;以半監(jiān)督學(xué)習(xí)作為主要研究對象,提出了幾種新的半監(jiān)督學(xué)習(xí)算法,并應(yīng)用到Web文本挖掘分析;同時,針對檢索結(jié)果聚類分析問題提出了一種解決方法,以改善搜索結(jié)果組織。在幾個常用的標(biāo)準(zhǔn)數(shù)據(jù)集上,通過相關(guān)實(shí)驗(yàn)進(jìn)行對比分析,驗(yàn)證了改進(jìn)方法的有效性。 本文所取得的工作成果以及創(chuàng)新點(diǎn)主要體現(xiàn)在以下幾個方面: 1.針對非平衡文本集上的分類問題,提出了一種基于Naive Bayesian的增強(qiáng)最大期望(Expectation Maximization, EM)半監(jiān)督分類算法。首先,構(gòu)建一個有效的特征選擇函數(shù)來過濾掉大量無效特征詞且保留高類別信息的特征項(xiàng),利用該特征選擇函數(shù)使得類別不均衡數(shù)據(jù)集的特征空間維度能夠真正有效降低。同時,對EM算法結(jié)合樸素Bayesian分類方法進(jìn)行改進(jìn)調(diào)整,在每步迭代過程中將后驗(yàn)類別概率最高的未標(biāo)注文檔從未標(biāo)注訓(xùn)練集轉(zhuǎn)移至已標(biāo)注集,避免干擾其它未標(biāo)注樣本所屬類別的確定。 2.針對在線商品評價這類情感特征傾向明顯的Web文本分類問題,提出了基于特征分布半監(jiān)督分類算法。通過利用特征項(xiàng)的類別分布情況來彌補(bǔ)信息增益方法的不足,修正原信息增益方法的特征項(xiàng)和類別的聯(lián)合分布概率,放大特征項(xiàng)在不同類別間出現(xiàn)的差異,調(diào)整后的信息增益方法保留真正具有較高類別區(qū)分能力的特征,達(dá)到有效降低特征空間維度的目的。再將基于特征分布的選擇方法與增強(qiáng)EM算法相結(jié)合進(jìn)行半監(jiān)督文本分類,取得了較好的分類效果和性能。 3.為解決傳統(tǒng)Web文本聚類方法精度和效率不理想的狀況,提出了基于強(qiáng)類別特征近鄰傳播的半監(jiān)督聚類算法。在高效、快速的近鄰消息傳播算法的基礎(chǔ)上吸收半監(jiān)督聚類的思想,充分利用少量已知類別標(biāo)簽數(shù)據(jù)中潛在的先驗(yàn)信息,提取強(qiáng)類別區(qū)分能力的特征項(xiàng)對訓(xùn)練樣本的夾角余弦相似性矩陣進(jìn)行調(diào)整,構(gòu)建綜合強(qiáng)類別特征和夾角余弦的相似性測度函數(shù),在算法每輪迭代完成后進(jìn)一步將類別確定性程度最高的未標(biāo)記樣本轉(zhuǎn)移到已標(biāo)注集,這些措施使得算法性能和精度都得到較大提升。 4.為提高少量帶類別標(biāo)簽樣本數(shù)據(jù)的利用效果,提出了一種融合種子擴(kuò)散近鄰傳播的半監(jiān)督聚類算法。在聚類初始階段把少量有限的帶類別標(biāo)簽樣本作為初始種子,然后通過擴(kuò)散增大規(guī)模,進(jìn)一步凈化、提純后移除誤標(biāo)記和噪音數(shù)據(jù),逐步將初始種子培育成規(guī)模更大的優(yōu)良種子集,以改善聚類初始化效果,同時利用seeds集樣本中蘊(yùn)含的類別結(jié)構(gòu)信息構(gòu)建更合理的相似性測度,促使算法快速向正確聚類目標(biāo)收斂,為處理大規(guī)模非對稱性以及高維稀疏的Web文本分析問題提供了有效的解決方案。 5.為了改善Web搜索結(jié)果的組織和發(fā)布效果,方便信息查找瀏覽,提出了基于潛在語義信息和后綴樹的Web檢索結(jié)果聚類算法。該算法首先結(jié)合向量空間模型和后綴樹模型的優(yōu)點(diǎn)進(jìn)行Web頁面摘要片斷的聚類過程,將擁有較多相同短語的頁面文檔組成一個基簇,,再借助潛在語義索引方法提取特征詞條和文檔之間蘊(yùn)含的潛在語義關(guān)聯(lián)信息,為頁面基簇挑選與主題貼切的候選短語作為文檔基簇的目錄標(biāo)簽,聚類結(jié)果使得Web檢索結(jié)果方便瀏覽且能協(xié)助用戶快速地找到他們感興趣的Web頁面或站點(diǎn)信息。
[Abstract]:Today, the Internet has become a popular and interactive information dissemination media.Web as a huge, open, heterogeneous and dynamic information generating container, and contain huge text, data, multimedia, information of all kinds of temporary data. Due to scattered resources and no unified management and structure. This leads to the relevant information is not easy, people are really interested in the content is often submerged in many irrelevant information.
Through the research of Web data mining, the new Web text mining methods and techniques applied to information retrieval, text mining results using Web to improve the content of information retrieval in page classification, the accuracy and efficiency of clustering, improve the retrieval results organization, improve Web information search and utilization efficiency, can directly or indirectly to solve the search engine precision, recall rate, information overload, return results Organization Limited and single form of service defects such as information retrieval system is developed to a new level to provide technical support. Therefore, Web oriented text information retrieval method for mining research has very important theoretical significance and commercial value.
At present, from the perspective of information retrieval for Web text mining is a very active research direction, carried out extensive and in-depth study of many scholars in this field, although there has been some gratifying achievements and application, but far not reached a mature stage, is still facing many important problems to be solved. Haven't found "best choice" feature dimension reduction method; high-dimensional sparse text data, the traditional classification, it is difficult to improve the accuracy and efficiency of clustering algorithm; semi supervised learning problem of small sample based on the training data; it is difficult to find, how to effectively improve the retrieval results of the organization, to facilitate the release browsing query and so on.
This paper based mining methods and research results in the existing Web text content, and further around the key issues and methods of Web text mining research. Non equilibrium data of class imbalance, the characteristics of online evaluation of this kind of emotional tendency of data reduction are given for solutions to semi supervised learning as the main; the object of study, this paper puts forward some new semi supervised learning algorithm, and applied to Web text mining analysis; at the same time, according to the search result clustering analysis a method is proposed to improve the search results. In several commonly used standard data collection, through the analysis and comparison of the related experiments to verify the effectiveness of improvement methods.
The achievements and innovation points of this paper are mainly reflected in the following aspects:
1. for the text classification problem on the set of non balance, and presents an improved expectation maximization Naive based on Bayesian (Expectation Maximization EM) semi supervised classification algorithm. First, build an effective feature selection function to filter out a large number of invalid feature feature words and retain high category information, feature space dimension selection function the categories of imbalanced data sets using this feature can really reduce. At the same time, the EM algorithm combined with simple Bayesian classification method is improved to adjust, in each iteration process, the posterior probability of the highest category of unlabeled documents have been transferred to the unlabeled training set annotation, avoiding interference with other unlabeled samples to determine the category.
2. for goods online evaluation of this kind of emotional features tend to Web text classification problem was proposed, the distribution characteristics of semi supervised classification algorithm based on information gain method. To remedy the deficiency by using category distribution feature of the joint probability distribution of the information gain method to amend the original features and categories, features in different magnification the difference between categories, information gain adjusted retain truly has the characteristics of higher categories distinguishing ability, to effectively reduce the dimension of the feature space. Then the feature selection method based on the distribution and the enhanced EM algorithm combining semi supervised text classification, classification results are gained and good performance.
3. in order to solve the traditional Web method of text clustering precision and efficiency is the ideal situation, propose a semi supervised clustering algorithm based on strong classification features affinity propagation. In the efficient, semi supervised clustering based algorithm for fast absorption neighbor news spread on the full use of potential a few known category labels in the data prior information, feature extracting category distinguishing ability of cosine similarity matrix of training samples to adjust the similarity measure function to construct the comprehensive strong classification features and cosine of the angle, in each iteration algorithm to complete further categories of the highest degree of uncertainty will be transferred to the unlabeled samples labeled set, these measures make the algorithm performance and accuracy has been greatly improved.
4. to improve the effect of the use of a small amount of labeled samples data, put forward a kind of fusion of seed dispersal of semi supervised affinity propagation clustering algorithm. In the initial stage of a small cluster with limited labeled samples as the initial seed, then diffusion through increasing the size of further purification, after purification to remove error markers and noise data, will be gradually the initial seeds into larger seed set, in order to improve the cluster initialization effect, at the same time using the seeds set of category structure information contains sample build similarity measure is more reasonable, the algorithm quickly to the correct target clustering convergence, for the analysis of the problem provides an effective solution for large non symmetry and high dimension sparse the Web text processing.
5. in order to improve the Web search results to organize and distribute the information search effect, convenient browsing, the latent semantic information and suffix tree clustering algorithm based on Web search results. Firstly, the clustering process combines the advantages of vector space model and suffix tree model for Web page Abstract fragments, will have the same page document more phrases a base cluster, then using latent semantic indexing method to extract semantic correlation information between feature words and documents, choose appropriate candidate phrases and themes for the page based cluster as document base cluster catalogue label, which results in Web search results clustering and easy browsing can help users quickly find their interest in Web the page or site information.

【學(xué)位授予單位】:華南理工大學(xué)
【學(xué)位級別】:博士
【學(xué)位授予年份】:2012
【分類號】:TP311.13

【參考文獻(xiàn)】

相關(guān)期刊論文 前3條

1 蘇金樹;張博鋒;徐昕;;基于機(jī)器學(xué)習(xí)的文本分類技術(shù)研究進(jìn)展[J];軟件學(xué)報;2006年09期

2 王繼成,潘金貴,張福炎;Web文本挖掘技術(shù)研究[J];計(jì)算機(jī)研究與發(fā)展;2000年05期

3 唐春生,金以慧;基于全信息矩陣的多分類器集成方法[J];軟件學(xué)報;2003年06期

相關(guān)博士學(xué)位論文 前2條

1 袁方;面向智能信息檢索的Web挖掘關(guān)鍵技術(shù)研究[D];東北大學(xué);2006年

2 尹世群;Web文本分類關(guān)鍵技術(shù)研究[D];西南大學(xué);2008年



本文編號:1688564

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1688564.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶e27f2***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com