基于垂直搜索引擎的結(jié)構(gòu)化信息處理技術(shù)研究

發(fā)布時(shí)間：2018-06-10 13:23

本文選題：搜索 + 索引�。� 參考：《浙江理工大學(xué)》2013年碩士論文

【摘要】：隨著互聯(lián)網(wǎng)的發(fā)展，搜索引擎在不斷滿足巨大的信息資源量的需求下，卻無法兼顧到信息搜索的準(zhǔn)確度和及時(shí)性，此時(shí)垂直搜索引擎為滿足用戶需求應(yīng)運(yùn)而生，本文通過對垂直搜索引擎進(jìn)行了深入研究，并針對現(xiàn)有模型及其所存在的問題提出一種改進(jìn)的垂直搜索引擎模型，依據(jù)該模塊特點(diǎn)對結(jié)構(gòu)化數(shù)據(jù)的去重和分類算法進(jìn)行改進(jìn)，通過兩個(gè)改進(jìn)算法在改進(jìn)的垂直搜索引擎模型中的實(shí)驗(yàn)應(yīng)用，，得出改進(jìn)的垂直搜索引擎模型進(jìn)一步提高了垂直搜索引擎的實(shí)時(shí)性和準(zhǔn)確性。新模型設(shè)計(jì)的主要方案是對現(xiàn)有模型新增加一個(gè)數(shù)據(jù)二次處理的模塊，該模塊主要對抽取到的非結(jié)構(gòu)化數(shù)據(jù)和半結(jié)構(gòu)化數(shù)據(jù)向結(jié)構(gòu)化數(shù)據(jù)轉(zhuǎn)換。模塊的主要研究內(nèi)容是對網(wǎng)頁信息的去重處理和分類處理。因此本文的主要研究內(nèi)容和創(chuàng)新點(diǎn)分為以下三點(diǎn)：（1）在參考現(xiàn)有的電子商務(wù)領(lǐng)域廣泛應(yīng)用的垂直搜索引擎的基礎(chǔ)上，提出一種改進(jìn)的垂直搜索引擎應(yīng)用模型，結(jié)合本文改進(jìn)的去重算法和分類算法使用查全率和準(zhǔn)確率兩個(gè)指標(biāo)評估該模型的實(shí)用性和可行性。（2）提出一種新的信息處理技術(shù)的網(wǎng)頁去重算法，并以時(shí)間復(fù)雜度、空間復(fù)雜度、查全率和準(zhǔn)確率作為四個(gè)指標(biāo)分析該算法在改進(jìn)的垂直搜索引擎模型中的可行性和健壯性，以及對信息檢索效率的提高。（3）對現(xiàn)有的一種分類算法進(jìn)行改進(jìn)，進(jìn)而使得適合于本文提出的垂直搜索引擎的結(jié)構(gòu)化數(shù)據(jù)處理計(jì)算，該算法的結(jié)構(gòu)包括詞條數(shù)組和每個(gè)詞條的文本鏈表。詞條數(shù)組指將所有的訓(xùn)練文本分詞，經(jīng)過特征提取后的所有特征項(xiàng)組成的數(shù)組，存儲在數(shù)組中的是特征項(xiàng)（詞條）的ID號。詞條數(shù)組中的每個(gè)詞條(ti)有一個(gè)指針，指向含有ti的所有文本組成的鏈表。文本鏈表由兩部分組成，文本的ID和ti在文本中的權(quán)重。ti的文本鏈表生成以后，按照ti在文本中的權(quán)重遞減排序，然后對其進(jìn)行進(jìn)一步的優(yōu)化進(jìn)而降低原有算法的查找范圍。
[Abstract]:With the development of the Internet, the search engine is not able to take into account the accuracy and timeliness of information search in order to meet the needs of users. In this paper, the vertical search engine is deeply studied, and an improved vertical search engine model is put forward in view of the existing model and its existing problems. According to the characteristics of this module, the algorithm of removing and classifying structured data is improved. Through the experimental application of two improved algorithms in the improved vertical search engine model, it is concluded that the improved vertical search engine model can further improve the real-time and accuracy of the vertical search engine. The main scheme of the new model design is to add a new data secondary processing module to the existing model, which mainly converts the extracted unstructured data and semi-structured data to structured data. The main research content of the module is to dereprocess and classify the web page information. Therefore, the main contents and innovations of this paper can be divided into the following three points: firstly, an improved vertical search engine application model is proposed on the basis of reference to the existing vertical search engine which is widely used in the field of electronic commerce. Combining the improved algorithm and classification algorithm to evaluate the practicability and feasibility of the model by using recall and accuracy. (2) A new information processing algorithm for web pages is proposed, and the complexity of time, space and space are used to evaluate the feasibility of the model. Recall rate and accuracy rate are used as four indexes to analyze the feasibility and robustness of the algorithm in the improved vertical search engine model, and to improve the efficiency of information retrieval. The structure of the algorithm consists of an array of terms and a text list of each term. The term array refers to an array of all the trained text participles and all the feature items extracted by the feature, and the ID number of the feature item (entry) is stored in the array. Each entry in the entry array has a pointer to a list of all text containing ti. The text list consists of two parts: the ID of the text and the weight of ti in the text. After the text list is generated, the text list is sorted according to the decreasing weight of ti in the text, and then it is further optimized to reduce the search range of the original algorithm.
【學(xué)位授予單位】：浙江理工大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2013
【分類號】：TP391.3

【參考文獻(xiàn)】

相關(guān)期刊論文前2條

1 曹玉娟;牛振東;趙X;彭學(xué)平;;基于概念和語義網(wǎng)絡(luò)的近似網(wǎng)頁檢測算法[J];軟件學(xué)報(bào);2011年08期

2 周博;劉奕群;張敏;金奕江;馬少平;;錨文本檢索有效性分析[J];軟件學(xué)報(bào);2011年08期

相關(guān)博士學(xué)位論文前2條

1 于瑞國;維數(shù)約減算法研究及其在大規(guī)模文本數(shù)據(jù)挖掘中的應(yīng)用[D];天津大學(xué);2008年

2 幸銳;基于紋理的圖像聚類研究[D];浙江大學(xué);2009年

相關(guān)碩士學(xué)位論文前8條

1 白廣奇;網(wǎng)頁內(nèi)容過濾的關(guān)鍵技術(shù)研究及實(shí)現(xiàn)[D];山東大學(xué);2005年

2 李凱;郵件過濾算法研究[D];哈爾濱工業(yè)大學(xué);2006年

3 李保洋;特征選擇在中醫(yī)數(shù)據(jù)挖掘中的應(yīng)用研究[D];北京交通大學(xué);2008年

4 賀莉娜;視頻語義特征提取的研究[D];北京交通大學(xué);2008年

5 黃艷;基于Web的個(gè)性化信息檢索技術(shù)研究[D];西北大學(xué);2008年

6 曲杰濤;基于DOM的智能網(wǎng)頁信息抽取技術(shù)研究[D];中國海洋大學(xué);2009年

7 萬狄飛;基于最優(yōu)分割策略的高性能文本分類方法[D];重慶郵電大學(xué);2008年

8 朱鳳芳;搜索引擎中網(wǎng)頁凈化與消重技術(shù)研究[D];東北大學(xué);2008年

本文編號：2003367

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2003367.html

上一篇：以聯(lián)邦搜索和圖書館門戶網(wǎng)站為例的檢索體驗(yàn)改善
下一篇：基于Lucene的非結(jié)構(gòu)化電子病歷文檔解析的實(shí)現(xiàn)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于垂直搜索引擎的結(jié)構(gòu)化信息處理技術(shù)研究