當(dāng)前位置：主頁(yè) > 管理論文 > 移動(dòng)網(wǎng)絡(luò)論文 >

Web信息抽取在書簽系統(tǒng)中的應(yīng)用研究與實(shí)現(xiàn)

發(fā)布時(shí)間：2018-12-15 19:39

【摘要】：社會(huì)化書簽系統(tǒng)是Web信息資源收集、管理、分享的有效工具,但是它的社會(huì)化功能取決于用戶量與資源量。本文主要的研究?jī)?nèi)容是如何將Web信息抽取等自然語(yǔ)言相關(guān)研究應(yīng)用于書簽系統(tǒng)中,解決書簽系統(tǒng)的冷啟動(dòng)問題,提高用戶體驗(yàn)。本文首先研究并實(shí)現(xiàn)了Web信息抽取算法。本文的Web信息抽取算法以Goose項(xiàng)目為基礎(chǔ),改進(jìn)了Web網(wǎng)頁(yè)數(shù)據(jù)抓取,添加了對(duì)網(wǎng)頁(yè)編碼的自動(dòng)識(shí)別,通過觀察與總結(jié)大量網(wǎng)站的HTML結(jié)構(gòu)特征,優(yōu)化了對(duì)網(wǎng)頁(yè)的預(yù)處理,并添加了對(duì)中文網(wǎng)頁(yè)信息抽取的支持,最后對(duì)正文進(jìn)行格式化處理,以優(yōu)化閱讀體驗(yàn)。最終實(shí)現(xiàn)了基于ElementTree的Web信息抽取模塊。該模塊能夠用于生產(chǎn)系統(tǒng)中,具有較強(qiáng)的實(shí)用性。同時(shí)本文基于Web信息抽取的結(jié)果與Web網(wǎng)頁(yè)的元數(shù)據(jù),實(shí)現(xiàn)了基于資源的標(biāo)簽推薦算法,并簡(jiǎn)單實(shí)現(xiàn)了網(wǎng)頁(yè)摘要功能。本文設(shè)計(jì)并實(shí)現(xiàn)了書簽系統(tǒng),基礎(chǔ)架構(gòu)采用Tornado作為Web服務(wù)器兼Web開發(fā)框架,MongoDB作為數(shù)據(jù)庫(kù)服務(wù)器,客戶端使用AngularJS框架、j Query框架,同時(shí)使用BootStrap3樣式風(fēng)格,實(shí)現(xiàn)了響應(yīng)式布局與扁平化網(wǎng)格的客戶端應(yīng)用,并實(shí)現(xiàn)了Chrome瀏覽器插件。系統(tǒng)實(shí)現(xiàn)中整合了Web信息抽取模塊,為用戶提供書簽內(nèi)容閱讀編輯等功能,有效的提高了用戶體驗(yàn)。基于信息抽取的結(jié)果,本文書簽系統(tǒng)的搜索功能能夠采用了全文搜索實(shí)現(xiàn),避免了傳統(tǒng)書簽系統(tǒng)中通常只針對(duì)標(biāo)簽或標(biāo)題進(jìn)行搜索的局限性,也避免了對(duì)整個(gè)Web頁(yè)面進(jìn)行全文搜索存在的噪音信息。本文實(shí)現(xiàn)的系統(tǒng)不同于當(dāng)前熱門的推薦閱讀系統(tǒng),更注重書簽管理而非閱讀,如果能將書簽系統(tǒng)與筆記系統(tǒng)結(jié)合使用,可以有效實(shí)現(xiàn)信息的二次過濾。
[Abstract]:Social bookmarking system is an effective tool for Web information resource collection, management and sharing, but its social function depends on the number of users and resources. The main research content of this paper is how to apply the natural language related research such as Web information extraction to the bookmark system to solve the cold start problem of the bookmark system and improve the user experience. In this paper, we first study and implement the Web information extraction algorithm. Based on the Goose project, the Web information extraction algorithm in this paper improves the Web web page data capture, adds the automatic recognition to the web page coding, and optimizes the preprocessing of the web pages by observing and summarizing the HTML structure features of a large number of websites. Finally, the text is formatted to optimize the reading experience. Finally, the Web information extraction module based on ElementTree is implemented. This module can be used in production system and has strong practicability. At the same time, based on the results of Web information extraction and the metadata of Web pages, a resource-based label recommendation algorithm is implemented, and a simple function of web page summary is realized. In this paper, a bookmark system is designed and implemented. The infrastructure uses Tornado as Web server and Web development framework, MongoDB as database server, AngularJS, j Query as client, and BootStrap3 style. The client application of response layout and flat grid is realized, and the Chrome browser plug-in is implemented. The system integrates Web information extraction module, provides users with bookmark content reading and editing functions, effectively improve the user experience. Based on the result of information extraction, the search function of the bookmark system in this paper can be realized by full-text search, which avoids the limitation of traditional bookmark system which only searches for tags or titles. Also avoid the entire Web page full-text search for the existence of noise information. The system realized in this paper is different from the popular recommendation reading system. It pays more attention to bookmark management than reading. If we can combine bookmark system with note-taking system, we can effectively realize the secondary filtering of information.
【學(xué)位授予單位】：南京理工大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2014
【分類號(hào)】：TP393.092;TP391.3

【參考文獻(xiàn)】

相關(guān)期刊論文前9條

1 龍樹全;趙正文;唐華;;中文分詞算法概述[J];電腦知識(shí)與技術(shù);2009年10期

2 齊鵬;李隱峰;宋玉偉;;基于Python的Web數(shù)據(jù)采集技術(shù)[J];電子科技;2012年11期

3 王利;劉宗田;王燕華;廖濤;;基于內(nèi)容相似度的網(wǎng)頁(yè)正文提取[J];計(jì)算機(jī)工程;2010年06期

4 趙欣欣;索紅光;劉玉樹;;基于標(biāo)記窗的網(wǎng)頁(yè)正文信息提取方法[J];計(jì)算機(jī)應(yīng)用研究;2007年03期

5 李亞君;李治森;;社會(huì)性書簽:一種新型的網(wǎng)絡(luò)服務(wù)[J];江西圖書館學(xué)刊;2008年01期

6 李觀金;;基于SEO的代碼優(yōu)化策略[J];科技致富向?qū)?2011年17期

7 孫承杰,關(guān)毅;基于統(tǒng)計(jì)的網(wǎng)頁(yè)正文信息抽取方法的研究[J];中文信息學(xué)報(bào);2004年05期

8 馮姚震;劉亞軍;;社會(huì)書簽在現(xiàn)代遠(yuǎn)程教育平臺(tái)中的應(yīng)用[J];寧波廣播電視大學(xué)學(xué)報(bào);2008年04期

9 喬綠茵;張敏;;我國(guó)基于Folksonomy的標(biāo)簽推薦方法研究綜述[J];信息資源管理學(xué)報(bào);2012年04期

相關(guān)博士學(xué)位論文前1條

1 靳延安;社會(huì)標(biāo)簽推薦技術(shù)與方法研究[D];華中科技大學(xué);2011年

，

本文編號(hào)：2381182

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/guanlilunwen/ydhl/2381182.html

上一篇：基于OPNET的TCP-SYN泛洪攻擊設(shè)計(jì)與仿真
下一篇：基于鏈接動(dòng)機(jī)的鏈接分類實(shí)證研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

Web信息抽取在書簽系統(tǒng)中的應(yīng)用研究與實(shí)現(xiàn)