Web數(shù)學(xué)公式提取方法的研究
發(fā)布時間:2019-01-10 19:05
【摘要】:隨著信息技術(shù)的發(fā)展,Web技術(shù)對數(shù)學(xué)交流的支持目益成熟和完善,用戶在Web上進行數(shù)學(xué)公式的獲取和管理數(shù)學(xué)公式活動,需要數(shù)學(xué)公式搜索引擎的支持。數(shù)學(xué)公式搜索引擎是第三代智能化搜索引擎的研究課題之一,而基于數(shù)學(xué)公式的爬蟲是數(shù)學(xué)公式搜索中極其重要的一部分,其質(zhì)量的好壞直接影響著數(shù)學(xué)公式搜索引擎的功能和性能。 本文的工作重點是對基于數(shù)學(xué)公式爬蟲的研究,主要涉及Web數(shù)學(xué)公式的識別提取和系統(tǒng)設(shè)計。目前,數(shù)學(xué)公式的識別研究已經(jīng)取得相當(dāng)大的進展,但無法應(yīng)用到數(shù)學(xué)公式交流和搜索上。本文對用戶可編程的數(shù)學(xué)公式的識別做了有針對性的研究工作,以Web文檔中XML格式、LaTeX格式、Infix格式描述的公式以及微軟辦公軟件和OpenOffice中公式為重點。總結(jié)分析這些描述形式的公式在Web中的存在形式及其外在的模式特征,利用模式匹配識別提取。在此研究基礎(chǔ)上,以開源軟件Nutch為系統(tǒng)基礎(chǔ)設(shè)計實現(xiàn)了數(shù)學(xué)爬蟲系統(tǒng)MathCrawler, MathCrawler有良好的系統(tǒng)架構(gòu),可以在互聯(lián)網(wǎng)上抓取含有數(shù)學(xué)公式相關(guān)內(nèi)容的文檔并提取出數(shù)學(xué)公式,并用實驗表明系統(tǒng)有良好的性能,可以較準(zhǔn)確地提取了數(shù)學(xué)公式。
[Abstract]:With the development of information technology, the support of Web technology for mathematical communication becomes more and more mature and perfect. Users need the support of mathematical formula search engine to obtain and manage mathematical formula on Web. The mathematical formula search engine is one of the research topics of the third generation intelligent search engine, and the reptile based on the mathematical formula is an extremely important part of the mathematical formula search. Its quality directly affects the function and performance of mathematical formula search engine. This paper focuses on the research of crawler based on mathematical formula, mainly involved in the identification and extraction of Web mathematical formula and the design of the system. At present, the research of mathematical formula recognition has made great progress, but it can not be applied to the communication and search of mathematical formula. This paper focuses on the identification of user programmable mathematical formulas, focusing on XML format, LaTeX format, Infix format description formula in Web document, Microsoft office software and OpenOffice formula. This paper summarizes and analyzes the existing forms of these descriptive forms in Web and their external pattern features, and extracts them by pattern matching recognition. On the basis of this research, this paper designs and implements the mathematical crawler system MathCrawler, MathCrawler based on open source software Nutch. It has a good system structure, and can grab the documents containing mathematical formula and extract the mathematical formula on the Internet. Experiments show that the system has good performance and can extract the mathematical formula more accurately.
【學(xué)位授予單位】:蘭州大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2012
【分類號】:TP393.09;TP391.3
本文編號:2406685
[Abstract]:With the development of information technology, the support of Web technology for mathematical communication becomes more and more mature and perfect. Users need the support of mathematical formula search engine to obtain and manage mathematical formula on Web. The mathematical formula search engine is one of the research topics of the third generation intelligent search engine, and the reptile based on the mathematical formula is an extremely important part of the mathematical formula search. Its quality directly affects the function and performance of mathematical formula search engine. This paper focuses on the research of crawler based on mathematical formula, mainly involved in the identification and extraction of Web mathematical formula and the design of the system. At present, the research of mathematical formula recognition has made great progress, but it can not be applied to the communication and search of mathematical formula. This paper focuses on the identification of user programmable mathematical formulas, focusing on XML format, LaTeX format, Infix format description formula in Web document, Microsoft office software and OpenOffice formula. This paper summarizes and analyzes the existing forms of these descriptive forms in Web and their external pattern features, and extracts them by pattern matching recognition. On the basis of this research, this paper designs and implements the mathematical crawler system MathCrawler, MathCrawler based on open source software Nutch. It has a good system structure, and can grab the documents containing mathematical formula and extract the mathematical formula on the Internet. Experiments show that the system has good performance and can extract the mathematical formula more accurately.
【學(xué)位授予單位】:蘭州大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2012
【分類號】:TP393.09;TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前3條
1 歐陽辰;數(shù)學(xué)公式與WEB[J];計算機工程與應(yīng)用;2001年17期
2 靳簡明;江紅英;王慶人;;數(shù)學(xué)公式識別系統(tǒng):MatheReader[J];計算機學(xué)報;2006年11期
3 盧托;于俊清;廖兆存;聶江;;基于Web的數(shù)學(xué)公式檢索系統(tǒng)設(shè)計與實現(xiàn)[J];微處理機;2008年02期
相關(guān)碩士學(xué)位論文 前4條
1 劉志偉;數(shù)學(xué)搜索引擎研究[D];蘭州大學(xué);2011年
2 吳明;WEB上數(shù)學(xué)公式表達(dá)技術(shù)研究[D];南京師范大學(xué);2005年
3 景珂;網(wǎng)絡(luò)數(shù)學(xué)搜索中的數(shù)學(xué)查詢語言與索引的研究[D];蘭州大學(xué);2009年
4 劉東閣;基于MathML的公式檢索系統(tǒng)的設(shè)計與實現(xiàn)[D];東北大學(xué);2009年
,本文編號:2406685
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2406685.html
最近更新
教材專著