基于多特征融合的網頁正文提取及雙語網站探測
發(fā)布時間:2019-04-10 12:57
【摘要】:隨著互聯(lián)網的快速發(fā)展,互聯(lián)網信息規(guī)模呈指數(shù)級增長,同時互聯(lián)網海量信息的背后伴隨著質量的參差不齊,,準確,快速,全面的獲取信息變得越來越困難,強大的信息提取能力變得備受關注,信息海量堆積也對信息抽取技術提出了新的機遇與挑戰(zhàn)。而隨著自然語言處理技術的飛速發(fā)展,機器翻譯技術在現(xiàn)實生活中的變得越來越實用,有道翻譯,Google翻譯,百度翻譯等相關產品已經成為非專業(yè)人士進行外文學習工作的重要工具。 雙語語料是機器翻譯的基礎,是機器翻譯中訓練、測試、分析機器翻譯模型的重要數(shù)據。雙語語料的數(shù)量與質量直接關系到機器翻譯參數(shù)的訓練結果,同時很大程度上對后續(xù)的機器翻譯產品性能產生影響。構建一個質量高、數(shù)量大的雙語語料庫對機器翻譯、自然語言處理等問題有巨大的應用價值和學術意義。 本文著力于架構并實現(xiàn)一個性能優(yōu)異、效率高的雙語文本抽取系統(tǒng)(此系統(tǒng)是互聯(lián)網雙語語料抓取系統(tǒng)的子系統(tǒng),不包括爬蟲和句子對齊)。本文的主要研究內容包含兩個方面:網頁正文提取和雙語網頁探測。 本文使用多特征融合技術針對網頁正文進行提取,不同于傳統(tǒng)生成DOM樹的網頁處理方法,本文采用基于容器標簽的線性化重構方法對網頁進行處理,在數(shù)據結構上使得需要進行樹操作的算法簡化到基于線性表的處理,同時通過長度,分詞結果,句子數(shù),等多個特征綜合判斷正文脈絡,而后通過基于信息增益的聚類獲得網頁正文。在雙語網頁探測方面本文采用基于局部句子錨點搜索的互譯率計算對正文得到的雙語文本進行互譯判斷。在此基礎上本文計加入了基于命名實體重合度、代詞比率等特征的輔助正文判斷算法,基于同一網站的大量網頁的模板自動生成算法,來提升算法的準確率。 本文的網頁正文提取和雙語網頁探測系統(tǒng)達到了目前同領域的頂級水平,本系統(tǒng)及后續(xù)處理系統(tǒng)生成中英三千萬雙語語料并經過了黑龍江省電子信息產品監(jiān)督檢驗院軟件評測中心的嚴格檢測準確率在95%以上。實驗結果也驗證了本文提出的多特征融合方法在雙語語料挖掘領域的有效性。
[Abstract]:With the rapid development of the Internet, the scale of Internet information is growing exponentially. At the same time, it is more and more difficult to obtain information in an all-round way with the uneven, accurate, rapid and all-round access to information behind the massive amount of information on the Internet. The powerful information extraction ability has been paid more and more attention, and the massive accumulation of information has brought new opportunities and challenges to the information extraction technology. With the rapid development of natural language processing technology, machine translation technology has become more and more practical in real life. Youdao Translation, Google translation, Baidu translation and other related products have become an important tool for non-professionals to study foreign languages. Bilingual corpus is the foundation of machine translation, and it is the important data of training, testing and analyzing machine translation model in machine translation. The quantity and quality of bilingual corpus are directly related to the training results of machine translation parameters and affect the performance of subsequent machine translation products to a great extent. The construction of a bilingual corpus with high quality and large quantity is of great practical and academic significance to machine translation, natural language processing and other problems. This paper focuses on the architecture and implementation of a bilingual text extraction system with excellent performance and high efficiency (this system is a subsystem of the bilingual data capture system on the Internet, excluding crawlers and sentence alignment). The main contents of this paper include two aspects: the extraction of web pages and the detection of bilingual web pages. In this paper, multi-feature fusion technology is used to extract the text of web page, which is different from the traditional method of generating DOM tree. In this paper, the linearization reconstruction method based on container tag is used to process the web page. In the data structure, the algorithm which needs tree operation is simplified to the linear table processing. At the same time, the text context is comprehensively judged by the length, the result of participle, the number of sentences, and so on. Then the text of the web page is obtained by clustering based on information gain. In the aspect of bilingual web page detection, this paper uses the mutual translation rate calculation based on local sentence anchor search to judge the mutual translation of the bilingual text obtained from the text. On this basis, this paper adds an auxiliary text judgment algorithm based on named entity coincidence degree, pronoun ratio and other features, and an automatic template generation algorithm based on a large number of web pages on the same website to improve the accuracy of the algorithm. The text extraction and bilingual web detection system of this paper has reached the top level in the same field at present. This system and its follow-up processing system generate Chinese-English 30 million bilingual corpus and pass through the software evaluation center of Heilongjiang Electronic Information products Supervision and Inspection Institute. The accuracy of strict detection is more than 95%. The experimental results also verify the effectiveness of the proposed multi-feature fusion method in bilingual corpus mining.
【學位授予單位】:哈爾濱工業(yè)大學
【學位級別】:碩士
【學位授予年份】:2014
【分類號】:TP393.092
本文編號:2455815
[Abstract]:With the rapid development of the Internet, the scale of Internet information is growing exponentially. At the same time, it is more and more difficult to obtain information in an all-round way with the uneven, accurate, rapid and all-round access to information behind the massive amount of information on the Internet. The powerful information extraction ability has been paid more and more attention, and the massive accumulation of information has brought new opportunities and challenges to the information extraction technology. With the rapid development of natural language processing technology, machine translation technology has become more and more practical in real life. Youdao Translation, Google translation, Baidu translation and other related products have become an important tool for non-professionals to study foreign languages. Bilingual corpus is the foundation of machine translation, and it is the important data of training, testing and analyzing machine translation model in machine translation. The quantity and quality of bilingual corpus are directly related to the training results of machine translation parameters and affect the performance of subsequent machine translation products to a great extent. The construction of a bilingual corpus with high quality and large quantity is of great practical and academic significance to machine translation, natural language processing and other problems. This paper focuses on the architecture and implementation of a bilingual text extraction system with excellent performance and high efficiency (this system is a subsystem of the bilingual data capture system on the Internet, excluding crawlers and sentence alignment). The main contents of this paper include two aspects: the extraction of web pages and the detection of bilingual web pages. In this paper, multi-feature fusion technology is used to extract the text of web page, which is different from the traditional method of generating DOM tree. In this paper, the linearization reconstruction method based on container tag is used to process the web page. In the data structure, the algorithm which needs tree operation is simplified to the linear table processing. At the same time, the text context is comprehensively judged by the length, the result of participle, the number of sentences, and so on. Then the text of the web page is obtained by clustering based on information gain. In the aspect of bilingual web page detection, this paper uses the mutual translation rate calculation based on local sentence anchor search to judge the mutual translation of the bilingual text obtained from the text. On this basis, this paper adds an auxiliary text judgment algorithm based on named entity coincidence degree, pronoun ratio and other features, and an automatic template generation algorithm based on a large number of web pages on the same website to improve the accuracy of the algorithm. The text extraction and bilingual web detection system of this paper has reached the top level in the same field at present. This system and its follow-up processing system generate Chinese-English 30 million bilingual corpus and pass through the software evaluation center of Heilongjiang Electronic Information products Supervision and Inspection Institute. The accuracy of strict detection is more than 95%. The experimental results also verify the effectiveness of the proposed multi-feature fusion method in bilingual corpus mining.
【學位授予單位】:哈爾濱工業(yè)大學
【學位級別】:碩士
【學位授予年份】:2014
【分類號】:TP393.092
【參考文獻】
相關期刊論文 前2條
1 李霞;蔣盛益;;基于DOM樹及行文本統(tǒng)計去噪的網頁文本抽取技術[J];山東大學學報(理學版);2012年03期
2 常寶寶,詹衛(wèi)東,張華瑞;面向漢英機器翻譯的雙語語料庫的建設及其管理[J];術語標準化與信息技術;2003年01期
本文編號:2455815
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/2455815.html
最近更新
教材專著