支持正則表達式的文本匹配優(yōu)化算法

發(fā)布時間：2018-03-17 07:00

本文選題：正則表達式　切入點：過濾　出處：《東北大學》2012年碩士論文　論文類型：學位論文

【摘要】：正則表達式本身具備描述復雜查詢的能力,能夠通過特定的語法描述一類文本的共同特征。正則表達式因其強大的表達能力和簡潔的語法,使得其計算機語言以及相關領域中的應用十分廣泛。因此,支持高效的正則表達式的文本匹配技術也就顯得尤為重要。目前,支持正則表達式的文本匹配的搜索引擎種類很多。但是,基本上所有的正則表達式匹配算法,都采用了自動機理論。也就是說,正則表達式通常先轉換成確定(或非確定)有限狀態(tài)自動機,然后利用自動機在文本中進行搜索匹配。由于需要在線地處理正則表達式并構建自動機,因此,基本上支持正則表達式的文本匹配算法都是在線的。按照匹配類型不同,支持正則表達式的文本匹配可以分為正則表達式的全局匹配與正則表達式的局部匹配。全局匹配是判斷字符串是否屬于正則表達式表達語言。而局部匹配在判斷字符串中的任何子串是否屬于正則表達式所表達的語言的同時,并同時需要返回子串的位置信息。根據(jù)匹配類型的不同,本文設計了正則表達式的局部匹配和全局匹配的算法。這兩個算法都支持離線的處理,并設計了相應的索引結構。對于正則表達式的全局匹配,定義了前-后綴的概念,提出了基于前-后綴的過濾策略,并同時選擇了適用于這種過濾策略的索引結構trie樹對字符串集合構建了索引。進而,提出了對正則表達表達式進行化簡方法。首先對查詢的字符串進行化簡,通過計算得到其過濾因子：前-后綴集合。然后對待檢索的字符串集合進行過濾,得到候選字符串集合。最后采用傳統(tǒng)的有限自動機進行驗證候選集合。而對于正則表達式的局部匹配,設計了基于BWT的索引結構,定義了正則表達式中不同操作符的運算規(guī)則,將正則表達式的文本匹配轉換成位置信息列表的處理操作。首先,根據(jù)將正則表達式片段在索引中其在文本中出現(xiàn)的位置信息的列表；然后,根據(jù)定義的正則表達式中操作符的運算規(guī)則對位置信息列表進行相應的操作,得到的位置信息的列表也就是正則表達式在文本中進行局部匹配成功的字符串的位置。最后,進行了大量的實驗測試,結果表明本文提出的兩種支持正則表達式的文本匹配算法具備較高的查詢效率。
[Abstract]:Regular expressions are capable of describing complex queries and can describe common features of a class of text through specific syntax. Because of its wide application in computer language and related fields, it is very important to support efficient text matching technology for regular expressions. At present, there are many kinds of search engines that support regular expression text matching. However, almost all regular expression matching algorithms adopt automata theory. Regular expressions are usually converted to deterministic (or non-deterministic) finite-state automata, then used to search for matching in text. Because regular expressions need to be processed online and automata are constructed, Basically, text matching algorithms that support regular expressions are online, depending on the type of match, Text matching that supports regular expressions can be divided into global matching of regular expressions and local matching of regular expressions. Global matching is to determine whether a string belongs to a regular expression expression language, and local matching is used to judge whether a string belongs to a regular expression language. While any substrings in a string belong to the language of a regular expression, It also needs to return the position information of the substring. According to the different matching types, the local matching and global matching algorithms of regular expressions are designed in this paper. Both of these algorithms support off-line processing, and the corresponding index structure is designed. In this paper, the concept of front suffix is defined, and a filtering strategy based on front suffix is proposed. At the same time, trie tree, an index structure suitable for this filtering strategy, is selected to index the collection of strings. In this paper, a method of simplifying regular expression expression is proposed. Firstly, the query string is simplified, and its filter factor is obtained by calculating the pre-suffix set. Then, the search string set is filtered. The candidate string set is obtained. Finally, the candidate set is verified by the traditional finite automata. For the local matching of regular expressions, the index structure based on BWT is designed, and the operation rules of different operators in regular expressions are defined. Converts the text match of a regular expression into a list of location information. First, based on the list of location information that appears in the text of the regular expression fragment in the index; then, According to the operation rules of the operators in the defined regular expression, the list of position information is the position of the string that the regular expression matches successfully in the text. Finally, A large number of experiments have been carried out and the results show that the two text matching algorithms which support regular expressions have high query efficiency.
【學位授予單位】：東北大學
【學位級別】：碩士
【學位授予年份】：2012
【分類號】：TP391.1

【參考文獻】

相關期刊論文前2條

1 張樹壯;羅浩;方濱興;云曉春;;一種面向網(wǎng)絡安全檢測的高性能正則表達式匹配算法[J];計算機學報;2010年10期

2 姚遠;劉鵬;單征;田雙鵬;;面向存儲的正則表達式匹配算法綜述[J];計算機應用;2009年12期

，

本文編號：1623640

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1623640.html

上一篇：SALSA算法技術剖析
下一篇：SEO在搜索中應用研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

支持正則表達式的文本匹配優(yōu)化算法