支持正則表達(dá)式的文本匹配優(yōu)化算法
發(fā)布時(shí)間:2018-03-17 07:00
本文選題:正則表達(dá)式 切入點(diǎn):過濾 出處:《東北大學(xué)》2012年碩士論文 論文類型:學(xué)位論文
【摘要】:正則表達(dá)式本身具備描述復(fù)雜查詢的能力,能夠通過特定的語法描述一類文本的共同特征。正則表達(dá)式因其強(qiáng)大的表達(dá)能力和簡(jiǎn)潔的語法,使得其計(jì)算機(jī)語言以及相關(guān)領(lǐng)域中的應(yīng)用十分廣泛。因此,支持高效的正則表達(dá)式的文本匹配技術(shù)也就顯得尤為重要。 目前,支持正則表達(dá)式的文本匹配的搜索引擎種類很多。但是,基本上所有的正則表達(dá)式匹配算法,都采用了自動(dòng)機(jī)理論。也就是說,正則表達(dá)式通常先轉(zhuǎn)換成確定(或非確定)有限狀態(tài)自動(dòng)機(jī),然后利用自動(dòng)機(jī)在文本中進(jìn)行搜索匹配。由于需要在線地處理正則表達(dá)式并構(gòu)建自動(dòng)機(jī),因此,基本上支持正則表達(dá)式的文本匹配算法都是在線的。按照匹配類型不同,支持正則表達(dá)式的文本匹配可以分為正則表達(dá)式的全局匹配與正則表達(dá)式的局部匹配。全局匹配是判斷字符串是否屬于正則表達(dá)式表達(dá)語言。而局部匹配在判斷字符串中的任何子串是否屬于正則表達(dá)式所表達(dá)的語言的同時(shí),并同時(shí)需要返回子串的位置信息。 根據(jù)匹配類型的不同,本文設(shè)計(jì)了正則表達(dá)式的局部匹配和全局匹配的算法。這兩個(gè)算法都支持離線的處理,并設(shè)計(jì)了相應(yīng)的索引結(jié)構(gòu)。對(duì)于正則表達(dá)式的全局匹配,定義了前-后綴的概念,提出了基于前-后綴的過濾策略,并同時(shí)選擇了適用于這種過濾策略的索引結(jié)構(gòu)trie樹對(duì)字符串集合構(gòu)建了索引。進(jìn)而,提出了對(duì)正則表達(dá)表達(dá)式進(jìn)行化簡(jiǎn)方法。首先對(duì)查詢的字符串進(jìn)行化簡(jiǎn),通過計(jì)算得到其過濾因子:前-后綴集合。然后對(duì)待檢索的字符串集合進(jìn)行過濾,得到候選字符串集合。最后采用傳統(tǒng)的有限自動(dòng)機(jī)進(jìn)行驗(yàn)證候選集合。而對(duì)于正則表達(dá)式的局部匹配,設(shè)計(jì)了基于BWT的索引結(jié)構(gòu),定義了正則表達(dá)式中不同操作符的運(yùn)算規(guī)則,將正則表達(dá)式的文本匹配轉(zhuǎn)換成位置信息列表的處理操作。首先,根據(jù)將正則表達(dá)式片段在索引中其在文本中出現(xiàn)的位置信息的列表;然后,根據(jù)定義的正則表達(dá)式中操作符的運(yùn)算規(guī)則對(duì)位置信息列表進(jìn)行相應(yīng)的操作,得到的位置信息的列表也就是正則表達(dá)式在文本中進(jìn)行局部匹配成功的字符串的位置。最后,進(jìn)行了大量的實(shí)驗(yàn)測(cè)試,結(jié)果表明本文提出的兩種支持正則表達(dá)式的文本匹配算法具備較高的查詢效率。
[Abstract]:Regular expressions are capable of describing complex queries and can describe common features of a class of text through specific syntax. Because of its wide application in computer language and related fields, it is very important to support efficient text matching technology for regular expressions. At present, there are many kinds of search engines that support regular expression text matching. However, almost all regular expression matching algorithms adopt automata theory. Regular expressions are usually converted to deterministic (or non-deterministic) finite-state automata, then used to search for matching in text. Because regular expressions need to be processed online and automata are constructed, Basically, text matching algorithms that support regular expressions are online, depending on the type of match, Text matching that supports regular expressions can be divided into global matching of regular expressions and local matching of regular expressions. Global matching is to determine whether a string belongs to a regular expression expression language, and local matching is used to judge whether a string belongs to a regular expression language. While any substrings in a string belong to the language of a regular expression, It also needs to return the position information of the substring. According to the different matching types, the local matching and global matching algorithms of regular expressions are designed in this paper. Both of these algorithms support off-line processing, and the corresponding index structure is designed. In this paper, the concept of front suffix is defined, and a filtering strategy based on front suffix is proposed. At the same time, trie tree, an index structure suitable for this filtering strategy, is selected to index the collection of strings. In this paper, a method of simplifying regular expression expression is proposed. Firstly, the query string is simplified, and its filter factor is obtained by calculating the pre-suffix set. Then, the search string set is filtered. The candidate string set is obtained. Finally, the candidate set is verified by the traditional finite automata. For the local matching of regular expressions, the index structure based on BWT is designed, and the operation rules of different operators in regular expressions are defined. Converts the text match of a regular expression into a list of location information. First, based on the list of location information that appears in the text of the regular expression fragment in the index; then, According to the operation rules of the operators in the defined regular expression, the list of position information is the position of the string that the regular expression matches successfully in the text. Finally, A large number of experiments have been carried out and the results show that the two text matching algorithms which support regular expressions have high query efficiency.
【學(xué)位授予單位】:東北大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2012
【分類號(hào)】:TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前2條
1 張樹壯;羅浩;方濱興;云曉春;;一種面向網(wǎng)絡(luò)安全檢測(cè)的高性能正則表達(dá)式匹配算法[J];計(jì)算機(jī)學(xué)報(bào);2010年10期
2 姚遠(yuǎn);劉鵬;單征;田雙鵬;;面向存儲(chǔ)的正則表達(dá)式匹配算法綜述[J];計(jì)算機(jī)應(yīng)用;2009年12期
,本文編號(hào):1623640
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1623640.html
最近更新
教材專著