程式師世界 >> 編程語言 >> C語言 >> C++ >> 關於C++ >> GRETA正則表達式模板類庫

GRETA正則表達式模板類庫

編輯：關於C++

本文摘要翻譯了幾篇文章的內容，簡單介紹 ATL CAtlRegExp，GRETA， Boost::regex 等正則表達式庫，這些表達式庫使我們可以方便地利用正則庫的巨大威力，給我們的工作提供了便利。

正則表達式語法字符元意義 . 匹配單個字符 [ ] 指定一個字符類，匹配方括號內的任意字符。例：[abc] 匹配 "a", "b"或 "c"。 ^ 如果^出現在字符類的開始處，它否定了字符類，這個被否定的字符類匹配除卻方括號內的字符的字符。如：[^abc]匹配除了"a", "b"和"c"之外的字符。如果^出現在正則表達式前邊，它匹配輸入的開頭，例：^[abc]匹配以"a", "b"或 "c"開頭的輸入。 - 在字符類中，指定一個字符的范圍。例如：[0-9]匹配 "0"到"9"的數字。 ? 指明?前的表達式是可選的，它可以匹配一次或不進行匹配。例如： [0-9][0-9]? 匹配"2"或"12"。 + 指明?前的表達式匹配一次或多次。例如：[0-9]+匹配 "1", "13", "666"等。 * 指明*前的表達式匹配零次或多次。 ??, +?, *? ?, +和*的非貪婪匹配版本，它們盡可能匹配較少的字符；而?, +和*則是貪婪版本，盡可能匹配較多的字符。例如：輸入 "<abc><def>", 則<.*?> 匹配 "<abc>"，而<.*>匹配 "<abc><def>"。 ( ) 分組操作符。例如：(\d+,)*\d+匹配一串由逗號分開的數字，例如： "1"或"1,23,456"。 \ 轉義字符，轉義緊跟的字符。例如，[0-9]+ 匹配一個或多個數字，而 [0-9]\+ 匹配一個數字後跟隨一個加號的情況。反斜槓\也用於表示縮寫，\a 就表示任何數字、字母。如果\後緊跟一個數字n，則它匹配第n個匹配群組(從0開始)，例如，<{.*?}>.*?</\0>匹配"<head>Contents</head>"。注意，在C++字符串中，反斜槓\需要用雙反斜槓\\來表示： "\\+", "\\a", "<{.*?}>.*?</\\0>"。 $ 放在正則表達式的最後，它匹配輸入的末端。例如：[0-9]$匹配輸入的最後一個數字。 | 間隔符，分隔兩個表達式，以正確匹配其中一個，例如： T|the匹配"The" 或"the"。

縮寫匹配

縮寫匹配 \a 字母、數字([a-zA-Z0-9]) \b 空格(blank): ([ \\t]) \c 字母([a-zA-Z]) \d 十進制數 ([0-9]) \h 十六進制數([0-9a-fA-F]) \n 換行: (\r|(\r?\n)) \q 引用字符串(\"[^\"]*\")|(\''''[^\'''']*\'''') \w 一段文字 ([a-zA-Z]+) \z 一個整數([0-9]+)

ATL CATLRegExp

ATL Server常常需要對地址、命令等復雜文字字段信息解碼，而正則表達式是強大的文字解析工具，所以，ATL提供了正則表達式解釋工具。

示例：

#include "stdafx.h" #include <atlrx.h> int main(int argc, char* argv[]) { 　 CAtlRegExp<> reUrl; 　 // five match groups: scheme, authority, path, query, fragment 　 REParseError status = reUrl.Parse( 　　　　"({[^:/?#]+}:)?(//{[^/?#]*})?{[^?#]*}(?{[^#]*})?(#{.*})? " ); 　 if (REPARSE_ERROR_OK != status) 　 { 　　　// Unexpected error. 　　　return 0; 　 } 　 CAtlREMatchContext<> mcUrl; 　 if (!reUrl.Match( 　 "http://search.microsoft.com/us/Search.asp? qu=atl&boolean=ALL#results", 　　　&mcUrl)) 　 { 　　　// Unexpected error. 　　　return 0; 　 } 　 for (UINT nGroupIndex = 0; nGroupIndex < mcUrl.m_uNumGroups; 　　　　++nGroupIndex) 　 { 　　　const CAtlREMatchContext<>::RECHAR* szStart = 0; 　　　const CAtlREMatchContext<>::RECHAR* szEnd = 0; 　　　mcUrl.GetMatch(nGroupIndex, &szStart, &szEnd); 　　　ptrdiff_t nLength = szEnd - szStart; 　　　printf("%d: \"%.*s\"\n", nGroupIndex, nLength, szStart); 　 } }輸出：

0: "http" 1: "search.microsoft.com" 2: "/us/Search.asp" 3: "qu=atl&boolean=ALL" 4: "results"

Match的結果通過第二個參數pContext所指向的CAtlREMatchContext類來返回，Match的結果及其相關信息都被存放在CAtlREMatchContext類中，只要訪問 CAtlREMatchContext的方法和成員就可以得到匹配的結果。CAtlREMatchContext 通過m_uNumGroups成員以及GetMatch（）方法向調用者提供匹配的結果信息。 m_uNumGroups代表匹配上的Group有多少組，GetMatch()則根據傳遞給它的Group 的Index值，返回匹配上的字符串的pStart和pEnd指針，調用者有了這兩個指針，自然可以很方便的得到匹配結果。

更多內容請參閱: CAtlRegExp Class

GRETA

GRETA是微軟研究院推出的一個正則表達式模板類庫，GRETA 包含的 C++ 對象和函數，使字符串的模式匹配和替換變得很容易，它們是:

" rpattern: 搜索的模式 " match_results/subst_results: 放置匹配、替換結果的容器

為了執行搜索和替換的操作，用戶首先需要用一個描述匹配規則的字符串來顯式初始化一個rpattern對象，然後把需要匹配的字符串作為參數，調用 rpattern的函數，比如match()或者substitute()，就可以得到匹配後的結果。如果match()/substitute()調用失敗，函數返回false，如果調用成功，函數返回true，此時，match_results對象存儲了匹配結果。請看例子代碼：

#include <iostream> #include <string> #include "regexpr2.h" using namespace std; using namespace regex; int main() { 　　match_results results; 　　string str( "The book cost $12.34" ); 　　rpattern pat( "\\$(\\d+)(\\.(\\d\\d))?" );　　　// Match a dollar sign followed by one or more digits, 　　// optionally followed by a period and two more digits. 　　// The double-escapes are necessary to satisfy the compiler. 　　match_results::backref_type br = pat.match( str, results ); 　　if( br.matched ) { 　　　　cout << "match success!" << endl; 　　　　cout << "price: " << br << endl; 　　} else { 　　　　cout << "match failed!" << endl; 　　} 　　return 0; }

程序輸出將是:

match success! price: $12.34

您可以閱讀GRETA文檔，獲知rpattern對象的細節內容，並掌握如何自定義搜索策略來得到更好的效率。

注意：所有在頭文件regexpr2.h裡的聲明都在名稱空間regex之中，用戶使用其中的對象和函數時，必須加上前綴"regex::"，或者預先 "using namespace regex;" 一下，為了簡單起見，下文的示例代碼中將省略"regex::" 前綴。作者生成了greta.lib和regexpr2.h文件，只需這兩個文件的支持即可使用greta來解析正則表達式。

匹配速度小議

不同的正則表達式匹配引擎擅長於不同匹配模式。作為一個基准，當用模式："^([0-9]+)(\-| |$)(.*)$"匹配字符串"100- this is a line of ftp response which contains a message string"時，GRETA的匹配速度比boost(http://www.boost.org)正則表達式庫大約快7倍，比ATL7的 CATLRegExp快10倍之多! Boost Regex 的說明文檔帶有一個很多模式的匹配測試 Performance結果。比較這個結果後，我發現GRETA在大部分情況下和Boost Regex性能差不多，但是在用Visual Studio.Net 2003編譯的情況下，GRETA還略勝一籌。

Boost.Regex

Boost提供了boost::basic_regex來支持正則表達式。 boost::basic_regex的設計非常類似std::basic_string：

namespace boost{ template <class charT, 　　class traits = regex_traits<charT>, 　　class Allocator = std::allocator<charT> > class basic_regex; typedef basic_regex<char> regex; typedef basic_regex<wchar_t> wregex; }

Boost Regex 庫附帶的文檔非常豐富，示例更是精彩，比如有兩個例子程序，不多的代碼，程序就可以直接對 C++ 文件進行語法高亮標記，生成相應的 HTML (converts a C++ file to syntax highlighted HTML)。下面的例子可以分割一個字符串到一串標記符號(split a string into tokens)。

#include <list>
#include <boost/regex.hpp>
unsigned tokenise(std::list<std::string>& l, std::string& s)
{
   return boost::regex_split(std::back_inserter(l), s);
}
#include <iostream>
using namespace std;
#if defined(BOOST_MSVC) || (defined(__BORLANDC__) && (__BORLANDC__ == 0x550))
// problem with std::getline under MSVC6sp3
istream& getline(istream& is, std::string& s)
{
   s.erase();
   char c = is.get();
   while(c != ''''\n'''')
   {
      s.append(1, c);
      c = is.get();
   }
   return is;
}
#endif
int main(int argc)
{
   string s;
   list<string> l;
   do{
      if(argc == 1)
      {
         cout << "Enter text to split (or \"quit\" to exit): ";
         getline(cin, s);
         if(s == "quit") break;
      }
      else
         s = "This is a string of tokens";
      unsigned result = tokenise(l, s);
      cout << result << " tokens found" << endl;
      cout << "The remaining text is: \"" << s << "\"" << endl;
      while(l.size())
      {
         s = *(l.begin());
         l.pop_front();
         cout << s << endl;
      }
   }while(argc == 1);
   return 0;
}