Cplusplus: sregex_token_iterator Sınıfı

Giriş
Açıklaması şöyle

std::regex_token_iterator is a read-only ForwardIterator that accesses the individual sub-matches of every match of a regular expression within the underlying character sequence. It can also be used to access the parts of the sequence that were not matched by the given regular expression (e.g. as a tokenizer).

Bu sınıf ile bir string içindeki tüm eşleşmeleri yakalamak mümkün. Normalde regex eşleşme bitince durur. Ancak string içinde halen başka eşleşmeler varsa iterator ile tüm eşleşmeleri dolaşmak mümkün.

regex_iterator ile ilişkisi
Sınıf altta regex_iterator kullanır. Açıklaması şöyle

A typical implementation of std::regex_token_iterator holds the underlying std::regex_iterator, a container (e.g. std::vector) of the requested submatch indexes, the internal counter equal to the index of the submatch, a pointer to std::sub_match, pointing at the current submatch of the current match, and a std::match_results object containing the last non-matched character sequence (used in tokenizer mode).

Constructor - iterator + iterator + regex
Şöyle yaparız.

const string str  ="...";
regex re ("..");
sregex_token_iterator rit (str.begin(), str.end(), re);

Bitiş iterator için şöyle yaparız.

sregex_token_iterator rend;

Tüm eşleşmeleri bir vector'e doldurmak için şöyle yaparız.

std::vector<std::string> result;
std::string str = "...";
std::regex re ("[A-Za-z0-9]+");

std::copy(
    std::sregex_token_iterator(str.begin(), str.end(), re),
    std::sregex_token_iterator(),
    std::back_inserter(result));

Constructor - iterator + iterator + regex + Capture Group Numarası
Açıklaması şöyle

A regex iterator helps to iterate over matched subsequences. However, sometimes you also want to process all the contents between matched expressions. [...] In addition, you can specify a list of integral values, which represent elements of a “tokenization”:

-1 means that you are interested in all the subsequences between matched regular expressions (token separators).

0 means that you are interested in all the matched regular expressions (token separators).

Any other value n means that you are interested in the matched nth subexpression inside the regular expressions.

Örnek
Şöyle yaparız. -1 eşleşmeyen grubu belirtir. 0 ise ile ilk yakalanan grubu (capture group) belirtir.

string data = "...";
regex re("...");
sregex_token_iterator rit(data.begin(), data.end(), re, { -1, 0 });

İlk * operatörüne erişince unmatched string'i alırız.

*i++; //unmatched content (-1)

İkinci erişmemizde ilk yakalanan grubu alırız.

*i++; //matched content (0)

Örnek
utf-8 ile şöyle yaparız. Bu sefer regex'e sadece ayraçları veririz. -1 ile eşleşmeyen grupları alırız. Yani bir nevi split olarak kullanırız.

std::regex regex("，|。|！|？");
std::string src = "使用boost split失败了，不知道什么原因。有人可以告诉我吗？谢谢！";

std::sregex_token_iterator iterator(src.begin(), src.end(), regex, -1);
std::sregex_token_iterator end;

for ( ; iterator != end; ++iterator) {
  std::string res = *iterator;
  std::cout << res << std::endl;
}

Çıktı olarak şunu alırız.

results:
使用boost split失败了
不知道什么原因
有人可以告诉我吗
谢谢

* operator
str() metodu ile aynıdır.

str metodu
Şöyle yaparız.

while(rit != rend)
{
  rit->str();
  ++rit;
}

Cplusplus

7 Nisan 2017 Cuma

sregex_token_iterator Sınıfı

Hiç yorum yok:

Yorum Gönder