re2 package provides pattern matching, extraction, replacement and other string processing operations using Google’s RE2 (C++) regular-expression library. The interface is consistent, and similar to stringr.
Why re2?
Regular expression matching can be done in two ways: using recursive backtracking or using finite automata-based techniques.
Perl, PCRE, Python, Ruby, Java, and many other languages rely on recursive backtracking for their regular expression implementations. The problem with this approach is that performance can degrade very quickly. Time complexity can be exponential. In contrast, re2 uses finite automata-based techniques for regular expression matching, guaranteeing linear time execution and a fixed stack footprint. See links to Russ Cox’s excellent articles below.
re2 provides three types of regular-expression functions:
All functions take a vector of strings as argument. Regular-expression patterns can be compiled, and reused for performance.
Here are the primary verbs of re2:
re2_detect(x, pattern)
finds if a pattern is present in
stringre2_count(x, pattern)
counts the number of matches in
stringre2_subset(x, pattern)
selects strings that matchre2_match(x, pattern, simplify = FALSE)
extracts first
matched substringre2_match("ruby:1234 68 red:92 blue:", "(\\w+):(\\d+)")
#> .0 .1 .2
#> [1,] "ruby:1234" "ruby" "1234"
# Groups can be named:
re2_match(c("barbazbla", "foobar"), "(foo)|(?P<TestGroup>bar)baz")
#> .0 .1 TestGroup
#> [1,] "barbaz" NA "bar"
#> [2,] "foo" "foo" NA
# Use pre-compiled regular expression:
re <- re2_regexp("(foo)|(bar)baz", case_sensitive = FALSE)
re2_match(c("BaRbazbla", "Foobar"), re)
#> .0 .1 .2
#> [1,] "BaRbaz" NA "BaR"
#> [2,] "Foo" "Foo" NA
re2_match_all(x, pattern)
extracts all matched
substringsre2_match_all("ruby:1234 68 red:92 blue:", "(\\w+):(\\d+)")
#> [[1]]
#> .0 .1 .2
#> [1,] "ruby:1234" "ruby" "1234"
#> [2,] "red:92" "red" "92"
re2_replace(x, pattern, rewrite)
replaces first matched
pattern in string# Use groups in rewrite:
re2_replace("bunny@wunnies.pl", "(.*)@([^.]*)", "\\2!\\1")
#> [1] "wunnies!bunny.pl"
re2_replace_all(x, pattern, rewrite)
replaces all
matched patterns in stringre2_replace_all("yabba dabba doo", "b+", "d")
#> [1] "yada dada doo"
# Multiple replacements
re2_replace_all(c("one", "two"), c("one" = "1", "1" = "2", "two" = "2"))
#> [1] "2" "2"
re2_extract_replace(x, pattern, rewrite)
extracts and
substitutes (ignores non-matching portions of x)re2_split(x, pattern, simplify = FALSE, n = Inf)
splits
string based on patternre2_split("How vexingly quick daft zebras jump!", " quick | zebras")
#> [[1]]
#> [1] "How vexingly" "daft" " jump!"
re2_locate(x, pattern)
seeks the start and end of
pattern in stringre2_locate_all(x, pattern)
locates start and end of all
occurrences of pattern in stringre2_locate_all(c("yellowgreen", "steelblue"), "l")
#> [[1]]
#> begin end
#> [1,] 3 3
#> [2,] 4 4
#>
#> [[2]]
#> begin end
#> [1,] 5 5
#> [2,] 7 7
In all the above functions, regular-expression pattern is vectorized.
Regular-expression pattern can be compiled using
re2_regexp(pattern, ...)
. Here are some of the options:
case_sensitive
: Match is case-sensitiveencoding
: UTF8 or Latin1literal
: Interpret pattern as literal, not regexplongest_match
: Search for longest match, not first
matchposix_syntax
: Restrict regexps to POSIX egrep
syntaxhelp(re2_regexp)
lists available options.
re2_get_options(regexp_ptr)
returns a list of options
stored in the compiled regular-expression object.
re2 supports pearl style regular expressions (with extensions like \d, \w, \s, …) and provides most of the functionality of PCRE – eschewing only backreferences and look-around assertions.
See RE2 Syntax for the syntax supported by RE2, and a comparison with PCRE and PERL regexps.
For those not familiar with Perl’s regular expressions, here are some examples of the most commonly used extensions:
"hello (\\w+) world" |
\w matches a “word” character |
"version (\\d+)" |
\d matches a digit |
"hello\\s+world" |
\s matches any whitespace character |
"\\b(\\w+)\\b" |
\b matches non-empty string at word boundary |
"(?i)hello" |
(?i) turns on case-insensitive matching |
"/\\*(.*?)\\*/" |
.*? matches . minimum no. of times possible |
The double backslashes are needed when writing R string literals. However, they should not be used when writing raw string literals:
r"(hello (\w+) world)" |
\w matches a “word” character |
r"(version (\d+))" |
\d matches a digit |
r"(hello\s+world)" |
\s matches any whitespace character |
r"(\b(\w+)\b)" |
\b matches non-empty string at word boundary |
r"((?i)hello)" |
(?i) turns on case-insensitive matching |
r"(/\*(.*?)\*/)" |
.*? matches . minimum no. of times
possible |