The STREGEX function performs regular expression matching against the strings contained in StringExpression. STREGEX can perform either a simple boolean True/False evaluation of whether a match occurred, or it can return the position and offset within the strings for each match. The regular expressions accepted by this routine, which correspond to “Posix Extended Regular Expressions”, are similar to those used by such UNIX tools as egrep, lex, awk, and Perl.
For more information about regular expressions, see Learning About Regular Expressions.
STREGEX is based on the regex package written by Henry Spencer, modified by Exelis VIS only to the extent required to integrate it into IDL. This package is freely available at: www.arglist.com/regex
To match a string starting with an “a”, followed by a “b”, followed by 1 or more “c”:
To perform the same match, and also find the locations of the three parts:
abccc a b ccc
Or more simply:
abccc a b ccc
By default, STREGEX returns the position of the matched string within each element of StringExpression. If no match is found, -1 is returned. Optionally, STREGEX can return a boolean True/False result of the match or the matched strings.
A string or string array in which to search for matches of RegularExpression.
A scalar string containing the regular expression to match. See Learning About Regular Expressions for a description of the meta characters that can be used in a regular expression.
Normally, STREGEX returns the position of the first character in each element of StringExpression that matches RegularExpression. Setting BOOLEAN modifies this behavior to simply return a True/False value indicating if a match occurred or not.
Normally, STREGEX returns the position of the first character in each element of StringExpression that matches RegularExpression. Setting EXTRACT modifies this behavior to simply return the matched substrings. The EXTRACT keyword cannot be used with either BOOLEAN or LENGTH.
Regular expression matching is normally a case-sensitive operation. Set FOLD_CASE to perform case-insensitive matching instead.
Set this keyword equal to a named variable that will contain the length of each matching string found. If no match is found in an element of StringExpression, the returned variable will contain -1 for that element. Together with this result of this function, which contains the starting points of the matches in StringExpression, LENGTH can be used with the STRMID function to extract the matched substrings. The LENGTH keyword cannot be used with either BOOLEAN or EXTRACT.
By default, STREGEX only reports the overall match. Setting SUBEXPR causes it to report the overall match as well as any subexpression matches. A subexpression is any part of a regular expression written within parentheses. For example, the regular expression ‘(a)(b)(c+)’ has 3 subexpressions, whereas the functionally equivalent 'abc+' has none. The SUBEXPR keyword cannot be used with BOOLEAN.
If a subexpression participated in the match several times, the reported substring is the last one it matched. Note, as an example in particular, that when the regular expression ‘(b*)+’ matches ‘bbb’, the parenthesized subexpression matches the three 'b's and then an infinite number of empty strings following the last ‘b’, so the reported substring is one of the empties. This occurs because the ‘*’ matches zero or more instances of the character that precedes it.
In order to return multiple positions and lengths for each input, the result from SUBEXPR has a new first dimension added compared to StringExpression.
This example searches a string array for words of any length beginning with “f” and ending with “t” without the letter “o” in between:
str = ['foot', 'Feet', 'fate', 'FAST', 'ferret', 'affluent']
PRINT, STREGEX(str, '^f[^o]*t$', /EXTRACT, /FOLD_CASE)
This statement results in:
Feet FAST ferret
Note the following about this example:
- Unlike the * wildcard character used by STRMATCH, the * meta character used by STREGEX applies to the item directly on its left, which in this case is [^o], meaning “any character except the letter ‘o’ ”. Therefore, [^o]* means “zero or more characters that are not ‘o’ ”, whereas the following statement would find only words whose second character is not “o”:
- The anchors (^ and $) tell STREGEX to find only words that begin with “f” and end with “t”. If we left out the ^ anchor in the above example, STREGEX would also return “ffluent” (a substring of “affluent”). Similarly, if we left out the $ anchor, STREGEX would also return “fat” (a substring of “fate”).