|
This topic describes the syntax and semantics of the regular
expressions supported by G-Lock Email Processor.
A regular expression is a pattern that is matched against a
subject string from left to right. Most characters stand for themselves in a
pattern, and match the corresponding characters in the subject. As a trivial
example, the pattern
The quick brown fox
matches a portion of a subject string that is identical to
itself. The power of regular expressions comes from the ability to include
alternatives and repetitions in the pattern. These are encoded in the pattern by
the use of meta-characters, which do not stand for themselves but instead are
interpreted in some special way.
There are two different sets of meta-characters: those that
are recognized anywhere in the pattern except within square brackets, and those
that are recognized in square brackets.
Outside square brackets, the meta-characters are as follows:
\
(backslash)
- general escape
character with several uses
^
(circumflex) - asserts
start of string (or line, in multiline mode)
$
(dollar)
-
asserts end of string
(or line, in multiline mode)
.
(full stop (period,
dot)) - matches any character except newline (by default)
[ ] (square brackets)
-
start and end
character class definition
|
(vertical bar) -
starts of alternative
branch
( ) (round brackets)
-
start and end
subpattern
?
extends the
meaning of (also 0 or 1 quantifier, also quantifier minimizer
*
0 or more
quantifier
+
1 or more
quantifier
also
"possessive quantifier"
{
start min/max
quantifier
Part of a pattern that is in square brackets is called a
"character class". In a character class the only meta-characters are:
\
general escape
character
^
negates the class,
but only if the first character
-
indicates
character range
[
opens a character
class
]
terminates the
character class
Let's describe each of these meta-characters in details.
Backslash (\)
The backslash character has several uses:
1) if it is followed by a non-alphameric character, it takes
away any special meaning that character may have. This use of backslash as an
escape character applies both inside and outside character classes.
For example, if you want to match a * character, you write \*
in the pattern. This escaping action applies whether or not the following
character would otherwise be interpreted as a meta-character, so it is always
safe to precede a non-alphameric with backslash to specify that it stands for
itself. In particular, if you want to match a backslash, you write \\.
2) the second use of backslash provides a way of encoding
non-printing characters in patterns in a visible manner. There is no restriction
on the appearance of non-printing characters, apart from the binary zero that
terminates a pattern, but when a pattern is being prepared by text editing, it
is usually easier to use one of the following escape sequences than the binary
character it represents:
\a
alarm, that is,
the BEL character (hex 07)
\cx
"control-x",
where x is any character
\e
escape (hex 1B)
\f
formfeed (hex
0C)
\n
newline (hex 0A)
\r
carriage return
(hex 0D)
\t
tab (hex 09)
\ddd
character with
octal code ddd, or backreference
\xhh
character with hex
code hh
\x{hhh..} character
with hex code hhh... (UTF-8 mode only)
3) the third use of
backslash is for specifying generic character types:
\d
any decimal digit
\D
any character that
is not a decimal digit
\s
any whitespace
character
\S
any character that
is not a whitespace character
\w
any "word"
character
\W
any "non-word"
character
Each pair of escape sequences partitions is the complete set
of characters into two disjoint sets. Any given character matches one, and only
one, of each pair.
4) the fourth use of backslash is for certain simple
assertions. An assertion specifies a condition that has to be met at a
particular point in a match, without consuming any characters from the subject
string. The backslashed assertions are:
\b
matches at a word
boundary
\B
matches when not at
a word boundary
\A
matches at start of
subject
\Z
matches at end of
subject or before newline at end
\z
matches at end of
subject
\G
matches at first
matching position in subject
These assertions may not appear in character classes (but
note that \b has a different meaning, namely the backspace character, inside a
character class).
Circumflex (^)
Outside a character class, in the default matching mode, the
circumflex character is an assertion which is true only if the current matching
point is at the start of the subject string.
Inside a character class, circumflex has an entirely
different meaning. Circumflex need not be the first character of the pattern if
a number of alternatives are involved, but it should be the first thing in each
alternative in which it appears if the pattern is ever to match that branch. If
all possible alternatives start with a circumflex, that is, if the pattern is
constrained to match only at the start of the subject, it is said to be an
"anchored" pattern.
Dollar ($)
A dollar character is an assertion which is true only if the
current matching point is at the end of the subject string, or immediately
before a newline character that is the last character in the string (by
default). Dollar need not be the last character of the pattern if a number of
alternatives are involved, but it should be the last item in any branch in which
it appears. Dollar has no special meaning in a character class.
Full stop (period, dot) (.)
Outside a character class, a dot in the pattern matches any
one character in the subject, including a non-printing character, but not (by
default) newline. The handling of dot is entirely independent of the handling of
circumflex and dollar, the only relationship being that they both involve
newline characters. Dot has no special meaning in a character class.
Square brackets []
An opening square bracket introduces a character class, terminated by a closing
square bracket. A closing square bracket on its own is not special. If a closing
square bracket is required as a member of the class, it should be the first data
character in the class (after an initial circumflex, if present) or escaped with
a backslash.
A matched character must be in
the set of characters defined by the class, unless the first character in the
class definition is a circumflex, in which case the subject character must not
be in the set defined by the class. If a circumflex is actually required as a
member of the class, ensure it is not the first character, or escape it with a
backslash.
For example, the character class [aeiou] matches any lower case vowel, while [^aeiou]
matches any character that is not a lower case vowel. Note that a circumflex is
just a convenient notation for specifying the characters that are in the class
by enumerating those that are not. It is not an assertion: it still consumes a
character from the subject string, and fails if the current pointer is at the
end of the string.
When caseless matching is set, any letters in a class represent both their upper
case and lower case versions, so for example, a caseless [aeiou] matches "A" as
well as "a", and a caseless [^aeiou] does not match "A", whereas a caseful
version would.
Minus (-)
The minus (hyphen) character can be used to specify a range of characters in a
character class. For example, [d-m] matches any letter between d and m,
inclusive. If a minus character is required in a class, it must be escaped with
a backslash or appear in a position where it cannot be interpreted as indicating
a range, typically as the first or last character in the class.
All non-alphameric characters other than \, -, ^ (at the start) and the
terminating ] are non-special in character classes, but it does no harm if they
are escaped.
Vertical bar (|)
Vertical bar characters are used
to separate alternative patterns. For example, the pattern
gilbert|sullivan
matches either "gilbert" or "sullivan". Any number of alternatives may appear,
and an empty alternative is permitted (matching the empty string). The matching
process tries each alternative in turn, from left to right, and the first one
that succeeds is used. If the alternatives are within a subpattern (defined
below), "succeeds" means matching the rest of the main pattern as well as the
alternative in the subpattern.
Round brackets ()
Round brackets are used as
subpatterns delimiters. Marking part of a pattern as a subpattern does two
things:
1. It localizes a set of
alternatives. For example, the pattern
cat(aract|erpillar|)
matches one of the words "cat",
"cataract", or "caterpillar". Without the parentheses, it would match
"cataract", "erpillar" or the empty string.
2. It sets up the subpattern as a
capturing subpattern (as defined above). Opening parentheses are counted from
left to right (starting from 1) to obtain the numbers of the capturing
subpatterns.
For example, if the string "the
red king" is matched against the pattern
the ((red|white) (king|queen))
the captured substrings are "red
king", "red", and "king", and are numbered 1, 2, and 3, respectively.
The fact that plain parentheses
fulfil two functions is not always helpful. There are often times when a
grouping subpattern is required without a capturing requirement. If an opening
parenthesis is followed by a question mark and a colon, the subpattern does not
do any capturing, and is not counted when computing the number of any subsequent
capturing subpatterns. For example, if the string "the white queen" is matched
against the pattern
the ((?:red|white) (king|queen))
the captured substrings are "white queen" and "queen", and are numbered 1 and 2.
The maximum number of capturing subpatterns is 65535, and the maximum depth of
nesting of all subpatterns, both capturing and non-capturing, is 200.
Quantifiers
The quantifiers can follow any of the following
items:
a literal data
character
the . metacharacter
the \C escape
sequence
escapes such as \d
that match single characters
a character class
a back reference (see
next section)
a parenthesized
subpattern (unless it is an assertion)
The general repetition quantifier specifies a
minimum and maximum number of permitted matches, by giving the two numbers in
curly brackets (braces), separated by a comma. The numbers must be less than
65536, and the first must be less than or equal to the second. For example:
z{2,4}
matches "zz", "zzz", or "zzzz". A
closing brace on its own is not a special character. If the second number is
omitted, but the comma is present, there is no upper limit; if both the second
number and the comma are omitted, the quantifier specifies an exact number of
required matches. Thus
[aeiou]{3,}
matches at least 3 successive
vowels, but may match many more, while
\d{8}
matches exactly 8 digits. An
opening curly bracket that appears in a position where a quantifier is not
allowed, or one that does not match the syntax of a quantifier, is taken as a
literal character. For example, {,6} is not a quantifier, but a literal string
of four characters.
The quantifier {0} is permitted, causing the
expression to behave as if the previous item and the quantifier were not
present.
For convenience the three most
common quantifiers have single-character abbreviations:
*
is equivalent to
{0,}
+
is equivalent to
{1,}
?
is equivalent to
{0,1}
By default, the quantifiers are
"greedy", that is, they match as much as possible (up to the maximum number of
permitted times), without causing the rest of the pattern to fail. The classic
example of where this gives problems is in trying to match comments in C
programs. These appear between the sequences /* and */ and within the sequence,
individual * and / characters may appear. An attempt to match C comments by
applying the pattern
/\*.*\*/
to the string
/* first command */
not comment
/* second comment */
fails, because it matches the
entire string owing to the greediness of the .* item.
However, if a quantifier is followed by a
question mark, it ceases to be greedy, and instead matches the minimum number of
times possible, so the pattern
/\*.*?\*/
does the right thing with the C
comments. The meaning of the various quantifiers is not otherwise changed, just
the preferred number of matches. Do not confuse this use of question mark with
its use as a quantifier in its own right. Because it has two uses, it can
sometimes appear doubled, as in
\d??\d
which matches one digit by
preference, but can match two if that is the only way the rest of the pattern
matches.
You can learn more about regular expressions
here
|