G-Lock Software

HomeProductsForumsFAQDownloadsRegistration













 Parse incoming email message - process unsubscribe request
 ... \ Products \ G-Lock Email Processor \ User Guide \ Using Regular Expressions

This topic describes the syntax and semantics of the regular expressions supported by G-Lock Email Processor.

A regular expression is a pattern that is matched against a subject string from left to right. Most characters stand for themselves in a pattern, and match the corresponding characters in the subject. As a trivial example, the pattern

  The quick brown fox

matches a portion of a subject string that is identical to itself. The power of regular expressions comes from the ability to include alternatives and repetitions in the pattern. These are encoded in the pattern by the use of meta-characters, which do not stand for themselves but instead are interpreted in some special way.

There are two different sets of meta-characters: those that are recognized anywhere in the pattern except within square brackets, and those that are recognized in square brackets.

Outside square brackets, the meta-characters are as follows:

  \  (backslash)   - general escape character with several uses

  ^  (circumflex) - asserts start of string (or line, in multiline mode)

  $  (dollar)  -  asserts end of string (or line, in multiline mode)

  .   (full stop (period, dot)) - matches any character except newline (by default)

  [ ] (square brackets)  -  start and end character class definition

  |  (vertical bar) -  starts of alternative branch

  ( ) (round brackets) -  start and end subpattern

  ?      extends the meaning of (also 0 or 1 quantifier, also quantifier minimizer

  *      0 or more quantifier

  +      1 or more quantifier

         also "possessive quantifier"

  {     start min/max quantifier

Part of a pattern that is in square brackets is called a "character class". In a character class the only meta-characters are:

  \      general escape character

  ^      negates the class, but only if the first character

  -      indicates character range

  [      opens a character class

  ]      terminates the character class  

Let's describe each of these meta-characters in details.

Backslash (\)

The backslash character has several uses:

1) if it is followed by a non-alphameric character, it takes away any special meaning that character may have. This use of backslash as an escape character applies both inside and outside character classes.

For example, if you want to match a * character, you write \* in the pattern. This escaping action applies whether or not the following character would otherwise be interpreted as a meta-character, so it is always safe to precede a non-alphameric with backslash to specify that it stands for itself. In particular, if you want to match a backslash, you write \\.

2) the second use of backslash provides a way of encoding non-printing characters in patterns in a visible manner. There is no restriction on the appearance of non-printing characters, apart from the binary zero that terminates a pattern, but when a pattern is being prepared by text editing, it is usually easier to use one of the following escape sequences than the binary character it represents:

  \a        alarm, that is, the BEL character (hex 07)

  \cx       "control-x", where x is any character

  \e        escape (hex 1B)

  \f        formfeed (hex 0C)

  \n        newline (hex 0A)

  \r        carriage return (hex 0D)

  \t        tab (hex 09)

  \ddd      character with octal code ddd, or backreference

  \xhh      character with hex code hh

  \x{hhh..} character with hex code hhh... (UTF-8 mode only)

 3) the third use of backslash is for specifying generic character types:

  \d     any decimal digit

  \D     any character that is not a decimal digit

  \s     any whitespace character

  \S     any character that is not a whitespace character

  \w     any "word" character

  \W     any "non-word" character

Each pair of escape sequences partitions is the complete set of characters into two disjoint sets. Any given character matches one, and only one, of each pair.

4) the fourth use of backslash is for certain simple assertions. An assertion specifies a condition that has to be met at a particular point in a match, without consuming any characters from the subject string. The backslashed assertions are:

  \b     matches at a word boundary

  \B     matches when not at a word boundary

  \A     matches at start of subject

  \Z     matches at end of subject or before newline at end

  \z     matches at end of subject

  \G     matches at first matching position in subject

These assertions may not appear in character classes (but note that \b has a different meaning, namely the backspace character, inside a character class).

Circumflex (^)  

Outside a character class, in the default matching mode, the circumflex character is an assertion which is true only if the current matching point is at the start of the subject string.

Inside a character class, circumflex has an entirely different meaning. Circumflex need not be the first character of the pattern if a number of alternatives are involved, but it should be the first thing in each alternative in which it appears if the pattern is ever to match that branch. If all possible alternatives start with a circumflex, that is, if the pattern is constrained to match only at the start of the subject, it is said to be an "anchored" pattern.

Dollar ($)  

A dollar character is an assertion which is true only if the current matching point is at the end of the subject string, or immediately before a newline character that is the last character in the string (by default). Dollar need not be the last character of the pattern if a number of alternatives are involved, but it should be the last item in any branch in which it appears. Dollar has no special meaning in a character class.

Full stop (period, dot) (.)

Outside a character class, a dot in the pattern matches any one character in the subject, including a non-printing character, but not (by default) newline. The handling of dot is entirely independent of the handling of circumflex and dollar, the only relationship being that they both involve newline characters. Dot has no special meaning in a character class.

Square brackets []

An opening square bracket introduces a character class, terminated by a closing square bracket. A closing square bracket on its own is not special. If a closing square bracket is required as a member of the class, it should be the first data character in the class (after an initial circumflex, if present) or escaped with a backslash.

A matched character must be in the set of characters defined by the class, unless the first character in the class definition is a circumflex, in which case the subject character must not be in the set defined by the class. If a circumflex is actually required as a member of the class, ensure it is not the first character, or escape it with a backslash.

For example, the character class [aeiou] matches any lower case vowel, while [^aeiou] matches any character that is not a lower case vowel. Note that a circumflex is just a convenient notation for specifying the characters that are in the class by enumerating those that are not. It is not an assertion: it still consumes a character from the subject string, and fails if the current pointer is at the end of the string.

When caseless matching is set, any letters in a class represent both their upper case and lower case versions, so for example, a caseless [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not match "A", whereas a caseful version would.

Minus (-)

The minus (hyphen) character can be used to specify a range of characters in a character class. For example, [d-m] matches any letter between d and m, inclusive. If a minus character is required in a class, it must be escaped with a backslash or appear in a position where it cannot be interpreted as indicating a range, typically as the first or last character in the class.

All non-alphameric characters other than \, -, ^ (at the start) and the terminating ] are non-special in character classes, but it does no harm if they are escaped.

Vertical bar (|)

Vertical bar characters are used to separate alternative patterns. For example, the pattern

  gilbert|sullivan

matches either "gilbert" or "sullivan". Any number of alternatives may appear, and an empty alternative is permitted (matching the empty string). The matching process tries each alternative in turn, from left to right, and the first one that succeeds is used. If the alternatives are within a subpattern (defined below), "succeeds" means matching the rest of the main pattern as well as the alternative in the subpattern.

Round brackets ()

Round brackets are used as subpatterns delimiters. Marking part of a pattern as a subpattern does two things:

1. It localizes a set of alternatives. For example, the pattern

  cat(aract|erpillar|)

matches one of the words "cat", "cataract", or "caterpillar". Without the parentheses, it would match "cataract", "erpillar" or the empty string.

2. It sets up the subpattern as a capturing subpattern (as defined above). Opening parentheses are counted from left to right (starting from 1) to obtain the numbers of the capturing subpatterns.

For example, if the string "the red king" is matched against the pattern

  the ((red|white) (king|queen))

the captured substrings are "red king", "red", and "king", and are numbered 1, 2, and 3, respectively.

The fact that plain parentheses fulfil two functions is not always helpful. There are often times when a grouping subpattern is required without a capturing requirement. If an opening parenthesis is followed by a question mark and a colon, the subpattern does not do any capturing, and is not counted when computing the number of any subsequent capturing subpatterns. For example, if the string "the white queen" is matched against the pattern

  the ((?:red|white) (king|queen))

the captured substrings are "white queen" and "queen", and are numbered 1 and 2. The maximum number of capturing subpatterns is 65535, and the maximum depth of nesting of all subpatterns, both capturing and non-capturing, is 200.

Quantifiers

The quantifiers can follow any of the following items:

  a literal data character

  the . metacharacter

  the \C escape sequence

  escapes such as \d that match single characters

  a character class

  a back reference (see next section)

  a parenthesized subpattern (unless it is an assertion)

The general repetition quantifier specifies a minimum and maximum number of permitted matches, by giving the two numbers in curly brackets (braces), separated by a comma. The numbers must be less than 65536, and the first must be less than or equal to the second. For example:

  z{2,4}

matches "zz", "zzz", or "zzzz". A closing brace on its own is not a special character. If the second number is omitted, but the comma is present, there is no upper limit; if both the second number and the comma are omitted, the quantifier specifies an exact number of required matches. Thus

  [aeiou]{3,}

matches at least 3 successive vowels, but may match many more, while

  \d{8}

matches exactly 8 digits. An opening curly bracket that appears in a position where a quantifier is not allowed, or one that does not match the syntax of a quantifier, is taken as a literal character. For example, {,6} is not a quantifier, but a literal string of four characters.

The quantifier {0} is permitted, causing the expression to behave as if the previous item and the quantifier were not present.

For convenience the three most common quantifiers have single-character abbreviations:

  *    is equivalent to {0,}

  +    is equivalent to {1,}

  ?    is equivalent to {0,1}

By default, the quantifiers are "greedy", that is, they match as much as possible (up to the maximum number of permitted times), without causing the rest of the pattern to fail. The classic example of where this gives problems is in trying to match comments in C programs. These appear between the sequences /* and */ and within the sequence, individual * and / characters may appear. An attempt to match C comments by applying the pattern

  /\*.*\*/

to the string

  /* first command */  not comment  /* second comment */

fails, because it matches the entire string owing to the greediness of the .* item.

However, if a quantifier is followed by a question mark, it ceases to be greedy, and instead matches the minimum number of times possible, so the pattern

  /\*.*?\*/

does the right thing with the C comments. The meaning of the various quantifiers is not otherwise changed, just the preferred number of matches. Do not confuse this use of question mark with its use as a quantifier in its own right. Because it has two uses, it can sometimes appear doubled, as in

  \d??\d

which matches one digit by preference, but can match two if that is the only way the rest of the pattern matches.

You can learn more about regular expressions here  

 

G-Lock Email Processor
  User Guide

Getting Started
Program Main Window
Process Bounced Emails
Process Remove Requests
Process Subscribe Requests
Process Confirmation Letters
Detect and Remove Spam Emails
Creating Accounts
Creating New Rule

Program Settings
General Settings
Inbox Explorer Settings

Setting up Filters
Principle of Email Filtering
Using Regular Expressions
Filter By Size
Filter By Header/Body
Filter By Subject

Setting up Field Extractor
Field Extractor Window
Source Box
Generate Value From Mask
Create Fields From Database
Create Fields From Clipboard
Create Table In External Database

Setting up Fields Processor
General Settings
Script Syntax

Setting up Bounced Emails Processor
Hard & Soft Bounced Emails
Exclusion List

Setting up Actions

MS Windows Script
General Settings

Database Manager
General Settings
Connection Info
Custom SQL
Working With Table In Excel
Get Identity Value From Table

Write To File
General Settings

Save Attachment
General Settings

Forward Email
General Settings
Additional Settings

Send Email
General Settings
Additional Settings
HTML Settings

Log
Outbox
Inbox Explorer

See also

Services
  Registration
Affiliate
 
Support
  Users Forum
Contact us
 
Info
  Trojan Port List
Privacy Statement
Media and Press Information
 
  Add to Favorites
 
 
   
 
 
   

 

Home | Products | Forums | FAQ | Downloads | Registration