package gnu.regexp;
Syntax and Usage Notes
This page was last updated on 22 June 2001

Brief Background
A regular expression consists of a character string where some characters are given special meaning with regard to pattern matching. Regular expressions have been in use from the early days of computing, and provide a powerful and efficient way to parse, interpret and search and replace text within an application.

Supported Syntax
Within a regular expression, the following characters have special meaning:

Unsupported Syntax
Some flavors of regular expression utilities support additional escape sequences, and this is not meant to be an exhaustive list. In the future, gnu.regexp may support some or all of the following:

(?mods) inlined compilation/execution modifiers (Perl5)
\G end of previous match (Perl5)
[.symbol.] collating symbol in class expression (POSIX)
[=class=] equivalence class in class expression (POSIX)
s/foo/bar/ style expressions as in sed and awk (note: these can be accomplished through other means in the API)

Java Integration
In a Java environment, a regular expression operates on a string of Unicode characters, represented either as an instance of java.lang.String or as an array of the primitive char type. This means that the unit of matching is a Unicode character, not a single byte. Generally this will not present problems in a Java program, because Java takes pains to ensure that all textual data uses the Unicode standard.

Because Java string processing takes care of certain escape sequences, they are not implemented in gnu.regexp. You should be aware that the following escape sequences are handled by the Java compiler if found in the Java source:

\b backspace
\f form feed
\n newline
\r carriage return
\t horizontal tab
\" double quote
\' single quote
\\ backslash
\xxx character, in octal (000-377)
\uxxxx Unicode character, in hexadecimal (0000-FFFF)
In addition, note that the \u escape sequences are meaningful anywhere in a Java program, not merely within a singly- or doubly-quoted character string, and are converted prior to any of the other escape sequences. For example, the line
gnu.regexp.RE exp = new gnu.regexp.RE("\u005cn");
would be converted by first replacing \u005c with a backslash, then converting \n to a newline. By the time the RE constructor is called, it will be passed a String object containing only the Unicode newline character.

The POSIX character classes (above), and the equivalent shorthand escapes (\d, \w and the like) are implemented to use the java.lang.Character static functions whenever possible. For example, \w and [:alnum:] (the latter only from within a class expression) will invoke the Java function Character.isLetterOrDigit() when executing. It is always better to use the POSIX expressions than a range such as [a-zA-Z0-9], because the latter will not match any letter characters in non-ISO 9660 encodings (for example, the umlaut character, "ü").

Reference Material

Notes
1 but see the REG_NOTBOL and REG_MULTILINE flags
2 but see the REG_NOTEOL and REG_MULTILINE flags
3 but see the REG_MULTILINE flag

[gnu.regexp] [change history] [api documentation] [test applet] [faq] [credits]