org.apache.xerces.impl.xpath.regex
Class RegularExpression

java.lang.Object
  extended by org.apache.xerces.impl.xpath.regex.RegularExpression
All Implemented Interfaces:
java.io.Serializable

public class RegularExpression
extends java.lang.Object
implements java.io.Serializable

A regular expression matching engine using Non-deterministic Finite Automaton (NFA). This engine does not conform to the POSIX regular expression.


How to use

A. Standard way
 RegularExpression re = new RegularExpression(regex);
 if (re.matches(text)) { ... }
 
B. Capturing groups
 RegularExpression re = new RegularExpression(regex);
 Match match = new Match();
 if (re.matches(text, match)) {
     ... // You can refer captured texts with methods of the Match class.
 }
 

Case-insensitive matching

 RegularExpression re = new RegularExpression(regex, "i");
 if (re.matches(text) >= 0) { ...}
 

Options

You can specify options to RegularExpression(regex, options) or setPattern(regex, options). This options parameter consists of the following characters.

"i"
This option indicates case-insensitive matching.
"m"
^ and $ consider the EOL characters within the text.
"s"
. matches any one character.
"u"
Redefines \d \D \w \W \s \S \b \B \< \> as becoming to Unicode.
"w"
By this option, \b \B \< \> are processed with the method of 'Unicode Regular Expression Guidelines' Revision 4. When "w" and "u" are specified at the same time, \b \B \< \> are processed for the "w" option.
","
The parser treats a comma in a character class as a range separator. [a,b] matches a or , or b without this option. [a,b] matches a or b with this option.
"X"
By this option, the engine confoms to XML Schema: Regular Expression. The match() method does not do subsring matching but entire string matching.

Syntax

Differences from the Perl 5 regular expression

  • There is 6-digit hexadecimal character representation (\vHHHHHH.)
  • Supports subtraction, union, and intersection operations for character classes.
  • Not supported: \ooo (Octal character representations), \G, \C, \lc, \ uc, \L, \U, \E, \Q, \N{name}, (?{code}), (??{code})

Meta characters are `. * + ? { [ ( ) | \ ^ $'.


BNF for the regular expression

 regex ::= ('(?' options ')')? term ('|' term)*
 term ::= factor+
 factor ::= anchors | atom (('*' | '+' | '?' | minmax ) '?'? )?
            | '(?#' [^)]* ')'
 minmax ::= '{' ([0-9]+ | [0-9]+ ',' | ',' [0-9]+ | [0-9]+ ',' [0-9]+) '}'
 atom ::= char | '.' | char-class | '(' regex ')' | '(?:' regex ')' | '\' [0-9]
          | '\w' | '\W' | '\d' | '\D' | '\s' | '\S' | category-block | '\X'
          | '(?>' regex ')' | '(?' options ':' regex ')'
          | '(?' ('(' [0-9] ')' | '(' anchors ')' | looks) term ('|' term)? ')'
 options ::= [imsw]* ('-' [imsw]+)?
 anchors ::= '^' | '$' | '\A' | '\Z' | '\z' | '\b' | '\B' | '\<' | '\>'
 looks ::= '(?=' regex ')'  | '(?!' regex ')'
           | '(?<=' regex ')' | '(?<!' regex ')'
 char ::= '\\' | '\' [efnrtv] | '\c' [@-_] | code-point | character-1
 category-block ::= '\' [pP] category-symbol-1
                    | ('\p{' | '\P{') (category-symbol | block-name
                                       | other-properties) '}'
 category-symbol-1 ::= 'L' | 'M' | 'N' | 'Z' | 'C' | 'P' | 'S'
 category-symbol ::= category-symbol-1 | 'Lu' | 'Ll' | 'Lt' | 'Lm' | Lo'
                     | 'Mn' | 'Me' | 'Mc' | 'Nd' | 'Nl' | 'No'
                     | 'Zs' | 'Zl' | 'Zp' | 'Cc' | 'Cf' | 'Cn' | 'Co' | 'Cs'
                     | 'Pd' | 'Ps' | 'Pe' | 'Pc' | 'Po'
                     | 'Sm' | 'Sc' | 'Sk' | 'So'
 block-name ::= (See above)
 other-properties ::= 'ALL' | 'ASSIGNED' | 'UNASSIGNED'
 character-1 ::= (any character except meta-characters)

 char-class ::= '[' ranges ']'
                | '(?[' ranges ']' ([-+&] '[' ranges ']')? ')'
 ranges ::= '^'? (range ','?)+
 range ::= '\d' | '\w' | '\s' | '\D' | '\W' | '\S' | category-block
           | range-char | range-char '-' range-char
 range-char ::= '\[' | '\]' | '\\' | '\' [,-efnrtv] | code-point | character-2
 code-point ::= '\x' hex-char hex-char
                | '\x{' hex-char+ '}'
                | '\v' hex-char hex-char hex-char hex-char hex-char hex-char
 hex-char ::= [0-9a-fA-F]
 character-2 ::= (any character except \[]-,)
 

TODO


Version:
$Id: RegularExpression.java 446721 2006-09-15 20:35:34Z mrglavas $
Author:
TAMURA Kent <kent@trl.ibm.co.jp>
See Also:
Serialized Form

Constructor Summary
RegularExpression(java.lang.String regex)
          Creates a new RegularExpression instance.
RegularExpression(java.lang.String regex, java.lang.String options)
          Creates a new RegularExpression instance with options.
 
Method Summary
 boolean equals(java.lang.Object obj)
          Return true if patterns are the same and the options are equivalent.
 int getNumberOfGroups()
          Return the number of regular expression groups.
 java.lang.String getOptions()
          Returns a option string.
 java.lang.String getPattern()
           
 int hashCode()
           
 boolean matches(char[] target)
          Checks whether the target text contains this pattern or not.
 boolean matches(char[] target, int start, int end)
          Checks whether the target text contains this pattern in specified range or not.
 boolean matches(char[] target, int start, int end, Match match)
          Checks whether the target text contains this pattern in specified range or not.
 boolean matches(char[] target, Match match)
          Checks whether the target text contains this pattern or not.
 boolean matches(java.text.CharacterIterator target)
          Checks whether the target text contains this pattern or not.
 boolean matches(java.text.CharacterIterator target, Match match)
          Checks whether the target text contains this pattern or not.
 boolean matches(java.lang.String target)
          Checks whether the target text contains this pattern or not.
 boolean matches(java.lang.String target, int start, int end)
          Checks whether the target text contains this pattern in specified range or not.
 boolean matches(java.lang.String target, int start, int end, Match match)
          Checks whether the target text contains this pattern in specified range or not.
 boolean matches(java.lang.String target, Match match)
          Checks whether the target text contains this pattern or not.
 void setPattern(java.lang.String newPattern)
           
 void setPattern(java.lang.String newPattern, java.lang.String options)
           
 java.lang.String toString()
          Represents this instence in String.
 
Methods inherited from class java.lang.Object
getClass, notify, notifyAll, wait, wait, wait
 

Constructor Detail

RegularExpression

public RegularExpression(java.lang.String regex)
                  throws ParseException
Creates a new RegularExpression instance.

Parameters:
regex - A regular expression
Throws:
org.apache.xerces.utils.regex.ParseException - regex is not conforming to the syntax.
ParseException

RegularExpression

public RegularExpression(java.lang.String regex,
                         java.lang.String options)
                  throws ParseException
Creates a new RegularExpression instance with options.

Parameters:
regex - A regular expression
options - A String consisted of "i" "m" "s" "u" "w" "," "X"
Throws:
org.apache.xerces.utils.regex.ParseException - regex is not conforming to the syntax.
ParseException
Method Detail

matches

public boolean matches(char[] target)
Checks whether the target text contains this pattern or not.

Returns:
true if the target is matched to this regular expression.

matches

public boolean matches(char[] target,
                       int start,
                       int end)
Checks whether the target text contains this pattern in specified range or not.

Parameters:
start - Start offset of the range.
end - End offset +1 of the range.
Returns:
true if the target is matched to this regular expression.

matches

public boolean matches(char[] target,
                       Match match)
Checks whether the target text contains this pattern or not.

Parameters:
match - A Match instance for storing matching result.
Returns:
Offset of the start position in target; or -1 if not match.

matches

public boolean matches(char[] target,
                       int start,
                       int end,
                       Match match)
Checks whether the target text contains this pattern in specified range or not.

Parameters:
start - Start offset of the range.
end - End offset +1 of the range.
match - A Match instance for storing matching result.
Returns:
Offset of the start position in target; or -1 if not match.

matches

public boolean matches(java.lang.String target)
Checks whether the target text contains this pattern or not.

Returns:
true if the target is matched to this regular expression.

matches

public boolean matches(java.lang.String target,
                       int start,
                       int end)
Checks whether the target text contains this pattern in specified range or not.

Parameters:
start - Start offset of the range.
end - End offset +1 of the range.
Returns:
true if the target is matched to this regular expression.

matches

public boolean matches(java.lang.String target,
                       Match match)
Checks whether the target text contains this pattern or not.

Parameters:
match - A Match instance for storing matching result.
Returns:
Offset of the start position in target; or -1 if not match.

matches

public boolean matches(java.lang.String target,
                       int start,
                       int end,
                       Match match)
Checks whether the target text contains this pattern in specified range or not.

Parameters:
start - Start offset of the range.
end - End offset +1 of the range.
match - A Match instance for storing matching result.
Returns:
Offset of the start position in target; or -1 if not match.

matches

public boolean matches(java.text.CharacterIterator target)
Checks whether the target text contains this pattern or not.

Returns:
true if the target is matched to this regular expression.

matches

public boolean matches(java.text.CharacterIterator target,
                       Match match)
Checks whether the target text contains this pattern or not.

Parameters:
match - A Match instance for storing matching result.
Returns:
Offset of the start position in target; or -1 if not match.

setPattern

public void setPattern(java.lang.String newPattern)
                throws ParseException
Throws:
ParseException

setPattern

public void setPattern(java.lang.String newPattern,
                       java.lang.String options)
                throws ParseException
Throws:
ParseException

getPattern

public java.lang.String getPattern()

toString

public java.lang.String toString()
Represents this instence in String.

Overrides:
toString in class java.lang.Object

getOptions

public java.lang.String getOptions()
Returns a option string. The order of letters in it may be different from a string specified in a constructor or setPattern().

See Also:
RegularExpression(java.lang.String,java.lang.String), setPattern(java.lang.String,java.lang.String)

equals

public boolean equals(java.lang.Object obj)
Return true if patterns are the same and the options are equivalent.

Overrides:
equals in class java.lang.Object

hashCode

public int hashCode()
Overrides:
hashCode in class java.lang.Object

getNumberOfGroups

public int getNumberOfGroups()
Return the number of regular expression groups. This method returns 1 when the regular expression has no capturing-parenthesis.