1.1 Regex Syntax Summary
- Character: All characters, except those having special meaning in regex, matches themselves. E.g., the regex
xmatches substring"x"; regex9matches"9"; regex=matches"="; and regex@matches"@". - Special Regex Characters: These characters have special meaning in regex (to be discussed below):
.,+,*,?,^,$,(,),[,],{,},|,\. - Escape Sequences (\char):
- To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash (
\). E.g.,\.matches"."; regex\+matches"+"; and regex\(matches"(". - You also need to use regex
\\to match"\"(back-slash). - Regex recognizes common escape sequences such as
\nfor newline,\tfor tab,\rfor carriage-return,\nnnfor a up to 3-digit octal number,\xhhfor a two-digit hex code,\uhhhhfor a 4-digit Unicode,\uhhhhhhhhfor a 8-digit Unicode.
- To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash (
- A Sequence of Characters (or String): Strings can be matched via combining a sequence of characters (called sub-expressions). E.g., the regex
Saturdaymatches"Saturday". The matching, by default, is case-sensitive, but can be set to case-insensitive via modifier. - OR Operator (|): E.g., the regex
four|4accepts strings"four"or"4". - Character class (or Bracket List):
- […]: Accept ANY ONE of the character within the square bracket, e.g.,
[aeiou]matches"a","e","i","o"or"u". - [.-.] (Range Expression): Accept ANY ONE of the character in the range, e.g.,
[0-9]matches any digit;[A-Za-z]matches any uppercase or lowercase letters. - Only these four characters require escape sequence inside the bracket list:
^,-,],\.
- […]: Accept ANY ONE of the character within the square bracket, e.g.,
- Occurrence Indicators (or Repetition Operators):
- +: one or more (
1+), e.g.,[0-9]+matches one or more digits such as'123','000'. - *: zero or more (
0+), e.g.,[0-9]*matches zero or more digits. It accepts all those in[0-9]+plus the empty string. - ?: zero or one (optional), e.g.,
[+-]?matches an optional"+","-", or an empty string. - {m,n}:
mton(both inclusive) - {m}: exactly
mtimes - {m,}:
mor more (m+)
- +: one or more (
- Metacharacters: matches a character
- . (dot): ANY ONE character except newline. Same as
[^\n] - \d, \D: ANY ONE digit/non-digit character. Digits are
[0-9] - \w, \W: ANY ONE word/non-word character. For ASCII, word characters are
[a-zA-Z0-9_] - \s, \S: ANY ONE space/non-space character. For ASCII, whitespace characters are
[ \n\r\t\f]
- . (dot): ANY ONE character except newline. Same as
- Position Anchors: does not match character, but position such as start-of-line, end-of-line, start-of-word and end-of-word.
- ^, ` matches a numeric string.
- \b: boundary of word, i.e., start-of-word or end-of-word. E.g.,
\bcat\bmatches the word"cat"in the input string. - \B: Inverse of \b, i.e., non-start-of-word or non-end-of-word.
- <, >: start-of-word and end-of-word respectively, similar to
\b. E.g.,\<cat\>matches the word"cat"in the input string. - \A, \Z: start-of-input and end-of-input respectively.
- Parenthesized Back References:
- Use parentheses
( )to create a back reference. - Use
$1,$2, … (Java, Perl, JavaScript) or\1,\2, … (Python) to retreive the back references in sequential order.
- Use parentheses
- Laziness (Curb Greediness for Repetition Operators):
*?,+?,??,{m,n}?,{m,}?
1.2 Example: Numbers [0-9]+ or \d+
-
A regex (regular expression) consists of a sequence of sub-expressions. In this example,
[0-9]and+. -
The
[...], known as character class (or bracket list), encloses a list of characters. It matches any SINGLE character in the list. In this example,[0-9]matches any SINGLE character between 0 and 9 (i.e., a digit), where dash (-) denotes the range. -
The
+, known as occurrence indicator (or repetition operator), indicates one or more occurrences (1+) of the previous sub-expression. In this case,[0-9]+matches one or more digits. -
A regex may match a portion of the input (i.e., substring) or the entire input. In fact, it could match zero or more substrings of the input (with global modifier).
-
This regex matches any numeric substring (of digits 0 to 9) of the input. For examples,
- If the input is
"abc123xyz", it matches substring"123". - If the input is
"abcxyz", it matches nothing. - If the input is
"abc00123xyz456_0", it matches substrings"00123","456"and"0"(three matches).
Take note that this regex matches number with leading zeros, such as
"000","0123"and"0001", which may not be desirable. - If the input is
-
You can also write
\d+, where\dis known as a metacharacter that matches any digit (same as[0-9]). There are more than one ways to write a regex! Take note that many programming languages (C, Java, JavaScript, Python) use backslash\as the prefix for escape sequences (e.g.,\nfor newline), and you need to write"\\d+"instead.