(r2) RegularExpression < TWiki

Tags: view all tags
Regular expressions (REs), unlike simple queries, allow you to search for text which matches a particular pattern. 

REs are similar to (but more poweful than) the "wildcards" used in the command-line interfaces found in operating systems such as Unix and MS-DOS. REs are used by sophisticated search engines, as well as by many Unix-based languages and tools ( e.g., =awk=, =grep=, =lex=, =perl=, and =sed= ).

*Examples*

<TABLE>
  <TR>
	 <TD>
		compan(y|ies)
	 </TD><TD>
		Search for _company_ , _companies_
	 </TD>
  </TR><TR>
	 <TD>
		(peter|paul)
	 </TD><TD>
		Search for _peter_ , _paul_
	 </TD>
  </TR><TR>
	 <TD>
		bug*
	 </TD><TD>
		Search for _bug_ , _bugs_ , _bugfix_
	 </TD>
  </TR><TR>
	 <TD>
		[Bb]ag
	 </TD><TD>
		Search for _Bag_ , _bag_
	 </TD>
  </TR><TR>
	 <TD>
		b[aiueo]g
	 </TD><TD>
		Second letter is a vowel. Matches _bag_ , _bug_ , _big_
	 </TD>
  </TR><TR>
	 <TD>
		b.g
	 </TD><TD>
		Second letter is any letter. Matches also _b&g_
	 </TD>
  </TR><TR>
	 <TD>
		[a-zA-Z]
	 </TD><TD>
		Matches any one letter (not a number and a symbol)
	 </TD>
  </TR><TR>
	 <TD>
		[^0-9a-zA-Z]
	 </TD><TD>
		Matches any symbol (not a number or a letter)
	 </TD>
  </TR><TR>
	 <TD>
		[A-Z][A-Z]*
	 </TD><TD>
		Matches one or more uppercase letters
	 </TD>
  </TR><TR>
	 <TD>
		[0-9][0-9][0-9]-[0-9][0-9]- <br> [0-9][0-9][0-9][0-9]
	 </TD><TD VALIGN="top">
		US social security number, e.g. 123-45-6789
	 </TD>
  </TR>
</TABLE>

Here is stuff for our UNIX freaks: <BR>
(copied from 'man grep')

<pre>
	  \c	A backslash (\) followed by any special character is  a
			 one-character  regular expression that matches the spe-
			 cial character itself.  The special characters are:

					+	 `.', `*', `[',  and  `\'  (period,  asterisk,
						  left  square  bracket, and backslash, respec-
						  tively), which  are  always  special,  except
						  when they appear within square brackets ([]).

					+	 `^' (caret or circumflex), which  is  special
						  at the beginning of an entire regular expres-
						  sion, or when it immediately follows the left
						  of a pair of square brackets ([]).

					+	 $ (currency symbol), which is special at  the
						  end of an entire regular expression.							  

	  .	 A `.' (period) is a  one-character  regular  expression
			 that matches any character except NEWLINE.
 
	  [string]
			 A non-empty string of  characters  enclosed  in  square
			 brackets  is  a  one-character  regular expression that
			 matches any one character in that string.  If, however,
			 the  first  character of the string is a `^' (a circum-
			 flex or caret), the  one-character  regular  expression
			 matches  any character except NEWLINE and the remaining
			 characters in the string.  The  `^'  has  this  special
			 meaning only if it occurs first in the string.  The `-'
			 (minus) may be used to indicate a range of  consecutive
			 ASCII  characters;  for example, [0-9] is equivalent to
			 [0123456789].  The `-' loses this special meaning if it
			 occurs  first (after an initial `^', if any) or last in
			 the string.  The `]' (right square  bracket)  does  not
			 terminate  such a string when it is the first character
			 within it (after an initial  `^',  if  any);  that  is,
			 []a-f]  matches either `]' (a right square bracket ) or
			 one of the letters a through  f  inclusive.	The  four
			 characters  `.', `*', `[', and `\' stand for themselves
			 within such a string of characters.

	  The following rules may be used to construct regular expres-
	  sions:

	  *	 A one-character regular expression followed by `*'  (an
			 asterisk)  is a regular expression that matches zero or
			 more occurrences of the one-character  regular  expres-
			 sion.	If  there  is  any choice, the longest leftmost
			 string that permits a match is chosen.

	  ^	 A circumflex or caret (^) at the beginning of an entire
			 regular  expression  constrains that regular expression
			 to match an initial segment of a line.

	  $	 A currency symbol ($) at the end of an  entire  regular
			 expression  constrains that regular expression to match
			 a final segment of a line.

	  *	 A  regular  expression  (not  just	a	one-
			 character regular expression) followed by `*'
			 (an asterisk) is a  regular  expression  that
			 matches  zero or more occurrences of the one-
			 character regular expression.	If  there  is
			 any  choice, the longest leftmost string that
			 permits a match is chosen.

	  +	 A regular expression followed by `+' (a  plus
			 sign)  is  a  regular expression that matches
			 one or more occurrences of the  one-character
			 regular  expression.  If there is any choice,
			 the longest leftmost string  that  permits  a
			 match is chosen.

	  ?	 A regular expression followed by `?' (a ques-
			 tion  mark)  is  a  regular  expression  that
			 matches zero or one occurrences of  the  one-
			 character  regular  expression.	If there is
			 any choice, the longest leftmost string  that
			 permits a match is chosen.

	  |	 Alternation:	 two	 regular	 expressions
			 separated  by  `|'  or NEWLINE match either a
			 match for  the  first  or  a  match  for  the
			 second.

	  ()	A regular expression enclosed in  parentheses
			 matches a match for the regular expression.

	  The order of precedence of operators at the same parenthesis
	  level  is  `[ ]'  (character  classes),  then  `*'  `+'  `?'
	  (closures),then  concatenation,  then  `|'  (alternation)and
	  NEWLINE.
</pre>
Topic revision: r2 - 2000-08-23 - PeterThoeny
Account
- Log In
~~Edit~~
~~Attach~~
Copyright © 1999-2025 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
Note: Please contribute updates to this topic on TWiki.org at TWiki:TWiki.RegularExpression.