Contents - Index - Previous - Next


Regular Expressions

 

Regular expressions are a very powerful way of matching patterns in text.  They are also quite hard to understand at first.

You can use regular expressions as a way of picking words for a Fast Concordance. To enter a regular expression, go to the Text menu and choose Configure Regular expression search, or open the Make Fast Concordance dialog and press the Edit button next to Regex.

To make your concordance, open the Make Fast Concordance dialog if it is not already open (go to the File menu and choose Make Fast Concordance from Files or Make Fast Concordance from Clipboard.)  In the Fast Concordance dialog, look at Word Selection Method and ensure Regex is selected.  Then press the Make Fast Concordance button.

This is a long help topic. You might prefer to start with the examples here, just after the Introduction.
___________________________________________________


Introduction

Regular expressions are a widely-used method of specifying patterns of text to search for. Special metacharacters allow you to specify, for instance, that a particular string you are looking for occurs at the beginning or end of a line, or contains n recurrences of a certain character.

To enter a Regular Expression, go to the Text menu and choose Configure Regular Expression search, or open the Make Fast Concordance dialog and press the Edit button next to Regex.

Concordance allows you to choose whether to match your regular expression against single words which have already been identified by the program, or (more normally) against whole lines of input. 

When Concordance displays the results of a regular expression search, the entire matched expression appears as a 'headword'.  

When Concordance displays the results of a regular expression search matched against lines, the displayed contexts are the actual line even if you have chosen contexts of variable length (for more on these, see Context Styles).

When using regular expressions, the option to analyse characters instead of words (Text -> Special) is ignored.

___________________________________________________



Some simple examples

These examples, except where indicated otherwise, assume you have chosen to match against lines, not words.

Example:

[A-Z][a-z]+  matches words beginning with a capital letter.

So, if you define the regular expression

   the [A-Z][a-z]+ of [A-Z][a-z]+

and make a Fast Concordance to the installed sample file Demo4.txt, you will get these results:

the Carriage of Trees
the Country of Lilliput
the Demands of Nature
the Help of Ladders
the Hogsheads of Wine
the Laws of Hospitality
the Promise of Honour

- which, you may agree, is rather pleasing.  

(Note that this result could also have been achieved using a
Phrase search.)

More simple examples:

[a-zA-Z]+ matches words of any length containing only lower- and upper-case alphabetic (English) letters.
 
[a-zA-Z]+$ matches the final word in lines which have no punctuation at the end.  (If you are analysing verse, these would be possible enjambed lines.)

[a-zA-Z\s]+ matches any number of words bounded by punctuation or a line end.
___________________________________________________


Regular Expressions: the full story

1. Simple matches

Any single character matches itself, unless it is a metacharacter with a special meaning described below.

A series of characters matches that series of characters in the target string, so the pattern "bluh" would match "bluh'' in the target string. 

You can cause characters that normally function as metacharacters or escape sequences to be interpreted literally by 'escaping' them, that is, preceding them with a backslash "\".  For instance, metacharacter "^" matches beginning of string, but "\^" matches character "^", "\\" matches "\" and so on.

Examples:
  foobar          matches string 'foobar'
  \^FooBarPtr     matches '^FooBarPtr'


2. Character classes

You can specify a character class, by enclosing a list of characters in [], which will match any one character from the list.

If the first character after the "['' is "^'', the class matches any character not in the list.

Examples:
  foob[aeiou]r   finds strings 'foobar', 'foober' etc. but not 'foobbr', 'foobcr' etc.
  foob[^aeiou]r  find strings 'foobbr', 'foobcr' etc. but not 'foobar', 'foober' etc.

Within a list, the "-'' character is used to specify a range, so that a-z represents all characters between "a'' and "z'', inclusive. 

If you want "-'' itself to be a member of a class, put it at the start or end of the list, or escape it with a backslash. If you want ']' you may place it at the start of list or escape it with a backslash.

Examples:
[-az]      matches 'a', 'z' and '-'
[az-]      matches 'a', 'z' and '-'
[a\-z]     matches 'a', 'z' and '-'
[a-z]      matches all twenty-six lower-case English characters from 'a' to 'z'
[a-zàáèé]        matches the 26 English characters, as above, plus the four accented characters shown
[\d-t]     matches any digit, '-' or 't'. 
[]-a]      matches any character from ']'..'a'.


3.  Metacharacters

Metacharacters are special characters which are the essence of Regular Expressions. There are different types of metacharacters, described below.

Metacharacters - line separators

  ^      start of line
  $      end of line
  .      any character in line

Examples:
  ^foobar     matches string 'foobar' only if it's at the beginning of line
  foobar$     matches string 'foobar' only if it's at the end of line
  ^foobar$    matches string 'foobar' only if it's the only string in line
  foob.r      matches strings like 'foobar', 'foobbr', 'foob1r' and so on


Metacharacters - predefined classes

  \w     an alphanumeric character (including "_")
  \W     a non-alphanumeric
  \d     a numeric character
  \D     a non-numeric
  \s     any space 
  \S     a non-space

You may use \w, \d and \s within custom character classes.

Note that the ability to search for a space is quite alien to what Concordance does in all other modes. With regular expressions, you can have spaces in headwords, or even spaces as 'headwords'.

Examples:
  foob\dr matches strings like 'foob1r', ''foob6r' and so on but not 'foobar', 'foobbr' and so on
  foob[\w\s]r matches strings like 'foobar', 'foob r', 'foobbr' and so on but not 'foob1r', 'foob=r' and so on


Metacharacters - iterators

Any item of a regular expression may be followed by another type of metacharacter - an iterator. Using these metacharacters you can specify the number of occurences of the previous character, metacharacter or subexpression.

  * zero or more ("greedy"), similar to {0,}
  + one or more ("greedy"), similar to {1,}
  ? zero or one ("greedy"), similar to {0,1}
  {n} exactly n times ("greedy")
  {n,} at least n times ("greedy")
  {n,m} at least n but not more than m times ("greedy")
  *? zero or more ("non-greedy"), similar to {0,}?
  +? one or more ("non-greedy"), similar to {1,}?
  ?? zero or one ("non-greedy"), similar to {0,1}?
  {n}? exactly n times ("non-greedy")
  {n,}? at least n times ("non-greedy")
  {n,m}? at least n but not more than m times ("non-greedy")

Digits in curly brackets of the form {n,m}, specify the minimum number of times to match the item n and the maximum m. The form {n} is equivalent to {n,n} and matches exactly n times. The form {n,} matches n or more times. There is no limit to the size of n or m, but large numbers will chew up more memory and slow down execution.

If a curly bracket occurs in any other context, it is treated as a regular character.

"Greedy" operators take as many as possible, "non-greedy" take as few as possible. For example, 'b+' and 'b*' applied to string 'abbbbc' return 'bbbb', 'b+?' returns 'b', 'b*?' returns an empty string, 'b{2,3}?' returns 'bb', 'b{2,3}' returns 'bbb'.

Examples:
foob.*r     matches strings like 'foobar',  'foobalkjdflkj9r' and 'foobr'
foob.+r     matches strings like 'foobar', 'foobalkjdflkj9r' but not 'foobr'
foob.?r     matches strings like 'foobar', 'foobbr' and 'foobr' but not 'foobalkj9r'
fooba{2}r   matches the string 'foobaar'
fooba{2,}r  matches strings like 'foobaar', 'foobaaar', 'foobaaaar' etc.
fooba{2,3}r matches strings like 'foobaar', or 'foobaaar'  but not 'foobaaaar'


Metacharacters - alternatives

You can specify a series of alternatives for a pattern using "|'' to separate them, so that fee|fie|foe will match any of "fee'', "fie'', or "foe'' in the target string (as would f(e|i|o)e). The first alternative includes everything from the last pattern delimiter ("('', "['', or the beginning of the pattern) up to the first "|'', and the last alternative contains everything from the last "|'' to the next pattern delimiter. For this reason, it's common practice to include alternatives in parentheses, to minimize confusion about where they start and end.

Alternatives are tried from left to right, so the first alternative found for which the entire expression matches, is the one that is chosen. This means that alternatives are not necessarily greedy. For example: when matching foo|foot against "barefoot'', only the "foo'' part will match, as that is the first alternative tried, and it successfully matches the target string. (This might not seem important, but it is important when you are capturing matched text using parentheses.)

Also remember that "|'' is interpreted as a literal within square brackets, so if you write [fee|fie|foe] you're really only matching [feio|].

Examples:
  foo(bar|foo)  matches strings 'foobar' or 'foofoo'.

Metacharacters - subexpressions

The bracketing construct ( ... ) may be used to define subexpressions.

Subexpressions are numbered based on the left to right order of their opening parenthesis. The first subexpression is numbered '1'.

Examples:
(foobar){8,10}   matches strings which contain 8, 9 or 10 instances of 'foobar'
foob([0-9]|a+)r  matches 'foob0r', 'foob1r' , 'foobar', 'foobaar', 'foobaar' etc.


Metacharacters - back-references

Metacharacters \1 through \9 are interpreted as back-references. \<n> matches the previously-matched subexpression #<n>.

Examples:
(.)\1+ matches 'aaaa' and 'cc'. 
(.+)\1+ also matches 'abab' and '123123'
(['"]?)(\d+)\1 matches '"13" (in double quotes), or '4' (in single quotes) or 77 (without quotes) etc


4. Escape sequences

Characters may be specified using an escape sequence syntax:  \xnn  where nn is a string of hexadecimal digits.

  \xnn       matches the character with the ANSI value nn. 

Example:
  foo\x20bar   matches 'foo bar' (note space in the middle)


___________________________________________________


Credits for Regular Expressions:

Delphi implementation of regular expressions by Andrey V. Sorokin, St. Petersburg, Russia, http://anso.da.ru.  Additional examples and documentation by Kit Eason.  Based on C source written by Henry Spencer, University of Toronto, 1986, and donated to the public domain.

See also:  Make Fast Concordance