## CS 200: Regular Expressions in Python

This notebook mirrors the <a target=ee href="https://developer.google.com/edu/python/regular-expressions">Google Python Course: Regular Expressions</a>

<script language="JavaScript">
    document.write("Last modified: " + document.lastModified)
</script>

Regular expressions comprise a pattern matching language.  They also are a formal
grammar that is a proper subset of context free grammars.  In addition, 
regular expressions are provably equivalent to deterministic finite state automata, aka deterministic finite 
state acceptors or DFA's.  

The functions defined in this notebook are found in <a target=dd href="retest.py">retest.py</a>.

Python implements regular expression pattern matching in the re module.

In [1]:
import re

In [2]:
dir(re)

['A',
 'ASCII',
 'DEBUG',
 'DOTALL',
 'I',
 'IGNORECASE',
 'L',
 'LOCALE',
 'M',
 'MULTILINE',
 'Match',
 'Pattern',
 'RegexFlag',
 'S',
 'Scanner',
 'T',
 'TEMPLATE',
 'U',
 'UNICODE',
 'VERBOSE',
 'X',
 '_MAXCACHE',
 '__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '__version__',
 '_cache',
 '_compile',
 '_compile_repl',
 '_expand',
 '_locale',
 '_pickle',
 '_special_chars_map',
 '_subx',
 'compile',
 'copyreg',
 'enum',
 'error',
 'escape',
 'findall',
 'finditer',
 'fullmatch',
 'functools',
 'match',
 'purge',
 'search',
 'split',
 'sre_compile',
 'sre_parse',
 'sub',
 'subn',
 'template']

A pattern is a string containing either characters or meta-characters.

The re method search(pattern, string) performs a pattern match.

In [4]:
pat = 'xxx'

In [5]:
r = re.search(pat, ' x ')

In [6]:
r ### there is no match

In [7]:
r2 = re.search(pat, 'xxxyyyxxx')

In [9]:
r2  ### there is a match!

<re.Match object; span=(0, 3), match='xxx'>

The power of regular expressions is that they can specify patterns, not just fixed characters. Here are the most basic patterns which match single characters:
<ul>
<li> a, X, 9, < -- ordinary characters just match themselves exactly. The meta-characters which do not match themselves because they have special meanings are: . ^ $ * + ? { } [ ] \ | ( ) (details below)

<li> . (a period) -- matches any single character except newline '\n'

<li> \w -- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although "word" is the mnemonic for this, it only matches a single word character, not a whole word. \W (upper case W) matches any non-word character.

<li> \b -- boundary between word and non-word

<li> \s -- (lowercase s) matches a single whitespace character -- space, newline, return, tab, formfeed [ \n\r\t\f]. \S (upper case S) matches any non-whitespace character.

<li> \t, \n, \r , \f-- tab, newline, return, formfeed. A formfeed is sometimes called a <a target=qq href="https://en.wikipedia.org/wiki/Page_break">page break</a>. <img src="https://www.computerhope.com/jargon/f/formfeed.jpg">

<li> \d -- decimal digit [0-9] (some older regex utilities do not support \d, but they all support \w and \s)

<li> ^ = start, $ = end -- match the start or end of the string

<li> \ -- inhibit the "specialness" of a character. So, for example, use \. to match a period or \\ to match a slash. If you are unsure if a character has special meaning, such as '@', you can put a slash in front of it, \@, to make sure it is treated just as a character.
    
<li> () - use parentheses to group patterns.
    
<li> | - the vertical bar is used to specify alternates. For example (a|e|i|o|u) matches any single vowel.
        
<li> [] - square brackets are used to specify a class of characters.  For example [aeiou] matches any single vowel.
    
<li> - - a dash or hyphen can be used to specify a range of characters.    For example, [a-zA-Z] matches any alphabetic character, either lower case or upper case.
    
<li> ^ - the up arrow or carat has a different meaning when inside  square brackets.  It creates the <b>complement</b> of the character class.  For example, [^aeiou] matches any non-vowel character.
</ul>

#### Examples

In [10]:
re.search('ab.','abxddd')

<re.Match object; span=(0, 3), match='abx'>

In [11]:
re.search('ab..', 'abc')  ## no match

In [12]:
re.search('\w\w\w','ab345')

<re.Match object; span=(0, 3), match='ab3'>

In [13]:
re.search('\w\w\w','ab dc')  ## no match

In [14]:
re.search('\w\w\W\w','ab dc')

<re.Match object; span=(0, 4), match='ab d'>

In [96]:
re.search(r'\w\w\b\w\w','ab dc')

In [15]:
re.search(r'\b\w\w\b','ab dc')

<re.Match object; span=(0, 2), match='ab'>

In [16]:
re.search(r'\bfoo\b', 'foo')

<re.Match object; span=(0, 3), match='foo'>

In [18]:
re.search(r'\bfoo\b', 'foo.')

<re.Match object; span=(0, 3), match='foo'>

In [19]:
re.search(r'\bfoo\b', '(foo)')

<re.Match object; span=(1, 4), match='foo'>

In [20]:
re.search(r'\bfoo\b', 'bar foo baz')

<re.Match object; span=(4, 7), match='foo'>

In [21]:
re.search(r'\bfoo\b', 'foobar')

In [23]:
re.search(r'\bfoo\b', 'foo3')

In [24]:
re.search('\w\w\s\w\w','ab dc')

<re.Match object; span=(0, 5), match='ab dc'>

In [63]:
re.search('\S\S\s\S\S', 'ab dc')

<re.Match object; span=(0, 5), match='ab dc'>

In [25]:
re.search('\d\d\d','12345')

<re.Match object; span=(0, 3), match='123'>

In [26]:
re.search('\d\d\d\D\d\d\d','123 456')

<re.Match object; span=(0, 7), match='123 456'>

In [66]:
re.search('\d\d\d','x123456')

<re.Match object; span=(1, 4), match='123'>

In [27]:
re.search('^\d\d\d','x123456')  # no match

In [28]:
re.search('^\d\d\d','123456')

<re.Match object; span=(0, 3), match='123'>

In [69]:
re.search('\d\d\d$','123456')

<re.Match object; span=(3, 6), match='456'>

In [29]:
re.search('\d\d\d$','123456 ')  ## no match

In [30]:
re.search('\d\d\d$','123456 '.strip())

<re.Match object; span=(3, 6), match='456'>

In [72]:
re.search('ab\.','abc') ## no match

In [31]:
re.search('ab\.', 'ab...')

<re.Match object; span=(0, 3), match='ab.'>

In [32]:
re.search('(cat|dog)','my cat')

<re.Match object; span=(3, 6), match='cat'>

In [33]:
re.search('(cat|dog)', 'your dog')

<re.Match object; span=(5, 8), match='dog'>

In [34]:
re.search('^(a|e|i|o|u)123', 'e123')

<re.Match object; span=(0, 4), match='e123'>

In [35]:
re.search('^[aeiou]123', 'o123456')

<re.Match object; span=(0, 4), match='o123'>

In [36]:
re.search('^[^aeiou]123','x123456')

<re.Match object; span=(0, 4), match='x123'>

In [37]:
re.search('^[a-z]123','x123456')

<re.Match object; span=(0, 4), match='x123'>

In [38]:
re.search('^[^a-z]123','X123456')

<re.Match object; span=(0, 4), match='X123'>

### Repetition

Things get more interesting when you use + and * to specify repetition in the pattern
<ul>
<li> + -- 1 or more occurrences of the pattern to its left, e.g. 'i+' = one or more i's

<li> * -- 0 or more occurrences of the pattern to its left

<li> ?  -- match 0 or 1 occurrences of the pattern to its left
</ul>   

### Leftmost & Largest

First the search finds the leftmost match for the pattern, and second it tries to use up as much of the string as possible -- i.e. + and * go as far as possible (the + and \* are said to be "greedy").

### Examples

In [39]:
re.match('pi+g', 'piiiig')  # one or more i's, as many as possible

<re.Match object; span=(0, 6), match='piiiig'>

Finds the first/leftmost solution, and within it drives the +
as far as possible (aka 'leftmost and largest').

In this example, note that it does not get to the second set of i's.

In [40]:
match = re.search(r'i+', 'piigiiii')

In [41]:
match

<re.Match object; span=(1, 3), match='ii'>

\s* = zero or more whitespace chars

Here look for 3 digits, possibly separated by whitespace.

In [42]:
re.search('\d\s*\d\s*\d', 'xx1 2 3xx')

<re.Match object; span=(2, 7), match='1 2 3'>

In [44]:
re.search(r'\d\s*\d\s*\d', 'xx12 3xx')

<re.Match object; span=(2, 6), match='12 3'>

In [45]:
re.search(r'\d\s*\d\s*\d', 'xx12        3xx')

<re.Match object; span=(2, 13), match='12        3'>

^ = matches the start of string, so the first case fails:

In [46]:
re.search(r'^b\w+', 'foobar') 

In [47]:
re.search(r'b\w+', 'foobar') 

<re.Match object; span=(3, 6), match='bar'>

Square brackets indicate a character class. e.g. [aeiou] matches any vowel

In [48]:
re.match('^[aeiou]+$','aaaaeee')

<re.Match object; span=(0, 7), match='aaaaeee'>

In [50]:
re.match('^[aeiou]+$','aaaaxxxeee')

### Testing regular expressions

In [51]:
def retest(str = 'an example word:cat!!'):
       pat = 'word:\w\w\w'
       match = re.search(pat, str)
       # If-statement after search() tests if it succeeded
       if match:
              print ('found: {}'.format( match.group()))
       else:
              print ('Did not find {} in {}'.format(pat, str))

In [52]:
retest()

found: word:cat


In [53]:
retest('hello world')

Did not find word:\w\w\w in hello world


In [54]:
retest('this is word:123456')

found: word:123


In [55]:
patterns = ['aaa',         # contains aaa
            'abc',         # contains abc
            '...',         # contains three characters
            '^...$',       # starts with three characters
            '\.\.\.',         # contains three periods
            'abd',         # contains abd
            '^[aeiou]*$',  # contains only vowels
            '^[^aeiou]*$', # contains only NON vowels
            '\w\W\w',      # two word characters separated by a non-word char
            '\w\w\w',      # three word characters 
            '^\d+$',       # contains only decimal digits
            '^[0-7]+$',    # contains only octal digits
            '^[0-9A-Fa-f]+$', # contains only hexadecimal digits
            '^[a-z]*$',    # contains only lower case letters
            '^\s+',        # starts with a whitespace char
            '^\d\s?',      # starts with a digit followed by zero or one space
            '\w+@\w+',     # match email address
            '(b)*(a)*(c)',
            '^b*a*c$',
            '^(b|c)*(a|b)*$',
            '^bb*(ab|ba)*|(bbc|cbc)*$',
            '^(ab|ba|cb|bc|ca|ac)*$',
            '(bc|bcc)(bac|cba)(cba|aa)',
            '\AThe',       # \A is beginning of string
]

In [56]:
def retest2(patlist = patterns):
   while (True):
       str = input("\nEnter string: ")
       if str == "quit": break
       for pat in patlist:
           match = re.search(pat, str)
           # match = re.match(pat, str)
           if match:
              print ('Matched: {} with: {}'.format(pat, match.group(0)))
           else:
              print ('Did not find {} in {}'.format(pat, str))

In [None]:
retest2()


Enter string: abc
Did not find aaa in abc
Matched: abc with: abc
Matched: ... with: abc
Matched: ^...$ with: abc
Did not find \.\.\. in abc
Did not find abd in abc
Did not find ^[aeiou]*$ in abc
Did not find ^[^aeiou]*$ in abc
Did not find \w\W\w in abc
Matched: \w\w\w with: abc
Did not find ^\d+$ in abc
Did not find ^[0-7]+$ in abc
Matched: ^[0-9A-Fa-f]+$ with: abc
Matched: ^[a-z]*$ with: abc
Did not find ^\s+ in abc
Did not find ^\d\s? in abc
Did not find \w+@\w+ in abc
Matched: (b)*(a)*(c) with: bc
Did not find ^b*a*c$ in abc
Did not find ^(b|c)*(a|b)*$ in abc
Matched: ^bb*(ab|ba)*|(bbc|cbc)*$ with: 
Did not find ^(ab|ba|cb|bc|ca|ac)*$ in abc
Did not find (bc|bcc)(bac|cba)(cba|aa) in abc
Did not find \AThe in abc

Enter string: joe@yale
Did not find aaa in joe@yale
Did not find abc in joe@yale
Matched: ... with: joe
Did not find ^...$ in joe@yale
Did not find \.\.\. in joe@yale
Did not find abd in joe@yale
Did not find ^[aeiou]*$ in joe@yale
Did not find ^[^aeiou]*$ in joe@yale
Mat

<h3 id="group">Group Matching</h3>

In [None]:
def retest3(patlist = ['(\w+)@(\w+)', '^(\d\d\d).(\d\d\d)']):
   while (True):
       str = input("\nEnter string: ")
       if str == "quit": break
       for pat in patlist:
           match = re.search(pat, str)
           if match:
              print ('Matched: {} group 1: {} group 2: {}'.format(pat, match.group(1), match.group(2)))
           else:
              print ('Did not find {} in {}'.format(pat, str))

In [None]:
retest3()

### Findall

In [27]:
def retest4(patlist = patterns):
   while (True):
       str = input("\nEnter string: ")
       if str == "quit": break
       for pat in patlist:
           match = re.findall(pat, str)
           if match:
              print ('Total matches for {}: {}'.format(pat, len(match)))
           else:
              print ('Did not find {} in {}'.format(pat, str))

In [None]:
retest4()


Enter string: aaaaaa
Total matches for aaa: 2
Did not find abc in aaaaaa
Total matches for ...: 2
Did not find ^...$ in aaaaaa
Did not find \.\.\. in aaaaaa
Did not find abd in aaaaaa
Total matches for ^[aeiou]*$: 1
Did not find ^[^aeiou]*$ in aaaaaa
Did not find \w\W\w in aaaaaa
Total matches for \w\w\w: 2
Did not find ^\d+$ in aaaaaa
Did not find ^[0-7]+$ in aaaaaa
Total matches for ^[0-9A-Fa-f]+$: 1
Total matches for ^[a-z]*$: 1
Did not find ^\s+ in aaaaaa
Did not find ^\d\s? in aaaaaa
Did not find \w+@\w+ in aaaaaa
Did not find (b)*(a)*(c) in aaaaaa
Did not find ^b*a*c$ in aaaaaa
Total matches for ^(b|c)*(a|b)*$: 1
Total matches for ^bb*(ab|ba)*|(bbc|cbc)*$: 1
Did not find ^(ab|ba|cb|bc|ca|ac)*$ in aaaaaa
Did not find (bc|bcc)(bac|cba)(cba|aa) in aaaaaa
Did not find \AThe in aaaaaa


End of regular expressions notebook.