Handout 1 Finite state machines and regular expressions by nyut545e2


									  06-05934 Models of Computation                                                           The University of Birmingham
  Spring Semester 2011                                                                       School of Computer Science
  notes by Achim Jung 2007                                                                              January 17, 2011
  adapted by Volker Sorge, Steve Vickers

                                                Handout 1
                               Finite state machines and regular expressions
1. What are we up to? In the 1930s, mathematicians asked the question, “What is computation?” although at the time the
   computer was not even invented. Nonetheless, in pursuing an answer they came across some truly amazing results which
   are still important for a proper understanding of computation today. We will get to this in a few weeks’ time.
   In this handout we start with the simpler question, “What is a computer?” One answer is to say that it is some kind of
   machine. So we begin with a way of describing machines and how they react to user commands.
   This “reaction to user commands” sounds innocent, but in fact takes us to another important strand of ideas, that of language
   recognition. When can a string of symbols be recognized as valid according to some definition? For example, when is it a
   valid Java source file? We shall see that different complexities of language correspond to different complexities of machines
   to recognize them. This week’s topic covers the simplest: finite state machines to recognize regular expressions. This is an
   important practical technique used by software such as grep.
2. Some simple examples. The simplest imaginable “machine” (if you can call it that) is a light with a switch. The light
   can be either on or off and we can move from one state to the other by operating the light switch. Here are two versions,
   depending on the type of switch employed:
                             press-on                                                                press
   press-off                                          press-on
                   off                        on                                         off                        on

                            press-off                                                               press
  The switch on the left is like the one that you are familiar with from your home: the switch can be pressed either at the top
  (“press-off”) or at the bottom (“press-on”). The one on the right is a “toggle switch”: whenever it is pressed, the system
  changes state. My computer monitor has such a switch, for example.
  For something a bit more complex, consider a vending machine for fizzy drinks which cost one pound a bottle. The machine
  accepts 50p and £1 coins. It is the machine’s job to accept your money, count how much has been inserted already, and
  (if you are lucky) vend a bottle of Coke when the right amount of money has been inserted and the appropriate button is
  pressed. In detail, our vending machine has the following states:
  q0 : Start state: the machine waits for coins to be inserted; if we press the Coke button nothing happens.
  q1 : 50p has been inserted. The machine waits for more coins; nothing happens if the button is pressed.
  q2 : £1 has been inserted. The machine waits for the Coke button to be pressed.
  Once the button has been pressed and a drink has been returned the machine goes back to the start state. Here is the state
  transition diagram:
                                                      50p                   50p
                                             q0                q1                  q2

   Note that we have used a different symbol for state q0 , as this is the machine’s normal “waiting state.” More about this in a
3. Transition tables. Graphs are easy to draw but difficult to implement and to reason about. Alternatively, we can describe
   their state transitions as tables. For the above example we get the following transition table:
                                                             50p       £1    B
                                                       q0     q1       q2    q0
                                                       q1     q2             q1
                                                       q2                    q0
  The table defines a transition function that, when given a state and an operation, returns the resulting state. We observe
  that the transition function is only partially defined since some of the table entries are left blank. In particular, there is no
  indication of what happens when we insert more than £1 into the machine.

4. Filling the transition table. We can easily fill the gaps by introducing an error state to deal with overpayment:
                                                        E              50p,£1,B
                            £1                 £1                                                      50p     £1   B
         B                  B                                50p,£1                           q0        q1     q2   q0
                                                                                              q1        q2     E    q1
                 50p                 50p                                                      q2        E      E    q0
        q0                  q1                          q2                                    E         E      E    E

  We can observe that the transition table is now slightly larger and fully filled. In other words the transition function is
  now total, meaning that for every possible state and for every command a valid transition is defined.1 Let’s summarise our
  understanding so far:
  Definition 1 (Finite state machine) A finite state machine consists of the following:
      • A finite non-empty set Q of states.
      • A finite set Σ (read “sigma”) of commands.

      • A transition function δ (read “delta”) which returns for every state q ∈ Q and every command x ∈ Σ the next state
        δ(q, x). In other words, δ is function from Q × Σ to Q.

5. Bells and whistles. The finite state machine is a very useful model for interactive systems, and it is indeed being used in
   real system design. It is great for making initial design sketches, discussing them with other engineers and even end-users,
   for discovering design flaws (such as missing transitions), and proving formally that certain properties hold. What we have
   presented here is only the bare minimum of the idea; obviously, more features could be added, such as allowing the machine
   to generate some output of its own, or to add a notion of “time-out” or probability. We will not go down this route but if
   you are interested then you should consider choosing the module Automatic Verification in the third year.
   Here we want to explore in how far finite state machines can be seen as computing devices. As it turns out, they are
   especially well-suited for one particular kind of computational task: string matching.
6. String matching. In the simplest case we are trying to test whether a given string (perhaps some user input) matches a
   fixed value, in other words, we are trying to compare two strings. Here is a state transition diagram for testing equality with
                    T                  o                       l                 k            i                 e                n
             0                1                     2                  3              4                   5               6           7

                                     ¬o                        ¬l           ¬k       ¬i       ¬e          ¬n
                                                                                                                         any letter

                                                                                      8                   any letter

  Here state 0 is a start state for the comparison, indicated by a diamond shape rather than a circle in the diagram, and state 7
  is a success state, indicated by a double boundary. Once we are in state 8, we can be certain that the string we are testing is
  not equal to “Tolkien”, just because it is too long already. However, we are not exploiting the possibilities of finite state
  diagrams at all. Here is a diagram that accepts both “colour” and “color”:

                                                                                                   u            r

                              c                     o                   l                 o               r
                        0                  1                       2             3                4                 6



     1 We may not be totally happy with the resulting system, though, as there seems to be no way to get our money back when we overpay. In the first

  exercise below you are asked to improve the design.

  Here is one that accepts all strings starting with “{”, ending in “}” and having any string of lower-case characters in-between
  (but no braces):2

                                                                                    a, b, c, ... , z

                                                                 {                       }

   A finite state diagram that is used for string matching always needs to have a start state (usually denoted by q0 ) and (one
   or several) “success states” (usually denoted F ; it is a subset of Q). The latter ones are usually called accepting states.3
   This is a good point to introduce the more traditional name finite state automaton4 for these diagrams, and sometimes
   deterministic finite state automaton to stress the fact that the transition function has precisely one successor state in each
   situation. Because of this latter terminology, the abbreviation is usually DFA. Don’t let this confuse you! “Finite state
   diagram”, “finite state machine”, “(deterministic) finite state automaton”, and “DFA” are all the same concept.
7. A pattern language. We see that finite state diagrams can express the process of string comparison, but surely we would
   not want to build the diagram ourselves whenever we come across a string comparison problem. Instead, what we need is a
   language which allows us to specify “strings with variations,” also known as patterns. The language that we will introduce
   here is that of regular expressions. The first example, as a regular expression, is just ’Tolkien’; we use quote signs to
   limit the expression, as is done in many UNIX programs. The second is written as ’col(o|ou)r’, and we see that “|”
   is used to express alternatives. The final one is written as ’{[a-z]*}’; this example contains two new constructs: the
   character range “[a-z]”, which is really just an abbreviation for a long alternative:
   and “*” which indicates an arbitrary number of repetitions of the subexpression just preceding it. In this case, it means
   that an arbitrary number of lower case characters may be enclosed between the braces. Note, that they do not all have
   to be the same, that is, the expression says “choose a character repeatedly”, not “choose a character and then repeat
   that arbitrarily often.” The symbol “*” is called Kleene star to honour the American logician Stephen Cole Kleene
   (1909–1994). Let’s define the language of regular expressions formally (as one would define the language of arithmetic
   expressions, for example):

  Definition 2 (Pattern, Regular Expression) Let Σ (read “sigma”) be an alphabet, that is, a finite set of characters. A
  pattern or regular expression over Σ is any string of symbols generated by the following inductive definition:
  Base case 1. The empty string is a pattern (the “empty pattern”).

  Base case 2. Every letter from Σ is a pattern.
  Concatenation. If p1 and p2 are patterns then so is (p1 p2 ).
  Alternative. If p1 and p2 are patterns then so is (p1 | p2 ).

  Kleene star. If p is a pattern then so is (p∗).

  We can save some brackets by defining precedence rules:
  Kleene Star “*” binds most tightly. It only applies to the immediately preceding pattern.

  Concatenation comes next. It does not have an explicitly operator but is expressed by writing patterns next to each other.
  Alternative “|” has the lowest precedence. Unless explicitly reigned in by brackets, it places everything to its left in
        alternative to everything on its right.
  To put this into use straight away, let us define precisely what it means for a string to match a given pattern:
  Definition 3 Let p be a pattern over an alphabet Σ. A string s consisting of letters from Σ matches p if one of the following
  Base case 1. Both pattern and string are empty.
  Base case 2. The pattern is a single letter from Σ and s is that same letter.
  Concatenation. The pattern is a concatenation of patterns p1 and p2 , and s is a concatenation of substrings s1 and s2
       such that s1 matches p1 , and s2 matches p2 .
  Alternative. The pattern is of the form (p1 |p2 ) and s matches either p1 or p2 (or both).
     2 For brevity, I left out the error state.
     3 Unfortunately,   they are also sometimes called final states although there is nothing “final” about them.
     4 Plural: automata

   Kleene star. The pattern is of the form (q∗) and s is either empty or consists of finitely many substrings s1 , . . . , sn , each
        of which matches q.
 8. Practical issues. Regular expressions are ubiquitous in Computer Science, and there are many tools that use them. For
    example, grep is a UNIX command that allows you to search for strings that match a given regular expression in a
    file. The emacs editor supports regular expressions in its search command “M-x isearch-forward-regexp”.
    Perl is an operating system scripting language of great importance and utility; pattern matching for regular expressions
    in one of its key features. Since JDK 1.4, Java provides a package for dealing with regular expressions: import
    java.util.regex.*; There are many abbreviations for lengthy regular expressions, the range “[a-z]” mentioned
    above is just one of them. Here are a few others:

   dot The symbol “.” matches every (ASCII) character.
   non-empty repetition “+” is like the Kleene star except that there has to be at least one occurrence of the pattern.
   escaping In order to escape the special meaning of *, |, etc, one uses “\”. For example, the pattern ’s\.j\.vickers’
         matches my email address only, whereas ’s.j.vickers’ would also match the string “sXjYvickers” (and
         many others).
 9. Implementing a pattern matcher. We have motivated regular expressions with the capabilities of finite state diagrams, and
    so you may expect that every regular expression can be implemented by a diagram. This is indeed so but the construction
    is surprisingly difficult. The rest of the handout is devoted to this issue. The problem arises from concatenation and star,
    as they require us to break up the given string into smaller parts. How is a program to know where to make those breaks?
    The only solution in these cases seems to be to examine all possible subdivisions of the given string. While there are only
    finitely many of them, because strings have finite length, you can probably guess that the resulting procedure would be of
    exponential complexity.
    It will be our first task to develop an efficient method for pattern matching. Indeed, as it will turn out, it can be done in
    linear time, by reading the characters of the given string only once and from left to right.
10. Every pattern matching problem is solved by a finite automaton.

      (a) The problem. At first glance, this task seems to be easy: Since patterns are constructed inductively, we can construct
          the recogniser inductively as well. For example, if the pattern is a concatenation p1 p2 , and if we have constructed
          recognisers for p1 and p2 already, then apparently all that we need to do is to append the recogniser for p2 to every
          accepting state of p1 , schematically:


                        F1                                F2                                      F1             F2


          Unfortunately, this does not always work because finite automata may have arrows leading back to the start state.
          Look at what can happen: F1 is a DFA for the regular expression ’a(aa)*’ (strings consisting of an odd number
          of a’s) and F2 is a DFA for ’b(bb)*’.

                                                                                    a             b

                                          b                                        a             b

          Glueing the two together as suggested above does not lead to a correct automaton for ’a(aa)*b(bb)*’; The reg-
          ular expression wants an odd number of a’s followed by an odd number of b’s, but the automaton accepts abbaab,
          for example.
          A similar problem arises for both alternative (where one would like to just glue the two automata together at their
          start states) but in general this destroys determinacy, as in this example (where I tried to construct an automaton for
          ’a | ab’ by joining the respective automata for ’a’ and ’ab’ together at their start state; in the result on the right
          there are two different transitions labelled a leading away from the start state.):

                                 a                                                      a

                                  a             b                                       a              b

    Kleene star is worse still but let’s not look for an example. It is clear that, when building the automaton inductively,
    we have to find a way to keep the constituent automata from interfering with each other.
(b) Nondeterministic automata. The way to proceed is to define a much more liberal class of automata, for which
    it will be easy to show that they can recognise every regular language, and then to show that they can, in fact, be
    simulated by deterministic finite automata.

    Definition 4 (Nondeterministic finite automaton) The definition is as for deterministic automata except that δ is
    allowed to be a relation between pairs (state, input letter) and states. In other words, at every stage there may be
    many (or none at all) choices of moving to another state.
    The definition of a nondeterministic finite automaton accepting a string s is as before except that we are now stipu-
    lating that the triples (q0 , s1 , p1 ), (p1 , s2 , p2 ), . . . , (pk−1 , sk , pk ) belong to the relation δ. In other words, we want
    that some execution path will lead to an accepting state.

    Note that no one in their right mind would want to write a program which implements a nondeterministic automaton.
    From a practical point of view there is nothing appealing about them. They are a purely abstract concept, introduced
    to discuss our translation problem at the right level of abstraction.
    In a further abstraction step we now allow automata to make transitions without reading an input symbol. This
    can be achieved in the formal definition by extending Σ with a new letter ε (read “epsilon”) which annotates all
    transitions which don’t read an input symbol. Let us call automata of this kind nondeterministic finite automata
    with ε-moves.
(c) Translating a regular expression into a nondeterministic automaton. The construction follows the inductive
    definition of patterns:
    Base case 1. The empty pattern is recognised by

          where the start state is also accepting.
    Base case 2. If the pattern consists of just a single letter x, then the following automaton recognises it:

    Concatenation. If we have automata F1 and F2 for patterns p1 and p2 then we combine them with an ε-move:


                                                                            F1                   ε                 F2


    Alternative. If we have automata F1 and F2 for patterns p1 and p2 then we combine them with ε-moves as follows

                       F1                                                                        F1
                                                                              ε                                      ε

                        F2                                                                        F2
                                                                              ε                                      ε

    Kleene star. If we have an automaton F for the pattern p then we construct an automaton for p∗ as follows:


                       F                                                              ε                             ε


      Note that because the ε-moves are introduced one-way only, the sub-automata have to be traversed in the same way
      in the larger network as if they were on their own. I do not believe that a formal proof at this stage could be more
      convincing than the pictures, so we leave it at that.
(d) Removing -moves. We now continue in our construction by showing that -moves are not really necessary in an
    automaton, that is, we show that every nondeterministic automaton with -moves can be simulated by one without.
      The proof is quite simple. For a given nondeterministic automaton with -moves (Q, q0 , F, δ) we construct another
      one (Q, q0 , F , δ ) which is almost the same except for the transition relation and the set of accepting states. What
      we do is to add to δ all triples (p, x, q) where there is a path in the original automaton which starts in p and ends in q,
      and which consists of an arbitrary number of -transitions, followed by a transition labelled with x. In addition, we
      add all those states to F (the set of accepting states) for which there is a path from the state to some accepting state
      consisting entirely of -moves. After these adjustments, we remove all -transitions from the relation δ. Here is an

                 ε                                                          a                  a                                    a
  ε                            ε                                                                                        a
                 a                                                                a

                           b                                                                   b                                    b

      Here we get a nice little automaton after removing unreachable states. In general, the result of removing ε-moves is
      a highly nondeterministic automaton.
      Let us check that this transformation is correct. If the original automaton accepts a string abc, say, then this means
      that there is a sequence of transitions all labelled with ε’s except one that is labelled with ”a”, one labelled with ”b”,
      and one labelled with ”c”, in other words, the labels on the transitions must form a string that looks like

                                                      εε . . . εaεε . . . εbεε . . . εcεε . . . ε

      We can split this up into four chunks of transitions, each starting with a sequence of epsilons and ending with a real
      character, except for the last one which is just epsilons; we also insert some names for intermediary states:

                                   q0   εε . . . εa   q1    εε . . . εb   q2     εε . . . εc   q3   εε . . . ε q4

      By definition of the revised automaton, there will be a transition labelled a from q0 to q1 , one labelled b from q1 to q2 ,
      and one labelled c from q2 to q3 . Again by construction, the state q3 will have been added to the set of accepting
      states because q4 is accepting. So indeed, abc will be accepted by the new automaton.
      The reverse implication is shown in the same way.
(e) Removing nondeterminacy. Let (Q, q0 , F, δ) be a nondeterministic finite automaton (without -moves). We con-
    struct a deterministic automaton which accepts the same strings as follows. The set of states will be the powerset of
    Q, that is, the set of all subsets of Q. We denote it by PQ. This is still a finite set.
      Start state is the one element set {q0 }. Given the input character x, the new transition function δ maps a set of
      states A to a set of states B if B contains exactly those states that can be reached by an x-transition from some state
      in A. As a formula:
                                           δ (A, x) = {q ∈ Q | ∃p ∈ A. (p, x, q) ∈ δ} .
      In other words, we explore all possible paths in the nondeterministic automaton in parallel (and this is exactly how
      implementations of pattern matchers work).
      For this to be correct, we have to say that a set is an accepting state of the new automaton if it contains at least one
      node from the original F .

                                                                               0             01
                 a                                                b                  a                b
                                                                               a              a
             b                                                                                    a
   0             1                 2                                  a        1             12   b       012
         a               b                                                                                      b
                                                                      ab                 a        b
                                                     a,b                       2             02


We get a pretty complicated automaton even in this very simple situation, but this is unavoidable. Note, however,
that the automaton can be simplified a little bit by erasing all nodes which cannot be reached from the start state:

                                         b                 a
                                                                 12        b       012


In practice, you would not first create the diagram with all nodes (for all subsets) and then delete the inaccessible
ones. Much more efficient is to create nodes only as they arise.
In the example above, start with state {0}. From there, transition a can go to either 1 or 2, so we also need the state
{1, 2}. From 1 or 2 a goes nowhere, but b goes to 0, 1 or 2, so we need {0, 1, 2}. With these three subsets we find
we can complete the diagram as above so they are enough. Note also that we would usually omit the empty set {},
since that represents an error state.

                                                    Exercise Sheet 1
Quickies (I suggest that you try to do these before the Exercise Class. )
   1. Alter the finite state diagram of the Coke machine so that it doesn’t get trapped in an error state if too much money
      has been inserted. Also, extend it with a “cancel” operation (which returns the money inserted so far). Change and
      extend the transition table accordingly.

   2. Build a finite state diagram that checks whether a string is equal to “Goo....gle” with arbitrary many o’s follow-
      ing the initial two.
   3. Design deterministic finite automata for the following patterns:
       (a) (a|b)c
       (b) ab|bc
       (c) ab|ac (Careful! Remember that from any state there must be at most one transition labelled with a particular
       (d) c(a|b)* c
Classroom exercises (Hand in to your tutor at the end of the exercise class.)
   4. Design a deterministic finite automaton for the pattern c(a|b)* ac.
      Which of the following strings are matched by it? “caaaac”, “cbbbbc”, “cababc”, “cbabac”. (Check them
      against both the pattern and the automaton.)                                                     2+2

Homework exercises (Submit via the correct pigeon hole before next Monday, 2pm.)
   5. Write a regular expression matched by strings of as and bs with at least three characters, in which the the last but
      two is an a. (Hence it ends in a?? and not b??.) Design, first, a non-deterministic finite automaton for it, and then
      a deterministic finite automaton.                                                                               2+2
   6. Write out a regular expression that matches UK postcodes. For simplicity, take it that a postcode has the following
      format. It has two parts, separated by a space. The first part is either a single letter (B, G, L, M or S for Birmingham,
      Glasgow, Liverpool, Manchester or Sheffield, or E, N or W for the East, North or West parts of London) or two
      letters, followed by one or two digits. The second part is a digit followed by two letters. You should use the UNIX
      abbreviations “[0-9]” and “[A-Z]”, which match all digits and all uppercase letters, respectively.                    2
Stretchers (Problems in this section go beyond what we expect of you in the May exam. Please submit your solution
through the special pigeon hole dedicated for these exercises. The deadline is the same as for the other homework.)
   7. Consider the regular expressions
      ’(a|b)* a’
      ’(a|b)* a(a|b)’
      ’(a|b)* a(a|b)(a|b)’
      ’(a|b)* a(a|b)(a|b)(a|b)’

       (a) For each expression, use everyday language to characterise the matching strings.
       (b) Design deterministic finite automata for them.
       (c) Conclude that the size of the smallest deterministic finite automaton matching a given pattern can grow expo-
           nentially with the length of the pattern.                                                     4 bonus points


To top