VIEWS: 30 PAGES: 8 POSTED ON: 2/27/2011
06-05934 Models of Computation The University of Birmingham Spring Semester 2011 School of Computer Science notes by Achim Jung 2007 January 17, 2011 adapted by Volker Sorge, Steve Vickers Handout 1 Finite state machines and regular expressions 1. What are we up to? In the 1930s, mathematicians asked the question, “What is computation?” although at the time the computer was not even invented. Nonetheless, in pursuing an answer they came across some truly amazing results which are still important for a proper understanding of computation today. We will get to this in a few weeks’ time. In this handout we start with the simpler question, “What is a computer?” One answer is to say that it is some kind of machine. So we begin with a way of describing machines and how they react to user commands. This “reaction to user commands” sounds innocent, but in fact takes us to another important strand of ideas, that of language recognition. When can a string of symbols be recognized as valid according to some deﬁnition? For example, when is it a valid Java source ﬁle? We shall see that different complexities of language correspond to different complexities of machines to recognize them. This week’s topic covers the simplest: ﬁnite state machines to recognize regular expressions. This is an important practical technique used by software such as grep. 2. Some simple examples. The simplest imaginable “machine” (if you can call it that) is a light with a switch. The light can be either on or off and we can move from one state to the other by operating the light switch. Here are two versions, depending on the type of switch employed: press-on press press-off press-on off on off on press-off press The switch on the left is like the one that you are familiar with from your home: the switch can be pressed either at the top (“press-off”) or at the bottom (“press-on”). The one on the right is a “toggle switch”: whenever it is pressed, the system changes state. My computer monitor has such a switch, for example. For something a bit more complex, consider a vending machine for ﬁzzy drinks which cost one pound a bottle. The machine accepts 50p and £1 coins. It is the machine’s job to accept your money, count how much has been inserted already, and (if you are lucky) vend a bottle of Coke when the right amount of money has been inserted and the appropriate button is pressed. In detail, our vending machine has the following states: q0 : Start state: the machine waits for coins to be inserted; if we press the Coke button nothing happens. q1 : 50p has been inserted. The machine waits for more coins; nothing happens if the button is pressed. q2 : £1 has been inserted. The machine waits for the Coke button to be pressed. Once the button has been pressed and a drink has been returned the machine goes back to the start state. Here is the state transition diagram: £1 B B 50p 50p q0 q1 q2 B Note that we have used a different symbol for state q0 , as this is the machine’s normal “waiting state.” More about this in a moment. 3. Transition tables. Graphs are easy to draw but difﬁcult to implement and to reason about. Alternatively, we can describe their state transitions as tables. For the above example we get the following transition table: 50p £1 B q0 q1 q2 q0 q1 q2 q1 q2 q0 The table deﬁnes a transition function that, when given a state and an operation, returns the resulting state. We observe that the transition function is only partially deﬁned since some of the table entries are left blank. In particular, there is no indication of what happens when we insert more than £1 into the machine. 1 4. Filling the transition table. We can easily ﬁll the gaps by introducing an error state to deal with overpayment: E 50p,£1,B £1 £1 50p £1 B B B 50p,£1 q0 q1 q2 q0 q1 q2 E q1 50p 50p q2 E E q0 q0 q1 q2 E E E E B We can observe that the transition table is now slightly larger and fully ﬁlled. In other words the transition function is now total, meaning that for every possible state and for every command a valid transition is deﬁned.1 Let’s summarise our understanding so far: Deﬁnition 1 (Finite state machine) A ﬁnite state machine consists of the following: • A ﬁnite non-empty set Q of states. • A ﬁnite set Σ (read “sigma”) of commands. • A transition function δ (read “delta”) which returns for every state q ∈ Q and every command x ∈ Σ the next state δ(q, x). In other words, δ is function from Q × Σ to Q. 5. Bells and whistles. The ﬁnite state machine is a very useful model for interactive systems, and it is indeed being used in real system design. It is great for making initial design sketches, discussing them with other engineers and even end-users, for discovering design ﬂaws (such as missing transitions), and proving formally that certain properties hold. What we have presented here is only the bare minimum of the idea; obviously, more features could be added, such as allowing the machine to generate some output of its own, or to add a notion of “time-out” or probability. We will not go down this route but if you are interested then you should consider choosing the module Automatic Veriﬁcation in the third year. Here we want to explore in how far ﬁnite state machines can be seen as computing devices. As it turns out, they are especially well-suited for one particular kind of computational task: string matching. 6. String matching. In the simplest case we are trying to test whether a given string (perhaps some user input) matches a ﬁxed value, in other words, we are trying to compare two strings. Here is a state transition diagram for testing equality with “Tolkien”: T o l k i e n 0 1 2 3 4 5 6 7 ¬o ¬l ¬k ¬i ¬e ¬n any letter ¬T 8 any letter Here state 0 is a start state for the comparison, indicated by a diamond shape rather than a circle in the diagram, and state 7 is a success state, indicated by a double boundary. Once we are in state 8, we can be certain that the string we are testing is not equal to “Tolkien”, just because it is too long already. However, we are not exploiting the possibilities of ﬁnite state diagrams at all. Here is a diagram that accepts both “colour” and “color”: 5 u r c o l o r 0 1 2 3 4 6 ¬u,r 7 1 We may not be totally happy with the resulting system, though, as there seems to be no way to get our money back when we overpay. In the ﬁrst exercise below you are asked to improve the design. 2 Here is one that accepts all strings starting with “{”, ending in “}” and having any string of lower-case characters in-between (but no braces):2 a, b, c, ... , z { } A ﬁnite state diagram that is used for string matching always needs to have a start state (usually denoted by q0 ) and (one or several) “success states” (usually denoted F ; it is a subset of Q). The latter ones are usually called accepting states.3 This is a good point to introduce the more traditional name ﬁnite state automaton4 for these diagrams, and sometimes deterministic ﬁnite state automaton to stress the fact that the transition function has precisely one successor state in each situation. Because of this latter terminology, the abbreviation is usually DFA. Don’t let this confuse you! “Finite state diagram”, “ﬁnite state machine”, “(deterministic) ﬁnite state automaton”, and “DFA” are all the same concept. 7. A pattern language. We see that ﬁnite state diagrams can express the process of string comparison, but surely we would not want to build the diagram ourselves whenever we come across a string comparison problem. Instead, what we need is a language which allows us to specify “strings with variations,” also known as patterns. The language that we will introduce here is that of regular expressions. The ﬁrst example, as a regular expression, is just ’Tolkien’; we use quote signs to limit the expression, as is done in many UNIX programs. The second is written as ’col(o|ou)r’, and we see that “|” is used to express alternatives. The ﬁnal one is written as ’{[a-z]*}’; this example contains two new constructs: the character range “[a-z]”, which is really just an abbreviation for a long alternative: ’a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|r|s|t|u|v|w|x|y|z’ and “*” which indicates an arbitrary number of repetitions of the subexpression just preceding it. In this case, it means that an arbitrary number of lower case characters may be enclosed between the braces. Note, that they do not all have to be the same, that is, the expression says “choose a character repeatedly”, not “choose a character and then repeat that arbitrarily often.” The symbol “*” is called Kleene star to honour the American logician Stephen Cole Kleene (1909–1994). Let’s deﬁne the language of regular expressions formally (as one would deﬁne the language of arithmetic expressions, for example): Deﬁnition 2 (Pattern, Regular Expression) Let Σ (read “sigma”) be an alphabet, that is, a ﬁnite set of characters. A pattern or regular expression over Σ is any string of symbols generated by the following inductive deﬁnition: Base case 1. The empty string is a pattern (the “empty pattern”). Base case 2. Every letter from Σ is a pattern. Concatenation. If p1 and p2 are patterns then so is (p1 p2 ). Alternative. If p1 and p2 are patterns then so is (p1 | p2 ). Kleene star. If p is a pattern then so is (p∗). We can save some brackets by deﬁning precedence rules: Kleene Star “*” binds most tightly. It only applies to the immediately preceding pattern. Concatenation comes next. It does not have an explicitly operator but is expressed by writing patterns next to each other. Alternative “|” has the lowest precedence. Unless explicitly reigned in by brackets, it places everything to its left in alternative to everything on its right. To put this into use straight away, let us deﬁne precisely what it means for a string to match a given pattern: Deﬁnition 3 Let p be a pattern over an alphabet Σ. A string s consisting of letters from Σ matches p if one of the following holds: Base case 1. Both pattern and string are empty. Base case 2. The pattern is a single letter from Σ and s is that same letter. Concatenation. The pattern is a concatenation of patterns p1 and p2 , and s is a concatenation of substrings s1 and s2 such that s1 matches p1 , and s2 matches p2 . Alternative. The pattern is of the form (p1 |p2 ) and s matches either p1 or p2 (or both). 2 For brevity, I left out the error state. 3 Unfortunately, they are also sometimes called ﬁnal states although there is nothing “ﬁnal” about them. 4 Plural: automata 3 Kleene star. The pattern is of the form (q∗) and s is either empty or consists of ﬁnitely many substrings s1 , . . . , sn , each of which matches q. 8. Practical issues. Regular expressions are ubiquitous in Computer Science, and there are many tools that use them. For example, grep is a UNIX command that allows you to search for strings that match a given regular expression in a ﬁle. The emacs editor supports regular expressions in its search command “M-x isearch-forward-regexp”. Perl is an operating system scripting language of great importance and utility; pattern matching for regular expressions in one of its key features. Since JDK 1.4, Java provides a package for dealing with regular expressions: import java.util.regex.*; There are many abbreviations for lengthy regular expressions, the range “[a-z]” mentioned above is just one of them. Here are a few others: dot The symbol “.” matches every (ASCII) character. non-empty repetition “+” is like the Kleene star except that there has to be at least one occurrence of the pattern. escaping In order to escape the special meaning of *, |, etc, one uses “\”. For example, the pattern ’s\.j\.vickers’ matches my email address only, whereas ’s.j.vickers’ would also match the string “sXjYvickers” (and many others). 9. Implementing a pattern matcher. We have motivated regular expressions with the capabilities of ﬁnite state diagrams, and so you may expect that every regular expression can be implemented by a diagram. This is indeed so but the construction is surprisingly difﬁcult. The rest of the handout is devoted to this issue. The problem arises from concatenation and star, as they require us to break up the given string into smaller parts. How is a program to know where to make those breaks? The only solution in these cases seems to be to examine all possible subdivisions of the given string. While there are only ﬁnitely many of them, because strings have ﬁnite length, you can probably guess that the resulting procedure would be of exponential complexity. It will be our ﬁrst task to develop an efﬁcient method for pattern matching. Indeed, as it will turn out, it can be done in linear time, by reading the characters of the given string only once and from left to right. 10. Every pattern matching problem is solved by a ﬁnite automaton. (a) The problem. At ﬁrst glance, this task seems to be easy: Since patterns are constructed inductively, we can construct the recogniser inductively as well. For example, if the pattern is a concatenation p1 p2 , and if we have constructed recognisers for p1 and p2 already, then apparently all that we need to do is to append the recogniser for p2 to every accepting state of p1 , schematically: F2 F1 F2 F1 F2 F2 Unfortunately, this does not always work because ﬁnite automata may have arrows leading back to the start state. Look at what can happen: F1 is a DFA for the regular expression ’a(aa)*’ (strings consisting of an odd number of a’s) and F2 is a DFA for ’b(bb)*’. a a a b b a b b Glueing the two together as suggested above does not lead to a correct automaton for ’a(aa)*b(bb)*’; The reg- ular expression wants an odd number of a’s followed by an odd number of b’s, but the automaton accepts abbaab, for example. A similar problem arises for both alternative (where one would like to just glue the two automata together at their start states) but in general this destroys determinacy, as in this example (where I tried to construct an automaton for ’a | ab’ by joining the respective automata for ’a’ and ’ab’ together at their start state; in the result on the right there are two different transitions labelled a leading away from the start state.): 4 a a a b a b Kleene star is worse still but let’s not look for an example. It is clear that, when building the automaton inductively, we have to ﬁnd a way to keep the constituent automata from interfering with each other. (b) Nondeterministic automata. The way to proceed is to deﬁne a much more liberal class of automata, for which it will be easy to show that they can recognise every regular language, and then to show that they can, in fact, be simulated by deterministic ﬁnite automata. Deﬁnition 4 (Nondeterministic ﬁnite automaton) The deﬁnition is as for deterministic automata except that δ is allowed to be a relation between pairs (state, input letter) and states. In other words, at every stage there may be many (or none at all) choices of moving to another state. The deﬁnition of a nondeterministic ﬁnite automaton accepting a string s is as before except that we are now stipu- lating that the triples (q0 , s1 , p1 ), (p1 , s2 , p2 ), . . . , (pk−1 , sk , pk ) belong to the relation δ. In other words, we want that some execution path will lead to an accepting state. Note that no one in their right mind would want to write a program which implements a nondeterministic automaton. From a practical point of view there is nothing appealing about them. They are a purely abstract concept, introduced to discuss our translation problem at the right level of abstraction. In a further abstraction step we now allow automata to make transitions without reading an input symbol. This can be achieved in the formal deﬁnition by extending Σ with a new letter ε (read “epsilon”) which annotates all transitions which don’t read an input symbol. Let us call automata of this kind nondeterministic ﬁnite automata with ε-moves. (c) Translating a regular expression into a nondeterministic automaton. The construction follows the inductive deﬁnition of patterns: Base case 1. The empty pattern is recognised by where the start state is also accepting. Base case 2. If the pattern consists of just a single letter x, then the following automaton recognises it: x Concatenation. If we have automata F1 and F2 for patterns p1 and p2 then we combine them with an ε-move: F1 F1 ε F2 F2 Alternative. If we have automata F1 and F2 for patterns p1 and p2 then we combine them with ε-moves as follows F1 F1 ε ε F2 F2 ε ε Kleene star. If we have an automaton F for the pattern p then we construct an automaton for p∗ as follows: 5 ε F ε ε F ε Note that because the ε-moves are introduced one-way only, the sub-automata have to be traversed in the same way in the larger network as if they were on their own. I do not believe that a formal proof at this stage could be more convincing than the pictures, so we leave it at that. (d) Removing -moves. We now continue in our construction by showing that -moves are not really necessary in an automaton, that is, we show that every nondeterministic automaton with -moves can be simulated by one without. The proof is quite simple. For a given nondeterministic automaton with -moves (Q, q0 , F, δ) we construct another one (Q, q0 , F , δ ) which is almost the same except for the transition relation and the set of accepting states. What we do is to add to δ all triples (p, x, q) where there is a path in the original automaton which starts in p and ends in q, and which consists of an arbitrary number of -transitions, followed by a transition labelled with x. In addition, we add all those states to F (the set of accepting states) for which there is a path from the state to some accepting state consisting entirely of -moves. After these adjustments, we remove all -transitions from the relation δ. Here is an example: ε a a a ε ε a a a b b b ε Here we get a nice little automaton after removing unreachable states. In general, the result of removing ε-moves is a highly nondeterministic automaton. Let us check that this transformation is correct. If the original automaton accepts a string abc, say, then this means that there is a sequence of transitions all labelled with ε’s except one that is labelled with ”a”, one labelled with ”b”, and one labelled with ”c”, in other words, the labels on the transitions must form a string that looks like εε . . . εaεε . . . εbεε . . . εcεε . . . ε We can split this up into four chunks of transitions, each starting with a sequence of epsilons and ending with a real character, except for the last one which is just epsilons; we also insert some names for intermediary states: q0 εε . . . εa q1 εε . . . εb q2 εε . . . εc q3 εε . . . ε q4 By deﬁnition of the revised automaton, there will be a transition labelled a from q0 to q1 , one labelled b from q1 to q2 , and one labelled c from q2 to q3 . Again by construction, the state q3 will have been added to the set of accepting states because q4 is accepting. So indeed, abc will be accepted by the new automaton. The reverse implication is shown in the same way. (e) Removing nondeterminacy. Let (Q, q0 , F, δ) be a nondeterministic ﬁnite automaton (without -moves). We con- struct a deterministic automaton which accepts the same strings as follows. The set of states will be the powerset of Q, that is, the set of all subsets of Q. We denote it by PQ. This is still a ﬁnite set. Start state is the one element set {q0 }. Given the input character x, the new transition function δ maps a set of states A to a set of states B if B contains exactly those states that can be reached by an x-transition from some state in A. As a formula: δ (A, x) = {q ∈ Q | ∃p ∈ A. (p, x, q) ∈ δ} . In other words, we explore all possible paths in the nondeterministic automaton in parallel (and this is exactly how implementations of pattern matchers work). For this to be correct, we have to say that a set is an accepting state of the new automaton if it contains at least one node from the original F . 6 0 01 a b a b a a b a 0 1 2 a 1 12 b 012 a b b ab a b b a,b 2 02 b We get a pretty complicated automaton even in this very simple situation, but this is unavoidable. Note, however, that the automaton can be simpliﬁed a little bit by erasing all nodes which cannot be reached from the start state: 0 b a a a 12 b 012 b a,b In practice, you would not ﬁrst create the diagram with all nodes (for all subsets) and then delete the inaccessible ones. Much more efﬁcient is to create nodes only as they arise. In the example above, start with state {0}. From there, transition a can go to either 1 or 2, so we also need the state {1, 2}. From 1 or 2 a goes nowhere, but b goes to 0, 1 or 2, so we need {0, 1, 2}. With these three subsets we ﬁnd we can complete the diagram as above so they are enough. Note also that we would usually omit the empty set {}, since that represents an error state. 7 Exercise Sheet 1 Quickies (I suggest that you try to do these before the Exercise Class. ) 1. Alter the ﬁnite state diagram of the Coke machine so that it doesn’t get trapped in an error state if too much money has been inserted. Also, extend it with a “cancel” operation (which returns the money inserted so far). Change and extend the transition table accordingly. 2. Build a ﬁnite state diagram that checks whether a string is equal to “Goo....gle” with arbitrary many o’s follow- ing the initial two. 3. Design deterministic ﬁnite automata for the following patterns: (a) (a|b)c (b) ab|bc (c) ab|ac (Careful! Remember that from any state there must be at most one transition labelled with a particular letter.) (d) c(a|b)* c Classroom exercises (Hand in to your tutor at the end of the exercise class.) 4. Design a deterministic ﬁnite automaton for the pattern c(a|b)* ac. Which of the following strings are matched by it? “caaaac”, “cbbbbc”, “cababc”, “cbabac”. (Check them against both the pattern and the automaton.) 2+2 Homework exercises (Submit via the correct pigeon hole before next Monday, 2pm.) 5. Write a regular expression matched by strings of as and bs with at least three characters, in which the the last but two is an a. (Hence it ends in a?? and not b??.) Design, ﬁrst, a non-deterministic ﬁnite automaton for it, and then a deterministic ﬁnite automaton. 2+2 6. Write out a regular expression that matches UK postcodes. For simplicity, take it that a postcode has the following format. It has two parts, separated by a space. The ﬁrst part is either a single letter (B, G, L, M or S for Birmingham, Glasgow, Liverpool, Manchester or Shefﬁeld, or E, N or W for the East, North or West parts of London) or two letters, followed by one or two digits. The second part is a digit followed by two letters. You should use the UNIX abbreviations “[0-9]” and “[A-Z]”, which match all digits and all uppercase letters, respectively. 2 Stretchers (Problems in this section go beyond what we expect of you in the May exam. Please submit your solution through the special pigeon hole dedicated for these exercises. The deadline is the same as for the other homework.) 7. Consider the regular expressions ’(a|b)* a’ ’(a|b)* a(a|b)’ ’(a|b)* a(a|b)(a|b)’ ’(a|b)* a(a|b)(a|b)(a|b)’ etc. (a) For each expression, use everyday language to characterise the matching strings. (b) Design deterministic ﬁnite automata for them. (c) Conclude that the size of the smallest deterministic ﬁnite automaton matching a given pattern can grow expo- nentially with the length of the pattern. 4 bonus points 8