# Lecture 3 Regular Expressions - by fjwuxn

VIEWS: 14 PAGES: 20

• pg 1
```									Regular Expressions

Lecture 3
Regular Expressions
Motivation: To search for strings using partially
specified patterns.

Examples:
• To validate data fields (dates, email, address)
• To filter text (disallowed web sites)
• To identify particular strings in a text
• To do replacement in a text (color -> colour)
Formal Definition of Regular
Expressions
Regular expressions can be defined over a finite set of
alphabet ∑:
1.        is a regular expression and denotes the set { }.
2.       For each a in ∑, a is a regular expression and denotes the
set {a}.
3.       If r and s are regular expressions denoting the sets R and
S respectively, then
–      (r | s) is a regular expression denoting R  S.
–      (r.s) is a regular expression denoting R  S.

–      (r*) is a regular expression denoting R*.
• The language can be stated as a formal
algebra.
• Regular expressions form a language for
expressing patterns.
• Recognizer for regular expressions can be
efficiently implemented.
Recognizers
• A recognizer for a language is a program
that takes as input a string x and answers
“yes” if x is a sentence of the language and
“no” otherwise.

• This recognizer is a machine which only
emits two possible responses to its input.
Finite State Automaton
• A Finite State Automaton (FSA) is an abstract
finite machine.
• Regular expressions can be viewed as a way to
describe a Finite State Automaton (FSA).
• Kleene‟s Theorem (1956): FSA and RE describe
the same languages:
– Any regular expression can be implemented as an FSA.
– Any FSA can be described by a regular expression.
• Regular language are those that can be recognized
by FSAs (or characterized by a regular
expression).
Basic Metacharacters
Wild card: .
Optionally: ?
Repetition: * and +
Choice: [Mm][0123456789]
Ranges: [a-z][0-9]
Negation: [^Mm] (only when „^‟ occurs
immediately after „[„)
Disjunction: |
Special Backslashes
\d: digit (i.e. numeral)
\D: non-digit
\s: „whitespace‟
\S: non-whitespace
\w: „alphanumeric‟ ([a-zA-Z0-9])
\W: non-alphanumeric
Standard escape sequences
\t: tab
\n: newline
\ is a general escape character.
Anchors
• Anchors are zero width characters.
• Anchors do not match strings in the text
instead they match positions in the text.
^: matches beginning of line (or text)
\$: matches end of line (or text)
\b: matches word boundary (i.e. a location
with \w on one side but not the other)
Introduction to Python
• Development started in 1990 at CWI
(National Research Institute for
Mathematics and Computer Science) in
Amsterdam.
• Owned by Python Software Foundation.
• Open Source Language
– Extensive Documentation and tutorials
Introduction to Python
• Available for Unix, Linux, Windows, MAC, etc.
• Easy to Learn, User friendly.
• Clear Syntax.
• Object Oriented Paradigm (encourages good
programming practices).
• A small number of Powerful high-level data types.
• New built in functions/modules and data types can
be added by implementing it in a complied
language like C/C++.
Introduction to Python
• Variables
– Name that refers to a certain value
– Limitations:
•   Cannot be a keyword (i.e. print, and, or, if etc.)
•   Case sensitive.
•   Cannot include illegal characters (i.e. \$, %, +, =,
etc.)
Introduction to Python
• Numbers
– Integers:
•   Whole numbers no decimal places.
•   Size = 4 Byte (32 bit).
•   Whole number result when divided two integers.
•   Long integers are represented by L at end of number
(454321354534L). These numbers are larger than 2 billion.
– Floating point numbers are the numbers with decimal
point values.
– If you want result in decimal value then use at least one
decimal number . (e.g: 10/4.0 = 2.5 and 10/4 = 2)
Introduction to Python
• Strings
– String is the set of text and must be inside single or
double quotation marks.
e.g: course = “Introduction to AI techniques”
– Use back slash if you need to add few special
functionality.
e.g: var1 = “He said \”I play cricket\” ”
var2 = “It\‟s amazing”
Others: Include Backslash = \\
New Line = \n
Tab = \t etc.
Introduction to Python
• Concatenation: + Operator Overloaded
e.g: Str = str1 + “XYZ” + str2

• Repetition: Repeating a string
e.g: str = “superman”
print str*3
>>>superman superman superman
Introduction to Python
• Math Operations
– Basic operations: Add + , Subtract -, Multiply
*, Divide /, Exponent **, Modulus %
– Order of precedence
• Parenthesis, Exponents, Multiply/Divide,
• Left to right
e.g: 6 * ( 3+2 ) = 30
6 * 3 + 2 = 20
Introduction to Python
• Input
– For string, use raw_input()
e.g.: email = raw_input(“What is your email?”)
– For numbers, use input()
e.g.: age = input(“What is your age”)
Introduction to Python
• Output
– For string, use print
e.g.: print “What is your email?”
>>> What is your email?
e.g.: email = “What is your email?”
print email
>>> What is your email?
– For numbers, use same print
e.g.: print pi
>>> 3.14159
Introduction to Python
– Use to # symbol to specify comments
– Everything after # will be ignored by interpreter
e.g.: age = 32 # age must be greater than 32
– To comment multiple lines using start and end symbols
(??????)
• Indentation
– Space sensitive language (Danger: Be careful)
e.g.: if x != y:                if x != y:
x=y                x=y
Python Resources
• How to Think Like a Computer Scientist: Learning with Python,
by Allen B. Downey, Jeffrey Elkner and Chris Meyers.
This text has been released under the Open Book Project.
• Learning Python, by Mark Lutz and David Ascher.
This is a good book for beginners to Python. Look here for corrections,
source code, etc.
• Programming Python, by Mark Lutz.
A programmer's reference. 1255pp, and not intended for beginners.
• Dive into Python, by Mark Pilgrim.
Advertised as a free book for "experienced programmers". The
Homepage also has a number of useful links to Python resources.
• Main site for Python documentation
Note the reference page for regular expression syntax.
• Python HOWTO Page (including the Regular Expression HOWTO)

```
To top