Lecture 3 Regular Expressions - by fjwuxn

VIEWS: 14 PAGES: 20

									Regular Expressions

      Lecture 3
          Regular Expressions
Motivation: To search for strings using partially
 specified patterns.

Examples:
• To validate data fields (dates, email, address)
• To filter text (disallowed web sites)
• To identify particular strings in a text
• To do replacement in a text (color -> colour)
          Formal Definition of Regular
                 Expressions
         Regular expressions can be defined over a finite set of
         alphabet ∑:
1.        is a regular expression and denotes the set { }.
2.       For each a in ∑, a is a regular expression and denotes the
         set {a}.
3.       If r and s are regular expressions denoting the sets R and
         S respectively, then
     –      (r | s) is a regular expression denoting R  S.
     –      (r.s) is a regular expression denoting R  S.

     –      (r*) is a regular expression denoting R*.
         Advantages of RE‟s
• The language can be stated as a formal
  algebra.
• Regular expressions form a language for
  expressing patterns.
• Recognizer for regular expressions can be
  efficiently implemented.
               Recognizers
• A recognizer for a language is a program
  that takes as input a string x and answers
  “yes” if x is a sentence of the language and
  “no” otherwise.

• This recognizer is a machine which only
  emits two possible responses to its input.
        Finite State Automaton
• A Finite State Automaton (FSA) is an abstract
  finite machine.
• Regular expressions can be viewed as a way to
  describe a Finite State Automaton (FSA).
• Kleene‟s Theorem (1956): FSA and RE describe
  the same languages:
   – Any regular expression can be implemented as an FSA.
   – Any FSA can be described by a regular expression.
• Regular language are those that can be recognized
  by FSAs (or characterized by a regular
  expression).
       Basic Metacharacters
Wild card: .
Optionally: ?
Repetition: * and +
Choice: [Mm][0123456789]
Ranges: [a-z][0-9]
Negation: [^Mm] (only when „^‟ occurs
 immediately after „[„)
Disjunction: |
             Special Backslashes
\d: digit (i.e. numeral)
\D: non-digit
\s: „whitespace‟
\S: non-whitespace
\w: „alphanumeric‟ ([a-zA-Z0-9])
\W: non-alphanumeric
Standard escape sequences
\t: tab
\n: newline
\ is a general escape character.
                 Anchors
• Anchors are zero width characters.
• Anchors do not match strings in the text
  instead they match positions in the text.
^: matches beginning of line (or text)
$: matches end of line (or text)
\b: matches word boundary (i.e. a location
  with \w on one side but not the other)
       Introduction to Python
• Development started in 1990 at CWI
  (National Research Institute for
  Mathematics and Computer Science) in
  Amsterdam.
• Owned by Python Software Foundation.
• Open Source Language
  – Download from www.python.org
  – Extensive Documentation and tutorials
        Introduction to Python
• Available for Unix, Linux, Windows, MAC, etc.
• Easy to Learn, User friendly.
• Clear Syntax.
• Object Oriented Paradigm (encourages good
  programming practices).
• A small number of Powerful high-level data types.
• New built in functions/modules and data types can
  be added by implementing it in a complied
  language like C/C++.
          Introduction to Python
• Variables
  – Name that refers to a certain value
  – Limitations:
     •   Cannot be a keyword (i.e. print, and, or, if etc.)
     •   Cannot start with a number.
     •   Case sensitive.
     •   Cannot include illegal characters (i.e. $, %, +, =,
         etc.)
          Introduction to Python
• Numbers
  – Integers:
     •   Whole numbers no decimal places.
     •   Size = 4 Byte (32 bit).
     •   Whole number result when divided two integers.
     •   Long integers are represented by L at end of number
         (454321354534L). These numbers are larger than 2 billion.
  – Floating point numbers are the numbers with decimal
    point values.
  – If you want result in decimal value then use at least one
    decimal number . (e.g: 10/4.0 = 2.5 and 10/4 = 2)
        Introduction to Python
• Strings
  – String is the set of text and must be inside single or
    double quotation marks.
    e.g: course = “Introduction to AI techniques”
  – Use back slash if you need to add few special
    functionality.
    e.g: var1 = “He said \”I play cricket\” ”
         var2 = “It\‟s amazing”
  Others: Include Backslash = \\
          New Line = \n
          Tab = \t etc.
       Introduction to Python
• Concatenation: + Operator Overloaded
  e.g: Str = str1 + “XYZ” + str2


• Repetition: Repeating a string
  e.g: str = “superman”
       print str*3
       >>>superman superman superman
       Introduction to Python
• Math Operations
  – Basic operations: Add + , Subtract -, Multiply
    *, Divide /, Exponent **, Modulus %
  – Order of precedence
     • Parenthesis, Exponents, Multiply/Divide,
       Add/Subtract.
     • Left to right
     e.g: 6 * ( 3+2 ) = 30
          6 * 3 + 2 = 20
       Introduction to Python
• Input
  – For string, use raw_input()
    e.g.: email = raw_input(“What is your email?”)
  – For numbers, use input()
    e.g.: age = input(“What is your age”)
        Introduction to Python
• Output
  – For string, use print
    e.g.: print “What is your email?”
    >>> What is your email?
    e.g.: email = “What is your email?”
         print email
     >>> What is your email?
  – For numbers, use same print
    e.g.: print pi
    >>> 3.14159
        Introduction to Python
• Comments
  – Use to # symbol to specify comments
  – Everything after # will be ignored by interpreter
  e.g.: age = 32 # age must be greater than 32
  – To comment multiple lines using start and end symbols
    (??????)
• Indentation
  – Space sensitive language (Danger: Be careful)
  e.g.: if x != y:                if x != y:
               x=y                x=y
                 Python Resources
• How to Think Like a Computer Scientist: Learning with Python,
  by Allen B. Downey, Jeffrey Elkner and Chris Meyers.
  This text has been released under the Open Book Project.
• Learning Python, by Mark Lutz and David Ascher.
  This is a good book for beginners to Python. Look here for corrections,
  source code, etc.
• Programming Python, by Mark Lutz.
  A programmer's reference. 1255pp, and not intended for beginners.
• Dive into Python, by Mark Pilgrim.
  Advertised as a free book for "experienced programmers". The
  Homepage also has a number of useful links to Python resources.
• Main site for Python documentation
  Note the reference page for regular expression syntax.
• Python HOWTO Page (including the Regular Expression HOWTO)

								
To top