Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>

LIS651 lecture 1 PHP basics by lifemate

VIEWS: 7 PAGES: 30

									  LIS651 lecture 4
regular expressions


    Thomas Krichel
      2006-12-03
              remember DOS?
• DOS had the * character as a wildcard. If
  you said
  DIR *.EXE
• It would list all the files ending with .EXE
• Thus the * wildcard would mean “all
  characters except the dot”
• Similarly, you could say
  DEL *.*
• to delete all your files
             regular expression
• Is nothing but a fancy wildcard.
• There are various flavours of regular
  expressions.
  – We will be using POSIX regular expressions
    here. They themselves come in two flavors
     • old-style
     • extended
    We study extended here aka POSIX 1003.2.
  – Perl regular expressions are more powerful and
    more widely used.
• POSIX regular expressions are accepted by
  both PHP and mySQL. Details are to follow.
                   pattern
• The regular expression describes a pattern
  of characters.
• Patters are common in other
  circumstances.
  – Query: „Krichel Thomas‟ in Google
  – Query: „"Thomas Krichel"‟ in Google
  – Dates are of the form yyyy-mm-dd.
            pattern matching
• We say that a regular expression matches
  the string if an instance of the pattern
  described by the regular expression can be
  found in the string.
• If we say “matches in the string” may make
  it a little more clearer.
• Sometimes people also say that the string
  matches the regular expression.
• I am confused.
             metacharacters
• Instead of just giving the star * special
  meaning, in a regular expression all the
  following have special meaning
  \^$.|()*+{}?[]
• Collectively, these characters are knows as
  metacharacters. They don't stand for
  themselves but they mean something else.
• For example DEL *.EXE does not mean:
  delete the file "*.EXE". It means delete
  anything ending with .EXE.
             metacharacters
• We are somehow already familiar with
  metacharacters.
  – In XML < means start of an element. To use <
    literally, you have to use &lt;
  – In PHP the "\n" does not mean backslash and
    then n. It means the newline character.
      simple regular expressions
• Characters that are not metacharacters just
  simply mean themselves
  „good‟     does not match in   „Good Beer‟
  „d B‟      matches in          „Good Beer‟
  „dB‟       does not match in   „Good Beer‟
  „Beer ‟    does not match in   „Good Beer‟
• If there are several matches, the pattern will
  match at the first occurrence.
  „o‟ matches in „Good Beer‟
         the backslash \ quote
• If you want to match a metacharacter in the
  string, you have to quote it with the
  backslash
  „a 6+ pack‟ does not match in   „a 6+ pack‟
  „a 6\+ pack‟ does match in      „a 6+ pack‟
  „\‟ does not match in „a \ against boozing‟
  „\\‟ does match in „a \ against boozing‟
    other characters to be quoted
• Certain non-metacharacters also need to
  be quoted. These include some of the usual
  suspects
  – \n the newline
  – \r the carriage return
  – \t the tabulation character
• But this quoting occurs by virtue of PHP, it
  is not part of the regular expression.
• Remember Sandford‟s law.
   anchor metacharacters ^ and $
• ^ matches at the beginning of the string.
• $ matches at the end of the string.
  „keeper‟    matches in       „beerkeeper‟
  „keeper$‟   matches in       „beerkeeper‟
  „^keeper‟   does not match in „beerkeeper‟
  „^$‟        matches in       „‟
• Note that in a double quoted-string an
  expression starting with $ will be replaced
  by the variable's string value (or nothing if
  the variable has not been set).
                 character classes
• We can define a character class by
  grouping a list of characters between [ and ]
  „b[ie]er‟        matches in „beer‟
  „b[ie]er‟        matches in „bier‟
  „[Bb][ie]er‟     matches in „Bier‟
• Within a class, metacharacters need not be
  escaped. In the class only -, ] and ^ are
  metacharacters.
           - in the character class
• Within a character class, the dash - becomes
  a metacharacter.
• You can use to give a range, according to the
  sequence of characters in the character set
  you are using. It‟s usually alphabetic
  „be[a-e]r‟   matches in           „beer‟
  „be[a-e]r‟   matches in           „becr‟
  „be[a-e]r‟   does not match in    „befr‟
• If the dash - is the last character in the class,
  it is treated like an ordinary character.
         ] in the character class
• ] gives you the end of the class. But if you
  put it first, it is treated like an ordinary
  character, because having it there
  otherwise would create an empty class, and
  that would make no sense.
  „be[],]r‟ matches in        „be]r‟
         ^ in the character class
• If the caret ^ appears as the first element in
  the class, it negates the characters
  mentioned.
  „be[^i]r‟     matches in          „beer‟
  „b[^ie]er‟    does not match in   „bier‟
  „be[^a-e]r‟   does match in       „befr‟
  „be[e^]r‟     matches in          „beer‟
  „beer[^6-9]   matches             „beer0‟ to „beer5‟
• Otherwise, it is an ordinary character.
      standard character classes
• The following predefined classes exist
  [:alnum:]   any alphanumeric characters
  [:digit:]   any digits
  [:punct:]   any punctuation characters
  [:alpha:]   any alphabetic characters (letters)
  [:graph:]   any graphic characters
  [:space:]   any space character (blank and \n, \r)
  [:blank:]   any blank character (space and tab)
  [:lower:]   any lowercase character
      standard character classes
  [:upper:]    any uppercase character
  [:cntrl:]    any control character
  [:print:]    any printable character
  [:xdigit:]   any character for a hex number
• They are locale and operating system
  dependent.
• With this discussion we leave character
  classes.
     The period . metacharacter
• The period matches any character except
  the newline \n.
• The reason why the \n is not counted is
  historic. In olden days matching was done
  line by line, because the computer could
  not hold as much memory.
  „.‟   does not match in   „‟;
  „^.$‟ does not match in    "\n"
  „^.$‟ matches in          „a‟
          alternative operator |
• This acts like an or
  „beer|wine‟ matches in „beer‟
  „beer|wine‟ matches in „wine‟
• Alternatives are performed last, i.e. they
  take the component alternative as large as
  they can.
              grouping with ( )
• You can use ( ) to group
  „(beer|wine) (glass|)‟ matches in        „beer glass‟
  „(beer|wine) (glass|)‟ matches in        „wine glass‟
  „(beer|wine) (glass|)‟ matches in        „beer ‟
  „(beer|wine) (glass|)‟ matches in        „wine ‟
  „(beer|wine) (glass(es|)|)‟ matches in
    „beer glasses‟
• Yes, groups can be nested.
               repetition operators
•   * means zero or more times what preceeds it.
•   + means one or more times what preceeds it.
•   ? means zero or one time what preceeds it.
•   The shortest preceding expression is used, i.e.
    either a single character or a group.
    (beer )*   matches in          „‟
    (beer )?   matches in          „‟
    (beer )+   matches in          „beer beer beer‟
    be+r       matches in          „beer‟
    be+r       does not match in   „bebe‟
                  enumeration
• We can use {min,max} to give a minimum min
  and a maximum max. min and max are
  positive integers.
  „be{1,3}r‟   matches in            „ber‟
  „be{1,3}r‟   matches in            „beer‟
  „be{1,3}r‟   matches in            „beeer‟
  „be{1,3}r‟   does not matches in   „beeeer‟
• ? is just a shorthand for {0,1}
• + is just a shorthand for {1,}
• * is just a shorthand for {0,}
                 examples
• US zip code ^[0-9]{5}(-[0-9]{4})?$
• something like a current date in ISO form
 ^(20[0-9]{2})-(0[1-9]|1[0-2])-([1-2][0-9]|3[01])$
• Something like a Palmer School course code
  (DIS[89])|(LIS[5-9]))[0-9]{2}
• Something like an XML tag </*[:alpha:]+ */*>
not using posix regular expressions
• Do not use regular expressions when you
  want to accomplish a simple for which there
  is a special PHP function already available.
• A special PHP function will usually do the
  specialized task easier. Parsing and
  understanding the regular expression takes
  the machine time.
                    ereg()
• ereg(regex, string) searches for the pattern
  described in regex within the string string.
• It returns false if no match was found.
• If you call the function as ereg(regex, string,
  matches) the matches will be stored in the
  array matches. Thus matches will be a
  numeric array of the grouped parts
  (something in ()) of the string in the string.
  The first group match will be $matches[1].
                ereg_replace
• ereg_replace ( regex, replacement, string )
  searches for the pattern described in regex
  within the string string and replaces
  occurrences with replacement. It returns
  the replaced string.
• If replacement contains expressions of the
  form \\number, where number is an integer
  between 1 and 9, the number sub-
  expression is used.
  $better_order=ereg_replace('glass of
   (Karlsberg|Bruch)', 'pitcher of \\1',$order);
                     split()
• split(regex, string, [max]) splits the string
  string at the occurrences of the pattern
  described by the regular expression regex. It
  returns an array. The matched pattern is not
  included.
• If the optional argument max is given, it
  means the maximum number of elements in
  the returned array. The last element then
  contains the unsplit rest of the string string.
• Use explode() if you are not splitting at a
  regular expression pattern. It is faster.
       case-insensitive function
• eregi() does the same as ereg() but work
  case-insensitively.
• eregi_replace() does the same as
  ereg_replace() but work case-insensitively.
• spliti() does the same as split() but work
  case-insensitively.
   regular expressions in mySQL
• You can use POSIX regular expressions in
  mySQL in the SELECT command
  SELECT … WHERE REGEXP „regex‟
• where regex is a regular expression.
http://openlib.org/home/krichel

    Thank you for your attention!

Please switch off machines b4 leaving!

								
To top