TextPad using regular expressions for searching and replacing by gve10368

VIEWS: 0 PAGES: 27

									What are regular expressions?

Formally, a regular expression defines a set of strings.

    /[Dd]at(a|um)/

    Defines the set: Data, data, Datum, datum

Used mainly for parsing text: search or search-and-replace.

    • But this is not your momma's search-and-replace.

Available in a variety of environments:

    • Text editors, including TextPad.
    • Unix commands such as grep.
    • Most programming languages.



                                                              Page 1 of 27
Preliminaries

Before getting started, configure TextPad:

    • Check "Regular expression" in the search or search-and-replace dialog box.

    • Edit the TextPad preferences:

        • Configure -> Preferences -> Editor

        • Check "Use POSIX regular expression syntax"




                                                                          Page 2 of 27
Terminology and notation

A regular expression defines a pattern.

It is said to find a match when it succeeds.

This tutorial uses the Perl idiom of enclosing regular expression patterns in
forward slashes:

    /PATTERN/

    s/SEARCH/REPLACE/




                                                                                Page 3 of 27
Literals

The simplest regular expressions are no different than garden-variety searching:

    • Ordinary characters match themselves.

"Ordinary characters" are those that have no special meaning within regular
expression syntax.

    /in/

    LAZINESS: The quality that makes you go to great effort to
    reduce overall energy expenditure. It makes you write labor-
    saving programs that other people will find useful, and
    document what you wrote so you don't have to answer so many
    questions about it. Hence, the first great virtue of a
    programmer.




                                                                              Page 4 of 27
Escaped character literals

The following characters have special meaning in regular expressions:
    + ? . * ^ $ () [] {} | \

To search for such characters, precede them with a backslash, which is known as
escaping the character:

    /\./

    Mr. Green, with the revolver, in the billiard room.


    /\\/

    C:\MPC\regular_expressions_intro.doc




                                                                          Page 5 of 27
Other special characters

Sometimes you need to match special whitespace characters. Here are the most
common:

   \t      tab
   \n      newline (end of line marker)




                                                                         Page 6 of 27
The wildcard

A period matches any character (with one exception, to be covered later):

    /a..e/


    IMPATIENCE: The anger you feel when the computer is being lazy.
    This makes you write programs that don't just react to your
    needs, but actually anticipate them. Or at least pretend to.
    Hence, the second great virtue of a programmer. See also
    laziness and hubris.




                                                                            Page 7 of 27
Quantifiers

Regular expression syntax provides several ways to specify the number of times
that a particular elements should occur:

   +        1   or more times
   ?        0   or 1 time; item is optional
   *        0   or more times; item is optional or repeatable
   {N}      N   times
   {N,}     N   or more times
   {N,M}    N   to M times




                                                                          Page 8 of 27
Quantifiers: basic examples

To use quantifiers in a regular expression you place them after the element that
you want to quantify:

    /IPUMSI?/      Matches   'IPUMS' and 'IPUMSI'.
    /40+9/         Matches   409, 4009, 40009, etc.
    /ab{2,3}a/     Matches   'abba' and 'abbba'.
    /.+/           Matches   any line with at least one character.
    /.*/           Matches   any line, even blank ones.




                                                                             Page 9 of 27
Quantifiers: a greedy example

   /john.+\.edu/


   john.levin@yale.edu history
   marcus_g_peterson@hotmail.com geography
   john@nationwidecash.com computers
   john_modell@brown.edu history
   waijohnchang@yahoo.com geography
   carol.johnson@normandale.edu sociology
   john.johnson@ibm.com commerce
   holly_johnston@hotmail.com biology
   marie.johnson@argonmedical.com other
   john.aim@sa.edu education
   chewie1974@gmail.test science




                                             Page 10 of 27
Regular expressions are greedy by default

Quantifiers match as many characters as possible, consistent with the overall goal
of finding a successful match.

/john.+\.edu/

john.aim@sa.edu    education     #   The quantifier consumes the line,
john.aim@sa.edu    education     #   but the match fails...
john.aim@sa.edu    education     #   ...so the matching engine...
john.aim@sa.edu    education     #   ...backs off the quantifier...
john.aim@sa.edu    education     #   ...step...
john.aim@sa.edu    education     #   ...by step...
john.aim@sa.edu    education     #   ...until the pattern succeeds.




                                                                           Page 11 of 27
Generosity?

Some implementations of regular expressions allow for non-greedy quantifiers.

In Perl, a question mark following a quantifier causes it to match as few
characters as possible.

    /john.+?\.edu/

    john.aim@sa.edu education        # Matches this
    john.aim@sa.edu education        # rather than this.

    /IPUMSI?/

    IPUMSI is good, never evil. # Matches this.
    IPUMSI is good, never evil. # Not this.

TextPad lacks this feature.



                                                                            Page 12 of 27
Anchoring

Positional requirements can be placed on patterns. This is called anchoring.

/^PATTERN/      Anchor to start of line.
/PATTERN$/      Anchor to end of line.

/ department$/

john.levin@yale.edu history department
marcus_peterson@hotmail.com BS department
john@nation.com department of knowledge   # Not a match
john_modell@brown.edu history department

/^P.{23}4/    # Finds 4 in column 25 on person record.

H00003101110011000000000310400000001
P00003111110011000000000301110881088
H00004101110000000000000410400000001
P00004111110000000000000408420011001
H00005101110000001100000510400000001
P00005111110000001110000507410031003


                                                                           Page 13 of 27
Word anchors

Most regular expression syntaxes have word anchors, which force a pattern to be
located at the word boundaries.

Syntax for word anchors in TextPad.

    \>     End of word.
    \<     Beginning of word.

    IPUMSI is good, never evil.        # No match in 'evil'.

Perl has a different and more robust syntax for word anchors.




                                                                         Page 14 of 27
Regular expressions are line-based

By default, regular expressions are applied one line at a time.

Implication 1: The wildcard does not match the newline character.

    /.+/   # Matches full line, not entire document.

Implication 2: The end-of-line anchor assumes the existence of the newline
character, so you do not need to specify it.

These two patterns are both find 'department' only if it exists at the end of the
line, but they differ in the text matched.

    /department$/

    marcus_peterson@hotmail.com BS department¶

    /department\n/

    marcus_peterson@hotmail.com BS department¶
                                                                               Page 15 of 27
Character classes

Character classes provide a way to define sets within a pattern.

    • Enclose one or more characters within square brackets.

    • The characters can be typed directly or using intuitive ranges.

    [brc]at               Matches   'bat', 'rat', or 'cat'.
    unit[0-9]             Matches   'unit' followed by any digit.
    unit[7-9]             Matches   'unit7', 'unit8', or 'unit9'.
    [a-z]{2}[0-9]{4}      Matches   MPC sample IDs, such as ih1970.

Character classes can also be defined in a negative fashion by placing the caret
symbol as the first item in the brackets.

    unit[^0-9]      Matches 'unit' followed by any non-digit.




                                                                            Page 16 of 27
Grouping or sub-patterns

A regular expression can be divided into parts, often called sub-patterns.

Enclosing a portion of a regular expression in parentheses defines a sub-pattern.

    /STUFF(SUB_PATTERN)MORE_STUFF(SUB_PATTERN)/




                                                                             Page 17 of 27
How sub-patterns are used

1. To apply quantifiers to a subset of a regular expression, rather than just to a
single character.

2. To search for text with repeated elements.

3. To use portions of a pattern when defining the replacement string in search-
and-replace operations.

In the latter two situations, the sub-patterns can be referred using a \N notation.
These are known as back-references.

    \1      first sub-pattern
    \2      second sub-pattern
    ...
    \9      ninth sub-pattern

Perl's syntax for back-references is $N.

                                                                               Page 18 of 27
Sub-pattern examples: with quantifiers

   house(cat)?   Matches 'house' or 'housecat'.
   (ha)+         Matches 'ha', 'haha', 'hahaha', etc.




                                                        Page 19 of 27
Sub-pattern examples: text with repeated elements

   /([0-9\-]+) +\1/

   232-456-789    763-456-7890
   123-456-789    123-456-789
   232-456-789    612-612-3245
   123-456-789    763-456-7890
   232-456-789    232-456-789
   123-456-789    612-612-3245
   612-612-3245    612-612-3245


   ^([0-9])([0-9])([0-9]).+\3\2\1

   232   456
   123   321
   232   456
   123   456
   237   732
   123   456
   612   216


                                                    Page 20 of 27
Using sub-patterns in the replacement string -- preliminaries

In a search-and-replace operations, the replacement string is not a regular
expression.

It is mainly a literal string with a few bits of added functionality, which vary
considerably from one environment to another.

The primary TextPad features are the following:

    \N         Use the Nth sub-pattern in the replacement.
    &          Use the entire match in the replacement
    \0         Ditto.

    \p         Use the clipboard contents.
    \i         Generate a sequence number.
    \i(N,M)    Ditto, starting at N and incrementing by M.




                                                                               Page 21 of 27
Sub-pattern examples: using sub-patterns in the replacement string

s/([0-9]{2})-([0-9]{2})/19\2\t\1/

   • Matches a date in the mm-yy format.
   • Stores the month and year portions as sub-matches.
   • Converts match to a tab-delimited string -- year then month.

   Before                      After
   06-60                       1960     06
   02-60                       1960     02
   10-70                       1970     10
   03-70                       1970     03
   06-80                       1980     06
   03-80                       1980     03
   03-80                       1980     03
   11-80                       1980     11
   12-90                       1990     12
   03-90                       1990     03



                                                                    Page 22 of 27
Sub-pattern examples: using sub-patterns in the replacement string

s/([0-9]+)(\.([0-9]+))?/\0\t\1\t\2\t\3/

   • Parses numbers into their integer and decimal components.

   • Matches one or more digits, optionally followed by a decimal point and
   some more digits.

   • Preserves every component in the replacement: full match, integer, entire
   decimal portion, and just the digits following the decimal.

   Before                  After

   460.914                 460.914     460      .914    914
   336.591                 336.591     336      .591    591
   60                      60          60       \2      \3
   108.767                 108.767     108      .767    767
   148.368                 148.368     148      .368    368
   24.911                  24.911      24       .911    911

                                                                          Page 23 of 27
Alternation

The pipe symbol can be used to specify alternatives within a regular expression

Whereas a character class provides alternatives at the level of individual
characters, this syntax can be applied to entire sub-patterns.

    /\.(edu|gov)$/

    john.levin@yale.edu
    marie.johnson@argonmedical.com
    samantha.johnson@bateswhite.gov
    peterson@pop.com
    marcus_g_peterson@alum.mit.edu


    /^(A|An|The) .+/      # Matches entire lines that
                          # start with 'A', 'An', or 'The'.




                                                                             Page 24 of 27
Text parsing -- a typical MPC example

Example: Czechoslovakia 1991 codebook.

General points:

    • Identify regularities -- without them, the task is not amendable to automated
    solutions.

    • Identify irregularities -- these are the challenges.

    • Control white space.

    • Be practical: use strategic "manual" editing. Do not force automation unless
    the magnitude of the job demands it.

    • Know the strengths of your tools, and use them in combination: Excel,
    TextPad, even Word.


                                                                            Page 25 of 27
Quick reference

Characters with special meaning:
    . \ + ? * ^ $ () [] {} |

Basic special characters:
    \          Treat the next character as literal text.
    .          Match any character except newline.
    \t         Tab.
    \n         Newline.

Quantifiers:
    +          1 or more times
    ?          0 or 1 time; item is optional
    *          0 or more times; item is optional or repeatable
    {N}        N times
    {N,}       N or more times
    {N,M}      N to M times
    ?          Preceding quantifier non-greedy (Perl, not TextPad)




                                                                     Page 26 of 27
Anchors:
    ^        Start of line.
    $        End of line.
    \>       End of word (TextPad).
    \<       Beginning of word (TextPad).

Character classes:
    []       Define character class.
    [^]      Define character class in negative fashion.

Sub-patterns and back-references:
    ()       Define a sub-pattern.
    |        Define alternative sub-patterns.
    \N       Use the Nth back-reference.
    &        Use the entire match in the replacement.
    \0       Ditto.

Other TextPad options in the replacement string:

    \p      Use the clipboard contents in the replacement.
    \i      Generate a sequence number (TextPad, not Perl).
    \i(N,M) Ditto, starting at N and incrementing by M.


                                                              Page 27 of 27

								
To top