Lab1 Regular Expressions in Java and Perl

Document Sample
Lab1 Regular Expressions in Java and Perl Powered By Docstoc
					     Lab 1: Regular Expressions in Java and Perl
                                  Massimo Poesio
                                October 19th, 2004


    The goal of this lab is to get some practice with the basic NLE tasks we
discussed in the first lectures, such as tokenization, and with using regular
expressions in Java and Perl. We will go through a quick introduction to Perl
for those of you who aren’t familiar with it yet.


1     Where to find code and data
The example code discussed in these labs can be found in the subdirectory code
of the cc437 web page:

                 http://cscourse.essex.ac.uk/course/cc437

the .java and .pl files used in this lab can be found in the subdirectory Lab RE java perl:

    http://cscourse.essex.ac.uk/course/cc437/code/Lab RE java perl/


2     Tokenization in Java
You should already be acquainted with basic string processing in Java–i.e., with
the methods of the String class. (If you aren’t, have a look at the online Java
documentation, or to Chapter 5 of Oliver Mason’s book (which also contains a
gentle introduction to Java if you don’t know it).) In this lab we’ll be mainly
concerned with the StringTokenizer class.
    As discussed in class, tokenization is one of the fundamental tasks in NLE:
extracting tokens from the input text. The definition of ‘token’ depends on
the application, but in most cases complete words count as tokens; sometimes,
punctuation markers do as well. Finite state methods are typically used for
tokenization, because of their efficiency. In Java, the methods of the class
StringTokenizer can be used for a very basic form of tokenization. For exam-
ple, the code:1
   1 This example is borrowed from the Java documentation for the StringTokenizer class at

java.sun.com, like the following example of use of split.




                                            1
      StringTokenizer st = new StringTokenizer("this is a test");
      while (st.hasMoreTokens()) {
          System.out.println(st.nextToken());
      }


prints the following output:
      this
      is
      a
      test
More sophisticated types of tokenization, allowing for different types of de-
limiting characters, can be specified using the split method of String or
the java.util.regex package. The argument of String.split is a regular
expression specifying a delimiter. The following example illustrates how the
String.split method can be used to break up a string into its basic tokens:

      String[] result = "this is a test".split("\\s");
      for (int x=0; x<result.length; x++)
          System.out.println(result[x]);


This code prints the following output:
      this
      is
      a
      test



3    Regular Expressions in Java
There are several regular expressions libraries in Java, but in this lab we will
use the default package that comes with Java 1.4, java.util.regex. The lab
is based on the regular expressions tutorial at the Java doc site,

        http://java.sun.com/docs/books/tutorial/extra/regex/

The two main classes of the java.util.regex API are Pattern and Matcher.
In the tutorial, you start by creating a java file, RegexTestHarness.java, that
can be used to read in different regular expressions from the file regex.txt.
The regular expression read from regex.txt is compiled into a pattern using
the compile method of the Pattern class; the pattern is used to find instances
that match the regular expression using the matcher method of the same class.



                                         2
Exercise: Go to the Regex tutorial page, download RegexTestHarness.java
into your folders, and make sure you can compile it.
    The tutorial then covers increasingly complex types of regular expressions,
as done in the lecture: from the simplest form of RE (a string of characters), to
metacharacters, disjunction, ranges, negation, predefined characters, and quan-
tifiers.

Exercise: Go through the tutorial, reading at least the sections up to and
including the section on ’Capturing Groups’, and doing the exercises.

Exercise: Using the java.util.regex, write a simple tokenizer that given
an input text, outputs one word per line by replacing strings of white space with
newlines. The simplest way to do this is to modify RegexTestHarness.java
for this purpose: the key idea is to replace the while loop calling matcher.find
with a call to the replaceAll method of the Matcher class. The more ambitious
may want to change the program so that it reads the regular expression from one
file and tokenizes a second file.


4    A Quick Introduction to Perl
Perl is an extremely popular programming language. It is best known as one of
the main languages used to write CGI scripts for Web pages, but in this module
we are only interested in using it search and text transformation –which was
always one of the main reasons for its development. If you want to learn more,
check out its excellent online manuals:

http://www.perldoc.com/perl5.6/

(It would be a good idea to open the page now, click on ’Manual page’, and
keep it open for the rest of the tutorial.)
    As always, we will begin by showing how to do a “Hello, world!” program
in Perl. Printing is done using the print operator. The following script prints
out “Hello, world!”:
# hello.pl
# A complex Perl program

print "Hello, world!\n";
Notice the use of ’#’ for comments, just like in C. (Generally, the syntax of Perl
is very much like that of C or C++.)

Exercise: Create a file with this code, and try it. Perl scripts do not have to
be compiled; you can execute a script by passing its name as an argument to
the perl command in a command-line window: e.g., if you called the file above
hello.pl, you can execute it by typing perl hello.pl.

                                        3
    The basic facts that you need to know about Perl is how to write loops that
go through a whole file, line by line, applying a regular expression to each line.
Perl’s syntax for while and if loops is very similar to that of C or Java;2 but
its treatment of input / output is very characteristic. Files are read via file
handles: a file handle BR for reading from the file file.txt is created using the
open command, as follows:3

open BR, "< file.txt" or die "can’t open file.txt";

Once a file handle is created, one can read from it by using the <BR> syntax.
The following command reads a line from BR and assigns it to variable $line:4

$line = <BR>;

The special file handles <STDIN> and <STDOUT> are always open to read from
the standard input and write to the standard output, respectively. Loooping
through the standard input is done as follows:

while (<STDIN>) {
   .... PUT YOUR COMMANDS HERE ...
}

This is short for ‘read the next line from the standard input, and store it in the
special variable $_’.


5      Regular Expressions in Perl
Once $_ is set, it is possible to check whether it contains an instance of a pattern
specified by a regular expression such as ab* by means of a second abbreviation:5

if (/ab*/) {
   .... PUT YOUR COMMANDS HERE ...
}

This command is shorthand for ‘if $_ matches ab*, .... ’. The matching operator
is indicated by =~, and is an infixed operator, like == in Java - a slightly more
expanded version of the if clause above would be:

if ($_ =~ /ab*/) {
   .... PUT YOUR COMMANDS HERE ...
}
    2 Formore details, see the perlsyn manual page.
    3 More details about open can be found in the perlopentut manual page for Perl.
   4 The treatment of variables in Perl is quite special, as well. Variables need not be de-

clared; their type is indicated using characters like $ (indicating that the variable is scalar) or
(indicating that the variable is an array or list). See the perldata manual page.
   5 The syntax for REs in Java is borrowed from Perl, so what you learned in the Java tutorial

will work in Perl, as well. For more details, look at the perlre manual page.


                                                4
So, the entire code for looping through the standard input, checking each line
for an instance of the pattern ab*, would be:

while (<STDIN>) {
  if (/ab*/) {
        .... PUT YOUR COMMANDS HERE ...
  }
}

We can now show the whole code for the simple version of grep seen in class:

# grep_simple.pl
# A Perl script that searches for occurrences of /ab*/ in
# its standard input.

while (<STDIN>) {
  if (/ab*/) {
        print $_;
  }
}

Exercise: Create a file (say, perl grep simple.pl) with this code, and try it
by typing perl grep simple.pl in a command-line window. You can then test
the script by typing in one line of input directly, then typing carriage return; or
even better, using the pipe syntax, passing the output of one program to the next
- e.g., type regex.txt | perl grep simple.pl will send the output of type
(= the contents of file regex.txt to the input of perl.
    As performing a regular expression match on each line of a file is a very
common task in Perl, it is possible to simply pass the command to be tested
on each as an argument of the option -ne. The script above could therefore be
rewritten simply as:

perl -ne"print $_ if /ab*/;"

And tested using the type construct:
type FILE | perl -ne"print $_ if /ab*/;"
Exercise: Java’s treatment of Regular Expressions, including the ideas of using
parentheses to ’group’ parts of the matching expression and to store them into
variables called $1, $2, etc is borrowed from Perl. Modify the Perl program
above so that it prints out the parts of text that match the pattern, rather than
the entire line.
    The following script is the Perl equivalent of the RegexTestHarness.java
program seen earlier. Notice the use of the /g modifier at the end of the regular
expression to find all matches, of the parentheses to isolate a ‘group’, and of the
variable $1 to get the value of the 1st match. (Perl also provides variables $2,
etc.)

                                        5
# RegexTestHarness.pl
# Perl equivalent of RegexTestHarness.java
# Author: Massimo Poesio
                      # open file regex.txt for reading; stop
                      # if file can’t be found
open BR, "< regex.txt" or die "can’t open regex.txt";
                      # read one line from the file. chop is needed to
                      # remove end-of-line.
$regex = <BR>; chomp($regex); chop($regex);
$input = <BR>; chomp($input); chop($input);

print "Current REGEX is: ", $regex, "\n";
print "Current INPUT is: ", $input, "\n";

                      # Now loop through all the matches. g/ returns
                      # the matches; $-[1] gives the start index of the
                      # last successful match, $+[1] the end index.
while ($input =~ /($regex)/g) {
    print "I found the text \"", $1,
          "\" starting at index ", $-[1],
          " and ending at index ", $+[1], "\n";
}

close BR;

Finally, you’ll need to know how to replace instances of a pattern. This is
done with the substitution operator discussed in the lectures. Substitution has
a syntax very similar to that of matching, but in addition to the matching
pattern, a substitution pattern is given, and an ’s’ precedes the whole expression.
Substituting instances of ab* with c is done as follows:

s/ab*/c/

Exercise: Write a perl program that does what the simple ‘tokenizer’ did in a
previous example, and can be executed from the command line.
   The following command runs a tokenizer over each line of a file called
HLT data mining.txt:

type HLT_data_mining.txt | perl -ne"$_ =~ s/\s+/\n/g; print $_;"

What the command in quotes says is: print the current line ($ ) after replacing
every sequence of white space (recall that the metacharacter \s means ‘any
white space’, and + means ‘1 or more repetitions’) with a newline character.
The result of this pipeline should be a file with a separate word on each line.
(Notice how many of these ‘tokens’ are not really words you’d find in a lexicon
... )


                                        6
Exercise: The ‘tokenizer’ above is slightly less crude than what we did before,
but still pretty basic. For example, in the output of this pipeline punctuation is
still attached to the preceding word (check out for example the line with ‘ma-
chines’). Can you think of a way of fixing this? (Hint: just add one more step
to the pipeline ... )


6    References
    • The Java documentation: http://java.sun.com/j2se/1.4.2/docs/api/java/
      util/StringTokenizer.html

    • Perl documentation: http://www.perldoc.com/perl5.6/
    • The perlre manual page
    • The Perl 5 Primer: http://www.futureone.com/~sponge/tutorial/perl/index.html




                                        7