Understanding the awk in the UNIX Shell by zwj23860

VIEWS: 42 PAGES: 4

									                      Understanding the awk in the UNIX Shell


The core idea in the creation of any programming/script language is to make it as
natural and as simple as possible. Still, it should allow the construction of advanced
expressions for solving complex problems. The creators of the UNIX Shell script
language did not forgot these two simple principles. For the sake of simplicity some
parts are broke down into multiple sub-parts. Sometimes these sub-parts evolve and
grove out to form their own programming language.

A prime example for this is the Awk. The name awk comes from the starting letter of
the three creators: Aho – Weinberg – Kernighan. These people defined it as “awk, a
pattern scanning and processing language.” Besides setting a concrete purpose for the
tool this also underlines the fact that awk has its own syntax and rules. With this, it
becomes on its own a programming language.

Awk was designed to scan and process files like the .cvs where data are organized into
columns and rows. However doing the same with any other source of organized data in
this structure is a valid option (like in the case of the command ls –l). The principle
behind awk is to divide an input stream into rows and records and make on this the
changes.

You could say that last time I presented the stream editor what accomplished the row
splitting. However, compared to the sed awk is a much more complex, powerful and
with it a more capable language. The record extraction allows us to throw away the
unnecessary and process the useful information only from a file/UNIX tool.

Awk is a script language and is the most adequate for solving small everyday problems
with it. The three creator of it do not recommend for using in big and complex problem
solving. However, opposite to this there are long lists of problems that can be solved
with it. It also has a couple of other advantages compared to other tools like the sed.
For instance, it can work with real numbers and follows a very C stylish syntax.

During this and a future article I will try to present it as compressed as it can be
without leaving out any crucial parts of the awk. Remember that this is only to
introduce to you this language on a basic level. I do not intend to light out every corner
of the script language. Nevertheless, this will be enough for you to use it in your
everyday problems you may come over in the future.

The structure of Awk

There are multiple versions of the Awk. I will present the traits true on the one present
in the GNU Linux. If you read my article about the sed (stream editor) you are already
in picture with its way to work. And if you hadn’t I recommend you to read it. First read
in the data (by default use the standard input). Process it and tokenize it into rows and
records. Execute a script on the rows. Print out the result on the standard output.

The script is a little command list what must be executed on every single line of the
input stream. Due to the fact that the awk scripts can turn out to be quite long to a
point where no longer will fit into a single row of the terminal you can also write them
on a separate file. Later you can call the specific file to take the place of the awk script.

Accomplish this with the option switch –f. Furthermore this script files should have .awk
extension. This is only for us to be able to make a difference between the files.

cat alfa.txt | awk –f script.awk

The first line of these scripts just like in the case of the bash shell scripts should start
with the following line:

#!/usr/bin/awk –f
With this if we run the script file directly the shell will figure out what to do with it.
Furthermore we can specify at start the field separator characters. By default this is the
regular expression [ \t]+, meaning one or more occurrence of tabs and spaces. You can
also initialize a variable of the script with an external value with the –v option switch.
The whole syntax is as follows:

-v variabale_Name=variable_start_Value;


From the point of view of the awk lines are just a sequence of fields separated by a
character or character sequence. It also likes to call lines as records. For example let
there be the input:

Held up so high I am not…
Tell me.

Here we have two records. In the first one we have seven fields where the field
separator is a single white space. While we process each line we can refer to the entire
line with the $0 syntax. Furthermore you can address fields individually with the
$1,$2,$3… syntax where the number after the dollar sign refers to the n-th field.

Therefore in the upper example in the second record while $0 is the whole line, $1 will
be equal with the string “Tell” and the $2 with the “me.”. Of course we can change this
during the process operation and also refer to them with the help of the variables:

k=1
print $k

The upper example will print in fact the first field corresponding to the $1. You can
follow information’s about the script, with the help of some general variables. These
exist for all awk script. With awk you can process multiple files also, one after another.

Inside the $NR you will find the      number of records you read in from all of the files. In
the $FNR is the number of lines       processed from the current file that is filtered with the
awk. The number of the fields         is inside the FN (Field number). With these you can
easily address to the last field in   each record as $FN.


Patterns

The awk program lines are built up from rules. You can compose a rule with the help of
a pattern and an action or statement. The last one should be enclosed inside a {}
brackets. For instance:

pattern {action/statement}

One of the most basic commands what you can insert into the awk action/statement list
is the print. This will print the values of the variables followed by the command and
close it with a new line. For example the following will print “feel rain” on the screen:

echo ‘I want to feel the rain” | awk ‘{print $4, $6}’

The pattern can be any regular expression. Now if you missed my article related to this
I will not start all over again. Just make a fast search under my name and you should
find it under the name of Regular Expressions under the UNIX Shell.

If you add the pattern also the statement or action will be executed only if the record
matches the pattern. If you enter no action or statement then the default will be called.
This is to print the entire record (aka print $0). For example, the following line will
print on the screen the Immortal word as the same input words does indeed start with
the “Im” character sequence:
echo Immortal | awk           ‘{/^Im/}’

Still the entire system is just a little more complicated than that. The table below will
make it all clear.
    Pattern                                     What it means?
BEGIN              This will be executed before processing any of the lines.
                   Usually a good place to make initializations, determine the field
                   separator characters and so on.
END                This will be executed after processing all the lines.
                   Use this to print out final results, like an addition of a column and
                   tasks that conclude the processing procedure and needs to be
                   executed only once.
/regex/            A simple regular expression, just what I introduced to you in the prior
                   lines.
pattern1     && The “and” operator. Both of the patterns must match in order to
pattern2           execute the action or statement that follows in the brackets.
pattern1       || Just as the upper with the difference that this is the “or” operator. If
pattern2           any of the patterns match we are good to go.
!pattern           Negation. If the pattern does not match step forward to the
                   statement.
pattern         ? This is a sort of if. If the pattern matches then the final word will be if
pattern1:          the pattern1 matches, otherwise the pattern2 will take over the
pattern2           significant role.
Relational         The evaluation of the following relational operators(related to
expression         mathematical comparison or string matching):
                   < --- Smaller than
                   > --- Greater than
                   <= --- Less or equal
                   >= --- Greater or equal
                   == --- Equal
                   != --- Not equal
                   ~ --- Match (for strings)
                   ~! --- No match(for strings)

                  For example on the pattern matching:
                  $2 ~ /^I/
                  => True if the second field starts with a great I letter.

                  Note the other relational operators may work for strings also with the
                  mention that a step by step character comparison will be made. The
                  expression is true if after evaluation we got a non-zero number value
                  or a not empty string.


Commands and variables

The commands goes between the {} brackets. Now you can enumerate them into a
single line using the “;” separator character or a single command into a single line as
the new line character is also a command separator character. A commands structure
(awk script) looks usually as follows:

BEGIN {intro commands}
   Pattern {Commands to be executed at every record if the
            pattern matches}
END {closing commands}

If any of these you do not need to use you can just skip it and do not include it into the
script. The character “#” is for comments. Any text found after this character will be
not considered in the interpretation of the script. The awk variables are very similar to
those present in the C language from the point of view of handling them.
However, here we do not need to declare them. They are created in the moment of
declaration and they type is dependent of the environment. If you assign a number to it
they will be numbers (represented as a float number always), if you assign a string to it
then they will be of a string type.

They are a list of internal variables what are created automatically at the start of the
script (and as you do not need to declare/modify them unless you want to change tem)
and this will help you program easier:

  Variable                        What it contains?                       Default value
ARGC            Contains the number of arguments on the command          None
                line(Command Line Argument Count )
FILENAME        The input files name (FILENAME)                          None
FS              The field separator characters. If this is empty every   Space and tab
                character will be into a separate column. (Field         => [\t]+
                Separator)
RS              The record separator characters. (Record Separator)      Newline
                                                                         character=> \n
IGNORECASE      If not zero at the regular expression pattern            0
                matching an ignore case compare will be made.
NF              How many fields are in the input record?                 none
                (Number of Fields in the current input record)
NR              How many lines have been already processed?              None
                (Numbers of records seen so far)
OFMT            How to format the numbers that will be printed?          “%.6g”
                (The output format for numbers)
OFS             The output field separator character.                    Space
ORS             Output Record Separator.                                 New line.
                Whit what do we separate records that are printed
                out. If change this to an empty string the output will
                be not separated.

When you use to declare some string constants you can use the escape characters of
the C language:

                     Escape character       What it means?
                     \\                     Backslash
                     \b                     Backspace
                     \f                     Form-feed
                     \n                     Newline
                     \r                     Carriage return
                     \t                     Horizontal Tab
                     \v                     Vertical Tab
                     0xn                    Where n is a number.
                                            Hex numbers.
                     \c                     Literally the c character.
                                            For example \? => ?
                     0n                     Where n is a number.
                                            Octal Number.

With this I will also stop. I have given you quite a chunk of information. I will leave to
you a little time to read it in details and comprehend it every piece of it. Nevertheless,
we are still far from the end. So make sure you come for the next article where I will
treat the other just as important part of the awk like the functions, external commands,
arrays, and etcetera. We will also put all this together and solve a couple of problems
to demonstrate the power of the language.

Before I say farewell I would like to encourage you to take effort and rate the article. If
you have any kinds of questions do not hesitate to ask it. Remember the blog was
constructed for this sole purpose. I also welcome any comments that are in the spirit of
a constructive criticism. Live With Passion!

								
To top