Docstoc

Mil

Document Sample
Mil Powered By Docstoc
					Monet Interpreter Language User Guide


1 Introduction
This document is a user manual for MIL. It is intended to be read by front-end developers, or application programmers who
want to program in MIL directly instead of passing through front-end translation such as SQL.
Beware that MIL has not been designed as the next high-level programming language to satisfy a large programmer
community. It's primary function is an intermediate language, with only little syntactic sugar added. The language is mostly
used by front-end compilers or through direct interaction from an application.




2 Screws and Bolts of the Language
Some general rules:
       upper- and lower case matters.

       all MIL keywords, however, are accepted in both total upper case and total lower case (but not in mixed-case).

       white space can be entered between all keywords, punctuation and atomic values.

       comments start with the '#' character and continue until the end of the line.




2.1 Value Types
MIL is a dynamically typed language. That is, variables can change value and type during execution. Operations and
procedures may return different types on different invocations.




2.1.1 Atomic Types

       void
        values that are no values. Occupies 0 space. You will normally never find individual void values, but they can occur
        inside tables as a zero-width column. Their meaning in that context is explained in the section on Virtual OIDs.

       bit
        boolean values being either true or false.
         chr
          character values, denoted like 'c' (standard) or '\001' (octal).

         sht
          short integers; in general represented as 16 bit integers. Since normal MIL uses standard integer notation for the INT
          type, short constant values must always be formulated as a cast: sht(42).

         int
          normal integer values. Uses 32-bit representation.

         oid
          unique numerical identifiers. The next section discusses these in detail.

         ptr
          pointers. Seldomly used. Some volatile representations (like stream pointers in the streams (io) module) use it as data
          representation. This datatype may not be used in persistent BATs. Its size is that of int on 32-bit platforms and that of
          lng on 64-bit platforms.

         flt
          standard floating point numbers. Uses a 32-bit representation.

         dbl
          double floats. Like sht, dbl values have to be cast always ( dbl(12.5)), since the MIL parser uses floating point
          notation already for flt .

         lng
          long integers, normally represented on machines as 64-bit integers. In this case the special LL syntax can be used after
          integers to denote longs: 1200089LL.

         str
          strings, using C-type standard string notation: e.g. "hello world".

         bat
          Binary Association Table ( BAT) values. Contain a collection of two-ary tuples, called BUNs (Binary UNits). By
          convention the left value in a BUN is called the head, the right one tail. Head-type and Tail-type determine the exact
          type of a BAT (its signature).


On the syntax level in MIL, we always abbreviate BAT types as BAT, omitting the signature. For textual signature descriptions,
throughout this document (and in the Monet Extension Language), we abbreviate BAT signatures at the first level: e.g.
BAT[OID,BAT].

Below we display an example BAT. This one has signature BAT[OID,STR]. It has OIDs in the heads, and strings in the tail. It
contains five BUNs:
#--------------------------#
# head          | tail   #
#--------------------------#
[ 1788@0,           "Martin"       ]
[ 1639@0,           "Arjan"    ]
[ 4309@0,           "Niels"    ]
[ 8145@0,           "Wilko"    ]
[ 7540@0,           "Sunil"    ]

More information about the internal layout of BATs is available in the technical documentation of GDK.


2.1.2 Type Identifiers

In MIL, type identifiers {void,bit,sht,int,bat,oid,ptr,flt,dbl,lng,str} are just synonyms for integer type numbers. Just try
int.print(); it will display an integer value. Type numbers may change over time (see the Monet Extension Language), and can
be looked up from their corresponding strings in the BAT monet_atomtbl:
monet>monet_atomtbl.print();
#-----------------#
#h       |t    #
#-----------------#
[ "void", 0            ]
[ "bit", 1      ]
[ "chr", 2          ]
[ "sht", 3          ]
[ "bat", 4          ]
[ "int", 5      ]
[ "oid", 6          ]
[ "ptr", 7         ]
[ "flt", 8     ]
[ "dbl", 9          ]
[ "lng", 10             ]
[ "str", 11         ]
>



2.1.3 Type Casts

Values of all numerical types can be cast to each other using 'TYPE(expression)' notation (e.g. int(13.5)). Not all types can
sensibly be cast to each other, of course.
Values of each type can always be created from their string description. A 'TYPE(STR)' cast is
guaranteed to exist for any type, e.g. point("0.345,4.78 : 0.89,1.274") might convert a string to a GIS
extension 2D point atom. This is especially useful for extension types (see 'Loading and Dropping
Extension Modules'), which - since they are user-created - do not have any special syntax support in
MIL.


2.1.4 NIL values

Each type has one special value called the NIL value. NIL values of any type can be described by TYPE(nil) (e.g. flt(nil)). The
'nil' keyword itself denotes a VOID-typed NIL. Most commands give error messages if such a nil-value is passed to them as a
parameter. In sorting, it is convention that the NILs appear first.


2.1.5 Unique OIDs
OIDs are special kinds of unsigned integers because the system guarantees uniqueness. They have the form number@stamp.
Currently, OIDs are implemented as 32-bit numbers, where the stamp part consists of the highest two bits, and the number part
occupies the rest of the OID.
A Monet server stamps each OID it generates with its own database-ID. By using Monet servers with different database-IDs,
cross-system OID uniqueness can be ensured.
Although OIDs should be generated with the command newoid(INT range):OID to ensure their uniqueness, you can also enter
OIDs by hand for debugging purposes.
You just type a number closed with an 'at' sign, e.g. 12345@, which specifies the OID 12345@STAMP, where STAMP is
the current database ID. You can also specify the stamp (e.g. 2) specifically yourself by typing 12345@2.
The database identifier concept is not yet supported in the current release; the STAMP is 0 by default: e.g. 01@3 = 01@0

2.1.6 Search Accelerator Identifiers
With search accelerators we mean:
         special data structures,

         attached to a column of a table, i.e. head or tail of a BAT.

         maintained under table updates by the database automatically.
All decent database systems maintain such structures, often called indices. They are used for speeding up database operations,
like select and/or join. Famous examples are hash-tables, B-trees, R-trees, Grid-files, etc.
Whereas these structures are vital to the efficiency of many database operations, they do not play a big role on the user
interface level.
These accelerator identifiers serve to identify a specific search accelerator, and are used by the
search accelerator management commands, that can be found in the BAT module. This set can
change over time (see the Monet Extension Language), and their mapping to strings can be looked
up in system BAT by the name of monet_acctbl:
monet>monet_acctbl.print();
#-----------------#
#h       |t    #
#-----------------#
[ "hash", 1        ]
[ "index", 2       ]
The hash identifier stands for Monet's built-in bucket-chained hash-tables. The index identifier stands for Monet's built-in
binary search trees.
2.2 Expressions
Values can be computed with MIL by formulating expressions:

       constant
        Constants as described above, are the simplest form of expressions (examples: true (bit), 42 (integer), 3.14 (float),
        "foo" (string)).

       TYPE(expr)
        Expressions can be cast to other types, as described in the previous section (e.g. sht(42) (short), lng(42) (long),
        dbl(3.14) (double)).

       (expr)
        Enclosing parentheses are useful, sometimes even necessary tools, to specify unambiguous MIL expressions.

       ident
        Variables (possibly constant variables) are denoted by identifiers - consisting of alphanumerical characters or
        underscores, starting with a letter. See the next section on how to create variables.

       bat(name)
        The BAT with the designated name (of type str) is looked up in the database, e.g. bat("monet_atomtbl") and
        bat("<monet_atomtbl>") both resolve to the same predefined bat.

       ident := expr
        Assignment: put a value in a MIL variable (e.g. answer := 42). See the 'Statements' section on MIL blocks and on
        how to declare variables.

       command(expr, ..expr.. )
        Algebraic commands have (an empty) parameter list, which are expressions themselves, yielding values after
        execution (e.g. join(a,b)). Values are passed in, execution takes place, a result value comes out.

        MIL uses command-overloading: it can select different command implementations, depending on the number of
        parameters and their actual types. A command can be either a Builtin Command (discussed at the end of this
        document), a command introduced by loading a module (see the Extension Modules), or a user-defined PROC
        (procedure, expressed in MIL). Syntactically, there is no difference.

        Command resolution: both commands and procs are checked in reverse order of definition (i.e. last defined gets
        checked first), and the first match of formal parameters with actual parameters is accepted (return types are not taken
        into account during command resolution.

       expr.command(expr, ..expr.. )
        This 'object-oriented' syntax is supported for algebraic commands ( e.g. a.join(b)). The first parameter is the target
        'object', and may be placed in front using dot notation. This is just a synonym for the normal command invocation
        notation (see previous point), no special semantics are involved.

       expr binary-operator expr
        Operators are just algebraic commands that can also be used in infix notations (you may do both +(1,2) and 1 + 2).
         Note that each binary operator (in our example +) is a command, but not each command is a binary operator!
         Operator precedence: MIL does not have anything like that: use parentheses in compound operator expressions!!


        unary-operator expr
         Not only binary, but also unary prefix operator notation can be allowed for certain commands ( - 3.14 is a command
         invocation, -3.14 isn't).


        (* string-expr)(expr, ..expr..)
         Invokes a command by means of its name. Command resolution will take place at execution time. For instance:
         "join")(b1,b2);. For multiplexed operations and set aggregates (see later), MIL has similar syntax using the [* string-
         expr](expr, ..expr..) notation (multiplex) and the {* string-expr }(expr) notation (set aggregate).




2.3 Statements
Interaction with an Mserver takes the form of a client session, that is defined as a sequence of statements:


        expr;
         Statements are formed by putting a semicolon after an expression.

        { statement ..statement.. }
         A sequential block specifies sequential execution of a concatenation of statements { e := 2.7; pi := 3.14; }.

          {| statement ..statement.. |}
         A parallel block specifies that a collection of statements can be executed in parallel. The parallel block terminates
         execution when all statements in it have terminated execution. Example:
         {| join(a1,b1); join(a2,b2); join(a3,b3); |}

        VAR ident, ...ident..;
         Variables must be declared (e.g. var pi,e;). Their lifetime is the execution of the nearest block (sequential, parallel) in
         which they were declared. If a variable is declared at the top-level; its lifetime is the MIL session.

        VAR ident := expr, ..ident-declaration..;
         Variables may be assigned a value directly in their declaration (e.g. var pi := 3.14, e := 2.7;).

        CONST ident := expr;
         This declares a constant variable. Its type and value cannot be changed (e.g. const pi := 3.14;).

        IF (BIT-expr) statement ELSE statement
         Standard if-then-else semantics, using C-style (for example: if (a < 18) print("kid"); else print("adult"));

        IF (BIT-expr) statement
         The ELSE part may be omitted.
         WHILE(BIT-expr) statement
          A simple while-loop construction. The BIT expression is tested, on FALSE the loop stops, and on TRUE the
          statement is executed and we loop to a new BIT expression evaluation ( { var i := 0; while((i := i + 1) <= 10)
          i.print(); } ).

         BAT @ ITERATOR(expr, ..expr.. ) statement
          Iterate over the elements of a BAT. For each element, the statement is executed. The symbols $h and $t (or: $head and
          $tail) can occur as variables in the statement, and denote the head- and tail-value of the current BUN respectively.
          We give an example with batloop() iterator that sequentially loops through all BUNs in a BAT[int,int] named 'ints',
          and prints the sum of all heads and tails: ints@batloop() print( $h + $t);.

         BAT @ [INT-expr] ITERATOR(expr, ..expr.. ) statement
          The same as above, except that the statement execution may now occur in parallel on multiple 'current elements' at the
          same time. The INT-expression denotes the maximum degree of parallelism (which is desirable, since BATs can
          contain billions of elements).
          We give an example with a recursive BAT[BAT,BAT] containing combinations of fragments that have to be joined
          with each other. The statement fragments@ [16] batloop() join($h, $t); will join them in parallel, with a maximum
          number of 16 jobs in parallel at the same time.

During program execution, the status of variables can dynamically be changed using the built-in commands
constant(..IDENT..) (which fixes value and type) and freeze(..IDENT..) (which fixes only the type).
Also, the vars() command produces a table of all active variables in the current context:
> vars();
#----------------------------------------------------------------------------------------- #
# type                 | type | type_status | value_status | value_value                                   #
#----------------------------------------------------------------------------------------- #
[ "b",                "BAT", "liquid",           "variable",      "<tmp_23>"                   ]
[ "monet_atomtbl",            "BAT", "frozen",            "variable",      "<monet_atomtbl>"                           ]
[ "BAT_WRITE",                 "int", "frozen",         "constant",       "0"                      ]
[ "BAT_APPEND",                  "int", "frozen",         "constant",       "2"                        ]
[ "BAT_READ",                  "int", "frozen",        "constant",        "1"                      ]
[ "BUF_DONTNEED",                   "int", "frozen",         "constant",         "4"                           ]
[ "BUF_WILLNEED",                 "int", "frozen",         "constant",          "3"                        ]
[ "BUF_SEQUENTIAL",                 "int", "frozen",         "constant",         "2"                               ]
[ "BUF_RANDOM",                   "int", "frozen",         "constant",       "1"                           ]
[ "BUF_NORMAL",                  "int", "frozen",         "constant",        "0"                       ]
[ "STORE_COMPR",                   "int", "frozen",         "constant",         "2"                            ]
[ "STORE_MMAP",                   "int", "frozen",         "constant",       "1"                           ]
[ "STORE_MEM",                   "int", "frozen",         "constant",       "0"                        ]
[ "LNG_MIN",                 "lng", "frozen",         "constant",        "-9223372036854775807" ]
[ "LNG_MAX",                  "lng", "frozen",         "constant",        "9223372036854775807" ]
[ "INT_MIN",              "int", "frozen",      "constant",     "-2147483647"                   ]
[ "INT_MAX",               "int", "frozen",     "constant",     "2147483647"                    ]
[ "SHT_MIN",               "sht", "frozen",      "constant",     "-32767"               ]
[ "SHT_MAX",                "sht", "frozen",      "constant",    "32767"                    ]
[ "CHR_MIN",                "chr", "frozen",      "constant",    "'\201'"           ]
[ "CHR_MAX",                "chr", "frozen",      "constant",     "'\177'"              ]
[ "RAND_MAX",                "int", "frozen",     "constant",     "2147483647"                      ]
[ ...                                                             ]
>
The standard collection of variables consists of a number of constants that are needed by various operators (
access(BAT[any,any], int BAT_XX), madvise(BAT[any,any], int BUF_XX) and mmap(BAT[any,any], int STORE_XX)).
The values_status tells whether it is a variable or a constant. For variables, the type_status may either be liquid or frozen,
which indicates whether the variable may be assigned a value of another type.


2.3.1 Exceptions
Exception-raising commands and how they are caught.
           any syntax error will raise an error exception. These exceptions cannot be caught by the user. The MIL interpreter will
            stop the interpretation on the active session; and start to wait for new input.

           BREAK;
            This statement makes MIL quit the nearest enclosing WHILE loop, or ITERATOR immediately ( while(bool) { ..
            BREAK; .. } or b@batloop() { .. BREAK; .. }).
            If a command used in Multiplex Operator form (see below) stops with a BREAK, the result BAT will have *no*
            result tuple for that execution. Multiplex Operator execution will continue with the next BUN, however.

           ERROR(STR format, ANY...);
            It is also possible to raise an error exception programmatically. This call produces an error message, much like in C-
            printf() format: ERROR("you entered %d foos too many\n", foo); (see: CMDERROR())

           CATCH(expr) : STR
            Executes the MIL expression passed as parameter. If an exception occurs during execution of the MIL expression,
            execution is stopped and the error text is returned by CATCH. If no exception occurs, CATCH returns str(nil).




2.4 Procedures
Users can quickly prototype functionality using the MIL procedure mechanism. They follow the same signature overloading
mechanism as MIL applies for its built-in algebraic operators and commands.

           PROC ident ( type ident, ..type ident..) : type statement
            Declare a MIL-procedure. Its definition is a simple MIL statement, in which the parameter identifiers may appear as
            variables. The types from the signature follow the same rules as found in MEL signature definitions: all atomic MIL
            types are called by name, BATs are parametrized (between square brackets: ) with a head-type and a tail-type, and the
         special typename any indicates a wildcarded type. Like in MEL, an any-type may have a tag-number using any::1 in
         order to make multiple any-types match (using the same tag-number). A signature may end with a special sequence
         ..type.., indicating a variable number of arguments. In case of ..any.. these arguments may be of any type, but you
         may be more restrictive.
         The parameters may also be omitted in the declaration, in which case MIL assumes ..any..; the actual parameters
         during execution are always available with positional arguments $1, $2, $3,... (Unix shell script style).
         As of now, you cannot indicate a return type in a PROC signature (MIL just defaults it to a any return value).

        PROC ident;
         If recursion is applied, you need to do a forward declaration of the procedure, in order for the MIL interpreter to be
         able to recognize it correctly.

        UNDEF ident;
         Undefine a MIL-procedure.

PROCs cannot be viewed once they have been parsed. You should store them yourself in some text-file. Maybe your
$DBDIR/users/$USER is a good place for that, since MIL scripts placed there can easily be read in (see 'source()'). PROCs
cannot be made persistent either. You can, however, put them in a MIL initialization file passed as startup parameter to
Mserver.


2.4.1 Variable Arguments
For handling variable numbers of arguments in PROCs, the following special positional arguments are available:

        $0 denotes the number of actual parameters (argument count).

        $(INT-expr) references the i-th actual parameter, where i is given by an integer expression.

        $(INT-expr..) can be used as a parameter to another function call, it actually stands for multiple parameter values. In
         this way, you can pass 'the rest' of the received parameters in a procedure to a another command. For example:
         proc printf(format) := { fprintf(stdout, format, $(2..)); }
         passes parameter 2-till-the-end of printf to a call to fprintf. If there is only one parameter, the $(2..) will expand to 0
         parameters without giving an error.
         Note that in the procedure definition of printf parameter 2-till-the-end are not named as formal parameters; still all
         actual parameters (if any) are still available with the positional parameter syntax $X, $(X) and $(X..).




3 Relational Table Support
The relational data-model provides support for tables with more than two columns. They can be supported in MonetDB using a
straight forward mapping. The way to go about this, is to map the n-ary data model { A1,...,An } onto n decomposed tables {
BAT[OID,A1] .. BAT[OID,An] }.
These sets of BATs, and possibly their derivatives (sub-sets, used during query processing) semantically describe different
parts of the same database objects. Their OID columns make them correspond.
3.1 Support For Corresponding BATs
Since this is a common situation, Monet provides special support for corresponding BATs, typically resulting in the following
three cases:
        natural join. Monet normally must do a natural join on the OID columns, to find out which attribute values in the
         BATs correspond. Though quick hash-lookups are used, this remains relatively expensive.

        sorted BATs. A common strategy is to keep the different BATs ordered on the OID column. Cheap merge-algorithms,
         and binary-search lookup, can then be used, even if the one BAT is a subset of the others.

        synced BATs. For BAT-sets that contain exactly the same sets of OIDs, those OID-columns can be marked synced().
         This information is then used to use positional lookup (the i-th element in BAT1 corresponds with the i-th in BAT2)
         and to turn semijoin- into very cheap copy-operations, etc.
This information (sorted, synced) about BATs is automatically kept up-to-date by Monet across updates, and propagated to
derived BATs, where appropriate (a select on a sorted BAT produces a sorted subset BAT). The operators in this section (as
well as much of the algebraic commands in the Extension Modules) use this information to accelerate processing.



3.1.1 Command Signature Notation

In command signature definitions in the remainder of this document, use MEL (Monet Extension Language) syntax:

        command-name(TYPE name, ...VARARGTYPE...) : RETURNTYPE

If the TYPE is ANY, this denotes any possible MIL type.



3.1.2 Example: value printing

The print(ANY val) command is used to print values. It will put them on the standard output and enclose them in square
brackets.
The RETURNTYPE is omitted here; this means that a VOID value is returned.
The second version of print is described by the pattern print(BAT[ANY::1, ANY], ..BAT[ANY::1,ANY]..). It has a variable
number of arguments of type BAT.
ANY types can be tagged with a number like ANY::2, in order to indicate that certain ANY types correspond with other ANY
types.
In this case, the command signature states that the head columns of all its BAT parameters are of the same type. The print()
command prints an N-ary table (where N = #arguments+1), consisting of the equality join of all BATs on the left column (head
column), with all head columns projected out except the leftmost one. This produces a [Head,Attr1,Attr2,...AttrN] ASCII table
on standard output:
print( #----------# , #---------------# ); ==> #--------------------#
     # h | age #     # h | name         #      # oid | age | name #
     #----------#   #---------------#       #--------------------#
     [ 1, 24 ]      [ 1, "Niels" ]          [ 1, 24, "Niels" ]
     [ 2, 38 ]      [ 2, "Martin" ]         [ 2, 38, "Martin" ]
     [ 3, 27 ]       [ 3, "Arjan" ]           [ 3, 27, "Arjan" ]
     [ 4, 27 ]       [ 4, "Fred" ]            [ 4, 27, "Fred" ]
     [ 5, 27 ]       [ 5, "Wilko" ]        [ 5, 27, "Wilko" ]
                                      BAT Printing Example, and its result on Standard Output




3.2 Multiplex Operators
Multiplex Operators execute commands for a set of 'objects' (in their Monet representation) on their different 'attributes' (i.e. a
set of BATs). For each command or procedure with signature:
                                                 COMMAND(type1,...,typeN) : typeR

there automatically exists its Multiplex Operator equivalent with signature:
                            [COMMAND](BAT[type0,type1],...,bat[type0,typeN]) : BAT[type0,typeR]

It applies the command operation on all tail values (corresponding by equal head value), and stores the single return value in
the tail of the return BAT (along with the corresponding head value). Note that this means that the multiplex operation
computes the natural join on equal head values of all its BAT parameters.
#----------# [+] #----------# = #----------#
#h |t #           #h |t #          #h |t #
#----------#       #----------#     #----------#
[ 1, 2 ]          [ 1, 1 ]        [ 1, 3 ]
[ 2, 1 ]          [ 2, 1 ]        [ 2, 3 ]
[ 3, 4 ]          [ 3, 1 ]        [ 3, 5 ]
[ 4, 2 ]          [ 4, 2 ]        [ 4, 4 ]
[ 5, 2 ]          [ 5, 2 ]    [ 5, 4 ]
                         Multiplex Operator Example: using the numerical '+' operator (arith module)

         All parameter head types should be equal or void

         The values fed into the command invocations are obtained by taking the tail values of the corresponding BUNs.

If you want a constant parameter to be given to each command invocation, instead of a tail-value from some BAT, you can
pass a constant (e.g. '42'). In the case of a constant value of type BAT, you must use BAT cast notation (e.g. '[A , const B]') to
avoid confusion.



3.3 BAT Casts: project(A,B)
The BAT cast operator allows you to enter constant values into columns:
project( #---------------# , "Monet User" ) = #-------------------#
        #h |t        #                   #h |t              #
        #---------------#                    #-------------------#
        [ 1, "Martin" ]                      [ 1, "Monet User" ]
      [ 2, "Fred" ]                     [ 2, "Monet User" ]
      [ 3, "Niels" ]                   [ 3, "Monet User" ]
      [ 4, "Arjan" ]                    [ 4, "Monet User" ]
      [ 5, "Sunil" ]                  [ 5, "Monet User" ]
                                  BAT Cast Example: Inserting a constant value in a column




3.4 Virtual OIDs
The VOID in MIL is used like a V-OID, or virtual oid. A virtual oid column in a BAT is an OID column, which has the
restriction that it must contain a dense sequence of OID numbers. That is, it contains a sequence of N numbers from base
'base': [base,base+1,base+2..,base+N-1]. Such columns need zero bytes of memory space; the BAT just contains the base
number, and the OID values are then computed on the fly by adding the BUN position (row-id) to it.
A second restriction is that only one column in a BAT can be a virtual oid (i.e. not both head and tail can be VOID).
VOID columns can be created using project(B) operator. It creates a view on top of B, in which the right column seems to be
missing; when you print it, all tail values are nil. Another way of creating a BAT with a VOID column is to use new(VOID,T)
(you can also use the constructor bat(VOID,T)). This will yield an empty BAT with a VOID head column.
You need to activate the VOID column interpretation in a BAT created by hand by specifying a base
OID of your choice. This can be done with the command bat[VOID,T].seqbase(oid).



3.5 Operations on BATs with a VOID column
Any MIL operation that can be executed on a BAT with OIDs can also be executed on a BAT with virtual OIDs. The result
type will automatically be converted to OID, if necessary. Only when from the semantics op the MIL command it is known
that it will again return a BAT with dense OIDs, the return BAT will also have a virtual oid column.


3.5.1 Less Storage
The great advantages of virtual oids lie there where tables are very large. Not having to store the OID for each item in a BAT
means at least a 50% reduction in space needs. At least 50% because alignment restrictions of the hardware in general make it
more (most systems enforce OIDs to be stored on 4 byte memory addresses). This means that a BAT[OID,CHR] containing a
4-byte OID and a 1-byte character takes 4 + 1 = 8 bytes in storage per tuple. A BAT[VOID,CHR] takes only one byte of
storage; an improvement of a factor 8!


3.5.2 Query Optimization
Most MIL commands just don't see the difference between OID and VOID columns. Only in some specific, but important,
cases, Monet makes use of the specific features of a VOID: lookup on OID number can be very efficient, since you need to do
nothing more than subtract the base number in order to know the position of a BUN. For instance, the bat[VOID,T].find(oid):T
will use positional lookup.
Similar optimizations have been made in the semijoin and join when one of the join columns is virtual oid. The same goes for
the select operation when the selection column is a virtual oid.



3.6 Pump
Set aggregates are the MIL language pump construct
                                                          {Y }(bat-expr, bat-expr)

that allow you to interpret both bats as a a set of collections of tt values, defined as
                                                         S = { seth | [h,any] in right }

in which
                                                             seth = { [t,t] | [h,t] in left }

are bags/sets of tail values.
The Y() is a read-only unary BAT command Y(bat[tt,any]) : rt or Y(bat[any,tt]) : rt that ignores the
contents                 of              the                       any                       column.
The pump version of Y, denoted in MIL
                                                  {Y }(bat[ht,tt], bat[ht,any]) : bat[ht,rt]

is defined as
                                                   { [h,r] | [h,t] in bat and r = Y(seth) }

The Y() language constructor is orthogonal, and works with all commands, builtins, procs, dereferenced address variables.
Dereferenced variables (as obtained with fcn = &sum) are called using the syntax {*fcn }(bat-expr).
To clarify it all, we give an example of the use of the sum() operator, that normally sums the tails of one BAT, as a set
aggregate {sum }().
{sum} ( #-----------# , #-----------# ) = #-----------#
     #h |t # #h | t #              #h |t #
     #-----------# #-----------#     #-----------#
     [ 1, 20.0 ] [ 1,      nil ]   [ 1, 70.0 ]
     [ 1, 10.0 ] [ 2,      nil ]   [ 2, 40.0 ]
     [ 1, 40.0 ] [ 3,      nil ]   [ 3,   0.0 ]
     [ 2, 20.0 ]
      [ 2, 20.0 ]
In the above we see why the second (right) parameter is necessary: it makes it possible to work with empty sets/bags as well!
In the new aggrX3 module you find pumps with three parameters. At this moment this, tertiary aggregate is defined only for
the most common aggregate functions, while the binary pump works for all aggregates (including your own MIL-procs). In the
tertiary pump, the first (left) parameter of the binary pump is split by column into two BATs with a synced void head columns,
where the tail of the first contains the original tail column, and the tail of the second the original head column. The rationale for
this new kind of aggregate is that many applications first compute these separate void-bats before joining them together, while
the fact that both are synced and void makes this unnecessary (as lookup across both BATs is easy).



3.7 GroupBy and OrderBy
As the aggregates expect a single grouping column (be it the head of the left parameter in the binary pump or the tail in the first
parameter of the tertiary pump), we have to explain how aggregates are computed when the grouping criterion spans multiple
columns.
The Monet strategy is to reduce a the multiple grouping BATs to a single oid column first. This is done by the CTgroup()
operator:
CTgroup( #----------# ) = #----------# . CTgroup( #----------# ) = #----------#
      #h |t #        #h | t #              #h |t #         #h |t #
      #----------#   #----------#           #----------#     #----------#
      [ 1@0, 2 ]      [ 1@0, 1@0 ]                 [ 1@0, 1 ]      [ 1@0, 1@0 ]
      [ 2@0, 1 ]      [ 2@0, 2@0 ]                 [ 2@0, 1 ]      [ 2@0, 2@0 ]
      [ 3@0, 4 ]      [ 3@0, 3@0 ]                 [ 3@0, 1 ]      [ 3@0, 3@0 ]
      [ 4@0, 2 ]      [ 4@0, 1@0 ]                 [ 4@0, 2 ]      [ 4@0, 4@0 ]
      [ 5@0, 2 ]      [ 5@0, 1@0 ]                 [ 5@0, 2 ]      [ 5@0, 4@0 ]

The unary CTgroup(bat[void,any]):bat[void,oid] replaces each unique tail value for the first head oid where it occurs. The
binary CTgroup(bat[void,oid], bat[void,any]):bat[void,oid] expects in the first parameter the output of the unary CTgroup()
and adds a new column to the grouping, while matching tuples from both parameters like a binary multiplex.
Order-by is handled in a similar fashion by the unary sort and the binary CTrefine() which much resembles the binary
CTgroup().
#----------#.reverse().sort().reverse() = #----------#.CTrefine( #----------# ) = #----------#
#h |t #                         #h | t #              #h |t #         #h |t #
#----------#                        #----------#       #----------#     #----------#
[ 1@0, 2 ]                           [ 2@0, 1 ]            [ 1@0, 1 ]       [ 2@0, 0@0 ]
[ 2@0, 1 ]                           [ 4@0, 2 ]            [ 2@0, 1 ]       [ 1@0, 1@0 ]
[ 3@0, 4 ]                           [ 1@0, 2 ]            [ 3@0, 1 ]       [ 4@0, 2@0 ]
[ 4@0, 2 ]                           [ 5@0, 2 ]            [ 4@0, 2 ]       [ 5@0, 2@0 ]
[ 5@0, 2 ]                           [ 3@0, 4 ]            [ 5@0, 2 ]       [ 3@0, 3@0 ]

The CTrefine(bat[oid,any], bat[void,any]):bat[oid,int] refines the order of the first parameter by looking at tail values from
the second that match on head. It outputs the re-ordered head column (which no longer will be VOID) and renumbers the tail
column with a newly sorted order of integers.
More information on grouping and ordering can be found in the xtables module.



3.8 Updates and Transactions
By default, Monet only has support for global database-wide transactions. A successful commit():bit ensures that all persistent
BATs have safely been written to disk. This is an all or nothing operation, as the abort() allows one to regress to the status at
the last succeeded commit.
BATs can be appended into with insert() and deleted from with delete().
The default BAT access mode is BAT_READ (an exception are newly created empty BATs, which have access mode
BAT_WRITE). There is also an access mode BAT_APPEND that only allows inserts.
4 Extension Modules
All MIL primitives are in fact defined by extension modules, which have a specification in the Monet Extension Language
(MEL).



4.1 Inspecting Modules
So we now query for all available modules using the modules() operator:
> modules();
#---------------------------------#
# name           | database                          #
#---------------------------------#
[ "algebra",         "any"                   ]
[ "arith",      "any"                ]
[ "ascii_io",    "any"                   ]
[ "bat",        "any"                ]
[ "builtin",    "any"                ]
[ "str",        "any"            ]
[ "sys",        "any"                ]
[ "trans",       "any"               ]
[ ..                         ]
[ "monettime",         "any",                    ]
[ ..                         ]
[ "xtables",         "any"               ]
>

This tells us that Monet has the monettime module, but it is not standard loaded. This module defines a number of temporal
data types and operations.
The MEL specification of this module contains code like this:
.MODULE monettime;


    .. omitted ...


.ATOM timestamp = lng;
    .FROMSTR = timestamp_fromstr;
    .TOSTR = timestamp_tostr;
.END


    .. omitted ...
.END monettime;

It contains a new atomic type called 'timestamp' that captures times in a "Day-Month-Year hour:minute" format.
>var d := timestamp("2002-aug-14 16:14");
!ERROR: interpret: no matching MIL operator to 'timestamp(str)'.

When you try to use a 'timestamp' without loading the monettime module, a runtime error occurs!
> module(monettime);
> loaded();
#---------------------------------#
# module            | usage_count            #
#---------------------------------#
[ "algebra",           1                 ]
[ "arith",      1                ]
[ "bat",        1             ]
[ "builtin",    1                ]
[ "str",       1             ]
[ "sys",        1                ]
[ "trans",         1             ]
[ "mapi",          1                 ]
[ "xtables",        1                ]
>

After loading the module, we can ask for all commands contained in it with sigs(str modulename):
> sigs("monettime");
[ ...                                        ]
[ "add(timestamp, lng) : timestamp"                          ]
[ "date(timestamp) : date"                       ]
[ "date(timestamp, tzone) : date"                    ]
[ "daytime(timestamp) : daytime"                         ]
[ "daytime(timestamp, tzone) : daytime"                          ]
[ "diff(timestamp, timestamp) : lng"                 ]
[ "dst(timestamp, tzone) : bit"                  ]
[ "timestamp(date, daytime) : timestamp"                             ]
[ "timestamp(date, daytime, tzone) : timestamp" ]
[ ...                                        ]
>
On-line help can be asked for each individual operator:
> help("diff");
COMMAND: diff(timestamp, timestamp) : lng
MODULE:         monettime
COMPILED: by adm on Mon Oct 7 11:22:50 2002
returns the number of milliseconds between 'val1' and 'val2' (!DS2.2).


>

We can now create temporal values, manipulate them, and store them in BATs.
> var d := timestamp("2002-aug-14 16:14");
> d.print();
[ 2002-08-14 16:14:00.000 ]
>
> var b := bat(oid,timestamp);
> b.insert(0@0, timestamp("2005-may-13 09:10"));
> b.insert(1@0, timestamp("2005-aug-16 10:20"));
> b.insert(2@0, timestamp("2005-sep-15 11:30"));
> b.insert(3@0, timestamp("2005-jan-01 12:40"));
> b.insert(4@0, timestamp("2005-feb-26 13:50"));
> b.insert(5@0, timestamp("2005-aug-15 14:00"));
>
> b.print();
#----------------------------------#
# oid    | timestamp                   #
#----------------------------------#
[ 0@0,      2005-may-13 09:10                      ]
[ 1@0,      2005-aug-16 10:20                  ]
[ 2@0,      2005-sep-15 11:30                  ]
[ 3@0,      2005-jan-01 12:40              ]
[ 4@0,      2005-feb-26 13:50              ]
[ 5@0,      2005-aug-15 14:00                  ]
>
> drop(monettime);
>
> b.insert(31165626@0, timestamp("2005-aug-15 15:10"));
!ERROR: interpret: no matching MIL operator to 'timestamp(str)'.
!ERROR: insert(param 3): evaluation error.

After the module is dropped, using temporal primitives will again produce syntax errors.
MIL has a parser that is based on dynamic lookup tables. When a module is loaded, its commands, operators, atoms and
accelerators become accepted keywords for that user.




5 Module Reference
Extension modules are written with the Mx literate programming tool to integrate code both with the technical and user
documentation. Concerning the latter, the topmost 'hide level' of each module contains a manual page (while the second hide-
level contains the technical documentation). Below, the manual pages of all modules are grouped according to their
functionality.



5.1 Kernel Modules
As of version 4.1 MIL has 7 standard modules, that are always loaded.
builtin
         these are the builtin system primitives. They cannot be overloaded.
bat
         basic commands to create and manipulate BATs.
algebra
         relational core of BAT commands (select, join, etc).
arith
         simple arithmetic operators on the standard types.
sys
         system information. This module contains helpful procedures for navigating through the system (ls, help).
str
         string handling.
trans
         support for global transactions (minimal database consistency).




5.2 Query Processing
algebra
          Relational core of BAT commands (select, join, etc).
aggr
          Binary aggregation operations.
aggrX3
          Fast tertiary aggregation operations.
xtables
          Cross-table operations for data mining.
radix
          Cache-conscious query processing.
enum
          Automatic creation of enumeration types. Used for space compression.
mmath
           Mathematic operations a la math.h.
ascii_io
           Bulk loading.




5.3 Transactions
lock
           Locks and semaphores.




5.4 Database Migration
upgrade
           Since MonetDB 4.8, the database format has changed. This module provides functionallity to migrate older database
           to the new format. Note: This must <i>only</i> be used with MIL-generated databases, <i>not</i> with databases
           that were generated via SQL or XQuery!




5.5 OS APIs
unix
           Provides access to some of the C stdlib.h functions.
alarm
           OS timers and interrupts. Standard IO module a la stdio.h.
streams
           Stream IO a la stdio.h.
tcpip
           TCP/IP asynchronous communication.




5.6 Extension Atoms
bitset
        Compact bit-set manipulation.
bitvector
        Compact bit-set manipulation.
blob
        The generic variable-size atom.
decimal
        Arbitrary precision arithmetic.
monettime
        data types and operations on them.
qt
        Quadtree implementation
str
        String manipulation module supporting UTF-8.
url
          URL operations




5.7 Performance Monitoring
calib
        Calibrator that automatically extracts relevant hardware parameters from any system.
counters
        System-independent hardware counters. (First version, building directly on system-dependent libraries.)
pcl
        System-independent hardware counters. (Second version, building on System-independent PCL library
mprof
        Monet performance profiling.




5.8 Benchmark Generators
ddbench
       Drill-down benchmark for data mining.
oo7
       OO7 benchmark generation and specific operations.
wisc
       Wisconsin benchmark code.




6 Managing The MIL Interpreter
While we now have explained the basic structure of MIL and given pointers to the reference manuals of all modules, we pay
some attention to some useful internals of the MIL interpreter of Monet.



6.1 Threads
Monet is a multi-threaded server. One ore more MIL interpreter threads are active handling user requests that are placed in a
queue. Parallel MIL blocks or parallel iterators put many jobs in this queue, that are then consumed by these multiple MIL
interpreter threads, thus creating parallelism.
At startup of the Mserver, one such interpreter thread is created. For each incoming MapiClient session, a new thread is created
as well.
> threads();
#-------------------------#
# BAT:     thread             #
# (int) (str)          #
#-------------------------#
[ 1,     "Interpreter" ]
>
> threadcnt(2);
#-------------------------#
# BAT:       thread           #
# (int) (str)           #
#-------------------------#
[ 1,       "Interpreter" ]
[ 2,       "Interpreter" ]
[ 3,       "Interpreter" ]
>
> threadcnt(-1);
#-------------------------#
# BAT:       thread           #
# (int) (str)           #
#-------------------------#
[ 1,       "Interpreter" ]
[ 2,       "Interpreter" ]
>
You can get a listing of all Monet threads with the threads() command. The threadcnt(int delta) command allows to increase or
decrease the number of MIL interpreter threads.



6.2 Client Sessions
Monet maintains a number of data structures for each client logged in. By using parallel MIL blocks or parallel iterators, one
client can keep multiple threads busy (threads are thus a lower-level concept than clients).
Clients can:
           fork other clients with fork(expr):int, which returns a client-id.

           killed with kill(int id).

           listed with clients().

A common application of fork() is to start a Mapi listener in the background:
> module(mapi);
> listen(45678).fork();
> clients();
#-------------------------------------------------------------------------#
# BAT:       name | login                        | mil                  #
# (int) (str) | (str)                      | (str)               #
#-------------------------------------------------------------------------#
[ 0,     "adm", "Mon Oct 7 15:59:18 2002",                  "clients()"       ]
[ 1,     "adm", "Mon Oct 7 15:59:18 2002",                  "listen(45678)"       ]
>
> kill(1);
>
> clients;
#-------------------------------------------------------------------------#
# BAT:       name | login                        | mil                  #
# (int) (str) | (str)                      | (str)               #
#-------------------------------------------------------------------------#
[ 0,     "adm", "Mon Oct 7 15:59:18 2002",                  "clients()"       ]
>
end{verbatim}


The session is ended with a {\tt quit()}. Note that typing this in the
{\tt Mserver} console with stop the Monet server
immediately, whereas in the {\tt MapiClient} it
will only end the current user session. The server can be shutdown from the
{\tt MapiClient} (and also from the
{\tt Mserver} console) with {\tt shutdown()}.


\subsection{ Miscellaneous}


6.2.1 Nested BATs
Since the tuples can be of any type, recursive types are possible (that is, you can make BATs that contain BATs). This feature,
though still supported, is not recommended for wide use, because recursively referenced BATs are locked into memory. It is
often more appropriate, e.g. transparent to the user, to administer relationships between BATs through their name.

6.2.2 Multiplexing a Dereferenced String

It is possible to use a string dereferencing of a command, operator, or proc as the function in a multiplex operator: [*fcn-
expr](..params..).




7 Further Reference
We recommend you to study the commands in the following modules:
system
       builtin (special keywords), sys (especially look at the PROCs).
bats
       bat (basic bat management) and algebra (the core algebra).
advanced query processing
       aggr and aggrX3 (aggregates), xtables (groupings) and radix (cache-conscious join).

The Ph.D. thesis of Peter Boncz is a full reference of Monet and MIL. Among other topics, it discusses in detail how relational
queries (SQL) can be translated into cache and CPU-efficient MIL.

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:2
posted:8/18/2011
language:English
pages:23