8.24 The Inner Core

The innermost library of the MonetDB database system is formed by the library called GDK, an abbreviation of Goblin Database Kernel. Its development was originally rooted in the design of a pure active-object-oriented programming language, before development was shifted towards a re-usable database kernel engine.

GDK is a C library that provides ACID properties on a DSM model , using main-memory database algorithms built on virtual-memory OS primitives and multi-threaded parallelism. Its implementation has undergone various changes over its decade of development, many of which were driven by external needs to obtain a robust and fast database system.

The coding scheme explored in GDK has also laid a foundation to communicate over time experiences and to provide (hopefully) helpful advice near to the place where the code-reader needs it. Of course, over such a long time the documentation diverges from reality. Especially in areas where the environment of this package is being described. Consider such deviations as historic landmarks, e.g. crystallization of brave ideas and mistakes rectified at a later stage.

8.25 Short Outline

The facilities provided in this implementation are:

  • GDK or Goblin Database Kernel routines for session management
  • BAT routines that define the primitive operations on the database tables (BATs).
  • BBP routines to manage the BAT Buffer Pool (BBP).
  • ATOM routines to manipulate primitive types, define new types using an ADT interface.
  • HEAP routines for manipulating heaps: linear spaces of memory that are GDK's vehicle of mass storage (on which BATs are built).
  • DELTA routines to access inserted/deleted elements within a transaction.
  • HASH routines for manipulating GDK's built-in linear-chained hash tables, for accelerating lookup searches on BATs.
  • TM routines that provide basic transaction management primitives.
  • TRG routines that provided active database support. [DEPRECATED]
  • ALIGN routines that implement BAT alignment management.

The Binary Association Table (BAT) is the lowest level of storage considered in the Goblin runtime system . A BAT is a self-descriptive main-memory structure that represents the binary relationship between two atomic types. The association can be defined over:

void:
virtual-OIDs: a densely ascending column of OIDs (takes zero-storage).
bit:
Booleans, implemented as one byte values.
chr:
A single character (8 bits integers). DEPRECATED for storing text (Unicode not supported).
bte:
Tiny (1-byte) integers (8-bit integers).
sht:
Short integers (16-bit integers).
int:
This is the C int type (32-bit).
oid:
Unique long int values uses as object identifier. Highest bit cleared always. Thus, oids-s are 31-bit numbers on 32-bit systems, and 63-bit numbers on 64-bit systems.
wrd:
Machine-word sized integers (32-bit on 32-bit systems, 64-bit on 64-bit systems).
ptr:
Memory pointer values. DEPRECATED. Can only be stored in transient BATs.
flt:
The IEEE float type.
dbl:
The IEEE double type.
lng:
Longs: the C long long type (64-bit integers).
str:
UTF-8 strings (Unicode). A zero-terminated byte sequence.
bat:
Bat descriptor. This allows for recursive adminstered tables, but severely complicates transaction management. Therefore, they CAN ONLY BE STORED IN TRANSIENT BATs.

This model can be used as a back-end model underlying other -higher level- models, in order to achieve better performance and data independence in one go. The relational model and the object-oriented model can be mapped on BATs by vertically splitting every table (or class) for each attribute. Each such a column is then stored in a BAT with type bat[oid,attribute], where the unique object identifiers link tuples in the different BATs. Relationship attributes in the object-oriented model hence are mapped to bat[oid,oid] tables, being equivalent to the concept of join indexes .

The set of built-in types can be extended with user-defined types through an ADT interface. They are linked with the kernel to obtain an enhanced library, or they are dynamically loaded upon request.

Types can be derived from other types. They represent something different than that from which they are derived, but their internal storage management is equal. This feature facilitates the work of extension programmers, by enabling reuse of implementation code, but is also used to keep the GDK code portable from 32-bits to 64-bits machines: the oid and ptr types are derived from int on 32-bits machines, but is derived from lng on 64 bits machines. This requires changes in only two lines of code each.

To accelerate lookup and search in BATs, GDK supports one built-in search accelerator: hash tables. We choose an implementation efficient for main-memory: bucket chained hash . Alternatively, when the table is sorted, it will resort to merge-scan operations or binary lookups.

BATs are built on the concept of heaps, which are large pieces of main memory. They can also consist of virtual memory, in case the working set exceeds main-memory. In this case, GDK supports operations that cluster the heaps of a BAT, in order to improve performance of its main-memory.

8.25.1 Rationale

The rationale for choosing a BAT as the building block for both relational and object-oriented system is based on the following observations:

  • - Given the fact that CPU speed and main-memory increase in current workstation hardware for the last years has been exceeding IO access speed increase, traditional disk-page oriented algorithms do no longer take best advantage of hardware, in most database operations.

    Instead of having a disk-block oriented kernel with a large memory cache, we choose to build a main-memory kernel, that only under large data volumes slowly degrades to IO-bound performance, comparable to traditional systems .

  • - Traditional (disk-based) relational systems move too much data around to save on (main-memory) join operations.

    The fully decomposed store (DSM assures that only those attributes of a relation that are needed, will have to be accessed.

  • - The data management issues for a binary association is much easier to deal with than traditional struct-based approaches encountered in relational systems.
  • - Object-oriented systems often maintain a double cache, one with the disk-based representation and a C pointer-based main-memory structure. This causes expensive conversions and replicated storage management. GDK does not do such `pointer swizzling'. It used virtual-memory (mmap()) and buffer management advice (madvise()) OS primitives to cache only once. Tables take the same form in memory as on disk, making the use of this technique transparent .

A RDBMS or OODBMS based on BATs strongly depends on our ability to efficiently support tuples and to handle small joins, respectively.

The remainder of this document describes the Goblin Database kernel implementation at greater detail. It is organized as follows:

GDK Interface:
It describes the global interface with which GDK sessions can be started and ended, and environment variables used.
Binary Association Tables:
As already mentioned, these are the primary data structure of GDK. This chapter describes the kernel operations for creation, destruction and basic manipulation of BATs and BUNs (i.e. tuples: Binary UNits).
BAT Buffer Pool:
All BATs are registered in the BAT Buffer Pool. This directory is used to guide swapping in and out of BATs. Here we find routines that guide this swapping process.
GDK Extensibility:
Atoms can be defined using a unified ADT interface. There is also an interface to extend the GDK library with dynamically linked object code.
GDK Utilities:
Memory allocation and error handling primitives are provided. Layers built on top of GDK should use them, for proper system monitoring. Thread management is also included here.
Transaction Management:
For the time being, we just provide BAT-grained concurrency and global transactions. Work is needed here.
BAT Alignment:
Due to the mapping of multi-ary datamodels onto the BAT model, we expect many correspondences among BATs, e.g. bat(oid,attr1),.. bat(oid,attrN) vertical decompositions. Frequent activities will be to jump from one attribute to the other (`bunhopping'). If the head columns are equal lists in two BATs, merge or even array lookups can be used instead of hash lookups. The alignment interface makes these relations explicitly manageable.

In GDK, complex data models are mapped with DSM on binary tables. Usually, one decomposes N-ary relations into N BATs with an oid in the head column, and the attribute in the tail column. There may well be groups of tables that have the same sets of oids, equally ordered. The alignment interface is intended to make this explicit. Implementations can use this interface to detect this situation, and use cheaper algorithms (like merge-join, or even array lookup) instead.

BAT Iterators:
Iterators are C macros that generally encapsulate a complex for-loop. They would be the equivalent of cursors in the SQL model. The macro interface (instead of a function call interface) is chosen to achieve speed when iterating main-memory tables.
Common BAT Operations:
These are much used operations on BATs, such as aggregate functions and relational operators. They are implemented in terms of BAT- and BUN-manipulation GDK primitives.

8.26 Interface Files

In this section we summarize the user interface to the GDK library. It consist of a header file (gdk.h) and an object library (gdklib.a), which implements the required functionality. The header file must be included in any program that uses the library. The library must be linked with such a program.

8.26.1 Database Context

The MonetDB environment settings are collected in a configuration file. Amongst others it contains the location of the database directory. First, the database directory is closed for other servers running at the same time. Second, performance enhancements may take effect, such as locking the code into memory (if the OS permits) and preloading the data dictionary. An error at this stage normally lead to an abort.

8.26.2 GDK session handling

int GDKinit (char *db, char *dbfarm, int allocmap)
int GDKexit (int status)

The session is bracketed by GDKinit and GDKexit. Initialization involves setting up the administration for database access, such as memory allocation for the database buffer pool. During the exit phase any pending transaction is aborted and the database is freed for access by other users. A zero is returned upon encountering an erroneous situation.

8.27 Binary Association Tables

Having gone to the previous preliminary definitions, we will now introduce the structure of Binary Association Tables (BATs) in detail. They are the basic storage unit on which GDK is modelled.

The BAT holds an unlimited number of binary associations, called BUNs (Binary UNits). The two attributes of a BUN are called head (left) and tail (right) in the remainder of this document.

The above figure shows what a BAT looks like. It consists of two columns, called head and tail, such that we have always binary tuples (BUNs). The overlooking structure is the BAT record. It points to a heap structure called the BUN heap. This heap contains the atomic values inside the two columns. If they are fixed-sized atoms, these atoms reside directly in the BUN heap. If they are variable-sized atoms (such as string or polygon), however, the columns has an extra heap for storing those (such variable-sized atom heaps are then referred to as Head Heaps and Tail Heaps). The BUN heap then contains integer byte-offsets (fixed-sized, of course) into a head- or tail-heap.

The BUN heap contains a contiguous range of BUNs. It starts after the first pointer, and finishes at the end in the free area of the BUN. All BUNs after the inserted pointer have been added in the last transaction (and will be deleted on a transaction abort). All BUNs between the deleted pointer and the first have been deleted in this transaction (and will be reinserted at a transaction abort).

The location of a certain BUN in a BAT may change between successive library routine invocations. Therefore, one should avoid keeping references into the BAT storage area for long periods.

Passing values between the library routines and the enclosing C program is primarily through value pointers of type ptr. Pointers into the BAT storage area should only be used for retrieval. Direct updates of data stored in a BAT is forbidden. The user should adhere to the interface conventions to guarantee the integrity rules and to maintain the (hidden) auxiliary search structures.

8.27.1 GDK variant record type

When manipulating values, MonetDB puts them into value records. The built-in types have a direct entry in the union. Others should be represented as a pointer of memory in pval or as a string, which is basically the same. In such cases the len field indicates the size of this piece of memory.

     

8.27.2 The BAT record

The elements of the BAT structure are introduced in the remainder. Instead of using the underlying types hidden beneath it, one should use a BAT type that is supposed to look like this:

typedef struct {
          // static BAT properties
          bat    batCacheid;       // bat id: index in BBPcache
          int    batPersistence;   // persistence mode
          bit    batCopiedtodisk;  // BAT is saved on disk?
          bit    batSet;           // all tuples in the BAT are unique?
          // dynamic BAT properties
          int    batHeat;          // heat of BAT in the BBP
          sht    batDirty;         // BAT modified after last commit?
          bit    batDirtydesc;     // BAT descriptor specific dirty flag
          Heap*  batBuns;          // Heap where the buns are stored
          // DELTA status
          BUN    batDeleted;       // first deleted BUN
          BUN    batFirst;         // empty BUN before the first alive BUN
          BUN    batInserted;      // first inserted BUN
          BUN    batCount;         // Tuple count
          // Head properties
          int    htype;            // Head type number
          str    hident;           // name for head column
          bit    hkey;             // head values should be unique?
          bit    hsorted;          // are head values currently ordered?
          bit    hvarsized;        // for speed: head type is varsized?
          bit    hnonil;           // head has no nils
          oid    halign;          // alignment OID for head.
          // Head storage
          int    hloc;             // byte-offset in BUN for head elements
          Heap   *hheap;           // heap for varsized head values
          Hash   *hhash;           // linear chained hash table on head
          // Tail properties
          int    ttype;            // Tail type number
          str    tident;           // name for tail column
          bit    tkey;             // tail values should be unique?
          bit    tnonil;           // tail has no nils
          bit    tsorted;          // are tail values currently ordered?
          bit    tvarsized;        // for speed: tail type is varsized?
          oid    talign;           // alignment OID for head.
          // Tail storage
          int    tloc;             // byte-offset in BUN for tail elements
          Heap   theap;            // heap for varsized tail values
          Hash   thash;            // linear chained hash table on tail
 } BAT;

The internal structure of the BAT record is in fact much more complex, but GDK programmers should refrain of making use of that.

The reason for this complex structure is to allow for a BAT to exist in two incarnations at the time: the normal view and the reversed view. Each bat b has a BATmirror(b) which has the negative cacheid of b in the BBP.

Since we don't want to pay cost to keep both views in line with each other under BAT updates, we work with shared pieces of memory between the two views. An update to one will thus automatically update the other. In the same line, we allow synchronized BATs (BATs with identical head columns, and marked as such in the BAT Alignment interface) now to be clustered horizontally.

8.27.3 Heap Management

Heaps are the low-level entities of mass storage in BATs. Currently, they can either be stored on disk, loaded into memory, or memory mapped.

int HEAPalloc (Heap *h, size_t nitems, size_t itemsize);
int HEAPfree (Heap *h);
int HEAPextend (Heap *h, size_t size);
int HEAPload (Heap *h, str nme,ext, int trunc);
int HEAPsave (Heap *h, str nme,ext);
int HEAPcopy (Heap *dst,*src);
int HEAPdelete (Heap *dst, str o, str ext);
int HEAPwarm (Heap *h);

These routines should be used to alloc free or extend heaps; they isolate you from the different ways heaps can be accessed.


   

8.27.4 Internal HEAP Chunk Management

Heaps are used in BATs to store data for variable-size atoms. The implementor must manage malloc()/free() functionality for atoms in this heap. A standard implementation is provided here.

void
HEAP_initialize (Heap* h, size_t nbytes, size_t nprivate, int align )
void
HEAP_destroy (Heap* h)
var_t
HEAP_malloc (Heap* heap, size_t nbytes)
void
HEAP_free (Heap *heap, var_t block)
int
HEAP_private (Heap* h)
void
HEAP_printstatus (Heap* h)
void
HEAP_check (Heap* h)

The heap space starts with a private space that is left untouched by the normal chunk allocation. You can use this private space e.g. to store the root of an rtree HEAP_malloc allocates a chunk of memory on the heap, and returns an index to it. HEAP_free frees a previously allocated chunk HEAP_private returns an integer index to private space.


   

8.27.5 BAT construction

BAT* BATnew (int headtype, int tailtype, BUN cap)
BAT* BATextend (BAT *b, BUN newcap)
A temporary BAT is instantiated using BATnew with the type aliases of the required binary association. The aliases include the built-in types, such as TYPE_int....TYPE_ptr, and the atomic types introduced by the user. The initial capacity to be accommodated within a BAT is indicated by cap. Their extend is automatically incremented upon storage overflow. Failure to create the BAT results in a NULL pointer.

The routine BATclone creates an empty BAT storage area with the properties inherited from its argument.

8.27.6 BUN manipulation

BAT* BATins (BAT *b, BAT *c, bit force)
BAT* BATappend (BAT *b, BAT *c, bit force)
BAT* BATdel (BAT *b, BAT *c, bit force)
BAT* BUNins (BAT *b, ptr left, ptr right, bit force)
BAT* BUNappend (BAT *b, ptr right, bit force)
BAT* BUNreplace (BAT *b, ptr left, ptr right, bit force)
int BUNdel (BAT *b, ptr left, ptr right, bit force)
int BUNdelHead (BAT *b, ptr left, bit force)
BUN BUNfnd (BAT *b, ptr head)
void BUNfndOID (BUN result, BATiter bi, oid *head)
void BUNfndSTD (BUN result, BATiter bi, ptr head)
BUN BUNlocate (BAT *b, ptr head, ptr tail)
ptr BUNhead (BAT *b, BUN p)
ptr BUNtail (BAT *b, BUN p)
The BATs contain a number of fixed-sized slots to store the binary associations. These slots are called BUNs or BAT units. A BUN variable is a pointer into the storage area of the BAT, but it has limited validity. After a BAT modification, previously obtained BUNs may no longer reside at the same location.

The association list does not contain holes. This density permits users to quickly access successive elements without the need to test the items for validity. Moreover, it simplifies transport to disk and other systems. The negative effect is that the user should be aware of the evolving nature of the sequence, which may require copying the BAT first.

The update operations come in three flavors. Element-wise updates can use BUNins, BUNappend, BUNreplace, BUNdel, and BUNdelHead. The batch update operations are BATins, BATappend and BATdel.

Only experts interested in speed may use BUNfastins, since it skips most consistency checks, does not update search accelerators, and does not maintain properties such as the hsorted and tsorted flags. Beware!

The routine BUNfnd provides fast access to a single BUN providing a value for the head of the binary association. A very fast shortcut for BUNfnd if the selection type is known to be integer or OID, is provided in the form of the macro BUNfndOID.

To select on a tail, one should use the reverse view obtained by BATmirror.

The routines BUNhead and BUNtail return a pointer to the first and second value in an association, respectively. To guard against side effects on the BAT, one should normally copy this value into a scratch variable for further processing.

Behind the interface we use several macros to access the BUN fixed part and the variable part. The BUN operators always require a BAT pointer and BUN identifier.

  • BAThtype(b) and BATttype(b) find out the head and tail type of a BAT.
  • BUNfirst(b) returns a BUN pointer to the first BUN as a BAT.
  • BUNlast(b) returns the BUN pointer directly after the last BUN in the BAT.
  • BUNhead(b, p) and BUNtail(b, p) return pointers to the head-value and tail-value in a given BUN.
  • BUNhloc(b, p) and BUNtloc(b, p) do the same thing, but knowing in advance that the head-atom resp. tail-atom of a BAT is fixed size.
  • BUNhvar(b, p) and BUNtvar(b, p) do the same thing, but knowing in advance that the head-atom resp. tail-atom of a BAT is variable sized.

8.27.7 BAT properties

BUN BATcount (BAT *b)
void BATsetcapacity (BAT *b, BUN cnt)
void BATsetcount (BAT *b, BUN cnt)
BUN BATbuncount (BAT *b)
str BATrename (BAT *b, str nme)
BAT * BATkey (BAT *b, int onoff)
BAT * BATset (BAT *b, int onoff)
BAT * BATmode (BAT *b, int mode)
BAT * BATsetaccess (BAT *b, int mode)
int BATdirty (BAT *b)
int BATgetaccess (BAT *b)
The function BATcount returns the number of associations stored in the BAT.

The function BATbuncount returns the space that is occupied in associations in the BAT. This is not the same as BATcount, since the first N associations may be unused or delta data.

The BAT is given a new logical name using BATrename.

The integrity properties to be maintained for the BAT are controlled separately. A key property indicates that duplicates in the association dimension are not permitted. The BAT is turned into a set of associations using BATset. Key and set properties are orthogonal integrity constraints. The strongest reduction is obtained by making the BAT a set with key restrictions on both dimensions.

The persistency indicator tells the retention period of BATs. The system support three modes: PERSISTENT, TRANSIENT, and SESSION. The PERSISTENT BATs are automatically saved upon session boundary or transaction commit. TRANSIENT BATs are removed upon transaction boundary. SESSION BATs are removed at the end of a session. They are normally used to maintain temporary results. All BATs are initially TRANSIENT unless their mode is changed using the routine BATmode.

The BAT properties may be changed at any time using BATkey, BATset, and BATmode.

Valid BAT access properties can be set with BATsetaccess and BATgetaccess: BAT_READ, BAT_APPEND, and BAT_WRITE. BATs can be designated to be read-only. In this case some memory optimizations may be made (slice and fragment bats can point to stable subsets of a parent bat). A special mode is append-only. It is then allowed to insert BUNs at the end of the BAT, but not to modify anything that already was in there.

8.27.8 BAT manipulation

BAT * BATclear (BAT *b)
BAT * BATcopy (BAT *b, int ht, int tt, int writeable)
BAT * BATmark (BAT *b, oid base)
BAT * BATmark_grp (BAT *b, BAT *g, oid *s)
BAT * BATnumber (BAT *b)
BAT * BATmirror (BAT *b)
BAT * BATreset (BAT *b)
The routine BATclear removes the binary associations, leading to an empty, but (re-)initialized BAT. Its properties are retained. A temporary copy is obtained with BATcopy. The new BAT has an unique name. The routine BATmark creates a binary association that introduces a new tail column of fresh densely ascending OIDs. The base OID can be given explicitly, or if oid_nil is passed, is chosen as a new unique range by the system. A similar routine is BATnumber, which copies the heads and assigns an integer index to the tail. It plays a crucial role in administration of query results.

The routine BATmirror returns the mirror image BAT (where tail is head and head is tail) of that same BAT. This does not involve a state change in the BAT (as previously): both views on the BAT exist at the same time.

8.27.9 BAT Input/Output

BAT * BATload (str name)
BAT * BATsave (BAT *b)
int BATmmap (BAT *b, int hb, int tb, int hh, int th, int force )
int BATmadvise (BAT *b, int hb, int tb, int hh, int th )
int BATdelete (BAT *b)

A BAT created by BATnew is considered temporary until one calls the routine BATsave or BATmode. This routine reserves disk space and checks for name clashes in the BAT directory. It also makes the BAT persistent. The empty BAT is initially marked as ordered on both columns. Failure to read or write the BAT results in a NULL, otherwise it returns the BAT pointer.

8.27.10 Heap Storage Modes

The discriminative storage modes are memory-mapped, compressed, or loaded in memory. The BATmmap() changes the storage mode of each heap associated to a BAT. As can be seen in the bat record, each BAT has one BUN-heap (bn), and possibly two heaps (hh and th) for variable-sized atoms. The BATmadvise call works in the same way. Using the madvise() system call it issues buffer management advice to the OS kernel, as for the expected usage pattern of the memory in a heap.

8.27.11 Printing

int BATprintf (stream *f, BAT *b)
int BATmultiprintf (stream *f, int argc, BAT *b[], int printoid, int order, int printorderby)

The functions to convert BATs into ASCII and the reverse use internally defined formats. They are primarily meant for ease of debugging and to a lesser extent for output processing. Printing a BAT is done essentially by looping through its components, printing each association. If an index is available, it will be used. The BATmultiprintf command assumes a set of BATs with corresponding oid-s in the head columns. It performs the multijoin over them, and prints the multi-column result on the file.

8.27.12 BAT clustering

BAT * BATsort (BAT *b)
BAT * BATsort_rev (BAT *b)
BAT * BATorder (BAT *b)
BAT * BATorder_rev (BAT *b)
BAT * BATrevert (BAT *b)
int BATordered (BAT *b)
When working in a main-memory situation, clustering of data on disk-pages is not important. Whenever mmap()-ed data is used intensively, reducing the number of page faults is a hot issue.

The above functions rearrange data in MonetDB heaps (used for storing BUNs var-sized atoms, or accelerators). Applying these clusterings will allow that MonetDB's main-memory oriented algorithms work efficiently also in a disk-oriented context.

The BATsort functions return a copy of the input BAT, sorted in ascending order on the head column. BATordered starts a check on the head values to see if they are ordered. The result is returned and stored in the hsorted field of the BAT. BATorder is similar to BATsort, but sorts the BAT itself, rather than returning a copy (BEWARE: this operation destroys the delta information. TODO:fix). The BATrevert puts all the live BUNs of a BAT in reverse order. It just reverses the sequence, so this does not necessarily mean that they are sorted in reverse order!

8.28 BAT Buffer Pool

int BBPfix (bat bi)
int BBPunfix (bat bi)
int BBPincref (bat bi, int logical)
int BBPdecref (bat bi, int logical)
void BBPhot (bat bi)
void BBPcold (bat bi)
str BBPname (bat bi)
bat BBPindex (str nme)
BAT* BATdescriptor (bat bi)
bat BBPcacheid (BAT *b)
The BAT Buffer Pool module contains the code to manage the storage location of BATs. It uses two tables BBPlogical and BBphysical to relate the BAT name with its corresponding file system name. This information is retained in an ASCII file within the database home directory for ease of inspection. It is loaded upon restart of the server and saved upon transaction commit (if necessary).

The remaining BBP tables contain status information to load, swap and migrate the BATs. The core table is BBPcache which contains a pointer to the BAT descriptor with its heaps. A zero entry means that the file resides on disk. Otherwise it has been read or mapped into memory.

BATs loaded into memory are retained in a BAT buffer pool. They retain their position within the cache during their life cycle, which make indexing BATs a stable operation. Their descriptor can be obtained using BBPcacheid.

The BBPindex routine checks if a BAT with a certain name is registered in the buffer pools. If so, it returns its BAT id. The BATdescriptor routine has a BAT id parameter, and returns a pointer to the corresponding BAT record (after incrementing the reference count). The BAT will be loaded into memory, if necessary.

8.29 GDK Extensibility

GDK can be extended with new atoms, search accelerators and storage modes.

8.29.1 Atomic Type Descriptors

The atomic types over which the binary associations are maintained are described by an atom descriptor.

void ATOMproperty (str nme, char *property, int (*fcn)(), int val);
int ATOMindex (char *nme);
int ATOMdump ();
void ATOMdelete (int id);
str ATOMname (int id);
int ATOMsize (int id);
int ATOMalign (int id);
int ATOMvarsized (int id);
ptr ATOMnilptr (int id);
int ATOMfromstr (int id, str s, int* len, ptr* v_dst);
int ATOMtostr (int id, str s, int* len, ptr* v_dst);
hash_t ATOMhash (int id, ptr val, in mask);
int ATOMcmp (int id, ptr val_1, ptr val_2);
int ATOMconvert (int id, ptr v, int direction);
int ATOMfix (int id, ptr v);
int ATOMunfix (int id, ptr v);
int ATOMheap (int id, Heap *hp, size_t cap);
void ATOMheapconvert (int id, Heap *hp, int direction);
int ATOMheapcheck (int id, Heap *hp, HeapRepair *hr);
int ATOMput (int id, Heap *hp, BUN pos_dst, ptr val_src);
int ATOMdel (int id, Heap *hp, BUN v_src);
int ATOMlen (int id, ptr val);
ptr ATOMnil (int id);
int ATOMformat (int id, ptr val, char** buf);
int ATOMprint (int id, ptr val, stream *fd);
ptr ATOMdup (int id, ptr val );

8.29.2 Atom Definition

User defined atomic types can be added to a running system with the following interface:.

  • ATOMproperty() registers a new atom definition, if there is no atom registered yet under that name. It then installs the attribute of the named property. Valid names are "size", "align", "null", "fromstr", "tostr", "cmp", "hash", "put", "get", "del", "length" and "heap".
  • ATOMdelete() unregisters an atom definition.
  • ATOMindex() looks up the atom descriptor with a certain name.

8.29.3 Atom Manipulation

  • The ATOMname() operation retrieves the name of an atom using its id.
  • The ATOMsize() operation returns the atoms fixed size.
  • The ATOMalign() operation returns the atoms minimum alignment. If the alignment info was not specified explicitly during atom install, it assumes the maximum value of {1,2,4,8} smaller than the atom size.
  • The ATOMnilptr() operation returns a pointer to the nil-value of an atom. We usually take one dedicated value halfway down the negative extreme of the atom range (if such a concept fits), as the nil value.
  • The ATOMnil() operation returns a copy of the nil value, allocated with GDKmalloc().
  • The ATOMheap() operation creates a new var-sized atom heap in 'hp' with capacity 'cap'.
  • The ATOMhash() computes a hash index for a value. `val' is a direct pointer to the atom value. Its return value should be an hash_t between 0 and 'mask'.
  • The ATOMcmp() operation computes two atomic values. Its parameters are pointers to atomic values.
  • The ATOMlen() operation computes the byte length for a value. `val' is a direct pointer to the atom value. Its return value should be an integer between 0 and 'mask'.
  • The ATOMdel() operation deletes a var-sized atom from its heap `hp'. The integer byte-index of this value in the heap is pointed to by `val_src'.
  • The ATOMput() operation inserts an atom `src_val' in a BUN at `dst_pos'. This involves copying the fixed sized part in the BUN. In case of a var-sized atom, this fixed sized part is an integer byte-index into a heap of var-sized atoms. The atom is then also copied into that heap `hp'.
  • The ATOMfix() and ATOMunfix() operations do bookkeeping on the number of references that a GDK application maintains to the atom. In MonetDB, we use this to count the number of references directly, or through BATs that have columns of these atoms. The only operator for which this is currently relevant is BAT. The operators return the POST reference count to the atom. BATs with fixable atoms may not be stored persistently.
  • The ATOMfromstr() parses an atom value from string `s'. The memory allocation policy is the same as in ATOMget(). The return value is the number of parsed characters.
  • The ATOMprint() prints an ASCII description of the atom value pointed to by `val' on file descriptor `fd'. The return value is the number of parsed characters.
  • The ATOMformat() is similar to ATOMprint(). It prints an atom on a newly allocated string. It must later be freed with GDKfree. The number of characters written is returned. This is minimally the size of the allocated buffer.
  • The ATOMdup() makes a copy of the given atom. The storage needed for this is allocated and should be removed by the user.

These wrapper functions correspond closely to the interface functions one has to provide for a user-defined atom. They basically (with exception of ATOMput(), ATOMprint() and ATOMformat()) just have the atom id parameter prepended to them.

8.29.4 Unique OIDs

oid OIDseed (oid seed);
oid OIDnew (oid inc);

OIDs are special kinds of unsigned integers because the system guarantees uniqueness. For system simplicity and performance, OIDs are now represented as (signed) integers; however this is hidden in the system internals and shouldn't affect semantics.

The OIDnew(N) claims a range of N contiguous unique, unused OIDs, and returns the starting value of this range. The highest OIDBITS designate site. [ DEPRECATED]

8.29.5 Built-in Accelerator Functions

BAT* BAThash (BAT *b, BUN masksize)
BAT * BAThashsplit (BAT *b, BUN n, int unary)
BAT * BATrangesplit (BAT *b, int n)

The current BAT implementation supports one search accelerator: hashing. The routine BAThash makes sure that a hash accelerator on the head of the BAT exists. A zero is returned upon failure to create the supportive structures.

The hash data structures are currently maintained during update operations.

A BAT can be redistributed over n buckets using a hash function with BAThashsplit. The return value is a list of BAT pointers. Similarly, a range partitioning based is supported.

8.29.6 Multilevel Storage Modes

We should bring in the compressed mode as the first, maybe built-in, mode. We could than add for instance HTTP remote storage, SQL storage, and READONLY (cd-rom) storage.

8.30 GDK Utilities

Interfaces for memory management, error handling, thread management and system information.

8.30.1 GDK memory management

void* GDKmalloc (size_t size)
void* GDKzalloc (size_t size)
void* GDKmallocmax (size_t size, size_t *maxsize, int emergency)
void* GDKrealloc (void* pold, size_t size)
void* GDKreallocmax (void* pold, size_t size, size_t *maxsize, int emergency)
void GDKfree (void* blk)
str GDKstrdup (str s)
void* GDKvmalloc (size_t size, size_t *maxsize, int emergency)
void* GDKvmrealloc (void* pold, size_t oldsize, size_t newsize, size_t oldmax, size_t *maxsize, int emergency)
void GDKvmfree (void* blk, size_t size, size_t maxsize)
These utilities are primarily used to maintain control over critical interfaces to the C library. Moreover, the statistic routines help in identifying performance and bottlenecks in the current implementation.

Compiled with -DMEMLEAKS the GDK memory management log their activities, and are checked on inconsistent frees and memory leaks.

8.30.2 GDK error handling

str GDKmessage
bit GDKsilent
int GDKfatal(str msg)
int GDKwarning(str msg)
int GDKerror (str msg)
int GDKgoterrors ()
int GDKsyserror (str msg)
str GDKerrbuf
GDKsetbuf (str buf)

The error handling mechanism is not sophisticated yet. Experience should show if this mechanism is sufficient. Most routines return a pointer with zero to indicate an error.

The error messages are also copied to standard output unless GDKsilent is set to a non-zero value. The last error message is kept around in a global variable.

Error messages can also be collected in a user-provided buffer, instead of being echoed to a stream. This is a thread-specific issue; you want to decide on the error mechanism on a thread-specific basis. This effect is established with GDKsetbuf. The memory (de)allocation of this buffer, that must at least be 1024 chars long, is entirely by the user. A pointer to this buffer is kept in the pseudo-variable GDKerrbuf. Normally, this is a NULL pointer.


   

The GDKembedded variable is a property set in the configuration file to indicate that the kernel is only allowed to run as a single process. This can be used to remove all locking overhead. The actual state of affairs is maintained in GDKprotected, which is set when locking is required, e.g. when multiple threads become active.


   

The kernel maintains a central table of all active threads. They are indexed by their tid. The structure contains information on the input/output file descriptors, which should be set before a database operation is started. It ensures that output is delivered to the proper client. The Thread structure should be ideally made directly accessible to each thread. This speeds up access to tid and file descriptors.

8.31 Transaction Management

int TMcommit ()
int TMabort ()
int TMsubcommit ()

MonetDB by default offers a global transaction environment. The global transaction involves all activities on all persistent BATs by all threads. Each global transaction ends with either TMabort or TMcommit, and immediately starts a new transaction. TMcommit implements atomic commit to disk on the collection of all persistent BATs. For all persistent BATs, the global commit also flushes the delta status for these BATs (see BATcommit/BATabort). This allows to perform TMabort quickly in memory (without re-reading all disk images from disk). The collection of which BATs is persistent is also part of the global transaction state. All BATs that where persistent at the last commit, but were made transient since then, are made persistent again by TMabort. In other words, BATs that are deleted, are only physically deleted at TMcommit time. Until that time, rollback (TMabort) is possible.

Use of TMabort is currently NOT RECOMMENDED due to two bugs:

  • TMabort after a failed %TMcommit does not bring us back to the previous committed state; but to the state at the failed TMcommit.
  • At runtime, TMabort does not undo BAT name changes, whereas a cold MonetDB restart does.

In effect, the problems with TMabort reduce the functionality of the global transaction mechanism to consistent checkpointing at each TMcommit. For many applications, consistent checkpointingis enough.

Extension modules exist that provide fine grained locking (lock module) and Write Ahead Logging (sqlserver). Applications that need more fine-grained transactions, should build this on top of these extension primitives.

TMsubcommit is intended to quickly add or remove BATs from the persistent set. In both cases, rollback is not necessary, such that the commit protocol can be accelerated. It comes down to writing a new BBP.dir.

Its parameter is a BAT-of-BATs (in the tail); the persistence status of that BAT is committed. We assume here that the calling thread has exclusive access to these bats. An error is reported if you try to partially commit an already committed persistent BAT (it needs the rollback mechanism).

8.31.1 Delta Management

BAT * BATcommit (BAT *b)
BAT * BATfakeCommit (BAT *b)
BAT * BATundo (BAT *b)
BAT * BATprev (BAT *b)
BAT * BATalpha (BAT *b)
BAT * BATdelta (BAT *b)

The BAT keeps track of updates with respect to a 'previous state'. Do not confuse 'previous state' with 'stable' or 'commited-on-disk', because these concepts are not always the same. In particular, they diverge when BATcommit, BATfakecommit, and BATundo are called explictly, bypassing the normal global TMcommit protocol (some applications need that flexibility).

BATcommit make the current BAT state the new 'stable state'. This happens inside the global TMcommit on all persistent BATs previous to writing all bats to persistent storage using a BBPsync[?].

EXPERT USE ONLY: The routine BATfakeCommit updates the delta information on BATs and clears the dirty bit. This avoids any copying to disk. Expert usage only, as it bypasses the global commit protocol, and changes may be lost after quitting or crashing MonetDB.

BATabort undo-s all changes since the previous state. The global TMabort[?] achieves a rollback to the previously committed state by doing BATabort on all persistent bats.

BUG: after a failed TMcommit, TMabort does not do anything because TMcommit does the BATcommits before attempting to sync to disk instead of after doing this.

The previous state can also be queried. BATprev is a view on the current BAT as it was in the previous state. BATalpha shows only the BUNs inserted since the previous state, and BATdelta the deleted buns.

CAVEAT: BATprev, BATalpha and BATdelta only return views if the underlying BATs are read-only (often not the case when BATs are being updated). Otherwise, copies must be made anyway.


   

8.32 BAT Alignment and BAT views

int ALIGNsynced (BAT* b1, BAT* b2)
int ALIGNsync (BAT *b1, BAT *b2)
int ALIGNrelated (BAT *b1, BAT *b2)
int ALIGNsetH ((BAT *dst, BAT *src)


BAT * BATpropcheck (BAT *b, int mode)


BAT* VIEWcreate (BAT *h, BAT *t)
int isVIEW (BAT *b)
bat VIEWhparent (BAT *b)
bat VIEWtparent (BAT *b)
BAT* VIEWhead (BAT *b)
BAT* VIEWcombine (BAT *b)
BAT* VIEWreset (BAT *b)
BAT* BATmaterialize (BAT *b)
Alignments of two columns of a BAT means that the system knows whether these two columns are exactly equal. Relatedness of two BATs means that one pair of columns (either head or tail) of both BATs is aligned. The first property is checked by ALIGNsynced, the latter by ALIGNrelated.

The BATpropcheck examines a BAT and tries to set all applicable properties (key,sorted,align,dense).

All algebraic BAT commands propagate the properties - including alignment properly on their results.

VIEW BATs are BATs that lend their storage from a parent BAT. They are just a descriptor that points to the data in this parent BAT. A view is created with VIEWcreate. The cache id of the parent (if any) is returned by VIEWhparent and VIEWtparent (otherwise it returns 0).

VIEW bats are read-only!!

The VIEWcombine gives a view on a BAT that has two head columns of the parent. The VIEWhead constructs a BAT view that has the same head column as the parent, but has a void column with seqbase=nil in the tail. VIEWreset creates a normal BAT with the same contents as its view parameter (it converts void columns with seqbase!=nil to materialized oid columns).

The BATmaterialize materializes a VIEW (TODO) or void bat inplace. This is useful as materialization is usually needed for updates.

8.33 BAT Iterators

BATloop (BAT *b; BUN p, BUN q)
BATloopDEL (BAT *b; BUN p; BUN q; int dummy)
DELloop (BAT *b; BUN p, BUN q, int dummy)
HASHloop (BAT *b; Hash *h, size_t dummy; ptr value)
HASHloop_bit (BAT *b; Hash *h, size_t idx; bit *value, BUN w)
HASHloop_chr (BAT *b; Hash *h, size_t idx; char *value, BUN w)
HASHloop_bte (BAT *b; Hash *h, size_t idx; bte *value, BUN w)
HASHloop_sht (BAT *b; Hash *h, size_t idx; sht *value, BUN w)
HASHloop_bat (BAT *b; Hash *h, size_t idx; bat *value, BUN w)
HASHloop_ptr (BAT *b; Hash *h, size_t idx; ptr *value, BUN w)
HASHloop_int (BAT *b; Hash *h, size_t idx; int *value, BUN w)
HASHloop_oid (BAT *b; Hash *h, size_t idx; oid *value, BUN w)
HASHloop_wrd (BAT *b; Hash *h, size_t idx; wrd *value, BUN w)
HASHloop_flt (BAT *b; Hash *h, size_t idx; flt *value, BUN w)
HASHloop_lng (BAT *b; Hash *h, size_t idx; lng *value, BUN w)
HASHloop_dbl (BAT *b; Hash *h, size_t idx; dbl *value, BUN w)
HASHloop_str (BAT *b; Hash *h, size_t idx; str value, BUN w)
HASHlooploc (BAT *b; Hash *h, size_t idx; ptr value, BUN w)
HASHloopvar (BAT *b; Hash *h, size_t idx; ptr value, BUN w)
SORTloop (BAT *b,p,q,tl,th,s)

The BATloop() looks like a function call, but is actually a macro. The following example gives an indication of how they are to be used:

void
print_a_bat(BAT *b)
{
	BATiter bi = bat_iterator(b);
	BUN p, q;

	BATloop(b, p, q)
		printf("Element %3d has value %d\n",
			   *(int*) BUNhead(bi, p), *(int*) BUNtail(bi, p));
}

8.33.1 simple sequential scan

The first parameter is a BAT, the p and q are BUN pointers, where p is the iteration variable.

     */
     #define BATloop(r, p, q)
      for(q = BUNlast(r), p = BUNfirst(r);p < q; p++)
     
     /*

8.33.2 batloop where the current element can be deleted/updated

Normally it is strictly forbidden to update the BAT over which is being iterated, or delete the current element. This can only be done with the specialized batloop below. When doing a delete, do not forget to update the current pointer with a p = BUNdelete(b,p) (the delete may modify the current pointer p). After the delete/update has taken place, the pointer p is in an inconsistent state till the next iteration of the batloop starts.

     */
     #define BATloopDEL(r, p, q)
      for(p = BUNfirst(r), q = BUNlast(r); p < q;
          q = MIN(q,BUNlast(r)), p++)
     
     /*

8.33.3 sequential scan over deleted BUNs

Stable BUNS that were deleted, are conserved to transaction end. You may inspect these data items. Again, the b is a BAT, p and q are BUNs, where p is the iteration variable.

     */
     #define DELloop(b, p, q)
      for (q = (b)->batFirst, p = (b)->batDeleted; p < q; p++)
     
     /*

8.33.4 hash-table supported loop over BUNs

The first parameter `b' is a BAT, the second (`h') should point to `b->H->hash', and `v' a pointer to an atomic value (corresponding to the head column of `b'). The 'hb' is an integer index, pointing out the `hb'-th BUN.

     */
     #define GDK_STREQ(l,r) (*(char*) (l) == *(char*) (r) && !strcmp(l,r))
     
     #define HASHloop(bi, h, hb, v)
      for (hb = h->hash[HASHprobe(h, v)]; hb != BUN_NONE; hb = h->link[hb])
       if (ATOMcmp(h->type, v, BUNhead(bi, hb)) == 0)
     #define HASHloop_str_hv(bi, h, hb, v)
      for (hb = (h)->hash[((BUN *) (v))[-1]&(h)->mask]; hb != BUN_NONE; hb = (h)->link[hb])
       if (GDK_STREQ(v, BUNhvar(bi, hb)))
     #define HASHloop_str(bi, h, hb, v)
      for (hb = (h)->hash[strHash(v)&(h)->mask]; hb != BUN_NONE; hb = (h)->link[hb])
       if (GDK_STREQ(v, BUNhvar(bi, hb)))
     /*

For string search, we can optimize if the string heap has eliminated all doubles. This is the case when not too many different strings are stored in the heap. You can check this with the macro strElimDoubles() If so, we can just compare integer index numbers instead of strings:

     */
     #define HASHloop_fstr(bi, h, hb, idx, v)
      for (hb = h->hash[strHash(v)&h->mask], idx = strLocate((bi.b)->H->vheap,v);
           hb != BUN_NONE; hb = h->link[hb])
       if (VarHeapValRaw((bi).b->H->heap.base, hb, (bi).b->H->width) == idx)
     /*

The following example shows how the hashloop is used:

void
print_books(BAT *author_books, str author)
{
        BAT *b = author_books;
        BUN i;

        printf("%s\n==================\n", author);
        HASHloop(b, (b)->H->hash, i, author)
			printf("%s\n", ((str) BUNtail(b, i));
}

Note that for optimization purposes, we could have used a HASHloop_str instead, and also a BUNtvar instead of a BUNtail (since we know the tail-type of author_books is string, hence variable-sized). However, this would make the code less general.

8.33.5 specialized hashloops

HASHloops come in various flavors, from the general HASHloop, as above, to specialized versions (for speed) where the type is known (e.g. HASHloop_int), or the fact that the atom is fixed-sized (HASHlooploc) or variable-sized (HASHloopvar).

     */
     #define HASHlooploc(bi, h, hb, v)
      for (hb = h->hash[HASHprobe(h, v)]; hb != BUN_NONE; hb = h->link[hb])
       if (ATOMcmp(h->type, v, BUNhloc(bi, hb)) == 0)
     #define HASHloopvar(bi, h, hb, v)
      for (hb = h->hash[HASHprobe(h, v)]; hb != BUN_NONE; hb = h->link[hb])
       if (ATOMcmp(h->type, v, BUNhvar(bi, hb)) == 0)
     /*

   

8.33.6 loop over a BAT with ordered tail

Here we loop over a BAT with an ordered tail column (see for instance BATsort). Again, 'p' and 'q' are iteration variables, where 'p' points at the current BUN. 'tl' and 'th' are pointers to atom corresponding to the minimum (included) and maximum (included) bound in the selected range of BUNs. A nil-value means that there is no bound. The 's' finally is an integer denoting the bunsize, used for speed.

     */
     #define SORTloop(b,p,q,tl,th)
      if (!(BATtordered(b)&1)) GDKerror("SORTloop: BAT not sorted. n");
      else for (p = (ATOMcmp((b)->ttype,tl,ATOMnilptr((b)->ttype))?
              SORTfndfirst(b,tl):BUNfirst(b)),
         q = (ATOMcmp((b)->ttype,th,ATOMnilptr((b)->ttype))?
              SORTfndlast(b,th):BUNlast(b)); p < q; p++)
     
     /* OIDDEPEND */
     #if SIZEOF_OID == SIZEOF_INT
     #define SORTfnd_oid(b,v) SORTfnd_int(b,v)
     #define SORTfndfirst_oid(b,v) SORTfndfirst_int(b,v)
     #define SORTfndlast_oid(b,v) SORTfndlast_int(b,v)
     sortloop[?.10](oid,int,oid,simple,&oid_nil)
     #else
     #define SORTfnd_oid(b,v) SORTfnd_lng(b,v)
     #define SORTfndfirst_oid(b,v) SORTfndfirst_lng(b,v)
     #define SORTfndlast_oid(b,v) SORTfndlast_lng(b,v)
     sortloop[?.10](oid,lng,oid,simple,&oid_nil)
     #endif
     #if SIZEOF_WRD == SIZEOF_INT
     #define SORTfnd_wrd(b,v) SORTfnd_int(b,v)
     #define SORTfndfirst_wrd(b,v) SORTfndfirst_int(b,v)
     #define SORTfndlast_wrd(b,v) SORTfndlast_int(b,v)
     sortloop[?.10](wrd,int,wrd,simple,&wrd_nil)
     #else
     #define SORTfnd_wrd(b,v) SORTfnd_lng(b,v)
     #define SORTfndfirst_wrd(b,v) SORTfndfirst_lng(b,v)
     #define SORTfndlast_wrd(b,v) SORTfndlast_lng(b,v)
     sortloop[?.10](wrd,lng,wrd,simple,&wrd_nil)
     #endif
     #define SORTloop_bit(b,p,q,tl,th) SORTloop_chr(b,p,q,tl,th)
     /*

8.34 Common BAT Operations

Much used, but not necessarily kernel-operations on BATs.

8.34.1 BAT aggregates

BAT* BAThistogram(BAT *b)
BAT* BATsample(BAT* b,BUN n)

The routine BAThistogram produces a new BAT with a frequency distribution of the tail of its operand.

The routine BATsample returns a random sample on n BUNs of a BAT.

For each BAT we maintain its dimensions as separately accessible properties. They can be used to improve query processing at higher levels.


   

8.34.2 Alignment transformations

Some classes of algebraic operators transform a sequence in an input BAT always in the same way in the output result. An example are the () function (including histogram(b), which is identical to (b.reverse)). That is to say, if synced(b2,b2) => synced((b1),(b2))

Another example is b.fetch(position-bat). If synced(b2,b2) and the same position-bat is fetched with, the results will again be synced. This can be mimicked by transforming the alignment-id of the input BAT with a one-way function onto the result.

We use output->halign = NOID_AGGR(input->halign) for the output = (input) case, and output->align = NOID_MULT(input1->align,input2->halign) for the fetch.


   

8.34.3 BAT relational operators

BAT * BATjoin (BAT *l, BAT *r, BUN estimate)
BAT * BATouterjoin (BAT *l, BAT *r, BUN estimate)
BAT * BATthetajoin (BAT *l, BAT *r, int mode, BUN estimate)
BAT * BATsemijoin (BAT *l, BAT *r)
BAT * BATselect (BAT *b, ptr tl, ptr th)
BAT * BATfragment (BAT *b, ptr l, ptr h, ptr L, ptr H)

BAT * BATsunique (BAT *b)
BAT * BATkunique (BAT *b)
BAT * BATsunion (BAT *b, BAT *c)
BAT * BATkunion (BAT *b, BAT *c)
BAT * BATsintersect (BAT *b, BAT *c)
BAT * BATkintersect (BAT *b, BAT *c)
BAT * BATsdiff (BAT *b, BAT *c)
BAT * BATkdiff (BAT *b, BAT *c)
The BAT library comes with a full-fledged collection of relational operators. The two selection operators BATselect and BATfragment produce a partial copy of the BAT. The former performs a search on the tail; the latter considers both dimensions. The BATselect operation takes two inclusive ranges as search arguments. Interpretation of a NULL argument depends on the position, i.e. a domain lower or upper bound.

The operation BATsort sorts the BAT on the header and produces a new BAT. A side effect is the clustering of the BAT store on the sort key.

The BATjoin over R[A, B] and S[C, D] performs an equi-join over B and C. It results in a BAT over A and D. The BATouterjoin implements a left outerjoin over the BATs involved. The BATsemijoin over R[A, B] and S[C, D] produces the subset of R[A, B] that satisfies the semijoin over A and C.

The full-materialization policy intermediate results in MonetDB means that a join can produce an arbitrarily large result and choke the system. The Data Distilleries tool therefore first computes the join result size before the actual join (better waste time than crash the server). To exploit that perfect result size knowledge, an result-size estimate parameter was added to all equi-join implementations. TODO: add this for semijoin/select/unique/diff/intersect

The routine BATsunique considers both dimensions in the double elimination it performs; it produces a set. The routine BATtunique considers only the head column, and produces a unique head column.

BATs that satisfy the set property can be further processed with the set operations BATsunion, BATsintersect, and BATsdiff. The same operations are also available in versions that only look at the head column:BATkunion, BATkdiff, and BATkintersect (which shares its implementation with BATsemijoin).

     */