2.9.1 Separate Documents vs Document Collections

2.9.1.1 Storage Overhead

In MonetDB/XQuery all XML is stored in relational tables. Each document is stored in a separate table (and as MonetDB uses column-wise storage, each column in stored in a separate file and memory array). Each table (and column) however, even if it is empty, occupies some space on disk and in memory. In the case of the XML tables, the minimal size for en empty table is around 32KB.

Therefore, if the average size of the XML documents you store is much less than 32KB, and you have many (thousands, or millions) of them, storing each of them in a separate document in MonetDB/XQuery will result in a lot of memory and disk-space being wasted, and queries running slower.

For such usage scenarios, it is much better if MonetDB/XQuery can store many XML documents together in a single relational table.

This is made possible using the XQuery concept of a collection. When you add XML documents to the database with pf:add-doc(url,name) it gets stored in a separate new collection (that has the same name name).

However, if you pass an extra parameter pf:add-doc(url,name,collection) the document is added to the collection collection. If collection already existed, the document gets appended to it.

2.9.1.2 fn:collection() vs pf:collection()

XQuery supports the collection concept using the standard builtin function fn:collection(name) as node()*, which returns a set of document nodes that belong together. In MonetDB/XQuery it is perfectly feasible to have collections that contain millions of (small) documents.

XML documents are trees, and in MonetDB/XQuery, a collection is also made into a tree, by automatically adding a super-root node above all document nodes of the collection. MonetDB/XQuery also provides the built-in extension function pf:collection(name) as node( that returns this super-root. Thus, fn:collection(name) is roughly equivalent to pf:collection(name)/child::*. The extension function pf:collection() can be much faster than fn:collection(). on collections that have thousands of documents (or more). The reason is that the former returns just a single node, whereas the latter may return thousands.

2.9.1.3 Frequently Adding/Deleting Documents From Collections

If you have many small documents, store them together in a single (or a few) collection(s). Storing them physically together makes MonetDB/XQuery more efficient.

By default, collections are read-only. The fact that no updates occur on such collections is exploited by creating fully ordered inverted lists as index structures. However, such a fully sorted index needs to be rebuilt from the ground, each time a new document is added to the collection.

Note that updatable XML collections do not use the fully sorted inverted files, but rather use hash-tables. Hash tables can be maintained under updates cheaply and do not need to be rebuilt from scratch when a document is added to a collection.

Therefore, in situations where an existing collection is frequently extended with new documents, we recommend to make that collection updatable. This is done by passing yet another parameter perc to the first pf:add-doc(url,name,collection,perc) call, with which you create the collection. The perc indicates the per-page free-space that is left on pages to accommodate updates, and must be between 1 and 100 (a good value is 10).