Fundamental Specifications of Kyoto Cabinet Version 1

Copyright (C) 2009-2012 FAL Labs
Last Update: Fri, 04 Mar 2011 23:07:26 -0800

Table of Contents

  1. Introduction
  2. Features
  3. Installation
  4. Tutorial
  5. Tips and Hacks
  6. License

Introduction

Kyoto Cabinet is a library of routines for managing a database. The database is a simple data file containing records, each is a pair of a key and a value. Every key and value is serial bytes with variable length. Both binary data and character string can be used as a key and a value. Each key must be unique within a database. There is neither concept of data tables nor data types. Records are organized in hash table or B+ tree.

The following access methods are provided to the database: storing a record with a key and a value, deleting a record by a key, retrieving a record by a key. Moreover, traversal access to every key are provided. These access methods are similar to ones of the original DBM (and its followers: NDBM and GDBM) library defined in the UNIX standard. Kyoto Cabinet is an alternative for the DBM because of its higher performance.

Each operation of the hash database has the time complexity of "O(1)". Therefore, in theory, the performance is constant regardless of the scale of the database. In practice, the performance is determined by the speed of the main memory or the storage device. If the size of the database is less than the capacity of the main memory, the performance will seem on-memory speed, which is faster than std::map of STL. Of course, the database size can be greater than the capacity of the main memory and the upper limit is 8 exabytes. Even in that case, each operation needs only one or two seeking of the storage device.

Each operation of the B+ tree database has the time complexity of "O(log N)". Therefore, in theory, the performance is logarithmic to the scale of the database. Although the performance of random access of the B+ tree database is slower than that of the hash database, the B+ tree database supports sequential access in order of the keys, which realizes forward matching search for strings and range search for integers. The performance of sequential access is much faster than that of random access.

As the API is based on object-oriented design, the hash database and the the B+ tree database have same methods which inherited from the upper abstract class. Beside them, seven kinds of databases are provided under the same base class. The prototype hash database is powered by the standard container of std::unordered_map. The prototype tree database is powered by the standard container of std::map. The stash database is powered by the original implementation of naive hash map saving memory. The cache hash database is powered by the original implementation of doubly-linked hash map with LRU deletion algorithm. The cache tree database is powered by the cache hash database and provides B+ tree mechanism. The directory hash database is powered by the directory mechanism of the file system and stores records as respective files in a directory. The directory tree database is powered by the directory hash database and provides B+ tree mechanism. All databases have practical utility methods related to transaction and cursor. Programs for command line interface are also included in the package.

Kyoto Cabinet runs very fast. For example, elapsed time to store one million records is 0.9 seconds for the hash database, and 1.1 seconds for the B+ tree database. Moreover, the size of database is very small. For example, overhead for a record is 16 bytes for the hash database, and 4 bytes for the B+ tree database. Furthermore, scalability of Kyoto Cabinet is great. The database size can be up to 8EB (9.22e18 bytes).

Kyoto Cabinet is written in the C++ language, and provided as API of C++, C, Java, Python, Ruby, Perl, and Lua. Kyoto Cabinet is available on platforms which have API conforming to C++03 with the TR1 library extensions. Kyoto Cabinet is a free software licensed under the GNU General Public License. The FOSS License Exception is also provided in order to accommodate products under other free and open source licenses. The Specific FOSS Library Linking Exception is also provided in order to be available in some specific FOSS libraries. On the other hand, a commercial license is also provided. If you use Kyoto Cabinet within a proprietary software, the commercial license is required.


Features

This section describes the features of Kyoto Cabinet.

Genealogy

The original DBM was developed by Kenneth Thompson as a part of the original AT&T UNIX. After that, a lot of followers developed such DBM-like products as NDBM, SDBM, GDBM, TDB, and BerkeleyDB. In 2003, I developed QDBM to replace GDBM for performance reason.

In 2007, Tokyo Cabinet was developed as the successor to QDBM on the following purposes. They were achieved and Tokyo Cabinet could replace conventional DBM products.

In 2009, Kyoto Cabinet was developed as another successor to QDBM. Compared with the sibling product (Tokyo Cabinet), the following advantages were pursued. However, the performance of Tokyo Cabinet is higher than Kyoto Cabinet, at least in single thread operations.

I'll maintain the both of Tokyo Cabinet and Kyoto Cabinet because their values are different.

Effective Implementation of Hash Database

Kyoto Cabinet uses hash algorithm to retrieve records. If a bucket array has sufficient number of elements, the time complexity of retrieval is "O(1)". That is, the time required for retrieving a record is constant, regardless of the scale of a database. It is also the same about storing and deleting. Collision of hash values is managed by separate chaining. Data structure of the chains is binary search tree. Even if a bucket array has unusually scarce elements, the time complexity of retrieval is "O(log n)".

Kyoto Cabinet attains performance improvement in retrieval by loading the whole of the bucket array onto the RAM. If the bucket array is on RAM, it is possible to access a region of a target record by about one set of file operations such as `lseek', `read', and `write'. The bucket array saved in a file is not read into RAM with the `read' call but directly mapped to RAM with the `mmap' call. Therefore, preparation time on connecting to a database is very short, and two or more processes can share the same memory map.

The hash function used for hash table is MurMurHash 2.0. If the number of elements of the bucket array is about a half of records stored within a database, although it depends on characteristic of the input, the probability of collision of hash values is about 55.3% (35.5% if the same, 20.4% if twice, 11.0% if four times, 5.7% if eight times). In that case, it is possible to retrieve a record by two or less sets of file operations. If it is made into a performance index, in order to handle a database containing one million of records, a bucket array with half a million of elements is required. The size of each element is 6 bytes. That is, if 3M bytes of RAM is available, a database containing one million records can be handled.

When overwriting a record with a value whose size is greater than the existing one, it is necessary to move the region to another position of the file. Because the time complexity of the operation depends on the size of the region of a record, extending values successively is inefficient. However, Kyoto Cabinet deals with this problem by alignment. If the incremental data can be placed in the padding region trailing the records, it is not necessary to move the region of the record.

Generally speaking, while succession of updating, fragmentation of available regions occurs, and the size of a database grows rapidly. Kyoto Cabinet deals with this problem by the free block pool and the automatic defragmentation mechanism. If a record is removed or shifted to another position, the region will be treated as a free block. The free block pool manages free blocks and reuses the best fit region for a new record. The automatic defragmentation is to shift records and free blocks separately. Successive free blocks coalesce into one.

Useful Implementation of B+ Tree Database

Although the B+ tree database is slower than the hash database, it features ordering access to each record. The order can be assigned by users. Records in the B+ tree database are sorted and arranged in logical pages. Sparse index organized in B tree that is multiway balanced tree are maintained for each page. Thus, the time complexity of retrieval and so on is "O(log n)". Cursor is provided to access each record in order. The cursor can jump to a position specified by a key and can step forward or backward from the current position. Because each page is arranged as double linked list, the time complexity of stepping cursor is "O(1)".

The B+ tree database is implemented based on the above the hash database. Because each page of the B+ tree database is stored as each record in the hash database, the B+ tree database inherits efficiency of storage management of the hash database. Because the header of each record is smaller and alignment of each page is adjusted according to the page size, in most cases, the size of database file is cut by half compared to one of the hash database.

Although operations of many pages are required to update the B+ tree database, Kyoto Cabinet expedites the process by the page cache and reducing file operations. The page cache is implemented with double layered LRU list, which realizes that frequently accessed pages are cached in the "hot" list and recently accessed pages are cached in the "warm" LRU list. If the page cache works efficiently and the whole of the sparse index is cached on memory, it is possible to retrieve a record by one or less set of file operations.

Each page of the B+ tree database can be stored with compressed. The default compression method is "Deflate" by ZLIB. Because records in a page has similar patterns, high efficiency of compression is expected due to the Lempel-Ziv algorithm. In case of handling text data, the size of a database is reduced to about 50% or less. If the scale of a database is large and disk I/O is the bottleneck, featuring compression makes the processing speed improved to a large extent. Moreover, you can specify such external compression algorithms as LZO and LZMA.

Practical Functionality

Kyoto Cabinet features transaction mechanisms. It is possible to commit a series of operations between the beginning and the end of the transaction in a lump, or to abort the transaction and perform rollback to the state before the transaction. Two isolation levels are supported: "serializable" and "read uncommitted". Durability is secured by write ahead logging and shadow paging.

Automatic transaction and automatic recovery mechanisms are also supported. If the automatic transaction option is specified when opening the database, every updating operation is guarded by transaction which is committed implicitly. Therefore, durability can be assured without explicit transaction operations. The automatic recovery mechanism works after the database is crashed outside transaction. If inconsistency of the database is detected when opening the database, all regions are scanned as with "fsck" and the database is reconstructed with surviving records implicitly.

Kyoto Cabinet provides two modes to connect to a database: "reader" and "writer". A reader can perform retrieving but neither storing nor deleting. A writer can perform all access methods. Exclusion control between processes is performed when connecting to a database by file locking. While a writer is connected to a database, neither readers nor writers can be connected. While a reader is connected to a database, other readers can be connect, but writers can not. According to this mechanism, data consistency is guaranteed with simultaneous connections in multitasking environment.

Functions of API are reentrant and available in multi-thread environment. Different database objects can be operated in parallel entirely. For simultaneous operations against the same database object, rwlock (reader-writer lock) is used for exclusion control. That is, while a writing thread is operating an object, other reading threads and writing threads are blocked. However, while a reading thread is operating an object, reading threads are not blocked. Locking granularity depends on data structures. The hash database uses record locking. The B+ tree database uses page locking.

In order to improve performance and concurrency, Kyoto Cabinet uses such atomic operations built in popular CPUs as atomic-increment and CAS (compare-and-swap). Lock primitives provided by the native environment such as the POSIX thread package are alternated by own primitives using CAS.

Simple but Flexible Interfaces

Kyoto Cabinet provides simple APIs based on object-oriented design. Every operation for database is encapsulated and published as lucid methods as `open', `close', `set', `remove', `get', and so on. The classes of the hash database and the B+ tree database are derived class of the common abstract class which defines the interface. Porting an application from one database to another is easy. Moreover, the polymorphic database API is provided to assign a database in run-time.

Kyoto Cabinet supports the "visitor" pattern. You can define arbitrary database operations with call back functions. The visitor class encapsulates that call back functions and their state data. The database class has the "accept" method, which accepts an instance of the visitor class and calls its functions with a record data. The return value of the call back function is reflected as the new state of the record.

In addition, a lot of useful utilities are provided such as "prefix search", "regex search", "logging", "hot backup", "pseudo-snapshot", and "merging". A framework for "MapReduce" is also provided. Although it is not distributed, it is useful for aggregate calculation with less CPU loading and less memory usage. The plain text database is an interface to treat a plain text file as a database file. It is useful to use a text file as input or output data for the MapReduce framework. The index database is a wrapper of a polymorphic database in order to improve the efficiency of the `append' operation. It is useful to construct inverted indices.

While the core API is provided for C++, bindings for other languages such as C, Java, Python, Ruby, Perl, and Lua are also provided. Command line interfaces are also provided corresponding to each API. They are useful for prototyping, test, and debugging.


Installation

This section describes how to install Kyoto Cabinet with the source package. As for a binary package, see its installation manual.

Preparation

Kyoto Cabinet is available on UNIX-like systems. At least, the following environments are supported.

gcc (GNU Compiler Collection) 4.2 or later and make (GNU Make) are required to install Kyoto Cabinet with the source package. They are installed by default on Linux, FreeBSD and so on.

As Kyoto Cabinet depends on the following libraries, install them beforehand.

Installation

When an archive file of Kyoto Cabinet is extracted, change the current working directory to the generated directory and perform installation.

Run the configuration script.

$ ./configure

Build programs.

$ make

Perform self-diagnostic test. This takes a while.

$ make check

Install programs. This operation must be carried out by the root user.

# make install

Result

When a series of work finishes, the following files will be installed.

/usr/local/include/kccommon.h
/usr/local/include/kcutil.h
/usr/local/include/kcthread.h
/usr/local/include/kcfile.h
/usr/local/include/kccompress.h
/usr/local/include/kccompare.h
/usr/local/include/kcmap.h
/usr/local/include/kcregex.h
/usr/local/include/kcdb.h
/usr/local/include/kcplantdb.h
/usr/local/include/kcprotodb.h
/usr/local/include/kcstashdb.h
/usr/local/include/kccachedb.h
/usr/local/include/kchashdb.h
/usr/local/include/kcdirdb.h
/usr/local/include/kctextdb.h
/usr/local/include/kcpolydb.h
/usr/local/include/kcdbext.h
/usr/local/include/kclangc.h
/usr/local/lib/libkyotocabinet.a
/usr/local/lib/libkyotocabinet.so.x.y.z
/usr/local/lib/libkyotocabinet.so.x
/usr/local/lib/libkyotocabinet.so
/usr/local/lib/pkgconfig/kyotocabinet.pc
/usr/local/bin/kcutiltest
/usr/local/bin/kcutilmgr
/usr/local/bin/kcprototest
/usr/local/bin/kcstashtest
/usr/local/bin/kccachetest
/usr/local/bin/kcgrasstest
/usr/local/bin/kchashtest
/usr/local/bin/kchashmgr
/usr/local/bin/kctreetest
/usr/local/bin/kctreemgr
/usr/local/bin/kcdirtest
/usr/local/bin/kcdirmgr
/usr/local/bin/kcforesttest
/usr/local/bin/kcforestmgr
/usr/local/bin/kcpolytest
/usr/local/bin/kcpolymgr
/usr/local/bin/kclanctest
/usr/local/man/man1/...
/usr/local/share/doc/kyotocabinet/...

Options of Configure

The following options can be specified with `./configure'.

`--prefix' and other options are also available as with usual UNIX software packages. If you want to install Kyoto Cabinet under `/usr' not `/usr/local', specify `--prefix=/usr'. As well, the library search path does not include `/usr/local/lib', it is necessary to set the environment variable `LD_LIBRARY_PATH' to include `/usr/local/lib' before running applications of Kyoto Cabinet.

ZLIB is enabled by default. LZO and LZMA are disabled by default. Note that the programs on which Kyoto Cabinet depends are under different licenses. For example, ZLIB is under the zlib license, LZO is under the GPL, and LZMA is under the LGPL. You must also comply with their license terms if you enable them.

How to Use the Library

Kyoto Cabinet provides API of the C++ language and it is available by programs conforming to the C++03 standard. As the header files of Kyoto Cabinet are provided as `kcutil.h', `kchashdb.h', and so on, applications should include one or more of them accordingly to use the API. As the library is provided as `libkyotocabinet.a' and `libkyotocabinet.so' and they depends on `libz.so', `libstdc++.so', `librt.so', `libpthread.so', `libm.so', and `libc.so', linker options corresponding to them are required by the build command. The typical build command is the following.

$ g++ -I/usr/local/include example.cc -o example \
  -L/usr/local/lib -lkyotocabinet -lz -lstdc++ -lrt -lpthread -lm -lc

For Windows

Microsoft Visual Studio (Visual C++) is required to build Kyoto Cabinet on Windows. The building configuration is described in the file `VCmakefile', which should be edited according to your environment. Then, perform the following command in a command prompt window.

> nmake -f VCmakefile

If you want, perform self-diagnostic test.

> nmake -f VCmakefile check

If all building processes finish successfully, the static library `kyotocabinet.lib' and some executable files are generated. As for now, neither DLL nor installation tool is provided. Please, install the header files, the library file, and the executable files by yourself.

If the header files and the library file are in the current directory, you can build an application program by the following command.

> cl /I. example.cc kyotocabinet.lib

By default, the library is built with linking to `LIBCMT.LIB' by the `/MT' option. If you want to use `MSVCRT.LIB', which is required by the MFC, rebuild the library with the `/MD' option and set the same option when building applications. If your environment is 64-bit version, add the `/D_SYS_WIN64_' option to compiler options to improve performance.

Many people use virus checking softwares on Windows. However, they are harmful for database softwares. To avoid unexpected errors and performance overhead of database functions, set aside the data directory from virus checking. If the above self-diagnostic test fails, please confirm the configuration of the virus checker first.


Tutorial

This section describes how to use Kyoto Cabinet with the command line utilities and some sample application programs.

Command Line Utility for the File Hash Database

To begin with, let's build a file hash database with the command line utility `kchashmgr'. The database stores a list of staffs of a company. Each record is composed of the key and the value. The key is the staff ID and the value is person's name.

The database must be created just one time before any database operation. Let's create a database file "casket.kch" with the default configuration.

$ kchashmgr create staffs.kch

Register some staffs into the database.

$ kchashmgr set staffs.kch 1001 "George Washington"
$ kchashmgr set staffs.kch 1002 "John Adams"
$ kchashmgr set staffs.kch 1003 "Thomas Jefferson"
$ kchashmgr set staffs.kch 1004 "James Madison"

Check the current contents.

$ kchashmgr list -pv staffs.kch
1001    George Washington
1002    John Adams
1003    Thomas Jefferson
1004    James Madison

To retrieve the value of a record, search the database with the key.

$ kchashmgr get staffs.kch 1003
Thomas Jefferson

Of course, you can remove a record with the key.

$ kchashmgr remove staffs.kch 1003

That's all for the fundamental operations. The DBM family have been improving performance thanks to discarding the functionality.

Sample Application of the File Hash Database

Next, let's write a sample application program handling a file hash database. See the following source code.

#include <kchashdb.h>

using namespace std;
using namespace kyotocabinet;

// main routine
int main(int argc, char** argv) {

  // create the database object
  HashDB db;

  // open the database
  if (!db.open("casket.kch", HashDB::OWRITER | HashDB::OCREATE)) {
    cerr << "open error: " << db.error().name() << endl;
  }

  // store records
  if (!db.set("foo", "hop") ||
      !db.set("bar", "step") ||
      !db.set("baz", "jump")) {
    cerr << "set error: " << db.error().name() << endl;
  }

  // retrieve a record
  string value;
  if (db.get("foo", &value)) {
    cout << value << endl;
  } else {
    cerr << "get error: " << db.error().name() << endl;
  }

  // traverse records
  DB::Cursor* cur = db.cursor();
  cur->jump();
  string ckey, cvalue;
  while (cur->get(&ckey, &cvalue, true)) {
    cout << ckey << ":" << cvalue << endl;
  }
  delete cur;

  // close the database
  if (!db.close()) {
    cerr << "close error: " << db.error().name() << endl;
  }

  return 0;
}

Save the above code as a file "example.cc". Then, perform the following command line. The command `kcutilmgr conf' prints the building configuration.

$ g++ `kcutilmgr conf -i` -o example example.cc `kcutilmgr conf -l`

Execute the application program built by the above.

$ ./example
hop
foo:hop
bar:step
baz:jump

The API of the file hash database is defined in the header `kchash.h'. So, include the header near the front of a source file. All symbols of Kyoto Cabinet are packaged in the name space `kyotocabinet'. You can use them without any prefix by importing the name space.

#include <kchashdb.h>
using namespace kyotocabinet;

The class `HashDB' contains all functionality of the file hash database and each instance expresses a file hash database file.

HashDB db;

Each database file must be opened by the `open' method before any database operation. The flag `HashDB::OWRITER' means the process will update the database. The flag `HashDB::OCREATE' means a database file will be created if it does not exist yet.

db.open("casket.kch", HashDB::OWRITER | HashDB::OCREATE);

Every opened database must be closed by the `close' method when it is no longer in use. Closing the database is very important to avoid data corruption and memory leak.

db.close();

To store a record, use the `set' method with the key and the value.

db.put("foo", "hop");

To retrieve the value of a record, use the `get' method with the key. On success, the return value is true and the result is assigned into the string object pointed to by the second parameter.

string value;
if (db.get("foo", &value)) {
  cout << value << endl;
}

Except for `set' and `get', there are other methods; `add', `replace', `append', `remove', `increment', and `cas'. Each method has two versions; for `std::string' parameters and for `char*' and `size_t' parameters.

Traversing records is a bit complicated task. It needs a cursor object, which expresses the current position in the sequence of all records in the database. Each cursor is created by the `cursor' method of the database object. Each cursor should be initialized by `jump' method before actual record operations.

DB::Cursor* cur = db.cursor();
cur->jump();

The cursor class has such methods against the record at the current position as `set_value', `remove', `get_key', `get_value', and `get'. Most methods have an optional stepping parameter to shift the current position to the next record atomically. Therefore, iterating such methods with the stepping parameter results in that all records are visited.

string ckey, cvalue;
while (cur->get(&ckey, &cvalue, true)) {
  cout << ckey << ":" << cvalue << endl;
}

More Complex Example Using the Visitor Pattern

Every operation against a record can be abstracted by the "visitor" pattern. A visitor object specifies call back methods which receives the state of a record and returns the new state. Let's see the next sample using the visitor pattern.

#include <kchashdb.h>

using namespace std;
using namespace kyotocabinet;

// main routine
int main(int argc, char** argv) {

  // create the database object
  HashDB db;

  // open the database
  if (!db.open("casket.kch", HashDB::OREADER)) {
    cerr << "open error: " << db.error().name() << endl;
  }

  // define the visitor
  class VisitorImpl : public DB::Visitor {
    // call back function for an existing record
    const char* visit_full(const char* kbuf, size_t ksiz,
                           const char* vbuf, size_t vsiz, size_t *sp) {
      cout << string(kbuf, ksiz) << ":" << string(vbuf, vsiz) << endl;
      return NOP;
    }
    // call back function for an empty record space
    const char* visit_empty(const char* kbuf, size_t ksiz, size_t *sp) {
      cerr << string(kbuf, ksiz) << " is missing" << endl;
      return NOP;
    }
  } visitor;

  // retrieve a record with visitor
  if (!db.accept("foo", 3, &visitor, false) ||
      !db.accept("dummy", 5, &visitor, false)) {
    cerr << "accept error: " << db.error().name() << endl;
  }

  // traverse records with visitor
  if (!db.iterate(&visitor, false)) {
    cerr << "iterate error: " << db.error().name() << endl;
  }

  // close the database
  if (!db.close()) {
    cerr << "close error: " << db.error().name() << endl;
  }

  return 0;
}

The methods `accept' and `iterate' receive visitor objects. Each visitor object must be implement the interface `DB::Visitor'. The method `visit_full' is called for an existing record. The parameters specify the pointer to key buffer, the length of the key buffer, the pointer to the value buffer, the length of the value buffer, and the pointer to the variable to notice the length of the return value. The return value is `NOP' for no update, `REMOVE' to remove the record, or otherwise the pointer to the buffer contains arbitrary byte pattern to update the value. The method `visit_empty' is called for an empty record space. The specification is the same to `visit_full' except that it does not receive the record value.

As it is, almost all built-in operations like `set', `remove', and `get' are implemented with the `accept' method. The following code is to store a record.

class Setter : public DB::Visitor {
public:
  Setter(const char* vbuf, size_t vsiz) : vbuf_(vbuf), vsiz_(vsiz) {}
private:
  const char* visit_full(const char* kbuf, size_t ksiz,
                         const char* vbuf, size_t vsiz, size_t* sp) {
    *sp = vsiz_;
    return vbuf_;
  }
  const char* visit_empty(const char* kbuf, size_t ksiz, size_t* sp) {
    *sp = vsiz_;
    return vbuf_;
  }
  const char* vbuf_;
  size_t vsiz_;
};
Setter setter("foo", 3);
accept(kbuf, ksiz, &setter, true);

Features of Various Database Classes

As Kyoto Cabinet provides various database classes, they have the common base class, which defines the common interface of all database classes. That is, you can use the following classes in the same way.

class file description
ProtoHashDB kcprotodb.h
prototype hash database.
on-memory database implemented with std::unorderd_map.
ProtoTreeDB kcprotodb.h
prototype tree database.
on-memory database implemented with std::map.
StashDB kcstashdb.h
stash database.
on-memory database saving memory.
CacheDB kccachedb.h
cache hash database.
on-memory database featuring LRU deletion.
GrassDB kccachedb.h
cache tree database.
on-memory database of B+ tree: cache with order.
HashDB kchashdb.h
file hash database.
file database of hash table: typical DBM.
TreeDB kchashdb.h
file tree database.
file database of B+ tree: DBM with order.
DirDB kcdirdb.h
directory hash database.
respective files in a directory of the file system.
ForestDB kcdirdb.h
directory tree database.
directory database of B+ tree: huge DBM with order.
TextDB kctextdb.h
plain text database.
emulation to handle a plain text file as a database.
PolyDB kcpolydb.h
polymorphic database.
dynamic binding of the above databases.

Each database has different features in persistence, algorithm, time complexity, record sequence, and lock model for concurrency.

class persistence algorithm complexity sequence lock unit
ProtoHashDB volatile hash table O(1) undefined whole (rwlock)
ProtoTreeDB volatile red black tree O(log N) lexical order whole (rwlock)
StashDB volatile hash table O(1) undefined record (rwlock)
CacheDB volatile hash table O(1) undefined record (mutex)
GrassDB volatile B+ tree O(log N) custom order page (rwlock)
HashDB persistent hash table O(1) undefined record (rwlock)
TreeDB persistent B+ tree O(log N) custom order page (rwlock)
DirDB persistent undefined undefined undefined record (rwlock)
ForestDB persistent B+ tree O(log N) custom order page (rwlock)
TextDB persistent plain text undefined stored order record (rwlock)

The C language binding is also provided as a wrapper of the polymorphic database API. Include the header file `kclangc.h' and use the pointer to `KCDB' as a database object.

Please see the the API documents for details. Writing your own sample application is the best way to learn this library. The specifications of command line utilities is also useful.


Tips and Hacks

This section describes tips and hacks to use Kyoto Cabinet.

Tuning the Stash Database

The stash database (StashDB) is on-memory database saving memory. The following tuning methods are provided.

The default tuning of the bucket number is about one million. If you intend to store more records, call `tune_buckets' to set the bucket number. The suggested ratio of the bucket number is the same to the total number of records and it is okay from 80% to 400%. If the ratio decreases smaller than 100%, the time efficiency will decrease rapidly because the collision chain is linear linked list.

Tuning the Cache Hash Database

The cache hash database (CacheDB) is on-memory database featuring LRU deletion. The following tuning methods are provided.

The optional features by `tune_options' is useful to reduce the memory usage at the expense of time efficiency. If `CacheDB::TCOMPRESS' is specified, the key and the value of each record is compressed implicitly when stored in the file. If the value is bigger than 1KB or more, compression is effective.

The default tuning of the bucket number is about one million. If you intend to store more records, call `tune_buckets' to set the bucket number. The suggested ratio of the bucket number is the same to the total number of records and it is okay from 50% to 400%. If the ratio decreases smaller than 100%, the time efficiency will decrease gradually because the collision chain is binary search tree.

The default compression algorithm of the `CacheDB::TCOMPRESS' option is "Deflate" by ZLIB. If you want to use another algorithm, call `tune_compressor' to set a functor which implements compression and decompression functions.

By default, the cache hash database maintains all records on memory and no record is expired. If you want to expire old records to keep the memory usage constant, call `cap_count' and/or `cap_size' to limit the capacity.

If you want to cache ten millions of records and keep the memory usage less than 8GB, the following tuning is suggested for example.

db.tune_buckets(10LL * 1000 * 1000);
db.cap_count(10LL * 1000 * 1000);
db.cap_size(8LL << 30);
db.open(...);

All tuning methods must be called before the database is opened.

Tuning the Cache Tree Database

The cache tree database (GrassDB) is on-memory database of B+ tree. Because each node of B+ tree is serialized as a page buffer and treated as a record in the cache hash database, all tuning methods of the cache hash database except for capacity limitation are inherited to the cache tree database. Moreover, the following tuning methods are added.

The tuning of the page size by `tune_page' does not have to be modified in most cases. The default is 8192, which is the twice of the typical page size of popular environments. If the size of each node exceeds the parameter, the node is divided into two.

The default tuning of the capacity size of the page cache is 64MB. If your want to reduce memory usage, call `tune_page_cache' to convert most pages into flat byte arrays which are serialized or compressed in order to improve space efficiency.

The default record comparator is the lexical ordering function. That is, records in the B+ tree database are placed in the lexical order of each key. If you want to use another ordering, call `tune_comparator' to set a functor which implements the ordering function.

If you want to cache ten millions of records and keep the memory usage as small as possible, the following tuning is suggested for example.

db.tune_options(GrassDB::TCCOMPESS);
db.tune_buckets(500LL * 1000);
db.tune_page(32768);
db.tune_page_cache(1LL << 20);
db.open(...);

All tuning methods must be called before the database is opened.

Tuning the File Hash Database

The file hash database (HashDB) is file database of hash table. The following tuning methods are provided.

The default alignment power is 3, which means the address of each record is aligned to a multiple of 8 (1<<3) bytes. If you trust that the database is constructed at a time and not updated often, call `tune_alignment' to set the alignment power 0, which means 1 (1<<0) byte. If the typical size of each record is expected to be larger than 1KB, tune the alignment 8 or more.

The tuning of the free block pool by `tune_fbp' does not have to be modified in most cases. The default is 10, which means the capacity of the free block pool is 1024 (1<<10).

The optional features by `tune_options' is useful to reduce the size of the database file at the expense of scalability or time efficiency. If `HashDB::TSMALL' is specified, the width of record addressing is reduced from 6 bytes to 4 bytes. As the result, the footprint for each record is reduced from 16 bytes to 12 bytes. However, it limits the maximum size of the database file up to 16GB (2GB multiplied by the alignment). If `HashDB::TLINEAR' is specified, the data structure of the collision chain of hash table is changed from binary tree to linear linked list. In that case, the footprint of each record is reduced from 16 bytes to 10 bytes although the time efficiency becomes sensitive to the number of the hash buckets. If `HashDB::TCOMPRESS' is specified, the value of each record is compressed implicitly when stored in the file. If the value is bigger than 1KB or more, compression is effective.

The default tuning of the bucket number is about one million. If you intend to store more records, call `tune_buckets' to set the bucket number. The suggested ratio of the bucket number is the twice of the total number of records and it is okay from 100% to 400%. If the ratio decreases smaller than 100%, the time efficiency will decrease gradually. If you set the bucket number, setting the `HashDB::TLINEAR' option is recommended to improve time and space efficiency.

The default tuning of the size of the internal memory-mapped region is 64MB. If the database size is expected to be larger than 64MB, call `tune_map to set the map size larger than the expected size of the database. Although the capacity of the RAM on the machine limits the map size, increasing the map size is effective to improve performance.

By default, auto defragmentation is disabled. If the existing records in the database are modified (removed or modified with varying the size), fragmentation of available regions proceeds gradually. In that case, call `tune_defrag' to enable auto defragmentation and set the unit step number. The suggested unit step number is 8, which means that a set of defragmentation operations is performed each 8 updating operations. The more the unit is, space efficiency becomes higher but time efficiency becomes lower.

The default compression algorithm of the `HashDB::TCOMPRESS' option is "Deflate" by ZLIB. If you want to use another algorithm, call `tune_compressor' to set a functor which implements compression and decompression functions.

If you intend to store ten thousands of records and reduce the database size as possible, the following tuning is suggested for example.

db.tune_alignment(0);
db.tune_options(HashDB::TSMALL | HashDB::TLINEAR);
db.tune_buckets(10LL * 1000);
db.tune_defrag(8);
db.open(...);

If you have a monster machine with 512GB RAM and intend to store ten billion records and improve time efficiency as possible, the following tuning is suggested for example.

db.tune_options(HashDB::TLINEAR);
db.tune_buckets(20LL * 1000 * 1000 * 1000);
db.tune_map(300LL << 30);
db.open(...);

All tuning methods must be called before the database is opened. Because the settings of `tune_alignment', `tune_fbp', `tune_options', and `tune_buckets' are recorded as the meta data of the database, the methods must be called before the database is created and they can not be modified afterward. Because other tuning parameters are not recorded in the database, they should be specified before every time opening the database.

Tuning the File Tree Database

The file tree database (TreeDB) is file database of B+ tree. Because each node of B+ tree is serialized as a page buffer and stored as a record in the file hash database, all tuning methods of the file hash database are inherited to the file tree database. Moreover, the following tuning methods are added.

The tuning of the page size by `tune_page' does not have to be modified in most cases. The default is 8192, which is the twice of the typical page size of popular environments. If the size of each node exceeds the parameter, the node is divided into two.

The default tuning of the capacity size of the page cache is 64MB. If your machine has abundant RAM, call `tune_page_cache' to load all nodes on the page cache. If the RAM is not abundant, it is better to keep the default page cache size and assign the RAM for the internal memory-mapped region by `tune_map'.

The default record comparator is the lexical ordering function. That is, records in the B+ tree database are placed in the lexical order of each key. If you want to use another ordering, call `tune_comparator' to set a functor which implements the ordering function.

The default alignment of the file tree database is 256 (1<<8). The default bucket number of the file tree database is about 65536. Other default tuning parameters are the same to the file hash database. Note that the bucket number should be calculated by the number of pages. The suggested ratio of the bucket number is about 10% of the number of records. If the compression option is specified, all records in each page are compressed at once. Therefore, compression is more effective for the file tree database rather than for the file hash database.

If you intend to store ten thousands of records and reduce the database size as possible, the following tuning is suggested for example.

db.tune_options(TreeDB::TLINEAR | TreeDB::TCCOMPESS);
db.tune_buckets(1LL * 1000);
db.tune_defrag(8);
db.tune_page(32768);
db.open(...);

If you have a monster machine with 512GB RAM and intend to store ten billion records and improve time efficiency as possible, the following tuning is suggested for example.

db.tune_options(TreeDB::TLINEAR);
db.tune_buckets(1LL * 1000 * 1000 * 1000);
db.tune_map(300LL << 30);
db.tune_page_cache(8LL << 30);
db.open(...);

All tuning methods must be called before the database is opened. Because the setting of `tune_page' is recorded as the meta data of the database, the methods must be called before the database is created and it can not be modified afterward. Because other tuning parameters are not recorded in the database, they should be specified before every time opening the database.

Tuning the Directory Hash Database

The directory hash database (DirDB) is powered by the directory mechanism of the file system and stores records as respective files in a directory. The following tuning methods are provided.

The optional features by `tune_options' is useful to reduce the size of the database file at the expense of time efficiency. If `DirDB::TCOMPRESS' is specified, the key and the value of each record is compressed implicitly when stored in the file. If the value is bigger than 1KB or more, compression is effective.

Performance of the directory hash database is strongly based on the file system implementation and its tuning. Some file systems such as EXT2 are not good at storing a lot of files in a directory. But, other file systems such as EXT3 and ReiserFS are relatively efficient in that situation. In general, file systems featuring B tree or its variants are more suitable than linear search algorithms.

All tuning methods must be called before the database is opened. Because the settings of `tune_options' are recorded as the meta data of the database, the method must be called before the database is created and they can not be modified afterward.

Tuning the Directory Tree Database

The directory tree database (ForestDB) is directory database of B+ tree. As with that the file tree database is based on the file hash database, the directory tree database is based on the directory hash database. So, all tuning methods of the directory hash database are inherited to the directory tree database. Additional tuning methods are the same as ones of the file tree database.

The performance characteristics of the directory tree database is similar to those of the directory hash database. However, because records are organized in pages, frequency of I/O operations is less and performance is better in many cases.

Choosing Suitable Databases

In order to choose suitable database types for your application, it is important to clarify the requirement specification. If it does not require persistency of records, on-memory databases are suggested. There are the prototype hash database (ProtoHashDB), the prototype tree database (ProtoTreeDB), the stash database (StashDB), the cache hash database (CacheDB), and the cache tree database (GrassDB). If the order of keys is important for your application logic, the cache tree database is suitable. If not, the stash database is suitable. The memory usage of the cache tree database is smaller than the others. The cache hash database can delete old records implicitly and keep the memory usage constant. The prototype databases have few cases to flourish.

If your application requires persistency of records, persistent databases are suggested. There are the file hash database (HashDB), the file tree database (TreeDB), the directory hash database (DirDB), and the directory tree database (ForestDB). If the order of keys is important for your application logic, the file tree database is suitable. If not, the file hash database is suitable. In most cases, performance and concurrency of the file hash database is better than the others. If the size of each record is large, the directory hash database is suitable.

Most DBM implementations including Kyoto Cabinet and Tokyo Cabinet are optimized to store small records. As it is, if you want to handle very large records, using the file system directly is better solution than using DBM. If you store and retrieve large records, the processing time by the `write' and `read' system calls is dominant rather than the `open' and `lseek' system calls. Although typical DBMs reduce the workload to locate the position of each record, they increase the workload to read/write the data of each record. If you want to handle large records but don't want to use the file system directly, use the directory hash database. It's a mere wrapper of the directory mechanism of the file system. If you want to handle large records in order of keys, use the directory tree database. The directory tree database is the final weapon of scalability.

If you want to decide the database type doing performance test with the production code, use the polymorphic database. It can specify the database type dinamically when opening the database. In fact, the polymorphic database is suggested in most usecases though it involves a little bit of performance overhead in runtime. That's why the official script language bindings support the polymorphic database only.

Transaction

If an application process which opened a database terminated without closing the database, it involves risks that some records may be missing and the database may be broken. By default, durability is settled when the database is closed properly and it is not settled for each updating operation. Kyoto Cabinet deal with the problem by transaction mechanism based on WAL (write ahead logging). Transaction is begun and committed by application explicitly. Durability during transaction is settled for each updating operation. Transaction can be aborted by application. In that case, all update operations during transaction are voided and the content of the database is rollbacked. Although transaction is very useful like this, throughput of updating operation decreases to about 50% of the default manner due to overhead of writing WAL data.

db.begin_transaction();
db.set("japan", "tokyo");
db.set("korea", "seoul");
db.end_transaction();

If you can't be bothered to begin and commit transaction explicitly, use auto transaction mechanism by specifying the `BasicDB::AUTOTRAN' option when opening the database. Auto transaction is begun and committed implicitly for each operation. Overhead of auto transaction is lighter than explicit transaction.

db.open("casket.kch", HashDB::OWRITER | HashDB::OCREATE | HashDB::OAUTOTRAN);
db.set("japan", "tokyo");
db.set("china", "beijing");

After all, it is important to choose the usage according to the requirements of your application.

default
risk on process crash: Some records may be missing.
risk on system crash: Some records may be missing.
performance penalty: none
remark: Auto recovery after crash will take time in proportion of the database size.
transaction
implicit usage: open(..., BasicDB::OAUTOTRAN);
explicit usage: begin_transaction(false); ...; end_transaction(true);
risk on process crash: none
risk on system crash: Some records may be missing.
performance penalty: Throughput will be down to about 30% or less.
transaction + synchronize
implicit usage: open(..., BasicDB::OAUTOTRAN | BasicDB::OAUTOSYNC);
explicit usage: begin_transaction(true); ...; end_transaction(true);
risk on process crash: none
risk on system crash: none
performance penalty: Throughput will be down to about 1% or less.

Various Iterators

Kyoto Cabinet supports four kinds of iterators: cursor, atomic iterator, parallel iterator, and MapReduce. Each of them has its particular feature and it is important to choose the best suited one for your task.

Cursor is called "external iterator" as well. You can create multiple cursors of a database object simultaneously and each cursor can located at different places.

DB::Cursor cur(&db);
cur.jump("foo");
cur1.get_value(&key, &value);

Atomic iterator is useful to retrieve or update records in a database atomically. Only one thread can use the atomic iterator at the same time and other threads accessing any record are blocked dualing its iteration.

class VisitorImpl : public DB::Visitor { /* implement visit_full */ };
VisitorImpl visitor;
db.iterate(&visitor, false);

Parallel iterator is useful to retrieve records in parallel. Although it cannot update any record, concurrent processing improves throughput. Multiple threads generated implicitly scan records simultaneously.

class VisitorImpl : public DB::Visitor { /* implement visit_full */ };
VisitorImpl visitor;
db.scan_parallel(&visitor, 8);

MapReduce is useful to compute all records and organize extracted data chunks by arbitrary keys. MapReduce is composed of three phases: "map", "shuffle", and "reduce". "map" or "mapper" is a user-defined function to scan all records and extract data, which are emitted as key-value records by the built-in "emit" function. "shuffle" is an implicit procedure to organize the extracted data by keys. And, "reduce" or "reducer" is a user-defined function to process a set of values organized by the same key.

class MapReduceImpl : public MapReduce {
  bool map(const char* kbuf, size_t ksiz, const char* vbuf, size_t vsiz) {
    emit(...);
    return true;
  }
  bool reduce(const char* kbuf, size_t ksiz, ValueIterator* iter) {
    const char* vbuf; size_t vsiz;
    while ((vbuf = iter->next(&vsiz)) != NULL) { ... }
    return true;
  }
};
MapReduceImpl mr;
mr.execute(&db, "/tmp", MapReduce::XPARAMAP | MapReduce::XPARARED);

The default behavior of MapReduce is based on atomic iterator. The `MapReduce::XNOLOCK' option make the mapper based on cursor. The `MapReduce::XPARAMAP' option make the mapper based on parallel iterator. The `MapReduce::XPARARED' option make the reducer based on a parallel processing model by a thread-pooling.

The plain text database (TextDB) allows you to use a plain text file as the input database or the output database of the MapReduce framework. When you store a record into a plain text database, the key is ignored and the value is appended at the end of the plain text, separating each record with a line feed character. When you use iterator or cursor, each line is retrieved as a record and the key is generated automatically from the offset of the line. The parallel iterator is also supported and you can process a text file in parallel easily.

Backup

Any hardware will break down suddenly. Especially such storage devices as HDD and SSD are fragile. Therefore, making backup files of your database file periodically is very important even if you use transaction. You can copy a database file by such popular commands as `cp' and `tar' when the database is not being updated by another process.

If an application uses multi threads and you want to make a backup file of the database in safety, use the `BasicDB::copy' method, which synchronizes the database status with the database file and makes a copy file. During the copying operation, it is assured that the database file is not updated.

db.copy("backup.kch");

You may want "hot backup", which means that other threads are not blocked while a thread is creating a backup file. In that case, use the `File::synchronize' method which synchronizes the database file and calls a function defined arbitrary. The call back function can execute a "snapshot" command provided by the operating system.

class BackupImpl : public FileProcessor {
  bool process(const std::string& path, int64_t size, int64_t count) {
    char cmd[1024];
    sprintf(cmd, "snapshot.sh %s", path.c_str());
    return system(cmd) == 0;
  }
} proc;
db.synchronize(&proc);

Pseudo-snapshot

Chiefly for the cache hash database and the cache tree database, the "pseudo-snapshot" mechanism is provided. The `BasicDB::dump_snapshot' dumps all records into a stream or a file. The `BasicDB::load_snapshot' method loads records from a stream or a file. Although the operations are performed atomically, they don't finish momentarily but take time in proportion of the database size with blocking the other threads. Because the format of pseudo-snapshot data is common among the all database classes, it is useful to migrate records for each other.

db.dump_snapshot("backup.kcss");
db.load_snapshot("backup.kcss");

If you don't want to let the other threads be blocked. Use the cursor mechanism and save/load records by yourself.

Encrypted Database

The `tune_compressor' method of the file tree database and so on can set an arbitrary data compression functor. In fact, the functor can perform not only data compression but also data encryption. The class `ArcfourCompressor' implements a lightweight cipher algorithm based on Arcfour (aka. RC4). It is useful to improve security of your database casually without high overhead.

ArcfourCompressor comp;
comp.set_key("foobarbaz", 9);
TreeDB db;
db.tune_options(kc::TreeDB::TCOMPRESS);
db.tune_compressor(&comp);
db.open(...);
comp.begin_cycle((uint64_t)getpid() << 32 + (uint64_t)time());

If you use the polymorphic database, it is very easy to enable encryption. The naming option "zcomp" specifies the compression algorithm and the naming option "zkey" specifies the encryption key.

PolyDB db;
db.open("casket.kct#zcomp=arc#zkey=foobarbaz", ...);

"zcomp" supports "zlib" for the raw format of ZLIB, "def" for the Deflate format, "gz" for the gzip format, "arc" for Arcfour encryption, and "arcz" for Arcfour encryption compressed by ZLIB.

Note that the hash database types (CacheDB, HashDB, DirDB) compress the value of each record only. That is, the key of each record is not compressed there. However, the tree database types (GrassDB, TreeDB, ForestDB) compress all data in database. So, if you want to use the compressor for encryption, choose one of tree database types.

Space Efficiency of On-memory Databases

The stash database (StashDB), the cache hash database (CacheDB), and the cache tree database (GrassDB) are useful to save memory usage of associative array of strings. They can be substituted for std::map in C++, java.lang.Map in Java, and built-in associative array mechanisms of many scripting languages. The stash database and the cache hash database improve space efficiency by serializing the key and the value of each record into a byte array. The cache tree database improves space efficiency by serializing records in each page into a byte array.

For example, if you store ten million records and each is composed of a 8-byte key and a 8-byte value, `std::map<std::string, std::string>' (ProtoTreeDB) uses about 1.2GB memory. In the same situation, the stash database uses 465MB memory; the cache hash database uses 618MB memory; and the cache tree database uses 318MB memory. The cache tree database provides the best space efficiency in this use case. However, as for time efficiency, the stash database and the cache hash database are superior to the cache tree database, due to the difference of hash table and B+ tree. Note that B+ tree is very good at sequeitial access but not suitable for random access. To improve time efficency of B+ tree, set the page size 1024 or less.

If you want to reduce the memory usage extremely, use the cache tree database with the compression option. Moreover, set the bucket number around 5% of the record number, set the page size 32768 or more, and set the capacity of the page cache less than 5% of the total record size. For example, for the above situation of ten million records, the bucket number should be 500 thousand, the page size should be 32768, and the page cache should be 8MB. Those are expressed as "%#opts=c#bnum=500k#psiz=32768#pccap=8m" for the polymorphic database. Consequently, ten million records take 60MB memory only.

Sharing One database by Multiple Processes

Multiple processes cannot access one database file at the same time. A database file is locked by reader-writer lock while a process is connected to it. Note that the `BasicDB::ONOLOCK' option should not be used in order to escape the file locking mechanism. This option is for workaround against some file systems such as NFS, which does not support file locking mechanisms.

If you want to get multiple processes to share one database, use Kyoto Tycoon instead. It is a lightweight database server as network interface to Kyoto Cabinet.


License

To use Kyoto Cabinet, you can choose either GNU GPL or a commercial license. If you choose GPL, the source code of application programs have to be licensed under a license compatible with GPL. If you choose the commercial license, you will be exempted from such duty imposed by GPL.

GNU General Public License

Kyoto Cabinet is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or any later version.

Kyoto Cabinet is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see `http://www.gnu.org/licenses/'.

FOSS License Exception

The FOSS License Exception is also provided in order to accommodate products under other free and open source licenses. See the body text for details.

Specific FOSS Library Linking Exception

The Specific FOSS Library Linking Exception is also provided in order to be available in some specific FOSS libraries. See the body text for details.

Commercial License

If you use Kyoto Cabinet within a proprietary software, a commercial license is required.

The commercial license allows you to utilize Kyoto Cabinet by including it in your applications for purpose of developing and producing your applications and to utilize Kyoto Cabinet in order to transfer, sale, rent, lease, distribute or sublicense your applications to any third parties. See the license guide for details.

Author

Kyoto Cabinet was written and is maintained by FAL Labs. You can contact the author by e-mail to `info@fallabs.com'.