Tkrzw: a set of implementations of DBM
Overview
DBM (Database Manager) is a concept of libraries to store an associative array on a permanent storage. In other words, DBM allows an application program to store key-value pairs in a file and reuse them later. Each of keys and values is a string or a sequence of bytes. The key of each record must be unique within the database and a value is associated to it. You can retrieve a stored record with its key very quickly. Thanks to simple structure of DBM, its performance can be extremely high.
Tkrzw is a C++ library implementing DBM with various algorithms. It features high degrees of performance, concurrency, scalability and durability. The following classes are provided.
- HashDBM : File database manager implementation based on hash table.
- TreeDBM : File database manager implementation based on B+ tree.
- SkipDBM : File database manager implementation based on skip list.
- TinyDBM : On-memory database manager implementation based on hash table.
- BabyDBM : On-memory database manager implementation based on B+ tree.
- CacheDBM : On-memory database manager implementation with LRU deletion.
- Std(Hash|Tree)DBM : On-memory DBM implementations using std::unordered_map and std::map.
- (Poly|Shard)DBM : Polymorphic and sharding database manager adapters.
- AsyncDBM : Asynchronous database manager adapter.
- (File|Mem)Index : Secondary index implementations.
All database classes share the same interface so that applications can use any of them with the common API. All classes are thread-safe so that multiple threads can access the same database simultaneously. Basically, the database file is opened with the "Open" method and closed with the "Close" method. You can store records with the "Set" method, retrieve records with the "Get" method, and remove records with the "Remove" method. Iterator is also supported to retrieve each and every record of the database.
Transactional features are also supported. If you want to evaluate a record and update it atomically, you use the "Process" method, which takes a key and an arbitrary callback function to process the record. If you want to evaluate multiple records and update them atomically, you use the "ProcessMulti" method, which takes keys and arbitrary callback functions to process the records. If you want to do long transactions without locking records, the "CompareExchange" method and the "CompareExchangeMulti" method are useful.
Each data class has different approaches for durability. The file database classes support the auto restore feature to recover records after the process crashes. The file hash/tree databases support the appending update mode, which prevents data corruption of existing records. The file skip database and all on-memory databases atomically replace the entire database with a completed temporary file, which realizes the strongest durability on a single machine. All database classes support producing online update logs, which enables online data replication and staged backup.
If the data structure is more complex than key-value pairs, you can treat the value as a serialized text/binary of any data structure. You can use any serialization format. To look up records by some properties except for the primary key, you can use secondary indices. MemIndex and FileIndex classes are useful for the purpose.
HashDBM and TreeDBM supoort automatic data compression and encryption with various algorithms. If you enable one of them, the record data is stored after being encoded automatically. Conversely, the record data in the file is loaded after being decoded automatically. Compression allows you to store large records into smaller files. Encryption allows you to store secret records in local files safely.
Tkrzw adopts dependency injection for file implementations. Bundled implementations support memory mapping I/O, normal read/write I/O, and direct block I/O. You can inject your own implementations to support specific data storages and file systems. Whereas such file implementations use APIs (system calls) of the underlying operating systems, the upper layer for database algorithms doesn't depend on them so that the maximum portability is achieved. Tkrzw supports any Unix-like systems and Windows.
You can use some bridging interfaces to use Tkrzw in other programming languages than C++. Currently, C, Java, Python, Ruby, and Go interfaces are provided. The C interface is bundled in the main package. The other interfaces are packaged separately.
Command line utilities are provided as well. Thereby, you can create databases, set records, retrieve records, remove records, rebuild databases, and restore databases. A tool to do performance tests is also bundled.
If you want to share a database among multiple processes or programs on remote machines, use the database service of Tkrzw-RPC.
Download
You can download source packages in the following directories.
- C++ source packages
- Java source packages
- Python source packages
- Ruby source packages
- Go source packages
Documents
The following are API documents for each language.
Installation
Tkrzw is implemented based on the C++17 standard and POSIX/Windows API. On a Unix-like system (Linux, FreeBSD, and Mac OS X), GCC (version 7.3 or later) is required. On a Windows system, Visual C++ (2019 or later) is required.
For UNIX
To build and install the library, usually, you will run the following commands.
"make check" is just for testing and it can take several minutes. You can omit it. If you build binary packages on resource-limited systems, use "make check-light" instead. If you want to install the files under "/usr", specify "--prefix=/usr" with the configure script.
On some UNIX-like systems including FreeBSD and Solaris, "make" defaults to bsdmake or other non-GNU make, so you might need to replace "make" with "gmake" in the above instructions.
By default, the library and related files are installed under "/usr/local".
Let's check whether the command utilities work properly.
Read the C++ API documents for usage of the library. Then, let's build a sample program. Make a file of the following C++ code and save it as "helloworld.cc".
To build an application program, you'll typically run a command like this. The compiler flag "-std=c++17" is necessary if the default C++ version of your compiler is older than C++17. The flags "-llzma", "-llz4", "-lzstd", and "-lz" should be set only if corresponding features (LZMA, LZ4, ZStd, and ZLib) are enabled by the configure script. If you don't use threading functions in the application code, you can omit the compiler flag "-pthread". Even so, the linking flags "-latomic" and "-lpthread" are necessary because the library uses threading functions.
The bundled command "tkrzw_build_util" is useful to know necessary CPPFLAGS (flags for the preprocessor), and LDFLAGS (flags for the linker).
In some environments, you can also use the "pkg-config" command to list up those flags.
To run the test suite, GoogleTest is required. Although testing the library is not necessary for most users, you can do so just in case.
The "--enable-opt-native" flag of the configure script is to optimize the operation code specifically to the native platform. If it is omitted, normal optimization is done. The "--enable-most-features" flag is to enable as many features as possible on the platform. Data compression features using external libraries are integrated. You can enable each of them separately by setting "--enable-zlib", "--enable-zstd", "--enable-lz4", and "--enable-lzma". If you use another prefix than "/usr/local" for extra packages, specify the prefix with "--with-extra", such as "--with-extra=/opt/homebrew". If you don't need dynamic linking library, set "--disable-shared". If you want the utility commands with static linking, set "--enable-static". If you develop Tkrzw itself, set "--enable-devel" or "--enable-debug".
For Windows
To build and install the library, you use "nmake". Edit VCMakefile to assure the base path of Visual C++ and the base path of the system SDK are correct. The directory names are different among minor versions so checking them is important.
Open "x64 Native Tools Command Prompt for VS 2019" in the control panel. It sets command paths and other necessary environment configurations. Set the current directory to Tkrzw's source directory. Run the following command.
nmake -f VCMakefile
> nmake -f VCMakefile check
]]>
To install the library, open another command prompt of VC++ as Administrator and run the following command.
nmake -f VCMakefile install
]]>
By default, the library and related files are installed under "C:\Program Files\tkrzw". You can change it by editing VCMakefile before installation.
Read the C++ API documents for usage of the library. Then, let's build a sample program. Make a file of the following C++ code and save it as "helloworld.cc".
To build an application program, you'll typically run a command like this. The compiler flag "/std:c++17" is necessary if the default C++ version of your compiler is older than C++17.
cl /std:c++17 /Zc:__cplusplus \
/I "C:\Program Files (x86)\tkrzw\include" \
/O2 /EHsc /W3 /MT helloworld.cc tkrzw.lib /OUT:helloworld.exe \
/link /libpath:"C:\Program Files (x86)\tkrzw\lib"
]]>
The bundled command "tkrzw_build_util" is useful to know necessary CPPFLAGS (flags for the preprocessor), and LDFLAGS (flags for the linker).
tkrzw_build_util config -i
/I "C:\Program Files\tkrzw\include"
> tkrzw_build_util config -l
/libpath:"C:\Program Files\tkrzw\lib" tkrzw.lib
]]>
By default, the compiler flag "/MT" is specified to link to the multithread, static version of the run-time library. You can change it to "/MD" link to the dynamic run-time library. You can also make a DLL for TKRZW by the linker option "/LD".
The data compression libraries are not integrated on Windows by default. If you want to enable some of them, define macros by adding some of "/D_TKRZW_COMP_ZLIB", "/D_TKRZW_COMP_ZSTD" , "/D_TKRZW_COMP_LZ4", and "/D_TKRZW_COMP_LZMA" to CLFLAGS in VCMakefile. Then, add the library (lib/dll) files to EXTRALIBRARIES in VCMakefile. You'll also add the library files to the command line to build your own programs.
Performance
Each database class has specific traits. It is important to select the best one or combine some of them for the use case.
class | medium | algorithm | ordered access |
time complexity |
update method |
threading mutex |
typical footprint |
---|---|---|---|---|---|---|---|
HashDBM | File | Hash table | No | O(1) | Online | Record RW-lock | 14 bytes/rec |
TreeDBM | File | B+ tree | Yes | O(log N) | Online | Page RW-lock | 3 bytes/rec |
SkipDBM | File | Skip list | Yes | O(log N) | Batch | Whole RW-lock | 4 bytes/rec |
TinyDBM | RAM | Hash table | No | O(1) | Online | Record RW-lock | - |
BabyDBM | RAM | B+ tree | Yes | O(log N) | Online | Page RW-lock | - |
CacheDBM | RAM | Hash table | No | O(1) | Online | Slot RW-lock | - |
StdHashDBM | RAM | Hash table | No | O(1) | Online | Whole RW-lock | - |
StdTreeDBM | RAM | Red-black tree | Yes | O(log N) | Online | Whole RW-lock | - |
Performance is also different among the database classes. Let's say, one thread sets 1 million records. The key of each record is an 8-byte string from "00000000", "00000001", to "00999999" in ascending order. The value is an 8-byte string. Then, it retrieves all records and finally remove them. The following table shows throughput of all operations on a computer with a 3.5Ghz 6-core CPU.
class | Set | Get | Remove | memory usage | file size |
---|---|---|---|---|---|
HashDBM | 2,194,855 QPS | 2,544,004 QPS | 1,918,255 QPS | 28.9 MB | 26.8 MB |
TreeDBM | 1,799,118 QPS | 1,897,519 QPS | 1,654,547 QPS | 58.5 MB | 17.8 MB |
SkipDBM | 1,252,369 QPS | 448,360 QPS | 1,299,623 QPS | 78.3 MB | 19.3 MB |
TinyDBM | 3,000,138 QPS | 3,718,786 QPS | 3,473,247 QPS | 52.8 MB | n/a |
BabyDBM | 2,580,311 QPS | 2,734,840 QPS | 2,362,725 QPS | 40.9 MB | n/a |
CacheDBM | 2,940,138 QPS | 3,880,209 QPS | 3,633,801 QPS | 71.3 MB | n/a |
StdHashDBM | 1,799,697 QPS | 2,987,751 QPS | 2,876,639 QPS | 99.9 MB | n/a |
StdTreeDBM | 1,489,055 QPS | 3,716,134 QPS | 3,489,438 QPS | 107.0 MB | n/a |
Next, we use 10 threads, each of which sets 100 thousand records. Thus, 1 million records in total are set. Then, the threads retrieve all records and finally remove them. As seen below, the on-memory hash database is extremely good at concurrent updating. The file hash database is also very good at concurrent updating although it is based on the file storage.
class | Set | Get | Remove | memory usage | file size |
---|---|---|---|---|---|
HashDBM | 2,235,611 QPS | 4,805,936 QPS | 2,690,573 QPS | 29.3 MB | 26.8 MB |
TreeDBM | 1,802,679 QPS | 4,085,953 QPS | 1,805,765 QPS | 79.4 MB | 19.0 MB |
SkipDBM | 914,857 QPS | 1,897,854 QPS | 1,000,715 QPS | 86.8 MB | 19.3 MB |
TinyDBM | 6,649,593 QPS | 10,278,645 QPS | 8,986,748 QPS | 55.5 MB | n/a |
BabyDBM | 3,195,461 QPS | 11,635,750 QPS | 3,436,732 QPS | 47.4 MB | n/a |
CacheDBM | 8,467,970 QPS | 12,087,366 QPS | 11,949,141 QPS | 72.0 MB | n/a |
StdHashDBM | 1,167,830 QPS | 15,775,355 QPS | 1,784,857 QPS | 100.0 MB | n/a |
StdTreeDBM | 1,124,624 QPS | 16,023,594 QPS | 2,130,565 QPS | 107.2 MB | n/a |
The above test cases should show ideal performance of each database class because records are accessed sequentially and the entire data can be cached on memory by the file system. Then, let's move on to tougher test cases. We build a large database of 100 million records with random keys between "00000000" and "99999999". As some keys are duplicated and such records are overwritten, the actual number of unique records is about 63 million.
class | Set | Get | Remove | memory usage | file size |
---|---|---|---|---|---|
HashDBM | 1,312,484 QPS | 2,052,518 QPS | 1,759,782 QPS | 1788.1 MB | 1828.2 MB |
TreeDBM | 191,631 QPS | 148,509 QPS | 389,659 QPS | 4012.0 MB | 1714.3 MB |
SkipDBM | 454,955 QPS | 213,812 QPS | 453,710 QPS | 3036.7 MB | 1225.7 MB |
TinyDBM | 2,534,238 QPS | 2,744,532 QPS | 3,083,089 QPS | 3573.5 MB | n/a |
BabyDBM | 518,104 QPS | 517,197 QPS | 514,992 QPS | 2605.4 MB | n/a |
CacheDBM | 2,351,585 QPS | 2,375,056 QPS | 3,051,809 QPS | 4753.8 MB | n/a |
StdHashDBM | 1,912,922 QPS | 2,443,481 QPS | 2,850,411 QPS | 6410.1 MB | n/a |
StdTreeDBM | 482,476 QPS | 530,598 QPS | 522,506 QPS | 6596.3 MB | n/a |
Finally, we do the same operations with 10 threads. These test cases aim to simulate typical use cases where a large amount of random records are inserted and retrieved simultaneously. As suggested by the time complexity O(1) in theory, actual throughputs of the file hash database and the on-memory hash database are not affected by access patterns or the number of records. Meanwhile, throughputs of the file tree database and the file skip database decline although they can still respond to hundreds of thousands of queries per second.
class | Set | Get | Remove | memory usage | file size |
---|---|---|---|---|---|
HashDBM | 2,558,225 QPS | 6,041,697 QPS | 4,122,059 QPS | 1788.3 MB | 1828.2 MB |
TreeDBM | 343,564 QPS | 562,983 QPS | 568,628 QPS | 5195.4 MB | 1722.2 MB |
SkipDBM | 393,086 QPS | 1,384,863 QPS | 418,209 QPS | 3036.8 MB | 1225.7 MB |
TinyDBM | 8,351,605 QPS | 9,880,235 QPS | 9,272,530 QPS | 3573.4 MB | n/a |
BabyDBM | 1,239,584 QPS | 3,754,941 QPS | 1,187,900 QPS | 2645.8 MB | n/a |
CacheDBM | 9,314,902 QPS | 9,711,285 QPS | 11,446,497 QPS | 4754.0 MB | n/a |
StdHashDBM | 1,279,515 QPS | 13,572,641 QPS | 1,794,882 QPS | 6410.1 MB | n/a |
StdTreeDBM | 374,263 QPS | 4,118,937 QPS | 407,559 QPS | 6596.1 MB | n/a |
In general, if you want a key-value storage with the highest performance, choosing the file hash database is recommended. If you need ordered access of records, choosing the file tree database is recommended. If you need scalability of ordered databases, choosing the file skip database is recommended. If you need extreme performance, the on-memory hash database and the on-memory tree database are useful although they require another way to save/load records. If you want a cache system with deletion of LRU records, the cache database is useful.
HashDBM: The File Hash Database
The file hash database stores key-value structure in a single file. It uses a hash table and linked lists of records from buckets. Therefore, given the number of records N and the number of buckets M, the average time complexity of data retrieval is O(N/M). If M is large enough, the time complexity can be said O(1).
Thread concurrency is pursued in this implementation. Only reader-writer locking is applied to each hash bucket. Therefore, even writer threads which can set or remove records performs in parallel. Blocking is done only when multiple writers try to update the same record simultaneously. Reader threads which retrieve records don't block each other even for the same record.
Durability is also pursued. Updating operations support two modes: in-place and appending. In the in-place mode, the region of an existing record in the file is re-written to modify the value of the record. In the appending mode, the region in the file for an existing record is never re-written. Instead, new data to overwrite the value is appended at the end of the file. The new data is put at the top of the linked list for the record so that it is detected prior to the old one. In both modes, even if the hash table is broken for some reason, the database is restored automatically to salvage record data. In the in-place mode, the record being updated when brokage happens can be lost but the other records can be recovered. In the appending mode, any record can never be lost unless the file system itself breaks.
The hash table is static and not reorganized implicitly. Thus, if the load factor of the hash table is high, the database should be rebuilt explicitly by the application. Rebuilding the database is effective to resolve fragmentation which can happen in both the in-place mode and the appending mode. Rebuilding the database doesn't block other database operations. In other words, you can retrieve and update records while the database is being rebuilt.
See the tips of Tuning HashDBM too. Tuning has a significant influence on performance. For details of the API, see the API document.
Example Code
This is a code example where basic operations are done without checking errors.
iter = dbm.MakeIterator();
iter->First();
std::string key, value;
while (iter->Get(&key, &value) == Status::SUCCESS) {
std::cout << key << ":" << value << std::endl;
iter->Next();
}
// Closes the database.
dbm.Close();
return 0;
}
]]>
This is a code example which represents a more serious use case with performance tuning and thorough error checks.
iter = dbm.MakeIterator();
if (iter->First() != Status::SUCCESS) {
// Failure of the First operation is critical so we stop.
Die("First failed: ", status);
}
while (true) {
// Retrieves the current record data.
std::string iter_key, iter_value;
status = iter->Get(&iter_key, &iter_value);
if (status == Status::SUCCESS) {
std::cout << iter_key << ":" << iter_value << std::endl;
} else {
// This happens at the end of iteration.
if (status != Status::NOT_FOUND_ERROR) {
// Error types other than NOT_FOUND_ERROR are critical.
Die("Iterator::Get failed: ", status);
}
break;
}
// Moves the iterator to the next record.
status = iter->Next();
if (status != Status::SUCCESS) {
// This could happen if another thread removed the current record.
if (status != Status::NOT_FOUND_ERROR) {
// Error types other than NOT_FOUND_ERROR are critical.
Die("Iterator::Get failed: ", status);
}
std::cerr << "missing: " << status << std::endl;
break;
}
}
// Closes the database.
// Even if you forgot to close it, the destructor would close it.
// However, checking the status is a good manner.
status = dbm.Close();
if (status != Status::SUCCESS) {
// The Close operation shouldn't fail. So we stop if it happens.
Die("Close failed: ", status);
}
return 0;
}
]]>
The file hash database, as well as other database classes, supports call-back functions to process the value of a record. The operation is done while the record is locked so that the update looks done atomically.
Format
The hash database file is composed of five sections: the metadata section, the bucket section, the free block pool section, the record header section, and the record section.
The metadata section dominates the first 128 bytes of the file. It contains the following fields.
- Magic data
- From the offset 0. A 9-byte string "TkrzwHDB\n".
- The front cyclic magic data.
- From the offset 9. A 1-byte integer.
- The package major version
- From the offset 10. A 1-byte integer.
- The package minor version
- From the offset 11. A 1-byte integer.
- The static flags
- From the offset 12. A 1-byte integer.
- The offset width
- From the offset 13. A 1-byte integer.
- The alignment power
- From the offset 14. A 1-byte integer.
- The closure flags
- From the offset 15. A 1-byte integer.
- The number of buckets
- From the offset 16. An 8-byte big-endian integer.
- The number of records
- From the offset 24. An 8-byte big-endian integer.
- The effective data size.
- From the offset 32. An 8-byte big-endian integer.
- The file size
- From the offset 40. An 8-byte big-endian integer.
- The last modified timestamp.
- From the offset 48. An 8-byte big-endian integer.
- The database type
- From the offset 56. A 4-byte big-endian integer.
- The opaque data
- From the offset 62. Arbitrary string data.
- The back cyclic magic data.
- From the offset 127. A 1-byte integer.
The magic data and the package version data are used for identifying the kind of the file. The two figures of the cyclic magic data are used to check consistency of the meta data. The version data indicates the version of the Tkrzw package when the file is created.
The static flags specifies flags which are not changed during lifetime of the database. The first bit and the second bit represent the update mode. 0x1 means the in-place update mode. 0x2 means the appending update mode. 0x0 and 0x3 are invalid. The third bit and the fourth bit represent the record CRC mode. 0x0 means no CRC. 0x1 means CRC-8. 0x2 means CRC-16. 0x3 means CRC-32. The fifth bit, the sixth, and the seventh bit represent compression mode. 0x0 means no compression. 0x1 means ZLib. 0x2 means ZStd. 0x3 means LZ4. 0x4 means LZMA. 0x5 means RC4. 0x6 means AES.
The offset width specifies how many bytes are used to store an offset value. Given the offset width W, the maximum value is 2^(W*8). So, W=3 sets the maximum value 16,777,216 and W=4 sets it 4,294,967,296. The offset width affects the size of the buckets and the footprint of each record. The alignment power specifies the alignment of the offset value. Given the alignment power P, the alignment is 2^P. So, P=2 sets the alignment 4 and P=3 sets it 8. The alignment affects the size of space for each record. The maximum database size is determined as 2^(W*8+P). The default value of the offset width is 4 and the default value of the alignment power is 3. So, the maximum database size is 32GiB by default.
When a database is opened in the writable mode, the metadata section is updated immediately to set the closure flag zero and set the last modified timestamp to the current UNIX time in microseconds. If the process crushes without closing the database, the flag and the timestamp helps us detect the incident. If the writable database is closed normally, the closure flag is set one and the last modified timestamp is updated too.
The number of buckets determines how many buckets are used for the hash table. The number of records indicates the current number of living key-value pairs in the database. The effective data size indicates the total amount of bytes used for the key and the value data of the living records. In other words, it doesn't include data for removed data, overwritten data, or other footprints. This figure is used to calculate storage efficiency.
The database type and the opaque data are used for any purposes by the application. Modifying them is done in place so it doesn't increase the file size even in the appending mode. The file tree database uses the opaque data to store its metadata.
The bucket section dominates from the offset 128 to an offset determined by the number of buckets and the offset width. Given the number of buckets M and the offset width W, the end offset is 128 + M * W. The default value of the number of buckets is 1,048,583. Then, the bucket section dominates from the offset 128 to 4,194,460. Each bucket is an integer in big-endian. If there's no record matching the bucket index, the value is zero. Otherwise, it indicates the offset of the data of the first record in a linked list of records matching the bucket index.
The hash function is the 64-bit version of MurmurHash 3. It is applied to the whole key data. If the number of buckets is less than 2^32, the hash value is folded by taking XOR of the greater 32 bits and the lower 32 bits in the format of (([32,47]<<16)|[48,63]) ^ (([0,15]<<16)|[16,31]). The bucket index of the key is the remainder of division of the hash value by the number of buckets.
The record section dominates from an aligned offset after the bucket section to the end of the file. The start point is equal to or after the figure calculated as 128 + num_buckets * W + 1024 and it is aligned to 4096 and 2^P. The record section contains multiple record data in sequence. Each record is serialized in the following format.
- The magic data
- 1 byte integer.
- The child offset
- A big-endian integer with the offset width.
- The key size
- A byte delta encoded integer.
- The value size
- A byte delta encoded integer.
- The padding size
- A byte delta encoded integer.
- The CRC value
- Optional: A 1-byte, 2-byte, or 4-byte integer.
- The key data
- Optional: Arbitrary string data.
- The value data
- Optional: Arbitrary string data.
- The padding data
- Optional: A magic data and null data.
The magic data is composed of the operation type part and the checksum part. The upper 2 bits are the operation type part. 0xC0 (0x3 << 6) indicates that the operation is to do nothing. 0x80 (0x2 << 6) is to set the value of an existing record. 0x40 (0x1 << 6) is to remove an existing record. 0x00 (0x1 << 6) is to add a new record. The lower 6 bits are the checksum part. The checksum is a total value of the seed value 11 and each byte of a concatenated sequence of the record key and the record value in modulo 61. And, the final value is added to 3 so that the range is between 3 and 63 inclusive. The checksum helps us detect data corruption.
The child offset indicates the offset of the next record data in the linked list of the bucket. When a new record data of the same bucket is added, the bucket points to the new record data and the new record data points to the top one of the existing linked list, or zero if it doesn't exist.
The key size, the value size, and the padding size are represented in byte delta encoding. A value between 0 and 127 takes 1 byte. A value between 128 and 16,383 takes 2 bytes. A value between 16,384 and 2,097,151 takes 3 bytes. A value between 268,435,456 and 34,359,738,367 takes 4 bytes.
The CRC value is optional. If it exists, the value is the result of CRC-8, CRC-16, or CRC-32 applied to the concatenated data of the key and the value. The CRC value is used to detect inconsistency when the database is restored.
The key data and the value data are optional, which means an empty key and an empty value respectively. The key data is stored as-is. The value data is also stored as-is by default. If the record compression is enabled, compressed data by the algorithm is stored.
The padding data is optional. If it exists, it begins with a byte of 0xDD. The remaining data is null codes. The actual padding size includes additional size of the metadata for the padding size. That is, if the padding size in the record metadata is more than 127, the actual padding size is less by one byte. Likewise, larger than 16383 means less by two bytes, larger than 2097151 means less by three bytes.
With the default 4-byte offset width and small-sized (less than 128 bytes) keys and medium-sized (less than 16384 bytes) values, the footprint for each record is 1 + 4 + 1 + 2 + 1 = 9 bytes. However, due to alignment, padding bytes can be added. With the default 8-byte alignment, average padding size is about 4 bytes.
The 16 bytes before the record section is the record header section. It is composed of "TkrzwREC\n", a 1-byte integer for the static flags, a 1-byte integer for the offset width, and a 1-byte integer for the alignment power. They would be used to recover records if the main header were broken.
The 1008 bytes before the record header section is the free block pool section. It contains pairs of an offset and a size of each free block. The offset is a big-endian integer of the offset width. The size is a big-endian integer of 4 bytes. The maximum number of pairs is determined as 1008 / (W + 4). Given the offset width 4, the maximum number is 126.
As long as the meta data for the file size indicates the actual content size, padding data of null codes can be appended at the end of the file. This is useful to align the file size to the requirement of direct I/O.
TreeDBM: The File Tree Database
The file tree database stores key-value structure in a single file. It uses a multiway balanced tree structure called B+ tree. Therefore, given the number of records N, the average time complexity of data retrieval is O(log N). Because records are ordered by the key, range searches including forward matching search are supported.
A B+ tree is composed of nodes which are composed of records and links. Each node is serialized as a "page", which are stored in the file hash database. Therefore, the file tree database inherits many characteristics of the file hash database: performance, scalability, and durability. The file tree database uses a double-layered LRU cache system to reduce frequency of inputs and outputs of pages; Pages which have been repeatedly accessed are stored in the "hot" cache so that they are not wiped out even if many pages are accessed by a few traversal operations, which use the "warm" cache.
Thread concurrency is pursued in this implementation. Only reader-writer locking is applied to each page. Therefore, if threads randomly access records, they don't block each other. If multiple threads access the same page, a writer blocks the others but readers don't block other readers. The page cache system is managed with slotted mutexes so that multiple threads can access the cache system simultaneously.
If records are accessed sequentially in order of the key, performance of the file tree database can be better than the file hash database because file I/O is done by the page and its frequency is much less. However, if records are accessed randomly, performance of the file tree database is worse than the file hash database. Likewise, space efficiency of the file tree database can be better than the file hash database if records are inserted in order. However, if records are inserted randomly, one thirds of storage space is used for padding. As with the file hash database, rebuilding the database can reduce the file size and it is done without blocking other threads.
See the tips of Tuning TreeDBM too. Tuning has a significant influence on performance. For details of the API, see the API document.
Example Code
This is a code example where basic operations are done without checking errors.
iter = dbm.MakeIterator();
iter->Jump("ba");
std::string key, value;
while (iter->Get(&key, &value) == Status::SUCCESS) {
if (!StrBeginsWith(key, "ba")) break;
std::cout << key << ":" << value << std::endl;
iter->Next();
}
// Closes the database.
dbm.Close();
return 0;
}
]]>
This is a code example which represents a more serious use case with performance tuning and thorough error checks.
iter = dbm.MakeIterator();
if (iter->First() != Status::SUCCESS) {
// Failure of the First operation is critical so we stop.
Die("First failed: ", status);
}
while (true) {
// Retrieves the current record data.
std::string iter_key, iter_value;
status = iter->Get(&iter_key, &iter_value);
if (status == Status::SUCCESS) {
std::cout << iter_key << ":" << iter_value << std::endl;
} else {
// This happens at the end of iteration.
if (status != Status::NOT_FOUND_ERROR) {
// Error types other than NOT_FOUND_ERROR are critical.
Die("Iterator::Get failed: ", status);
}
break;
}
// Moves the iterator to the next record.
status = iter->Next();
if (status != Status::SUCCESS) {
// This could happen if another thread removed the current record.
if (status != Status::NOT_FOUND_ERROR) {
// Error types other than NOT_FOUND_ERROR are critical.
Die("Iterator::Get failed: ", status);
}
std::cerr << "missing: " << status << std::endl;
break;
}
}
// Closes the database.
// Even if you forgot to close it, the destructor would close it.
// However, checking the status is a good manner.
status = dbm.Close();
if (status != Status::SUCCESS) {
// The Close operation shouldn't fail. So we stop if it happens.
Die("Close failed: ", status);
}
return 0;
}
]]>
This is an example to make a dictionary from a TSV file. The keys are normalized into lower case and forward matching queries are supported.
columns = StrSplit(line, "\t");
if (columns.size() < 2) continue;
const std::string key = StrLowerCase(columns[0]);
ConcatProcessor proc(line);
dbm.Process(key, &proc, true);
}
// Find records by forward matching with "app".
std::unique_ptr iter = dbm.MakeIterator();
iter->Jump("app");
std::string key, value;
while (iter->Get(&key, &value) == Status::SUCCESS) {
if (!StrBeginsWith(key, "app")) break;
std::cout << "---- " << key << " ----" << std::endl;
std::cout << value << std::endl;
iter->Next();
}
// Close the input file
file.Close();
// Closes the database.
dbm.Close();
return 0;
}
]]>
Format
The tree database file is based on the hash database. Metadata of B+ tree are stored in the opaque metadata section and nodes of B+ tree are stored as records of the hash database.
The opaque metadata section of the hash database is a space to store an arbitrary 64-byte string, which the tree database uses to store its metadata. It contains the following fields.
- Magic data
- From the offset 0. A string "TDB" and a training null '\0' character.
- The number of records
- From the offset 4. A 6-byte big-endian integer.
- The effective data size.
- From the offset 10. A 6-byte big-endian integer.
- The root page ID
- From the offset 16. A 6-byte big-endian integer.
- The first page ID.
- From the offset 22. A 6-byte big-endian integer.
- The last page ID.
- From the offset 28. A 6-byte big-endian integer.
- The number of leaf nodes
- From the offset 34. A 6-byte big-endian integer.
- The number of inner nodes
- From the offset 40. A 6-byte big-endian integer.
- The maximum page size
- From the offset 46. A 3-byte big-endian integer.
- The maximum number of branches
- From the offset 49. A 3-byte big-endian integer.
- The level of the tree
- From the offset 52. A 1-byte integer.
- The key comparator type
- From the offset 53. A 1-byte integer.
- The opaque data
- From the offset 54. Arbitrary string data.
The magic data is used for identifying the kind of the database. The number of records indicates the current number of living key-value pairs in the database. The effective data size indicates the total amount of bytes used for the key and the value data of the living records. In other words, it doesn't include data for removed data, overwritten data, or other footprints. This figure is used to calculate storage efficiency.
There are two kinds of nodes of B+ tree: leaf nodes and inner nodes. Leaf nodes contain records of key-value pairs. Inner nodes contain keys and links to other nodes. The root ID indicates the ID of the root node of B+ tree. The root node is used as the entry point of search. The first ID indicates the ID of the first leaf node of B+ tree. Leaf nodes are linked in a double linked list and the first leaf node is the entry point of sequential access. The level of the tree indicates how many layers exist in the tree. If there's no inner tree, the level is 1.
The key comparator type indicates a function used to compare keys of records. The default LexicalKeyComparator is 1. LexicalCaseKeyComparator which ignores case is 2. DecimalKeyComparator which compares keys as numeric decimal integer expressions is 3. HexadecimalKeyComparator which compares keys as numeric hexadecimal integer expressions is 4. RealNumberKeyComparator which compares keys as decimal real number expressions is 5. SignedBigEndianKeyComparator which compares keys as big-endian binary signed integer expressions is 6. FloatBigEndianKeyComparator which compares keys as big-endian binary floating-point number expressions is 7. Their counterparts for pair strings are from 101 to 107. 255 means that another custom comparator is set. In that case, the comparator must be specified every time when the database is opened.
A leaf nodes has a key which is a 6-byte big-endian integer of its ID. The ID is a unique number not less than 1. Its value is serialized in the following format. Records are sorted in ascending order of the key.
- The previous leaf node ID
- A 6-byte big-endian integer.
- The next leaf node ID
- A 6-byte big-endian integer.
- Repetition of records
- Pairs of key-value records.
- The key is composed of the key size in byte delta encoding and an arbitrary key data.
- The value is composed of the value size in byte delta encoding and an arbitrary value data.
An inner nodes has a key which is a 6-byte big-endian integer of Its ID. The ID is a unique number not less than 2^46 * 3. Its value is serialized in the following format. links are sorted in ascending order of the key.
- The heir node ID
- A 6-byte big-endian integer.
- Repetition of links
- Pairs of key/ID links.
- The key is composed of the key size in byte delta encoding and an arbitrary key data.
- The ID is a 6-byte big-endian integer.
By default, the maximum payload size of each leaf node is 8130. When the page size exceeds it, the leaf node is divided into two and the half size 4065 fits the typical page size 4096 of the operating system. This maximizes space efficiency. If a node is divided and there is no parent node, a new inner node is created and the ID of the first divided node becomes the heir ID. The first key of the second divided node and the ID of the second divided node compose a link, which is inserted in the parent node. By default, the maximum number of links of each inner node is 256. When the number of links exceeds it, the inner node is divided into two and their links are inserted to the parent. This operation is done recursively to the root of the tree.
When the page size of the leaf node becomes less than half of the maximum payload size, the leaf is merged with a sibling node. Therefore, actually, the page size of the merged leaf node can be 1.5 times of the maximum payload size. When the number of links of an inner node becomes less than half of the maximum number of links, the inner node is merged to a sibling node. This operation is done recursively to the root of the tree. If the root is an inner node and it has only one child, the root is removed and the only child becomes the new root.
SkipDBM: The File Skip Database
The file skip database stores key-value structure in a single file. It uses a skip list of sorted records. Therefore, given the number of records N, the average time complexity of data retrieval is O(log N). Because records are ordered by the key, range searches including forward matching search are supported. Retrieving a record by the positional index is also supported.
Although skip list is a static structure, you can modify the database with usual methods of DBM, like the Set and Remove methods. Updates are reflected on the file in a batch manner. After a series of updates, you have to synchronize the database by calling the Synchronize method or close the database by the Close method. Then, you can retrieve records.
The file skip database supports two building modes: at-random and in-order. In the default at-random mode, records can be inserted in random order. When the database is synchronized, input records are sorted by merge sort with temporary files. Thus, you can build a large database whose size exceeds the RAM capacity of your machine. The average time complexity of building the entire database is O(N * log N) in the at-random mode. In the in-order mode, records must be inserted in ascending order of the key. The average time complexity of building the entire database is O(N) in the in-order mode. In both modes, duplicated keys are allowed but you can call an arbitrary reducer function to solve them later.
Records can be inserted concurrently in the at-random mode although synchronizing the database takes the most of the time for building the database. In the in-order mode, it is no use to insert records concurrently because the order cannot be assured by concurrent insertion. For retrieval, only reader locking is applied so that multiple threads can access the database in parallel.
See the tips of Tuning SkipDBM too. Tuning has a significant influence on performance. For details of the API, see the API document.
Example Code
This is a code example where basic operations are done without checking errors.
iter = dbm.MakeIterator();
iter->First();
std::string key, value;
while (iter->Get(&key, &value) == Status::SUCCESS) {
std::cout << key << ":" << value << std::endl;
iter->Next();
}
// Closes the database.
dbm.Close();
return 0;
}
]]>
This is a code example which represents a more serious use case with performance tuning and thorough error checks.
input_records;
input_records["foo"] = "hop";
input_records["bar"] = "step";
input_records["baz"] = "jump";
// Stores records.
for (const auto& input_record : input_records) {
status = dbm.Set(input_record.first, input_record.second);
if (status != Status::SUCCESS) {
// The Set operation shouldn't fail. So we stop if it happens.
Die("Set failed: ", status);
}
}
// Closes the database.
status = dbm.Close();
if (status != Status::SUCCESS) {
// The Close operation shouldn't fail. So we stop if it happens.
Die("Close failed: ", status);
}
// Opens the existing database as a reader mode.
status = dbm.Open("casket.tks", false);
if (status != Status::SUCCESS) {
// Failure of the Open operation is critical so we stop.
Die("Open failed: ", status);
}
// Retrieves records.
// If there was no record, NOT_FOUND_ERROR would be returned.
std::string value;
status = dbm.Get("foo", &value);
if (status == Status::SUCCESS) {
std::cout << value << std::endl;
} else {
std::cerr << "missing: " << status << std::endl;
}
// Traverses records with forward matching with "ba"
std::unique_ptr iter = dbm.MakeIterator();
status = iter->Jump("ba");
if (status == Status::NOT_FOUND_ERROR) {
// This could happen if there are no records equal to or greater than it.
std::cerr << "missing: " << status << std::endl;
} else if (status != Status::SUCCESS) {
// Other errors are critical so we stop.
Die("Jump failed: ", status);
}
while (true) {
// Retrieves the current record data.
std::string iter_key, iter_value;
status = iter->Get(&iter_key, &iter_value);
if (status == Status::SUCCESS) {
if (!StrBeginsWith(iter_key, "ba")) break;
std::cout << iter_key << ":" << iter_value << std::endl;
} else {
// This happens at the end of iteration.
if (status != Status::NOT_FOUND_ERROR) {
// Error types other than NOT_FOUND_ERROR are critical.
Die("Iterator::Get failed: ", status);
}
break;
}
// Moves the iterator to the next record.
status = iter->Next();
if (status != Status::SUCCESS) {
// This could happen if another thread cleared the database.
if (status != Status::NOT_FOUND_ERROR) {
// Error types other than NOT_FOUND_ERROR are critical.
Die("Iterator::Get failed: ", status);
}
std::cerr << "missing: " << status << std::endl;
break;
}
}
// Closes the database.
// Even if you forgot to close it, the destructor would close it.
// However, checking the status is a good manner.
status = dbm.Close();
if (status != Status::SUCCESS) {
// The Close operation shouldn't fail. So we stop if it happens.
Die("Close failed: ", status);
}
return 0;
}
]]>
Although the file skip database is not suitable for incessant updates, it supports updating content by adding some records and merging them with the entire database. You can apply an arbitrary reducing function to handle records with the same key.
Format
The skip database file is composed of two sections: the metadata section and the record section.
The metadata section dominates the first 128 bytes of the file. It contains the following fields.
- Magic data
- From the offset 0. A 9-byte string "TkrzwSDB\n".
- The front cyclic magic data.
- From the offset 9. A 1-byte integer.
- The package major version
- From the offset 10. A 1-byte integer.
- The package minor version
- From the offset 11. A 1-byte integer.
- The offset width
- From the offset 12. A 1-byte integer.
- The step unit
- From the offset 13. A 1-byte integer.
- The maximum level
- From the offset 14. A 1-byte integer.
- The closure flags
- From the offset 15. A 1-byte integer.
- The number of records
- From the offset 24. An 8-byte big-endian integer.
- The effective data size.
- From the offset 32. An 8-byte big-endian integer.
- The file size
- From the offset 40. An 8-byte big-endian integer.
- The last modified timestamp.
- From the offset 48. An 8-byte big-endian integer.
- The database type
- From the offset 56. A 4-byte big-endian integer.
- The opaque data
- From the offset 62. Arbitrary string data.
- The back cyclic magic data.
- From the offset 127. A 1-byte integer.
The magic data, the cyclic magic data and the package version data are used for identifying the kind of the file. The two figures of cyclic magic data are used to check consistency of the meta data. The version data indicates the version of the Tkrzw package when the file is created.
The offset width specifies how many bytes are used to store an offset value. Given the offset width W, the maximum value is 2^(W*8). So, W=3 sets the maximum value 16,777,216 and W=4 sets it 4,294,967,296. The offset width affects the size of the buckets and the footprint of each record. The default value of the offset width is 4, which means the maximum file size is 4GB. The step unit specifies how often records are given links to forward records. Given the step unit S, records whose indices can be divided by S have the level 1 links, which refers to the forward record in S-th steps. Records whose indices can be divided by S^2 have the level 2 links, which refers to the forward in S^2-th steps. Likewise, fewer records have higher level links. The highest level is limited by the maximum level property. To get the optimal performance, given the number of records N, the maximum level should be higher than Log_S N - 1. So given, R=1000000 and S=4, the maximum level should be higher than log_4 100000 - 1 = 8.96. The default value of the step unit is 4 and the default value of the maximum level is 14. So, they are optimal until the number of records is 4^15 = 1,073,741,824.
When a database is opened in the writable mode, the metadata section is updated immediately to set the closure flag zero and set the last modified timestamp to the current UNIX time in microseconds. If the process crushes without closing the database, the flag and the timestamp helps us detect the incident. If the writable database is closed normally, the closure flag is set one and the last modified timestamp is updated too.
The record section dominates from 128 to the end of the file. The record section contains multiple record data in sequence in ascending order of the key. Each record is serialized in the following format.
- The magic data
- 1 byte integer.
- The offsets of the forward records
- A big-endian integers with the offset width.
- The key size
- A byte delta encoded integer.
- The value size
- A byte delta encoded integer.
- Optional: The key data
- Arbitrary string data.
- Optional: The value data
- Arbitrary string data.
The magic data is 0xFF and helps us detect data corruption.
The offsets of the forward records are composed of big-endian integers whose number is the same as the level of the record. The level is determined by how many times the index of the record can be divided by the step unit without a remainder. Given the step unit S and the index C, the destination of the first level link is the record whose index is C + S^1. The destination of the second level link is the record whose index is C + S^2.
The key size and the value size are represented in byte delta encoding. A value between 0 and 127 takes 1 byte. A value between 128 and 16,383 takes 2 bytes. A value between 16,384 and 2,097,151 takes 3 bytes. A value between 268,435,456 and 34,359,738,367 takes 4 bytes.
The key data and the value data are optional, which means an empty key and an empty value respectively. The key data and the value data are stored as-is.
As long as the meta data for the file size indicates the actual content size, padding data of null codes can be appended at the end of the file. This is useful to align the file size to the requirement of direct I/O.
TinyDBM: The On-memory Hash Database
The on-memory hash database stores key-value structure on-memory. It uses a hash table and linked lists of records from buckets. Therefore, given the number of records N and the number of buckets M, the average time complexity of data retrieval is O(N/M). If M is large enough, the time complexity can be said O(1).
Thread concurrency is pursued in this implementation. Only reader-writer locking is applied to each hash bucket. Therefore, even writer threads which can set or remove records perform in parallel. Blocking is done only when multiple writers try to update the same record simultaneously. Reader threads which retrieve records don't block each other even for the same record.
Whereas thread safety and thread performance are the most important features of the on-memory hash database, memory efficiency is also remarkable. Because the key and the value, and all metadata are serialized in a single sequence of bytes, memory footprint is minimum. Typically, pure footprint except for the footprint from the memory allocator is 10 bytes. Assuming the key and the value are 8-byte strings, actual memory usage is about 55% of std::map<std::string, std::string>.
You don't have to open or close on-memory databases to use them. You can just set records and retrieve them as if they were std::unordered_map. However, you can associate the database to a file by opening it with a file path. Then, when you close the database, all records are saved in the file. And, you can reuse the records by opening the database with the same file path.
For details of the API, see the API document.
Example Code
This is a code example where basic operations are done without checking errors.
iter = dbm.MakeIterator();
iter->First();
std::string key, value;
while (iter->Get(&key, &value) == Status::SUCCESS) {
std::cout << key << ":" << value << std::endl;
iter->Next();
}
return 0;
}
]]>
This is a code example which represents a more serious use case where records are saved in a file and they are reused later.
First();
std::string key;
while (iter->Get(&key) == Status::SUCCESS) {
if (dbm.CompareExchange(key, "jump", "land") == Status::SUCCESS) {
std::cout << key << " landed" << std::endl;
}
std::cout << key << " is now " << dbm.GetSimple(key) << std::endl;
iter->Next();
}
// Closes the database.
// In the reader mode, the file is not updated and no error occurs.
dbm.Close();
return 0;
}
]]>
BabyDBM: The On-memory Tree Database
The on-memory tree database stores key-value structure on-memory. It uses a multiway balanced tree structure called B+ tree. Therefore, given the number of records N, the average time complexity of data retrieval is O(log N). Because records are ordered by the key, range searches including forward matching search are supported.
A B+ tree is composed of nodes which are composed of records and links. Each node is serialized as a "page", which are kept on-memory. Thread concurrency is pursued in this implementation. Only reader-writer locking is applied to each page. Therefore, if threads randomly access records, they don't block each other. If multiple threads access the same page, a writer blocks the others but readers don't block other readers.
Whereas thread safety and thread performance are the most important features of the on-memory tree database, memory efficiency is also remarkable. Because the key and the value, and all metadata are serialized in a single sequence of bytes, memory footprint is minimum. Typically, pure footprint except for the footprint from the memory allocator is 10 bytes. Assuming the key and the value are 8-byte strings, actual memory usage is about 50% of std::map<std::string, std::string>.
You don't have to open or close on-memory databases to use them. You can just set records and retrieve them as if they were std::unordered_map. However, you can associate the database to a file by opening it with a file path. Then, when you close the database, all records are saved in the file. And, you can reuse the records by opening the database with the same file path.
For details of the API, see the API document.
Example Code
This is a code example where basic operations are done without checking errors.
iter = dbm.MakeIterator();
iter->Jump("ba");
std::string key, value;
while (iter->Get(&key, &value) == Status::SUCCESS) {
if (!StrBeginsWith(key, "ba")) break;
std::cout << key << ":" << value << std::endl;
iter->Next();
}
return 0;
}
]]>
This is a code example which represents a more serious use case where records are saved in a file and they are reused later.
Last();
std::string key, value;
while (iter->Get(&key, &value) == Status::SUCCESS) {
std::cout << key << ":" << value << std::endl;
iter->Previous();
}
// Closes the database.
// In the reader mode, the file is not updated and no error occurs.
dbm.Close();
return 0;
}
]]>
CacheDBM: The On-memory Cache Database with LRU Deletion
The on-memory cache database stores key-value structure on-memory. It uses a hash table and triple linked lists: from the buckets to each record, from the first record to the last record, and from the last record to the first record. When a record is accessed, it is placed at the end of the list. Therefore, the least recent used record is placed at the first position. The cache database has a capacity to keep the memory usage stable. When the number of records exceeds the capacity, least recent used records are removed implicitly. The average time complexity is O(1).
The cache database is sharded internally so that mutex is done for each shard, in order to improve concurrent performance. The actual concurrent performance of the cache database is almost the same as the normal on-memory hash database. Whereas space efficiency of the cache database is worse than those of the on-memory hash database and the on-memory tree database, it is suitable as a cache system, thanks to the LRU deletion feature.
You don't have to open or close on-memory databases to use them. You can just set records and retrieve them as if they were std::unordered_map. However, you can associate the database to a file by opening it with a file path. Then, when you close the database, all records are saved in the file. And, you can reuse the records by opening the database with the same file path.
For details of the API, see the API document.
Example Code
This is a code example where basic operations are done without checking errors.
Std(Hash|Tree)DBM: The STL On-memory Databases
The standard on-memory hash database StdHashDBM stores key-value structure on memory with std::unordered_map. The standard on-memory tree database stores key-value structure on memory with std::map. As the hash database uses a hash table, the average time complexity of data retrieval is O(1). As the tree database uses a binary tree, the average time complexity of data retrieval is O(log N).
Thread safety is assured in these implementations. Reader-writer locking is applied to the whole database. Therefore, multiple reader threads can access the same database without blocking whereas a writer thread blocks other threads. Thread safety is the only benefit of these implementations over using bare std::unordered_map or std::map.
You don't have to open or close on-memory databases to use them. You can just set records and retrieve them as if they were std::unordered_map and std::map. However, you can associate the database to a file by opening it with a file path. Then, when you close the database, all records are saved in the file. And, you can reuse the records by opening the database with the same file path.
In contrast with the hash database, the tree database StdTreeDBM assures that all keys are ordered in alphabetical order. The iterator supports the Jump method, which enables range searches including forward-matching search.
For details of the API, see the API document.
Example Code
This is a code example where basic operations are done without checking errors.
iter = dbm.MakeIterator();
iter->Jump("ba");
std::string key, value;
while (iter->Get(&key, &value) == Status::SUCCESS) {
if (!StrBeginsWith(key, "ba")) break;
std::cout << key << ":" << value << std::endl;
iter->Next();
}
return 0;
}
]]>
This is a code example which represents a more serious use case where records are saved in a file and they are reused later.
(Poly|Shard)DBM: Polymorphic and Sharding DBM Adapters
Although all database classes share the same interface defined by the DBM class, constructors and tuning parameters are different. The polymorphic database manager adapter PolyDBM is provided in order to absorb such differences so that you can always use the same class PolyDBM. The concrete class is determined by the extension of the path. You can open a hash database by the following code.
The following extensions are recognizable.
- HashDBM:
.tkh
,.hash
- TreeDBM:
.tkt
,.tree
- SkipDBM:
.tks
,.skip
- TinyDBM:
.tkmt
,.tiny
,.flat
- BabyDBM:
.tkmb
,.baby
- CacheDBM:
.tkmc
,.cache
- StdHashDBM:
.tksh
,.stdhash
- StdTreeDBM:
.tkst
,.stdtree
You can specify the class name directly by passing the "dbm" parameter to the OpenAdvanced method. It also takes tuning parameters. You can use the hash database and tune it by the following code.
params = {
{"dbm", "HashDBM"}, {"align_pow", "0"}, {"num_buckets", "10000"}};
dbm.OpenAdvanced("casket", true, File::OPEN_DEFAULT, params);
]]>
On-memory databases also have to be opened by the Open or OpenAdvanced method. However, if the path is empty, records are not saved in or loaded from any file. If you specify a non-empty path, records are saved in and loaded from the file. The following code opens an on-memory tree database without associating it to any file.
params = {
{"dbm", "BabyDBM"}, {"key_comparator", "DecimalKeyComparator"}};
dbm.OpenAdvanced("", true, File::OPEN_DEFAULT, params);
]]>
The polymorphic database manager adapter is also useful when you develop interfaces for other programming languages, like Java, Python, Ruby, Go, Lua etc. Whereas writing code for each database class is tedious, just writing code for PolyDBM allows you to use all database classes.
The sharding database manager adapter ShardDBM is a wrapper of multiple PolyDBM instances for sharding a database. Whereas an instance of ShardDBM behaves like a usual DBM, records are stored in multiple database files, which are called shards. A hash function is used to determine which shard each record belongs to.
By separating one database file into several files, concurrent performance improves. Possibility that a thread is blocked by mutexes is divided by the number of shards. Moreover, efficiency of rebuilding database improves. Usually, rebuilding a database file requires making a temporary file whose size is proportional to the original database file. With sharding, rebuilding database is done for each shard one by one. Thus, the size of each temporary file is also divided.
ShardDBM can shard any database, whether it is ordered or unordered. With an ordered database, the iterator gathers records from each shard using a heap tree structure so that records are retrieved in ascending order of the key.
You specify the number of shards by passing the "num_shards" parameter to the OpenAdvanced method. It also takes the same tuning parameters as PolyDBM.
params = {
{"num_shards", "4"}, {"dbm", "TreeDBM"},
dbm.OpenAdvanced("casket", true, File::OPEN_DEFAULT, params);
]]>
If the file path is "casket" and the number of shards is 4, "casket-00000-of-00004", "casket-00001-of-00004", "casket-00002-of-00004", and "casket-00003-of-00004" are created. To open an existing database, specify the path without the suffix. You can omit the "num_shards" parameter to open existing files.
For details of the API, see the API document of PolyDBM and ShardDBM.
Example Code
This is a code example of basic usase of PolyDBM.
open_params = {
{"update_mode", "UPDATE_APPENDING"},
{"max_page_size", "1000"}, {"max_branches", "128"},
{"key_comparator", "DecimalKeyComparator"},
};
dbm.OpenAdvanced("casket.tkt", true, File::OPEN_TRUNCATE, open_params).OrDie();
// Stores records.
dbm.Set("1", "one").OrDie();
dbm.Set("2", "two").OrDie();
dbm.Set("3", "three").OrDie();
// Rebuild the database.
const std::map rebuild_params = {
{"update_mode", "UPDATE_IN_PLACE"},
{"max_page_size", "4080"}, {"max_branches", "256"},
};
dbm.RebuildAdvanced(rebuild_params).OrDie();
// Retrieves records.
std::cout << dbm.GetSimple("1", "*") << std::endl;
std::cout << dbm.GetSimple("2", "*") << std::endl;
std::cout << dbm.GetSimple("3", "*") << std::endl;
std::cout << dbm.GetSimple("4", "*") << std::endl;
// Closes the database.
dbm.Close().OrDie();
return 0;
}
]]>
This is a code example of basic usase of ShardDBM.
open_params = {
{"num_shards", "10"},
{"dbm", "TreeDBM"}, {"key_comparator", "DecimalKeyComparator"},
};
dbm.OpenAdvanced("casket", true, File::OPEN_TRUNCATE, open_params).OrDie();
// Stores records.
for (int32_t i = 1; i <= 100; i++) {
const std::string key = ToString(i);
const std::string value = ToString(i * i);
dbm.Set(key, value).OrDie();
}
// Retrieve records whose keys are 50 or more.
auto iter = dbm.MakeIterator();
iter->Jump("50");
std::string key, value;
while (iter->Get(&key, &value).IsOK()) {
std::cout << key << ":" << value << std::endl;
iter->Next();
}
// Closes the database.
dbm.Close().OrDie();
return 0;
}
]]>
AsyncDBM: Asynchronous Database Manager Adapter
Database operations can take time because of file I/O operations done inside. So, a thread calling DBM methods can block for a while. If you want to avoid blocking of the current thread, using the asynchronous API is useful. The class AsyncDBM is a wrapper of the normal synchronous API of the DBM interface. Thus, any concrete class of DBM can be used asynchronously. AsyncDBM has a task queue handled by a thread pool. When you call an asynchronous method, the operation is registered as a task in the task queue. The thread pool monitoring the task queue detects tasks and the operations are done in the background by some threads simultaneously. The main thread never blocks just for registering database operations.
Each asynchronous methods returns a std::future object. If you want to wait for a database operation to finish, you can call the "wait" or "wait_for" methods of the future object. You can get the result of the database operation by calling the "get" method of the future object. If you are not interested in the result of the database operation, you can ignore the future object without waiting or evaluating the result. The destructor of the AsyncDBM object assures that all pending tasks are done and all threads in the thread pool are joined.
Generally speaking, asynchronous API is not for improving the throughput because serializing tasks in the task queue and notifying its update to worker threads has a certain overhead. To improve throughput, you should create and use multi threads yourself to handle divided subtasks concurrently. Meanwhile, asynchronous API is useful for interactive features, where blocking of the current thread should be avoided. You use the TaskQueue class to execute arbitrary tasks in the background. The task queue instance used inside the AsyncDBM instance can be obtained by the GetTaskQueue method.
You can use std::async to execute an arbitrary function in the background. It also returns std::future to wait for the operation to finish and evaluate the result. So, the semantics are almost the same between AsyncDBM and std::async. However, std::async creates a new thread for each operation, which deteriorates performance and throughput. Therefore, AsyncDBM manages a thread pool where the same threads are reused again and again. Whereas the maximum throughput of std::async is less than 100,000 QPS, the maximum throughput of the TaskQueue class is more than 2,000,000 QPS. However, serializing parameters for various operations to put them in the task queue is a cumbersome to implement. AsyncDBM encapsulates such complex details, you can easily implement asynchronous database operations without deep knowledge of multi threading.
Example Code
This is a code example of basic usase of AsyncDBM.
(File|Mem)Index: Secondary Indices
HashDBM is the fastest file database in theory and in practice. Its performance is stable and it tolerates frequent and concurrent updates. However, it is an unordered database so it supports only exact matching of the key. In other words, it doesn't support range search or forward matching search, which ordered databases support. Moreover, compared with relational databases, the interface of DBM is simple and it can find records only by the primary key. However, combining HashDBM and secondary indices is a practical solution to demands for high performance and complex structure.
You store the main entity data with HashDBM so that you get performance and durability. The key is a unique string which is considered as the primary key of a relational database. The value is a serialized complex data structure in a format like JSON and Protocol Buffers. You make on-memory indices for some fields of the main entity so that you can search the database for records whose specific fields match some conditions.
Typically, an index means an inverted structure made from the main database. The key of the index is composed of two factors: pattern for matching and the primary key of a record in the main database. Let's say, you build a database for employees. The primary key is the employee ID so the main database uses the employee ID as the key. The value is a data structure composed of a personal name, a division ID, a job title, salary etc. You make an index for division IDs because you want to know who are working for each division quickly without scanning the whole database for each query. Then, each record of the index should be composed of a division ID and an employee ID.
FileIndex is a class to handle a secondary index on file. It is implemented with the file tree database. MemIndex is a class to handle a secondary index on memory. It is implemented with the on-memory tree database. And, StdIndex is a template class to realize secondary indices by wrapping std::set<std::pair<X, Y>> composed of arbitrary data types. In order to use FileIndex and MemIndex via the same interface, PolyIndex class is also provided.
For the division ID index, the first element can be int64_t for the division ID and the second element can be int64_t for the employee ID. Therefore, using StdIndex<int64_t, int64_t> is a good idea. By having the primary key of the main database as the second element, each record becomes unique and you don't have to worry about duplication. You can also set an arbitrary comparator to adjust the order of records.
Secondary indices are not important data to be perpetuated on the storage because they can be reproduced as long as the main entity data exist. Therefore, using some of memory indices is a practical solution to realize extreme throughput. To use on-memory indices, you have to load all index entries from the main database every time when the process starts.
For details of the API, see the API document of FileIndex, MemIndex, and PolyIndex.
Example Code
This code represents an advanced usage of HashDBM with FileIndex, which allows for finding employees by the division ID.
fields = StrSplit(serialized, "\t");
if (fields.size() < 3) {
Die("Deserialize failed");
}
employee_id = StrToInt(fields[0]);
name = fields[1];
division_id = StrToInt(fields[2]);
return true;
}
void Print() {
std::cout << "employee_id=" << employee_id << " name=" << name
<< " division_id=" << division_id << std::endl;
}
};
// Updates information of an employee.
// If the name is empty, the entry is removed.
void UpdateEmployee(const Employee& employee, DBM* dbm, FileIndex* division_index) {
const std::string& primary_key = ToString(employee.employee_id);
class Updater : public DBM::RecordProcessor {
public:
Updater(const Employee& employee, FileIndex* division_index)
: employee_(employee), division_index_(division_index) {
}
// This is called if there is an existing record of the key.
std::string_view ProcessFull(std::string_view key, std::string_view value) override {
Employee old_record;
old_record.Deserialize(value);
// Removes an entry for the existing record.
if (old_record.division_id > 0) {
division_index_->Remove(ToString(old_record.division_id), key);
}
// If the new name is empty, removes the record.
if (employee_.name.empty()) {
return REMOVE;
}
// Adds an entry for the new record.
if (employee_.division_id > 0) {
division_index_->Add(ToString(employee_.division_id), key);
}
// Updates the record.
new_value_ = employee_.Serialize();
return new_value_;
}
// This is called if there is no existing record of the key.
std::string_view ProcessEmpty(std::string_view key) override {
// If the new name is empty, nothing is done.
if (employee_.name.empty()) {
return NOOP;
}
// Adds an entry for the new record.
if (employee_.division_id > 0) {
division_index_->Add(ToString(employee_.division_id), key);
}
// Updates the record.
new_value_ = employee_.Serialize();
return new_value_;
}
private:
const Employee& employee_;
FileIndex* division_index_;
std::string new_value_;
} updater(employee, division_index);
dbm->Process(primary_key, &updater, true).OrDie();
}
// Main routine.
int main(int argc, char** argv) {
// Creates the database manager.
HashDBM dbm;
// Prepare an index for divisions and their members.
FileIndex division_index;
// Opens a new database,
dbm.Open("casket.tkh", true, File::OPEN_TRUNCATE).OrDie();
// Opens a new index.
// Setting PairDecimalKeyComparator is to sort the division ID in numeric order.
// Otherwise, "10" would come earlier than "2" in lexical order.
TreeDBM::TuningParameters division_index_params;
division_index_params.key_comparator = PairDecimalKeyComparator;
Status status = division_index.Open(
"casket-division-index.tkt", true, File::OPEN_TRUNCATE, division_index_params);
// Register employee records.
const std::vector employees = {
{10001, "Anne", 301}, {10002, "Marilla", 301}, {10003, "Matthew", 301},
{10004, "Diana", 401},
{10005, "Gilbert", 501},
};
for (const Employee& employee : employees) {
UpdateEmployee(employee, &dbm, &division_index);
}
// Closes the index.
division_index.Close().OrDie();
// Closes the database.
dbm.Close().OrDie();
// Opens an existing database,
dbm.Open("casket.tkh", true).OrDie();
// Opens an existing index,
division_index.Open("casket-division-index.tkt", true).OrDie();
// Updates for employees.
const std::vector updates = {
{10001, "Anne", 501}, // Anne works with Gilbert.
{10003, "", 0}, // Matthew left the company.
{10006, "Minnie May", 401},
{10007, "Davy", 301}, {10008, "Dora", 301},
};
for (const Employee& employee : updates) {
UpdateEmployee(employee, &dbm, &division_index);
}
// Prints members for each division.
for (const int64_t division_id : {301, 401, 501}) {
std::cout << "-- Division " << division_id << " --" << std::endl;
for (const std::string employee_id :
division_index.GetValues(ToString(division_id))) {
Employee employee;
employee.Deserialize(dbm.GetSimple(employee_id));
employee.Print();
}
}
// Closes the index.
division_index.Close().OrDie();
// Closes the database.
dbm.Close().OrDie();
return 0;
}
]]>
Let's do the same thing as the above with StdIndex instead of FileIndex. We scan the main database when the process starts in order to build the on-memory secondary index.
DivisionIndex;
// Structure for an employee.
// Actually, you should use a serious implementation like Protocol Buffers.
struct Employee {
int64_t employee_id = 0;
std::string name;
int64_t division_id = 0;
std::string Serialize() const {
return StrCat(employee_id, "\t", name, "\t", division_id);
}
void Deserialize(const std::string_view& serialized) {
const std::vector fields = StrSplit(serialized, "\t");
if (fields.size() < 3) {
Die("Deserialize failed");
}
employee_id = StrToInt(fields[0]);
name = fields[1];
division_id = StrToInt(fields[2]);
}
void Print() {
std::cout << "employee_id=" << employee_id << " name=" << name
<< " division_id=" << division_id << std::endl;
}
};
// Loads index records from the main database.
void LoadIndexRecords(DBM* dbm, DivisionIndex* division_index) {
class IndexBuilder : public DBM::RecordProcessor {
public:
explicit IndexBuilder(DivisionIndex* division_index)
: division_index_(division_index) {}
// This is called for each existing record.
std::string_view ProcessFull(std::string_view key, std::string_view value) override {
const int64_t key_num = StrToInt(key);
Employee employee;
employee.Deserialize(value);
division_index_->Add(employee.division_id, key_num);
return NOOP;
}
private:
DivisionIndex* division_index_;
} index_builder(division_index);
dbm->ProcessEach(&index_builder, false).OrDie();
}
// Updates information of an employee.
// If the name is empty, the entry is removed.
void UpdateEmployee(const Employee& employee, DBM* dbm, DivisionIndex* division_index) {
const std::string& primary_key = ToString(employee.employee_id);
class Updater : public DBM::RecordProcessor {
public:
Updater(const Employee& employee, DivisionIndex* division_index)
: employee_(employee), division_index_(division_index) {
}
// This is called if there is an existing record of the key.
std::string_view ProcessFull(std::string_view key, std::string_view value) override {
const int64_t key_num = StrToInt(key);
Employee old_record;
old_record.Deserialize(value);
// Removes an entry for the existing record.
if (old_record.division_id > 0) {
division_index_->Remove(old_record.division_id, key_num);
}
// If the new name is empty, removes the record.
if (employee_.name.empty()) {
return REMOVE;
}
// Adds an entry for the new record.
if (employee_.division_id > 0) {
division_index_->Add(employee_.division_id, key_num);
}
// Updates the record.
new_value_ = employee_.Serialize();
return new_value_;
}
// This is called if there is no existing record of the key.
std::string_view ProcessEmpty(std::string_view key) override {
const int64_t key_num = StrToInt(key);
// If the new name is empty, nothing is done.
if (employee_.name.empty()) {
return NOOP;
}
// Adds an entry for the new record.
if (employee_.division_id > 0) {
division_index_->Add(employee_.division_id, key_num);
}
// Updates the record.
new_value_ = employee_.Serialize();
return new_value_;
}
private:
const Employee& employee_;
DivisionIndex* division_index_;
std::string new_value_;
} updater(employee, division_index);
dbm->Process(primary_key, &updater, true).OrDie();
}
// Main routine.
int main(int argc, char** argv) {
// Creates the database manager.
HashDBM dbm;
// Prepare an index for divisions and their members.
DivisionIndex division_index;
// Opens a new database,
dbm.Open("casket.tkh", true, File::OPEN_TRUNCATE).OrDie();
// After opening the database, indices should be loaded.
// As the database is empty at this point, there's no effect here.
LoadIndexRecords(&dbm, &division_index);
// Register employee records.
const std::vector employees = {
{10001, "Anne", 301}, {10002, "Marilla", 301}, {10003, "Matthew", 301},
{10004, "Diana", 401},
{10005, "Gilbert", 501},
};
for (const Employee& employee : employees) {
UpdateEmployee(employee, &dbm, &division_index);
}
// Closes the database.
dbm.Close().OrDie();
// Opens an existing database,
dbm.Open("casket.tkh", true).OrDie();
// After opening the database, indices should be loaded.
// All existing records are reflected on the index.
LoadIndexRecords(&dbm, &division_index);
// Updates for employees.
const std::vector updates = {
{10001, "Anne", 501}, // Anne works with Gilbert.
{10003, "", 0}, // Matthew left the company.
{10006, "Minnie May", 401},
{10007, "Davy", 301}, {10008, "Dora", 301},
};
for (const Employee& employee : updates) {
UpdateEmployee(employee, &dbm, &division_index);
}
// Prints members for each division.
for (const int64_t division_id : {301, 401, 501}) {
std::cout << "-- Division " << division_id << " --" << std::endl;
for (int64_t employee_id : division_index.GetValues(division_id)) {
Employee employee;
employee.Deserialize(dbm.GetSimple(ToString(employee_id)));
employee.Print();
}
}
// Closes the database.
dbm.Close().OrDie();
return 0;
}
]]>
C Language Interface
The C++ API cannot be use directly in C code. So, the C interface is provided as a wrapper of PolyDBM and ShardDBM. Those adapter classes absorb differences among concrete database classes so you can use HashDBM, TreeDBM, SkipDBM and other databases in C code. The C interface is useful to embed the functionality to other languages which provide bridging interface for C but not for C++. As the name mangling feature of C++ sometimes hinders simple integration.
You can create the database object and open a database by calling the tkrzw_dbm_open function. It returns the pointer to a TkrzwDBM object. Most methods of PolyDBM and ShardDBM in C++ have their counterparts in the C interface. For methods of the database object, there are counterpart functions whose names begin with "tkrzw_dbm_" and they receive the pointer to a TkrzwDBM object as the first parameter. An iterator object is created by calling the tkrzw_dbm_make_iterator function. For methods of the iterator object, there are counterpart functions whose name begins with "tkrzw_dbm_iter_" and they receive the pointer to a TkrzwDBMIter object as the first parameter.
The generic file functions are also useful to handle flat files. They are wrapper of PolyFile class in C++. You can use files in the memory mapping I/O mode, the normal I/O mode, and the direct I/O mode with the same interface. Their names begin with "tkrzw_file_" and they receive the pointer to a TkrzwFile object as the first parameter.
The secondary index is also available. They are wrapper of PolyIndex class in C++, which wraps both FileIndex and MemIndex. The function names begin with "tkrzw_index_" and they receive the pointer of a TkrzwIndex object as the first parameter.
The status of the last called database operation or file operation is stored as a thread-local variable. You can get it with "tkrzw_get_last_status" function. Exceptions thrown in the internal C++ code are caught implicitly. Actually, whereas Tkrzw's code doesn't throw exceptions by itself, std::bad_alloc can be thrown by failure of the "new" operator and the "malloc" function. In such error cases, zero, false, null, and failure values are returned and the status is set to TKRZW_STATUS_SYSTEM_ERROR.
The library header is based on the C99 and C11 standards. On UNIX-like systems, typically, your program is built like this:
On Windows, typically, your program is built like this:
cl /std:c11 \
/I "C:\Program Files (x86)\tkrzw\include" \
/O2 /EHsc /W3 /MT helloworld.c tkrzw.lib /OUT:helloworld.exe \
/link /libpath:"C:\Program Files (x86)\tkrzw\lib"
]]>
For details of the API, see the API document for details.
Example Code
You can use a hash database by the following code.
#include "tkrzw_langc.h"
// Main routine.
int main(int argc, char** argv) {
// Opens the database file.
TkrzwDBM* dbm = tkrzw_dbm_open(
"casket.tkh", true, "truncate=true,num_buckets=100");
// Stores records.
tkrzw_dbm_set(dbm, "foo", -1, "hop", -1, true);
tkrzw_dbm_set(dbm, "bar", -1, "step", -1, true);
tkrzw_dbm_set(dbm, "baz", -1, "jump", -1, true);
// Retrieves a record.
char* value_ptr = tkrzw_dbm_get(dbm, "foo", -1, NULL);
if (value_ptr) {
puts(value_ptr);
free(value_ptr);
}
// Traverses records.
TkrzwDBMIter* iter = tkrzw_dbm_make_iterator(dbm);
tkrzw_dbm_iter_first(iter);
while (true) {
char* key_ptr = NULL;
if (!tkrzw_dbm_iter_get(iter, &key_ptr, NULL, &value_ptr, NULL)) {
break;
}
printf("%s:%s\n", key_ptr, value_ptr);
free(key_ptr);
free(value_ptr);
tkrzw_dbm_iter_next(iter);
}
tkrzw_dbm_iter_free(iter);
// Closes the database file.
tkrzw_dbm_close(dbm);
return 0;
}
]]>
Note that every created TkrzwDBM object should be closed by the tkrzw_dbm_close function. Every regions allocated by tkrzw_dbm_get, tkrzw_dbm_iter_get and other functions should be released by the free function. Every created TkrzwDBMIter object should be released by the tkrzw_dbm_iter_free function.
Command Line Utilities
tkrzw_build_util: Build Utilities
tkrzw_build_util is a command for build utilities. Thereby, you can obtain flags to build your own program using the library of Tkrzw. The usage is the following. A subcommand name comes as the first argument. Some options and other arguments can follow.
tkrzw_build_util config [options]
- Prints configurations.
tkrzw_build_util version
- Prints the version information.
- Options of the config subcommand:
-v
: Prints the version number.-i
: Prints C++ preprocessor options for build.-l
: Prints linker options for build.-p
: Prints the prefix for installation.
The following is a way to build your own application without setting complicated flags.
tkrzw_dbm_util: DBM Utilities
tkrzw_dbm_util is a command for managing database files. Thereby, you can create databases, set records, retrieve records, remove records, rebuild databases, and restore databases. The usage is the following. A subcommand name comes as the first argument. Some options and other arguments can follow.
tkrzw_dbm_util create [common_options] [tuning_options] [options] file
- Creates a database file.
tkrzw_dbm_util inspect [common_options] file [attr]
- Prints inspection of a database file.
tkrzw_dbm_util get [common_options] file key
- Gets a record and prints it.
tkrzw_dbm_util set [common_options] [options] file key
- Sets a record.
tkrzw_dbm_util remove [common_options] file key
- Removes a record.
tkrzw_dbm_util rekey [common_options] file old_key new_key
- Changes the key of a record.
tkrzw_dbm_util list [common_options] file
- Lists up records and prints them.
tkrzw_dbm_util rebuild [common_options] [tuning_options] file
- Rebuilds a database file for optimization.
tkrzw_dbm_util restore [common_options] old_file [new_file
]- Restores a broken database file.
tkrzw_dbm_util merge [common_options] dest_file src_files...
- Merges database files.
tkrzw_dbm_util export [common_options] [options] dbm_file rec_file
- Exports records to a flat record file.
tkrzw_dbm_util import [common_options] [options] dbm_file rec_file
- Imports records from a flat record file.
- Common options:
--dbm impl
: The name of a DBM implementation: auto, hash, tree, skip, tiny, baby, cache, stdhash, stdtree, poly, shard. (default: auto)--file impl
: The name of a file implementation: std, mmap-para, mmap-atom, pos-para, pos-atom. (default: mmap-para)--no_wait
: Fails if the file is locked by another process.--no_lock
: Omits file locking.--sync_hard
: Synchronizes the file physically when closing.--alloc_init num
: The initial allocation size. (default: 1048576)--alloc_inc num
: The allocation increment factor. (default: 2.0)--block_size num
: The block size of the positional access file. (default: 1)--direct_io
: Enables the direct I/O option of the positional access file.--sync_io
: Enables the synchronous I/O option of the positional access file.--padding
: Enables padding at the end of the file.--pagecache
: Enables the mini page cache in the process.--multi
: Calls xxxMulti methods for get, set, and remove subcommands.- Options for the create subcommand:
--truncate
: Truncates an existing database file.- Options for the inspect subcommand:
--validate
: Validates records.- Options for the set subcommand:
--no_overwrite
: Fails if there's an existing record with the same key.--append str
: Appends the value at the end after the given delimiter.--incr num
: Increments the value with the given initial value.--reducer func
: Sets the reducer for the skip database: none, first, second, last, concat, concatnull, concattab, concatline, total. (default: none)- Options for the rekey subcommand:
--no_overwrite
: Fails if there's an existing record with the same key.- Options for the list subcommand:
--move type
: Type of movement: first, jump, jumplower, jumplowerinc, jumpupper, jumpupperinc. (default: first)--jump_key str
: Specifies the jump key. (default: empty string)--items num
: The number of items to print (default: 10).--escape
: C-style escape is applied to the TSV data.--keys
: Prints keys only.- Options for the rebuild subcommand:
--restore
: Skips broken records to restore a broken database.- Options for the merge subcommand:
--reducer func
: Sets the reducer for the skip database: none, first, second, last, concat, concatnull, concattab, concatline, total. (default: none)- Options for the restore subcommand:
--auto str
: The restore mode automatically done: none, default, default-ns, sync, sync-ns. (default: none)--end_offset
: The exclusive end offset of records to read. (default: -1)--class
: The class name given to PolyDBM or ShardDBM.- Options for the export and import subcommands:
--tsv
: The record file is in TSV format instead of flat record.--escape
: C-style escape/unescape is applied to the TSV data.--keys
: Exports keys only.--ulog num
: Uses update logs based on the timestamp.--ulog_ids num num
: Sets the server ID and the DBM index of update logs.- Tuning options for HashDBM:
--in_place
: Uses in-place rather than pre-defined ones.--append
: Uses the appending mode rather than the in-place mode.--record_crc num
: The record CRC mode: -1, 0, 8, 16, 32. (default: 0 or -1)--record_comp str
: The record compression mode: default, none, zlib, zstd, lz4, lzma, rc4, aes. (default: none or default)--offset_width num
: The width to represent the offset of records. (default: 4 or -1)--align_pow num
: Sets the power to align records. (default: 3 or -1)--buckets num
: Sets the number of buckets for hashing. (default: 1048583 or -1)--cipher_key str
: Sets the encryption key for cipher compressors. (default: empty)- Tuning options for TreeDBM:
--in_place
: Uses in-place rather than pre-defined ones.--append
: Uses appending rather than pre-defined ones.--record_crc num
: The record CRC mode: -1, 0, 8, 16, 32. (default: 0 or -1)--record_comp str
: The record compression mode: default, none, zlib, zstd, lz4, lzma, rc4, aes. (default: none or default)--offset_width num
: The width to represent the offset of records. (default: 4 or -1)--align_pow num
: Sets the power to align records. (default: 10 or -1)--buckets num
: Sets the number of buckets for hashing. (default: 131101 or -1)--cipher_key str
: Sets the encryption key for cipher compressors. (default: empty)--max_page_size num
: Sets the maximum size of a page. (default: 8130 or -1)--max_branches num
: Sets the maximum number of branches of inner nodes. (default: 256 or -1)--comparator func
: Sets the key comparator: lex, lexcase, dec, hex, real, float. (default: lex)- Tuning options for SkipDBM:
--offset_width num
: The width to represent the offset of records. (default: 4)--step_unit num
: Sets the step unit of the skip list. (default: 4)--max_level num
: Sets the maximum level of the skip list. (default: 14)--sort_mem_size num
: Sets the memory size used for sorting. (default: 268435456)--insert_in_order
: Inserts records in ascending order of the key.- Options for PolyDBM and ShardDBM:
--params str
: Sets the parameters in "key=value,key=value" format.
The following is a sample usage to make a hash database file to associate country names to their capital cities' names. If the extension of the file is "tkh", the type of the database is regarded as HashDBM.
Let's say, you have a dictionary in TSV format. The first field is an index word and the second field is its description. Then, make a skip database of the dictionary. If the extension of the file is "tks", the type of the database is regarded as SkipDBM.
You can look up words which forward-match a pattern.
tkrzw_dbm_perf: DBM Performance Checker
tkrzw_dbm_perf is a command for checking performance of DBM implementations. Thereby, you can check performance of database operations in various scenarios including multithreading. The usage is the following. A subcommand name comes as the first argument. Some options and other arguments can follow.
tkrzw_dbm_perf sequence [options]
- Checks setting/getting/removing performance in sequence.
tkrzw_dbm_perf parallel [options]
- Checks setting/getting/removing performance in parallel.
tkrzw_dbm_perf wicked [options]
- Checks consistency with various operations.
tkrzw_dbm_perf index [options]
- Checks performance of on-memory indexing.
- Common options:
--dbm impl
: The name of a DBM implementation: auto, hash, tree, skip, tiny, baby, cache, stdhash, stdtree, poly, shard. (default: auto)--iter num
: The number of iterations. (default: 10000)--size num
: The size of each record value. (default: 8)--threads num
: The number of threads. (default: 1)--random_seed
num : The random seed or nagative for real RNG. (default: 0)--verbose
: Prints verbose reports.--path path
: The path of the file to write or read.--file impl
: The name of a file implementation: std, mmap-para, mmap-atom, pos-para, pos-atom. (default: mmap-para)--no_wait
: Fails if the file is locked by another process.--no_lock
: Omits file locking.--sync_hard
: Synchronizes the file physically when closing.--alloc_init num
: The initial allocation size. (default: 1048576)--alloc_inc num
: The allocation increment factor. (default: 2.0)--block_size num
: The block size of the positional access file. (default: 1)--direct_io
: Enables the direct I/O option of the positional access file.--sync_io
: Enables the synchronous I/O option of the positional access file.--padding
: Enables padding at the end of the file.--pagecache
: Enables the mini page cache in the process.--ulog str
: Sets the prefix of update log files.- Options for the sequence subcommand:
--random_key
: Uses random keys rather than sequential ones.--random_value
: Uses random length values rather than fixed ones.--set_only
: Does only setting.--get_only
: Does only getting.--iter_only
: Does only iterating.--remove_only
: Does only removing.--validate
: Validates records.- Options for the parallel subcommand:
--random_key
: Uses random keys rather than sequential ones.--random_value
: Uses random length values rather than fixed ones.--keys
: The number of unique keys.--rebuild
: Rebuilds the database occasionally.--sleep num
: The duration to sleep between iterations. (default: 0)--validate
: Validates records.- Options for the wicked subcommand:
--iterator
: Uses iterators occasionally.--clear
: Clears the database occasionally.--rebuild
: Rebuilds the database occasionally.--validate
: Validates records.- Options for the index subcommand:
--type expr
: The types of the key and value of the index: file, mem, n2n, n2s, s2n, s2s, str. (default: file)--random_key
: Uses random keys rather than sequential ones.--random_value
: Uses random length values rather than fixed ones.- Options for HashDBM:
--append
: Uses appending rather than pre-defined ones.--record_crc num
: The record CRC mode: -1, 0, 8, 16, 32. (default: 0 or -1)--record_comp str
: The record compression mode: default, none, zlib, zstd, lz4, lzma, rc4, aes. (default: none or default)--offset_width num
: The width to represent the offset of records. (default: 4)--align_pow num
: Sets the power to align records. (default: 3)--buckets num
: Sets the number of buckets for hashing. (default: 1048583)--fbp_cap num
: Sets the capacity of the free block pool. (default: 2048)--min_read_size num
: Sets the minimum reading size to read a record. (default: -1)--cache_buckets
: Caches the hash buckets on memory.--cipher_key str
: Sets the encryption key for cipher compressors. (default: empty)- Options for TreeDBM:
--append
: Uses the appending mode rather than the in-place mode.--record_crc num
: The record CRC mode: -1, 0, 8, 16, 32. (default: 0 or -1)--offset_width num
: The width to represent the offset of records. (default: 4)--align_pow num
: Sets the power to align records. (default: 10)--buckets num
: Sets the number of buckets for hashing. (default: 131101)--fbp_cap num
: Sets the capacity of the free block pool. (default: 2048)--min_read_size num
: Sets the minimum reading size to read a record. (default: -1)--cache_buckets
: Caches the hash buckets on memory.--cipher_key str
: Sets the encryption key for cipher compressors. (default: empty)--max_page_size num
: Sets the maximum size of a page. (default: 8130)--max_branches num
: Sets the maximum number of branches of inner nodes. (default: 256)--max_cached_pages num
: Sets the maximum number of cached pages. (default: 10000)--page_update_write
: Writes updated pages immediately.- Options for SkipDBM:
--offset_width num
: The width to represent the offset of records. (default: 4)--step_unit num
: Sets the step unit of the skip list. (default: 4)--max_level num
: Sets the maximum level of the skip list. (default: 14)--sort_mem_size num
: Sets the memory size used for sorting. (default: 268435456)--insert_in_order
: Inserts records in ascending order of the key.--max_cached_records num
: Sets the number of cached records (default: 65536)--reducer func
: Sets the reducer: none, first, second, last, concat, total. (default: none)- Options for TinyDBM and StdHashDBM:
--buckets num
: Sets the number of buckets for hashing. (default: 1048583)- Options for CacheDBM:
--cap_rec_num num
: Sets the maximum number of records. (default: 1048576)--cap_mem_size num
: Sets the total memory size to use.- Options for PolyDBM and ShardDBM:
--params str
: Sets the parameters in "key=value,key=value" format.
The following is a sample usage to make a hash database file and then store 1 million records of 8-byte keys and 8-byte values, retrieve all of them, and remove all of them in sequence. The number of threads is 10 and each thread handles 100 thousand records. The number of buckets is 2 million.
The following is a sample usage to make a skip database file and then store 1 million records of 8-byte keys and 8-byte values, retrieve all of them. The number of threads is 10 and each thread handles 100 thousand records. The step unit and the maximum level of the skip list are 3 and 13 respectively.
tkrzw_ulog_util: Update Logging Utilities
tkrzw_ulog_util is a command for managing update log files. Thereby, you can add update logs, inspect update logs, list up files, and remove old files. The usage is the following. A subcommand name comes as the first argument. Some options and other arguments can follow.
tkrzw_ulog_util writeset [options] prefix key value
- Writes a SET operation.
tkrzw_ulog_util writeremove [options] prefix key
- Writes a REMOVE operation.
tkrzw_ulog_util writeclear [options] prefix
- Writes a CLEAR operation.
tkrzw_ulog_util read [options] prefix
- Reads each log.
tkrzw_ulog_util listfiles [options] prefix
- Lists log files.
tkrzw_ulog_util removeoldfiles [options] prefix
- Removes old log files.
- Options for the write subcommands:
--timestamp num
: The timestamp to record. (default: current time)--max_file_size num
: The maximum file size. (default: 1Gi)--server_id num
: The server ID to record. (default: 0)--dbm_index num
: The DBM index to record. (default: 0)- Options for the read subcommand:
--timestamp num
: The timestamp to start at (default: 0)--server_id num
: The server ID to focus on. (default: all)--dbm_index num
: The DBM index to focus on. (default: all)--items num
: The number of items to print. (default: 10)--escape
: C-style escape is applied to the TSV data.- Options for the listfiles subcommand:
--timestamp num
: The timestamp to start at (default: 0)- Options for the removeoldfiles subcommand:
--timestamp num
: The timestamp of threshold. (default: -7D = 7 days ago)--exclude_latest
: Avoid removing the latest file.
The following is a sample usage to make a file of update logs to write three records and apply it to a database.
To remove old update logs, you run the "removeoldfiles" subcommand. The "--timestamp" option can take a relative time from the current time with a value beginning with "+" or "-". And the value can have a suffix for the unit: "D" for days, "H" for hours, "M" for minutes, and "S" for seconds. No unit means milliseconds. It is a good manner not to remove the latest file regardless of its age. If you remove files which are older than 2 days from now exclusing the latest one, you run the following. Don't forget "-".
Tips
Storing Structured Data
If you want to store structued records in the database, using std::map<string, string> as a container is practical. Serialize each attribute of various types into a string and set it as a record of a string map. Then, serialize the map itself into a string.
The header "tkrzw_str_util.h" provides a lot of useful functions for serialization/deserialization. For a string vector, use SerializeStrVector and DeserializeStrVector. For basic numeric types, use SerializeBasicValue and DeserializeBasicValue. For a vector of basic numeric types, use SerializeBasicVector and DeserializeBasicVector. Finally, for a string map, use SerializeStrMap and DeserializeStrMap.
#include
#include "tkrzw_dbm_hash.h"
#include "tkrzw_str_util.h"
// Main routine.
int main(int argc, char** argv) {
// All symbols of Tkrzw are under the namespace "tkrzw".
using namespace tkrzw;
// Creates the database manager.
HashDBM dbm;
// Opens a new database.
dbm.Open("casket.tkh", true, File::OPEN_TRUNCATE);
// Makes a map representing structured data.
std::map record;
record["name"] = "John Doe";
record["id"] = SerializeBasicValue(98765);
record["rating"] = SerializeBasicValue(0.8);
std::vector medals = {"gold", "silver", "bronze"};
record["medals"] = SerializeStrVector(medals);
std::vector rivals = {123456, 654321};
record["rivals"] = SerializeBasicVector(rivals);
std::vector scores = {48.9, 52.5, 60.3};
record["scores"] = SerializeBasicVector(scores);
// Stores the serialized string.
dbm.Set("john", SerializeStrMap(record));
// Retrieves the serialized string and restore it.
std::string serialized;
std::map restored;
if (dbm.Get("john", &serialized).IsOK()) {
restored = DeserializeStrMap(serialized);
}
// Shows the restored content.
std::cout << "name: " << restored["name"] << std::endl;
std::cout << "id: " <<
DeserializeBasicValue(restored["id"]) << std::endl;
std::cout << "rating: " <<
DeserializeBasicValue(restored["rating"]) << std::endl;
for (const auto& value : DeserializeStrVector(restored["medals"])) {
std::cout << "medals: " << value << std::endl;
}
for (const auto& value :
DeserializeBasicVector(restored["rivals"])) {
std::cout << "rivals: " << value << std::endl;
}
for (const auto& value :
DeserializeBasicVector(restored["scores"])) {
std::cout << "scores: " << value << std::endl;
}
// Closes the database.
dbm.Close();
return 0;
}
]]>
Whereas interoperability of the database file itself is assured by the library, interoperability of the content of records is a responsibility of the application. If you store binary data in records, taking care of the byte order is necessary. As long as you use the utility functions in the above example, interoperability is assured by the library. You can also use a standardized serialization format like JSON and Protocol Buffers.
When you store numeric values in the record, using SerializeBasicValue and SerializeBasicVector is a straightforward way. The two functions serialize an integer into a fixed-length byte sequence. For example, int32_t and float take 4 bytes respectively, and int64_t and double take 8 bytes respecitively. Meanwhile, if the value is an integer and its value is typically small, using IntToStrDelta and SerializeIntVectorDelta is a good idea. The two functions serialize an integer into a variable-length byte sequence. The second parameter specifies whether the zigzag encoding is applied to support negative values. Deserialization is done by StrToIntDelta and DeserializeIntVectorDelta respecitively.
numvec = {-123, 0, 123};
std::string numvec_str = SerializeIntVectorDelta(numvec, true);
std::vector restored_numvec = DeserializeIntVectorDelta(numvec_str, true);
std::vector unumvec = {0, 123, 18235};
std::string unumvec_str = SerializeIntVectorDelta(unumvec, false);
std::vector restored_unumvec = DeserializeIntVectorDelta(unumvec_str, false);
]]>
Tuning HashDBM
The file hash database uses a static hash table, whose size doesn't change implicitly. The number of hash buckets is set when the database is created. The load factor, which is defined as the ratio of the number of records to the number of buckets, should be less than 1, in order to maintain a good performance. So, the number of buckets should be larger than the estimated number of records. By default, the number of buckets is about 1 million.
To create a very small database, you'll run a command like this.
The same thing can be done in C++.
To make a very large database, you'll set offset_width=5, align_pow=4, and num_buckets=100000000 or something.
When the load factor exceeds 1 and performance decreases, you'll rebuild the database. By default, the number of buckets is set as twice as the number of records so that the load factor becomes 0.5.
The same thing can be done in C++. The RebuildAdvanced method can take tuning parameters so that you can specify them explicitly.
One of the most important features of Tkrzw's file hash database is that the operation to rebuild the database doesn't block other threads trying to update the database. Rebuilding is done by making a temporary database file while accepting updates to the original file. When the rebuilding is done, updates during the rebuilding are applied to the new file incrementally and then the original file is replaced with the new file atomically. All these operations are done just by calling the Rebuild method by a dedicated thread.
Rebuilding the database is also effective to resolve data fragmentation. The file hash database has a metadata property of effective data size, which is the total size of keys and values of all records. When the effective data size is less than 30% of the file size, it's time to rebuild the database. To know whether the database should be rebuilt, you'll run a command like this.
The same thing can be done in C++.
Alignment is effective to mitigate fragmentation when the size of the value of records are changed frequently. By default, the alignment is set 2^3 = 8 bytes so that every records begins at an offset which is a multiple of 8. If the record size is not a multiple of 8, padding bytes are put at the end. If the size of a record increases but the padding can contain the increment, the record don't have to be moved to another place. If the record is moved, the original position region becomes a free block. In the in-place update mode, the free block can be reused but it is not assured. Having 8-byte alignment means that the expected size of padding is 4 bytes. If the typical record size is 30 bytes or something, 4 bytes padding seems acceptable and effective.
The position of each record is represented by a 4-byte integer by default. It means that the maximum value is 4^32 = 4GiB. Actually, the position is calculated as the value multiplied by the alignment. So, the maximum size is 4GiB * 8 = 32GiB by default. If the database size is about to exceed the limit, you should rebuild the database with an increased offset width. Usually, 5-byte integer will do because it can represent up to 1TiB. Alignment is not useful for the appending mode. And, databases in the appending mode tend to be very large. Therefore, you'll create a database in the appending mode like this.
By default, I/O of the file hash database file is done via memory mapping. Thus, if the size of the database file exceeds the capacity of the virtual memory, the process can crash with segmentation fault or be killed by the OOM killer. To avoid it, you can use the normal I/O system calls, by creating the database object with the specific file object. See the tips on File Classes for details. If the database file is much larger than the RAM on your machine, using direct I/O is the best practice. See Direct I/O for details. Compressing record data is useful to calibrate tradeoff of time efficiency and space efficiency. See Data Compression for details.
Update Modes of HashDBM and TreeDBM
You specify an update mode when creating or rebuilding a HashDBM or TreeDBM database by setting the update_mode tuning parameter to the OpenAdvanced or RebuildAdvanced method. The value can be UPDATE_IN_PLACE (the default) or UPDATE_APPENDING.
In-Place
In the in-place mode, an existing record is updated in-place if possible. Let's say, there's two records in the database, A (key="A", value="aaa") and B (key="B", value="bbb"). The record section in the database file is like this. "xxx" means metadata and "..." means padding.
[xxx:A:aaa.....][xxx:B:bbb.....]
Now, the value of A is updated into "aaaaaa". The padding can contain the additional size so the record is not moved.
[xxx:A:aaaaaa..][xxx:B:bbb.....]
The value of A is updated into "aaaaaaaaa". The padding cannot contain the additional size so the record is moved. The original section becomes a free block.
[xxx:*:........][xxx:B:bbb.....][xxx:A:aaaaaaaaa.......]
The free block is registered in the free block pool so it is reused for a new record whose size can be contained in it. Let's say, a new record C (key="C", value="cccccc") is added.
[xxx:C:cccccc..][xxx:B:bbb.....][xxx:A:aaaaaaaaa.......]
If a record is removed, the region for the record becomes a free block and it is reused later.
Therefore, in the in-place mode, the size of the database doesn't increase rapidly even if records are set or deleted often. Even so, fragmentation can still gradually creep in, so you should rebuild the database occasionally.
Appending
Meanwhile, in the appending mode, we never write to the existing region, which is the region of the file where records have already been stored. There is no free block pool, and space for removed records is never reused.
Let's say the value of record A is updated from "aaa" to "aa". A new record region is appended at the end of the file even though "aa" could fit in the old space.
[xxx:A:aaa][xxx:B:bbb][xxx:A:aa]
If A is removed, a new record region to mark it as removed is appended. This removal record stores a key, but no value:
[xxx:A:aaa][xxx:B:bbb][xxx:A:aa][xxx:A:(removed)]
Similarly, if we add a new record C with value "c", we append a record for it rather than using the space of the removed record A:
[xxx:A:aaa][xxx:B:bbb][xxx:A:aa][xxx:A:(removed)][xxx:C:c]
So in appending mode, each updating operation appends one record to the file. These records are linked together in a chain from the most recent operation to the oldest operation. Each bucket of the hash table points to the record for the most recent operation on any record whose key falls in that bucket. For example, here is what it looks like when a given hash bucket has a chain of two records and we add one more new record:
Notice how each new record gets prepended to the existing chain of records. So whenever HashDBM needs to search for a record with a given key (for example, to do a Get or Set operation), or iterate keys, HashDBM will encounter the most recent records first. That is how HashDBM can correctly fetch the most recent value for a key even if that key's value has been set multiple times, and correctly determine that a previously-added key has been removed.
Because one record gets appended for each updating operation, the size of the database file increases rapidly if you update a lot, even if you repeatedly update the same key. Furthermore, after many update operations, the chain of records for a given hash bucket grows longer and longer, and so that potentially slows down most Tkrzw operations.
So, it is very important to rebuild the database periodically to save space and restore good performance. Tkrzw lets you rebuild a live database (from a background thread, if you desire) while it is still in use.
By now you may be thinking that the appending mode is totally crazy. But it has two major advantages over the in-place mode: durability and updating performance. Because appending mode never modifies the existing area, the existing information is never lost. This lets us provide stronger durability guarantees in case of an operating system crash or power failure. See ACID Transaction for details.
You might think that the in-place mode is faster and that the appending mode is slower. However, it is not the case necessarily. In the in-place mode, updated "dirty" pages of the I/O buffer scatter across the database file so that effectiveness of synchronizing the dirty pages with the device is low. Meanwhile, in the appending mode, dirty pages occur only at end of the file so that effectiveness of synchronizing the dirty pages with the device is high. In other words, a database in the appending mode causes inefficient reading with random access but it causes efficient writing with sequential access. Therefore, if rebuilding the database is done at a reasonable frequency, the appending mode can be faster than the in-place mode. SSD devices are good at random reading but not good at random writing. Especially if you build a huge database on an SSD storage, the appending mode is worth considering.
Performance Comparison
Let's build a hash database and store 100 million records (10 million for each of 10 threads). And then retrieve and remove all of them. First, we check performance of the in-place mode.
On my environment, setting operations took 48 seconds, which meant 2.0 million QPS. Getting operations took 25 seconds, which meant 3.9 million QPS. And, removing operations took 83 seconds, which meant 1.2 million QPS. The file size was 2.9GiB. Removing operations took longer time than setting operations because removing existing records requires random access writing to reorganize bucket chains. Then, let's do the same thing in the appending mode.
This time, setting operations took 44 seconds, which meant 2.2 million QPS. Getting operations took 26 seconds, which meant 3.8 million QPS. And, removing operations took 54 seconds, which meant 1.8 million QPS. The file size was 4.5GiB. Throughput of setting and getting operations in this setting should be the same between the in-place mode and the appending mode. However, removing operations in the appending mode should be faster than the in-place mode because appending causes only sequential writing.
Tuning TreeDBM
The file tree database uses B+ tree. Records are organized in ascending order of the key. Records at adjacent positions are chunked as a unit called "page". The larger each page is, the better space efficiency is. And, performance of sequential access is also better. Meanwhile, performance of random access deteriorates if pages are too big. The default maximum page size 8130 is suitable in most cases.
If the machine has abundant memory, increasing the max_cached_pages parameter contributes better performance of random access. By default, 10,000 pages are cached. The average page size is about 8K bytes so 80MB memory is used for the cache system. Let's say, the machine has 2GB free memory. Then, specifying 200,000 or something is feasible.
Usually, there's no need to modify the default tuning parameters. By default, the maximum database size is 1TB because the offset width is 4 and the alignment power is 8. If the number of records exceeds 20 million, the num_buckets parameter should be increased for better performance.
To create a very small database, you'll run a command like this.
The same thing can be done in C++.
To make a very large database which are accessed at random, you'll set align_pow=12 and max_page_size=4080. Due to the alignment power, every hash record is aligned at a multiple of 4096 bytes, which maximize efficiency of the underlying I/O system. As the maximum page size doesn't include footprint of the hash record of 16 bytes, 4096 - 16 = 4080 is the ideal size. Note that 4080 is the maximum size so the actual usage is less than it. However, having padding space is beneficial to mitigate movement of page positions. Increasing the max_cached_pages is also better as long as the memory capacity allows for it.
Let's say, there is a leaf page containing only one small record. If the alignment is 4096 bytes, most space is filled with padding bytes.
[XXX:A:aaa............................]
Some records are inserted into the leaf page. If the total page size doesn't exceed the allocated size, the page doesn't have to reallocated.
[XXX:A:aaa:B:bbb:C:ccc:D:ddd..........]
If a new record is added to the leaf page and the total page size exceeds the allocated size, the page is reallocated and the original position becomes a free space.
[XXX:*********************************][XXX:A:aaa:B:bbb:C:ccc:D:ddd:E:eee:F:fff:G:ggg:H:hhh:I:iii...]
However, if the maximum page size setting is less than the alignment size, the page is divided and it avoids reallocation. Half of records are moved to a new page. Therefore, having a large alignment and setting the maximum page size a bit less than the alignment pay in the long run.
[XXX:A:aaa:B:bbb:C:ccc:D:ddd:E:eee:...][XXX:F:fff:G:ggg:H:hhh:I:iii..........]
In case that the cache cannot contain all records, however you tune the tree database, performance of random access cannot be comparable to the file hash database. Thus, if you don't need ordered record access, using the file hash database is recommended. If you need ordered record access and updating is done at random, consider using the file skip database, which is more scalable.
By default, updated cached pages are just kept on memory and the updates of each page are not written in the file until the page is cached out or the database is closed. If the process or the system crashes before the page is written, the updates are lost. If you are afraid of process/system crashes but you don't want to call the Synchronize method periodically, setting the tuning parameter page_update_mode to PAGE_UPDATE_WRITE is useful. It ensures that each page is written for each update. Although, setting PAGE_UPDATE_WRITE is much slower than the default setting, it is faster than calling Synchronize for each update operation. Calling Synchronize periodically (e.g. for every 5 minutes) is faster than setting PAGE_UPDATE_WRITE, but less durable.
By default, I/O of the file tree database file is done via memory mapping. Thus, if the size of the database file exceeds the capacity of the virtual memory, the process can crash with segmentation fault or be killed by OOM killer. To avoid it, you can use by the normal I/O system calls, by creating the database object with the specific file object. See the tips on File Classes for details. If the database file is much larger than the RAM on your machine, using direct I/O is the best practice. See Direct I/O for details. Compressing record data is useful to calibrate tradeoff of time efficiency and space efficiency. See Data Compression for details.
Comparators of TreeDBM
By default, the file tree database uses a lexical comparator which compares keys in lexicographical order. This is suitable for many use cases. The following built-in comparator can be specified via the tuning parameters.
LexicalKeyComparator
: Compares keys as strings in lexicographical order.LexicalCaseKeyComparator
: Compares keys as strings but ignore cases in ASCII code.DecimalKeyComparator
: Compares keys as decimal integers like "1" and "99".HexadecimalKeyComparator
: Compares keys as hexadecimal integers like "8A" and "FF3C".RealNumberKeyComparator
: Compares keys as decimal real numbers like "2.1" and "-5.82".SignedEndianKeyComparator
: Compares keys as big-endian binaries of signed integers.FloatBigEndianKeyComparator
: Compares keys as big-endian binaries of floating-point numbers.
You can implement your own comparator whose signature is "int32_t (*)(std::string_view, std::string_view)" aka. KeyComparator. In that case, you must specify it as "param.key_comparator" every time when you open the database.
If you use integers as keys, the most covenient way is to set DecimalKeyComparator as the comparator and convert keys into decimal integer strings by SPrintF or ToString. To restore an integer from a decimal string, use StrToInt or std::atoi.
If keys of integers are large numbers and you pursue space efficiency, representing keys as big-endian binary is a good idea. The default comparator sorts them properly. To convert keys, use IntToStrBigEndian. When you restore a big-endian binary into an integer, use StrToIntBigEndian.
If you use real numbers as keys, the most covenient way is to set RealNumberKeyComparator as the comparator and convert keys into decimal real number strings by SPrintF or ToString. To restore a real number from a decimal string, use StrToDouble or std::atof.
If keys of signed integers are large numbers, representing keys as big-endian binary is a good idea. Set SignedBigEndianKeyComparator as the comparator and use IntToStrBigEndian to convert keys. When you restore a big-endian binary into an integer, use StrToIntBigEndian and cast the return value into int64_t.
(StrToIntBigEndian(key)) << endl;
}
]]>
If keys of real numbers are large numbers or has long fractional parts and you pursue space efficiency, representing keys as big-endian binary is a good idea. Set FloatBigEndianKeyComparator as the comparator and use FloatToStrBigEndian to convert keys. When you restore a big-endian binary into an integer, use StrToFloatBigEndian.
By the way, databases other than the file skip database don't support duplicated keys. That's because duplicated keys deteriorate time and space efficiency. If you have to handle duplicated keys, build a skip database first and merge duplicated records. Then, you can convert it to another type of database.
Data Compression for HashDBM and TreeDBM
Data compression can be applied to the record value so that the size of the database file becomes smaller at the cost of processing time. In general, using data compression decreases performance and throughput. However, if the database file is large and I/O is the bottle neck, data compression sometimes increases performance and throughput because of lower amount of data transmission with the storage.
You can integrate the following compression libraries when you build Tkrzw.
- ZLib : Probably the most popular compression library, which is used in GZip. It combines good compression ratio and good processing speed.
- ZStd : Probably the most useful compression library. As compression ratio is almost the same as ZLib, the processing speed is much faster.
- LZ4 : A very fast compression library. Though compression ratio is worse than ZLib and ZStd, processing speed is much faster.
- LZMA : A library known for the best compression. Though compression ratio is better than ZLib and ZStd, the processing speed is much slower.
You can enable a compressor by setting RECORD_COMP_ZLIB etc to record_comp_mode of TuningParameters. If a compressor is enabled, record data stored in the database is compressed automatically. When you retrieve the record, the data is decompressed automatically. In HashDBM, compression is done for each record and only the record value is compressed. Thus, compression is useful only if the record value is large. In TreeDBM, compression is done for each page which includes multiple records. Thus, compression is more useful in TreeDBM.
Which algorithm is the best suited should be determined according to the data size, the data content, and the access pattern. However, ZStd is the one you should try first. If ZStd is not available on your platform, you use ZLib instead. If you prefer faster speed, try LZ4. If the database is almost read-only for archival usage, try LZMA.
To check performance, let's write and read 1 million records in HashDBM and TreeDBM. The key is a 8-byte numeric string like "00000001" and "00000002". The value is a 100-byte semi-random alphabet strings like "cvexgzibb...".
class | Hash-Set | Hash-Get | Hash Size | Tree-Set | Tree-Get | Tree Size |
---|---|---|---|---|---|---|
Raw | 1,515,645 QPS | 1,950,478 QPS | 114.63 MB | 1,399,361 QPS | 1,658,892 QPS | 106.49 MB |
ZLib | 150,205 QPS | 2,044,919 QPS | 89.03 MB | 350,346 QPS | 1,126,326 QPS | 12.72 MB |
ZStd | 293,260 QPS | 2,054,519 QPS | 97.44 MB | 1,040,141 QPS | 1,159,513 QPS | 14.84 MB |
LZ4 | 917,089 QPS | 1,975,289 QPS | 115.81 MB | 1,309,562 QPS | 1,392,015 QPS | 20.30 MB |
LZMA | 1,478 QPS | 1,829,766 QPS | 149.90 MB | 31,802 QPS | 795,360 QPS | 13.58 MB |
Data Encryption for HashDBM and TreeDBM
Data encryption is also supported in the same mechanism as data compression. As with the compressors, HashDBM encrypts only the value of each record and TreeDBM encrypts the keys and values of multiple records in each page. Therefore, if you want to encrypt the key too, you should use TreeDBM.
Available algorithms are RC4 and AES. Both use a private key and an initial vector (IV) to encrypt each record. You set the private key when you open a database. A unique IV is generated for each record and stored with the record data. RC4 (aka. ArcFour) is a streaming encryption algorithm. It uses 6-bit IV. AES (aka. Rijndael) is a block encryption algorithm. It uses 16-bit IV and padding data. In terms of space efficiency, RC4 is better than AES. In terms of time efficiency, it depends. If the length of the data to compress is longer than 3000 bytes, RC4 is faster. In terms of cipher strength, AES is better.
You can enable a compressor by setting RECORD_COMP_RC4 or RECORD_COMP_AES to record_comp_mode of TuningParameters. You should also set a cipher key to cipher_key of TuningParameters. If a cipherer is enabled, record data stored in the database is encrypted automatically. When you retrieve the record, the data is decrypted automatically. Note that the cipher key is not stored in the database and you must set the cipher key every time you open the database.
To check performance, let's write and read 1 million records in HashDBM and TreeDBM. The key is a 8-byte numeric string like "00000001" and "00000002". The value is a 100-byte semi-random alphabet strings like "cvexgzibb...".
class | Hash-Set | Hash-Get | Hash Size | Tree-Set | Tree-Get | Tree Size |
---|---|---|---|---|---|---|
Raw | 1,515,645 QPS | 1,950,478 QPS | 114.63 MB | 1,399,361 QPS | 1,658,892 QPS | 106.49 MB |
RC4 | 342,577 QPS | 2,041,478 QPS | 120.35 MB | 1,100,026 QPS | 1,187,321 QPS | 106.65 MB |
AES | 915,297 QPS | 2,055,004 QPS | 143.24 MB | 1,001,877 QPS | 1,045,518 QPS | 107.29 MB |
Tuning SkipDBM
The file skip database uses a skip list based on sorted records. Whereas its time efficiency is worse than that of the hash table, its space efficiency is better. Typically, the footprint for each record is less than 5 bytes. Usually, there's no need to modify the default tuning parameters. If the size of the database can exceed 4GB, the offset width parameter should be set 5. If the number of records is more than 1 billion, the maximum level parameter should be set 15 or more. If the main usage of the database is for sequential access, the step unit can be 16 and the maximum level can be 8.
To create a very small database, you'll run a command like this.
The same thing can be done in C++.
To make a very large database, you'll set offset_width=5, step_unit=8, max_level=12, and sort_mem_size=2GB. Increasing the step_unit parameter reduces frequency of random access. Increasing the sort_mem_size parameter is a good idea because the default value is 256MB.
If you want to optimize the time/space efficiency by all means, rebuilding the database might help because optimal parameters are set implicitly.
The same thing can be done in C++. The RebuildAdvanced method can take tuning parameters so that you can specify them explicitly.
Rebuilding the database blocks other threads completely. In the first place, the file skip database is not suitable for random updates. Use the file hash database for that purpose. The skip list database is suitable for batch processing.
Although the file skip database uses much resources to build the database file, space efficiency of the database file is the best. Therefore, it is practical to build a large database on a high-performance machine and distribute it to low-performance machines where data retrieval is done.
You can improve performance of data retrieval on the file skip database by increasing the maximum number of cached records. By default, 65536 records are cached. Because nodes whose indices are multiples of the step unit or its powered numbers. You can set the parameter like this.
By default, I/O of the file skip database file is done via memory mapping. Thus, if the size of the database file exceeds the capacity of the virtual memory, the process can crash with segmentation fault or be killed by OOM killer. To avoid it, you can use normal positional by the normal I/O system calls, by creating the database object with the specific file object. See the tips on File Classes for details.
Building SkipDBM
The skip database is composed of records sorted by the key. Although you can insert records randomly, they are not visible until you call the Synchronize() method, which sorts records implicitly and merge them with existing records of the database. In other words, updating the skip database is done in an offline (batch) manner, not in an online manner. If you already have records which are sorted in ascending order of the key, you can use the insert_in_order mode, which is very quick and scalable. You can input multiple records of the same key and the order of insertion within records of the same key is preserved in the database. Note that the order of records of different keys must strictly be consistent to std::less<std::string> if you use the insert_in_order mode. In contrast, if your records are not sorted, you use the default mode, which sorts the records implicitly with merge sort on temporary files. The reason for using temporary files is to build a huge database exceeding the memory capacity. You can input multiple records with the same key here too. The order within records of the same key is preserved during merge sort because it is a stable sort.
You can specify an arbitrary reducer to handle records of the same key. By default, no reducer is applied and all records are passed through. Whereas the following built-in reducers are bundled, you can implement your own reducers too.
ReduceToFirst
: Keeps the first record.ReduceToSecond
: Keeps the second record if it exists. Otherwise, keeps the first record.ReduceToLast
: Keeps the last record.ReduceConcat
: Concatenates all values and makes a single record.ReduceConcatWithNull
: Concatenates all values and makes a single record. Null code is used as separators.ReduceConcatWithTab
: Concatenates all values and makes a single record. Tab is used as separators.ReduceConcatWithLine
: Concatenates all values and makes a single record. Linefeed is used as separators.ReduceToTotal
: Totals the numeric values and makes a single record.
If the input records have multiple records of the same key and you want to keep the first one, you'll finish the database with ReduceToFirst. The following code keeps records of {"foo":"first"} and {"bar":"primero"} only.
In principle, the file skip database should be considered as a "static" database which is not updated after building it. However, you can modify it with usual methods of DBM. You can insert arbitrary records, which are put after existing records of the same keys. Then, you can reduce them when calling the Synchronize method. If you want to keep the latest value to overwrite the old value, you'll synchronize the database with ReduceToLast. Assuming we use the above finished database, the following code keeps records of {"foo":"fifth"} and {"bar":"segundo"} only.
You can insert new data, but how do you remove the existing data? It is implemented as insertion of a special value. If you call the Remove method, a special value called REMOVING_VALUE is inserted. When synchronizing the database, if the special value is detected, records of the same key before the special value are removed. Then, the reducer is applied if any. The following code removes records of "foo" but doesn't remove records of "bar".
Note that synchronizing the database requires rebuilding the entire database even if only one record is modified. Therefore, updating records should be done in a batch processing style, where you do every operation and then synchronize the database. Note again that you cannot access new records until you call the Synchronize method.
You can merge multiple skip database files after building them separately. You can apply a reducer there. If you have to aggregate a large amount of data, building partial results as separate database files and merge them afterward is a practical way. You can do it in C++ by calling the MergeSkipDatabase method. You can do the same thing by a command like this.
The "merge" subcommand of tkrzw_dbm_util is useful to apply a reducer even if you specify no source databases to merge.
SkipDBM as a MapReduce
You can use SkipDBM as a framework of MapReduce. Map means an operation to generate and classify records from the given data source. Reduce means an operation to process records for each class and store the result. Between Map and Reduce, the framework does merge sort to organize records for each class. Reading data source and adding records to the database are considered a Map operation. And, the reducer applied when finishing the database represents a Reduce operation. The following code implements word counting.
First();
std::string doc_id, text;
while (iter->Get(&doc_id, &text) == Status::SUCCESS) {
const std::vector words = GetWords(text);
for (const auto& word : words) {
count_dbm(word, "1");
}
iter->Next();
}
count_dbm.SynchronizeAdvanced(false, nullptr, SkipDBM::ReduceToTotal);
count_dbm.Close();
text_dbm.Close();
]]>
Tuning TinyDBM, BabyDBM, and CacheDBM
On-memory databases are convenient to manage large amount of objects while saving memory as much as possible. Objects to be stored must be serialized as strings. Thus, choosing an efficient serialization format is important. Of all on-memory databases, TinyDBM is the best in time efficiency. BabyDBM is the best in space efficiency. It also supports ordered access of records. CacheDBM is suitable for handling cache data where old records are discarded implicitly.
TinyDBM has a tuning parameter for the number of buckets. The number of buckets should be the same as or more than the number of stored records. It is set with the constructor.
BabyDBM has a tuning parameter for the comparison function of keys. The default comparison function is a lexical comparator, which is also usable for integers serialized as big-endian binary strings. It is set with the constructor.
CacheDBM has two tuning parameters for the capacity: the maximum number of records and the maximum memory usage. The maximum number of records is necessary because it is used to determine the number of buckets. Therefore, setting too large value is not good for space efficiency. The maximum memory usage is used as an insurance to ensure that memory usage doesn't exceed the specified value.
If you use integers as keys and/or values, using functions IntToStrBigEndian and StrToIntBigEndian as a serializer and a deserializer is a good idea for space efficiency. This technique is usable for file databases too.
DBM::Process
One of the nicest features of Tkrzw is the Process method, shared by all database classes, which lets your application read and modify a single record of the database in a way that is atomic to all other threads. So it will never be possible for another thread to sneak in and change the record value between when your thread reads it and modifies it. This is one example of Atomicity and Isolation in the ACID traits.
More specifically, when you call Process, you provide a key and a "writeable" boolean indicating whether or not you want to modify/add records with that key. Multiple threads can all get read access (writeable==false) at the same time. Only one thread can have write access (writeable==true), and while a thread has write access, it blocks all reader threads. Most of the other DBM methods like Get, Set and Remove use Process under the covers, so they also acquire the appropriate read or write lock.
Locking is based on keys, not records. So when you acquire a read or write lock on a key, you prevent any other thread from modifying the record with that key or creating a new record with that key or removing a record with that key. When you acquire a write lock on a key, you also gain the right to modify the record with that key or create a new record with that key or remove a record with that key.
Different database classes implement key locking with different granularities. For example, HashDBM internally has a bank of 256 locks and keys get grouped together and guarded with these locks, whereas StdHashDBM internally has a single lock for all keys in the database. These internal differences will not affect the correctness of your application, but they may impact its performance. See the individual section for each database type to learn more.
Process takes a callback that Tkzrw calls with the record lock held so that you may perform the atomic operation. For that callback you can pass either an instance of a subclass of DBM::RecordProcessor or a lambda function. A subclass of DBM::RecordProcessor overwrites ProcessFull and ProcessEmpty methods.
- If the database has an existing record for the key passed to Process, Tkzrw calls your ProcessFull(key, value) with the key and the existing value, and you can either:
- return DBM::RecordProcessor::NOOP to make no change to the database, or
- return DBM::RecordProcessor::REMOVE to remove that record from the database, or
- return a new record value to modify that record in the database.
- If the database has no existing record for the key passed to Process, Tkzrw calls your ProcessEmpty(key), and you can either:
- return DBM::RecordProcessor::NOOP or DBM::RecordProcessor::REMOVE to make no change to the database, or
- return a new record value to add that record to the database.
The Process method can also take a lambda function, whose signature is like the following.
- If the database has an existing record for the key passed to Process, Tkzrw calls your lambda with the key and the existing value, and you can either:
- return DBM::RecordProcessor::NOOP to make no change to the database, or
- return DBM::RecordProcessor::REMOVE to remove that record from the database, or
- return a new record value to modify that record in the database.
- If the database has no existing record for the key passed to Process, Tkzrw calls your lambda with the key and a special value of string_view where value.data() == DBM::RecordProcessor::NOOP.data(), and you can either:
- return DBM::RecordProcessor::NOOP or DBM::RecordProcessor::REMOVE to make no change to the database, or
- return a new record value to add that record to the database.
When testing "value" passed to your callback against NOOP, make sure that you compare the pointers with the code "value.data() == DBM::RecordProcessor::NOOP.data()" rather than the incorrect code "value == DBM::RecordProcessor::NOOP", which would only compare the string values, or the incorrect code "&value == &NOOP", which would compare the object addresses. The uniqueness of the special NOOP value comes from the value of its NOOP.data() pointer. This applies to all callbacks for Process, ProcessMulti, ProcessFirst, and ProcessEach.
When returning a new value from your callbacks including lambda ones, remember that you are returning a std::string_view, and so the memory to which the string_view.data() points must remain valid throughout Process. This often means you must return a std::string_view that points to a member of your DBM::RecordProcessor subclass or a variable in a higher scope captured by reference in your lambda.
The following example has a lambda callback that works as an access counter. It takes an arbitrary ID string, increments the record value for it, saves the value as a decimal numeric string, and then returns the current access count. Note the check of NOOP by data() pointer, and note how the returned string is declared outside the lambda so it lasts util Process finishes, as explained above.
Process(id,
[&](std::string_view key, std::string_view value) -> std::string_view {
if (value.data() == DBM::RecordProcessor::NOOP.data()) {
count = 1;
return "1";
}
count = StrToInt(value) + 1;
new_value = ToString(count);
return new_value;
}, true).OrDie();
return count;
}
]]>
If your callback needs to spend a substantial amount of time between when it reads the existing database value and when it calculates the correct return value (NOOP, REMOVE, new value), this forces you to potentially degrade the performance of other threads, because your callback runs while Tkzrw is holding a lock on that record (and possibly other records) in the database. For many applications, there is a clever way around this dilemma called CompareExchange.
DBM::ProcessFirst
If you want to do an atomic operation to the first record without checking the key beforehand, the ProcessFirst method is useful. Using ProcessFirst is more effective than using an extternal iterator and call its First and Process methods in sequence.
As with the Iterator class, ProcessFirst can be slow if it is used with unordered databases (HashDBM, TinyDBM, CacheDBM, StdHashDBM) because finding the first record might scan the whole hash buckets. Meanwhile, Iterator and ProcessFirst are effective with ordered databases (TreeDBM, SkipDBM, BabyDBM, StdTreeDBM).
If you want to use the database as a queue (FIFO), you can use TreeDBM or BabyDBM, and their PushLast and PopFirst methods. PushLast takes an arbitrary record value and generates its key from the current timetamp. The, a record of the generated key and the given value is stored. Deduplication of the key is done implicitly. Thus, the enqueueing operation is written like the following.
PushLast(value);
}
]]>
PopFirst is implemented based on ProcessFirst. It retrieves the first record and removes it atomically. Thus, the dequeueing operation is written like the following.
PopFirst(nullptr, value);
}
]]>
DBM::ProcessEach
The ProcessEach method is a way to make bulk changes to the database atomically. It works as the following procedures.
- locks the entire database so no other thread can read or write any record.
- calls your callback once with a special sentinel value, giving you a chance to do preprocessing. For a DBM::RecordProcessor subclass, the sentinel is ProcessEmpty(key) with key.data() == DBM::RecordProcessor::NOOP.data(). For a lambda, the sentinel is func(key, value) with both key.data() == DBM::RecordProcessor::NOOP.data() and value.data() == DBM::RecordProcessor::NOOP.data().
- calls your callback once for every existing record in the database, just as with Process. So you will either get ProcessFull() or lambda func(key,value) with non-NOOP key and value.
- calls your callback one more time with the special sentinel value again, giving you a chance to do postprocessing.
- unlocks the entire database.
The following function multiplies the values of all records with a given factor. If the value of a record is less than 1, the record is removed.
ProcessEach(
[&](std::string_view key, std::string_view value) -> std::string_view {
if (value.data() == DBM::RecordProcessor::NOOP.data()) {
return DBM::RecordProcessor::NOOP;
}
const int64_t count = StrToInt(value) * factor;
if (count < 1) {
return DBM::RecordProcessor::REMOVE;
}
new_value = ToString(count);
return new_value;
}, true).OrDie();
}
]]>
The Iterator class of each DBM type also has the Process method, and it can also take a DBM::RecordProcessor or lambda function as explained above. If you want to modify a specific record in an atomic manner, use DBM::Process. If you want to modify some or all records, each in an atomic manner (but not atomically as a whole), use DBM::Iterator::Process. If you want to access every records atomically and modify some of them, use DBM::ProcessEach. If you need to modify multiple records as a batch/transaction in an atomic manner, use DBM::ProcessMulti, which is described below.
DBM::ProcessMulti
Transactional operations on multiple records can be done with the ProcessMulti method. It takes multiple record keys and multiple callback functions to operate the records. As with the Process method, you use lambda functions as callback functions. All records specified by the keys are locked while all callback functions are executed. Thus, other threads don't observe transitional states of the transaction. ProcessMulti works as the following procedures.
- It acquires read or write locks on all of your specified keys in the database in exactly the same way we described above for Process. So this prevents other threads from modifying, adding, or removing any records with those keys. If writeable==true, you have the right to modify, add, or remove records with those keys yourself.
- Then it calls your callbacks in sequence, one per key.
- Then releases the locks on all the records at once, so other threads will see all the records change state atomically rather than seeing any intermediate states of records.
Let's assume that you manage bank accounts using a file hash database. You transfer 1000 from account A to account B. This transaction is composed of these operations.
- Confirm that the account B exists.
- Evaluate the balance of account A and confirm it is not less than 1000.
- Decrease the balance of account A by 1000.
- Increase the balance of account B by 1000.
To keep the consistency as a financial system, all operations must be done atomically. You can implement this money transfer function as the following.
std::string_view {
if (value.data() == DBM::RecordProcessor::NOOP.data()) {
op_status.Set(Status::NOT_FOUND_ERROR, "no such destination account");
}
return DBM::RecordProcessor::NOOP;
};
// Callback to operate the source account.
std::string new_src_value;
auto proc_src =
[&](std::string_view key, std::string_view value) -> std::string_view {
if (!op_status.IsOK()) {
return DBM::RecordProcessor::NOOP;
}
if (value.data() == DBM::RecordProcessor::NOOP.data()) {
op_status.Set(Status::NOT_FOUND_ERROR, "no such source account");
return DBM::RecordProcessor::NOOP;
}
const int64_t current_balance = StrToInt(value);
if (current_balance < amount) {
op_status.Set(Status::INFEASIBLE_ERROR, "insufficient balance");
return DBM::RecordProcessor::NOOP;
}
new_src_value = ToString(current_balance - amount);
return new_src_value;
};
// Callback to operate the destination account.
std::string new_dest_value;
auto proc_dest =
[&](std::string_view key, std::string_view value) -> std::string_view {
if (!op_status.IsOK()) {
return DBM::RecordProcessor::NOOP;
}
if (value.data() == DBM::RecordProcessor::NOOP.data()) {
op_status.Set(Status::APPLICATION_ERROR, "inconsistent transaction");
return DBM::RecordProcessor::NOOP;
}
const int64_t current_balance = StrToInt(value);
new_dest_value = ToString(current_balance + amount);
return new_dest_value;
};
// Invokes the three callbacks in sequence atomically.
const Status status = dbm->ProcessMulti({
{dest_key, check_dest},
{src_key, proc_src},
{dest_key, proc_dest}}, true);
if (!status.IsOK()) {
return status;
}
return op_status;
}
]]>
If your callbacks need to spend a substantial amount of time between when they read the existing database values and when they calculate the correct return values, this forces you to potentially degrade the performance of other threads, because your callbacks run while Tkzrw is holding locks on one or more records in the database. For many applications, there is a clever way around this dilemma called CompareExchangeMulti.
DBM::CompareExchange and DBM::CompareExchangeMulti
Callbacks of the Process or ProcessMulti method shouldn't spend much time because they hold locks of the target records and prevent any other threads from making progress. However, doing the same thing with separate method calls is not acceptable because another thread may simultaneously change the values that you are processing. To get out of this dilemma, you can often use the extremely clever CompareExchange and CompareExchangeMulti methods.
The key insight is that, for many applications, if you read the value of a record, do some lengthy processing of that record, and then at the moment you write the result, you find that someone else has modified the value of the original record, you can simply re-run the lengthy process again with the new value, or (for some applications) just forget about running the lengthy process because it is no longer needed. You only actually need atomic behavior at the very last step, where you are deciding to commit your processed data or not based on whether or not the value has changed since the beginning of the process.
CompareExchange and CompareExchangeMulti are wrappers around Process and ProcessMulti which help you accomplish this. You do the following procedures.
- first read the value of the records you want (using Get(); no special locking needed at this stage)
- then process the value (the lengthy part)
- then call CompareExchange or CompareExchangeMulti, passing in the old record value(s) that you worked on, and the new record value(s) that you computed. CompareExchange and CompareExchangeMulti atomically check that the record(s) still have the same old value(s), and if so they set your new value(s) and return success. If not (if the value(s) changed while you were processing), CompareExchange and CompareExchangeMulti fail so you can try again.
In this example, you read the current balances of the accounts, calculate the desired balances after the transaction, and then call CompareExchangeMulti so that it only succeeds if the balances have not changed. You repeat the transaction until it succeeds.
Get(src_key, &src_value);
if (!status.IsOK()) {
return status;
}
const int64_t src_balance = StrToInt(src_value);
if (src_balance < amount) {
return Status(Status::INFEASIBLE_ERROR, "insufficient balance");
}
// Gets the balance of the destination account.
std::string dest_value;
status = dbm->Get(dest_key, &dest_value);
if (!status.IsOK()) {
return status;
}
const int64_t dest_balance = StrToInt(dest_value);
// Generate new values for the new balances.
const std::string src_new_value = ToString(src_balance - amount);
const std::string dest_new_value = ToString(dest_balance + amount);
// Finish the transaction atomically if the balances are not modified.
const std::vector> expected =
{{src_key, src_value}, {dest_key, dest_value}};
const std::vector> desired =
{{src_key, src_new_value}, {dest_key, dest_new_value}};
status = dbm->CompareExchangeMulti(expected, desired);
if (status.IsOK()) {
return Status(Status::SUCCESS);
}
if (status != Status::INFEASIBLE_ERROR) {
return status;
}
}
return Status(Status::INFEASIBLE_ERROR, "too busy");
}
]]>
Notification
In multi-thread programming, you sometimes want to use notification features. The SignalBroker class is useful to notify of generic updates to other threads. The "producer" threads and the "consumer" threads share an instance of SignalBroker. The producer threads calls its Send method after they update the database.
PushLast("hello");
broker->Send();
}
]]>
The consumer threads check the database first. If the checked data doesn't satify the precondition of the succeeding operation, they wait for the next database update and its signal. To prevent the signal from being sent between checking the current state and starting to wait, the Waiter object should be created before checking the current state.
PopFirst(nullptr, &value) == Status::SUCCESS) {
std::cout << value << std::endl;
break;
}
waiter.Wait();
}
}
]]>
If you want to use signals associated with arbitrary identifiers, you can use SlottedKeySignalBroker. The Send method takes a key and wakes up threads which are waiting for the same key.
* broker) {
dbm->Set("greeitng", "hello");
broker->Send("greeting");
}
]]>
As in the SignalBroker case, the consumer uses a Waiter object to make a critical section to check the database and wait for the next update.
* broker) {
while (true) {
SlottedKeySignalBroker::Waiter waiter(broker, "greeting");
std::string value;
if (dbm->Get("greeting", &value) == Status::SUCCESS) {
std::cout << value << std::endl;
break;
}
waiter.Wait();
}
}
]]>
Using SignalBroker and SlottedKeySignalBroker is better in terms of concurrency performance than using std::mutex and std::condition_variable. SignalBroker and SlottedKeySignalBroker don't cause exclusive locking for the database operations but use shared locking. Moreover, the Wait method of the Waiter class takes an optional timeout parameter so that you can use the long polling idiom.
Restoring Broken Databases
An opened database should be closed properly. Otherwise, the database file might be broken. When a database is opened, a field in the metadata section is marked as "opened". When the database is closed, the field is marked as "closed". If the process opening a database crashes without closing the database, the field is still marked as "opened". When the database file is opened the next time, the library detects the past incident and the database is considered "unhealthy".
By default, an unhealthy database is restored automatically when you open the database. The whole database file is scanned and as many records as possible are salvaged. This behavior can be changed by setting the restore_mode tuning parameter. The default value is RESTORE_DEFAULT, which causes the best-effort restore. If it is RESTORE_SYNC, the database is restored to the state at the time when the Synchronized method was called last time. If it is RESTORE_READ_ONLY, no restore operations are done and the database becomes read-only to avoid further inconsistency. If it is RESTORE_NOOP, the database is treated healthy without any restore operation.
You can add optional settings to RESTORE_DEFAULT or RESTORE_SYNC by bit-or operation, like RESTORE_SYNC | RESTORE_NO_SHORTCUTS. By default, the following shortcuts are checked and applied before doing the full-fledged restoration. The RESTORE_NO_SHORTCUTS option suppresses them.
- #1 : If the following condition is satisfied, the database is considered healthy and restoration is skipped.
- The update mode is the appending mode.
- The actual file size is the same as the metadata's file size.
- The validation check of the bucket section passes.
- #2 : If the following condition is satisfied, the database is considered restored without doing actual restoration.
- The update mode is the in-place mode.
- The actual file size is equal to or larger than the metadata's file size.
- The validation check of the bucket section passes.
- The validation check of the entire record section passes.
- #3 : If the following condition is satisfied, the database is considered restored without doing actual restoration.
- The update mode is the appending mode.
- The actual file size is equal to or larger than the metadata's file size.
- The validation check of the bucket section passes.
- The validation check of the record section after the metadata's file size passes.
If the database is in the appending update mode and you call the Synchronize method periodically, it is likely that the third shortcut is applied. And, it finishes very quickly. That's another reason why the appending mode is suitable for online services.
Another option is RESTORE_WITH_HARDSYNC, which synchronizes temporary data during the restoration operation. You set the option if you are concerned about another system crash during the restoration operation.
You can check whether a database is healthy or not by a command like this.
"num_records", "file_size", and "mod_time" represent information when the database was closed successfully (or the Synchronize method is called) the last time. So, even if you add many records before the last crash, the metadata doesn't reflect them. The restore operation recompute the metadata from salvaged records.
In contrast to the default API setting, the command line tool opens the database file in the RESTORE_READ_ONLY mode, the file is not restored implicitly. If you want to do the restore operation explicitly, you'll run a command like this.
If you want to set tuning parameters to the new restored database, you can prepare the new database with the tuning parameters and set it as the destination. This operation can be done even for a healthy database to rebuild it.
The same thing can be done in C++. HashDBM, TreeDBM, and SkipDBM have the RestoreDatabase method. PolyDBM and ShardDBM also privides wrapper functions.
To restore sharded database files like (casket.tkh-00000-of-00002 and casket.tkh-00001-of-00002), you'll run a command like this.
HashDBM and TreeDBM support the appending update mode. Databases in the appending mode can utilize an interesting feature called snapshot reproduction. As all updating operations are recorded in the database, you can consider the database as a list of REDO logs. You can replay them until a certain moment. And, the moment is represented as the then size of the database file. Given that you know the file size at a moment by calling the GetFileSize method, you can restore the database to the state at the moment.
You'll check the behavior by commands like this.
If the value of --end_offset is 0, the database is restored to the state at the time when the Synchronized method was called the last time. If the value of --end_offset is -1, as many records as possible are salvaged.
ACID Transaction
In the context of database, transaction is a concept of executing one or more database operations in a meaningful and reliable manner. We expect tansactions have ACID traits. Tkrzw's transactional features are designed to achieve the ACID traits. We should consider transactions in two aspects: transactions across threads and transactions across crashes.
Process, ProcessFirst, ProcessEach, ProcessMulti, CompareExchange, and CompareExchangeMulti assure Atomicity and Isolation. Specifically, they assure that other threads reading the same database will either observe no changes to any of the records, or they will observe the complete set of changes to all of the records involved in the transaction, regardless of the relative timing of all threads in the process. The business logic that you write inside your callbacks for those methods assure the Consistency of the transaction, another important aspect of ACID.
File database implementations provide Durability across crashes as well as the other ACI traits. If the application or system crashes when it is in the middle of a transaction, the database will rollback cleanly to the state before the transaction on the next Open. Durability of a translaction is guaranteed only when the next Synchronize or Close method finishes successfully. In this meaning, on-memory databases associated with files also provide the same level of durability.
Unlike some other databases, Tkrzw doesn't have an explicit rollback operation to cancel a set of previously requested updating operations. Once you call Set, Remove, or once your callback of the Process returns a new value or REMOVE, there is no way to take the operation back. Usually, checking the precondition in ProcessMulti or before CompareExchangeMulti makes rollback unnecessary. If you want rollback features, you have to record "undo" transaction logs in another database and apply them explicitly by yourself. If the database is HashDBM or TreeDBM and it is in the appending update mode, you can use the snapshot reproduction feature that lets you rollback your database to a specific point in its transaction history.
The Synchronize method is done within an exclusive lock against other threads. Thus, atomicity of transactional operations of multiple records are protected as long as you use ProcessMulti or CompareExchangeMulti.
There is a lot of misunderstanding about durability because "durability" is not a single yes-or-no, or even single-valued, concept. Different "durability" configurations of hardware and software can offer you different guarantees, or probabilities, that your data will not be damaged in specific specified threat situations. We consider the following situations as typical threads. The lower items are more severe and succeed all risks of the upper items.
- #1 : The process finishes without closing the database.
- The metadata is not written into the file.
- Record data cached in the process is not written into the file.
- #2 : The process crashes suddenly.
- Multiple write operations for one record updating can stop in the middle.
- One write operation can stop in the middle.
- #3 : The operating system crashes suddenly.
- A part of dirty pages is not written into the device.
- Inconsistent data blocks can be written into the device.
- #4 : The hardware breaks down.
- Anything unpredicable can happen.
- The entire file can be lost.
To deal with hardware brokage (#4), you should consider data replication service or distributed computing, which are not in the scope of Tkrzw's features. Tkrzw provides features for ACID traits across crashes of the operating system (#3) and the upper.
There is often a tradeoff between performance (CPU, memory, disk space) and durability. For example, by adding synchronization points by the Synchronize method at strategic points, we can often gain certain durability benefits, but at some performance cost. Or by appending new changes to the end of the file by the appending update mode of HashDBM/TreeDBM, we can gain certain durability benefits, but at the cost of extra disk space usage and fragmentation. In its default configuration, Tkzrw always favors performance. But Tkrzw offers a host of options to trade off durability for performance depending on your needs.
Settings for Higher Durability
HashDBM and TreeDBM give you higher durability with the following settings. Each has benefits for durability at the cost of performance.
- Open the database in the appending mode by setting the TuningParameters.update_mode to OpenAdvanced to UPDATE_APPENDING.
- The file size of the database in the appending update mode grows by each update operations. Thus, you call the Rebuild method periodically to keep time and space efficiency.
- Open the database with the TuningParameters.restore_mode to OpenAdvanced to RESTORE_SYNC.
- This restores unhealthy databases to the state at the last time when Synchronize succeeds. Adding RESTORE_NO_SHORTCUTS reduces the risk of possible missing links in case that null codes are written unexpectedly. Adding RESTORE_WITH_HARDSYNC to the restore mode gives you more safety against another crash during the restoration operation.
- Call the Synchronize method after the transaction which you must consider completed/consistent.
- The "hard" flag should be true to tolarate system crashes. If it is false, it can toralate only process crashes. If hard=true is more costly since it writes all cached data to disk and waits for confirmation from the hardware.
- Open the database with a file open option OPEN_SYNC_HARD.
- This option causes the same physical synchronization as above just before closing the database file.
In addition, if you are concerned that the storage device could taint the existing region on the same disk block when you append data, you should set the TuningParameters.align_pow to OpenAdvanced to a value for the disk block size. Typically, it's from 9 (meaning 2^9 = 512 bytes) to 14 (meaning 2^14 = 16384). However, it means that the minimum record size becomes 512 or 16384, whereby the space efficiency can be unpractically low.
SkipDBM gives you higher durability with the following settings. Each has benefits for durability at the cost of performance. The same thing can be said to the other on-memory databases associated with files.
- Call the Synchronize method after the transaction which you must consider completed/consistent.
- The "hard" flag should be true to tolarate system crashes. If it is false, it can toralate only process crashes. If hard=true is more costly since it writes all cached data to disk and waits for confirmation from the hardware.
- Open the database with a file open option OPEN_SYNC_HARD.
- This option causes the same physical synchronization as above just before closing the database file.
The Synchronize method of SkipDBM and on-memory databases makes a new database file as a temporary file. Then, the old database file is replaced with the temporary file atomically. Thus, the database never breaks and no record is ever lost. Even so, SkipDBM has restore features just in case that the file is currupted for unexpected reasons. RESTORE_DEFAULT and RESTORE_SYNC are equivalent for SkipDBM. Actually, calling Synchronize often is not practical in terms of performance. Just setting OPEN_SYNC_HARD and delaying synchronization as much as poossible is better for performance.
If you use ShardDBM for sharding databases, ACID traits across crashes are not guaranteed regardless of the inner database classes. That's because the Synchronize method is called one by one for each inner databases without the global locking. It can end up with a partial success of Synchronize.
The above settings can realize the durability that most ACID-supporting databases provide, although some important details vary between databases. Even if you do not configure Tkrzw into the ACID settings, you can still call Process, ProcessMulti, etc. to get the benefits of Atomicity and Isolation between threads in your callback. But if you want ACID transactions, where you can modify two or more records atomically even in the presence of crashes, you must use the ACID settings. Even applications which don't require ACID traits may still want to choose the ACID setting, because it offers stronger durability benefits against various forms of disk corruption.
Some applications might want ACID transactions or other benefits of the ACID settings, but cannot accept the performance penalty of calling Synchronize. Such applications can still get some partial protection by opening the database with the default restore mode. With RESTORE_DEFAULT, Tkrzw will make a best-effort attempt to recover as many records as possible after a crash. There is no guarantee about which records, or even how many records, can be recovered. However, you can often recover many records beyond the last Synchronize, which might be a better trade-off depending on the application. Unlike the ACID settings, this method works even with a ShardDBM database. Such applications should also consider using UPDATE_APPENDING, which offers some protection against disk corruption.
Detecting Corruption
Regardless of whether you choose the ACID settings or not, HashDBM and TreeDBM provide partial protection against corruption to record data in the file in the form of checksums on each record. The checksums allow Tkrzw and your application to detect database corruption with a probability that you can tune:
- By default, you get a 6-bit checksum on each record, which can detect errors at a rate of 98.3%
- You can pass RECORD_CRC_8 as the TuningParameters.record_crc_mode to OpenAdvanced to add 8 more bits of CRC-8 checksum stored with each record, raising the error detection rate to 1 - 1/ (61 * 256) = 99.993%
- You can pass RECORD_CRC_16 as the TuningParameters.record_crc_mode to OpenAdvanced to add 16 more bits of CRC-16 checksum stored with each record, raising the error detection rate to 1 - 1 / (61 * 65536) = 99.999975%
- You can pass RECORD_CRC_32 as the TuningParameters.record_crc_mode to OpenAdvanced() to add 32 more bits of CRC-32 checksum stored with each record, raising the error detection rate to 1 - 1 / (61 * 4294967296) = 99.9999999996%
CRCs are better than checksums in that they detect one contiguous error sequence within the width of CRC at the 100% rate. Typically RECORD_CRC_8 or RECORD_CRC_16 are enough for situations where power failures and hardware failures are not common.
The checksum and CRC are not checked in normal database operation like Get and Set. They are checked when the database is restored by the auto restore feature or an explicit call of the RestoreDatabase function. Consistency of metadata is checked for each record access. If you want to validate all records using the checksum and CRC, call the ValidateRecords method explicitly.
Tkrzw can also detect corruption in the file header containing the metadata of the database. If corruption of the file header is detected, the backup metadata is read from the record header section. If Tkzrw can restore some or all of your data, the Open method returns a success status and the IsAutoRestored method returns true. If your database is too corrupted to be read at all, the Open method returns an error status.
In all cases, any form of corruption to the database file shouldn't make Tkrzw crash your process. Tkrzw will either detect the corruption and return an error as described above, or it will fail to detect the corruption and return success status. Some forms of corruption that Tkzrw cannot detect may cause you to lose data. For example, if a rogue process modifies a certain header field in a certain way, some of your records can disappear on the next Open, but your process will not crash.
Making Backup Data Online
Let's say, your service runs 24 hours a day and any downtime is not allowed. Then, you have to make backup data of the database online without halting the service. There are two methods to do so: CopyFileData and Export.
The CopyFileData method synchronizes the updated content to the database file and copy the whole data to another file. As copying is done while the content of the database is synchronized and frozen, the copied file is considered as a snapshot of the database at a certain moment. And, the copying operation is done by the fastest way on the system. If the file system is BtrFS or XFS on Linux, reflink (copy-on-write) is used and the copying operation is done at a split second. Otherwise, on Linux, the sendfile system call is used.
The downside of the CopyFileData method is that other threads trying to update the database are blocked during the copying operation, which can take several seconds or minutes depending on the file size. If it is not acceptable, use the Export method, which accesses each and every records and adds them to another database. As other threads updating the database are not blocked during the iteration, atomicity as the database is not assured. Atomicity of each record is assured.
Some file systems provides a feature to make a snapshot of a file. The Synchronize method can call a callback function while the database is synchronized. Therefore, if the callback function uses the snapshot feature, you can make an atomic snapshot at a moment without making other threads wait.
Update Logging and Incremental Backup
If you have a backup database file, you can recover the state of the database at the time of taking the backup file. However, you loose updates after that time. Taking update logs mitigates this issue. Each database class supports injection of any update logger instance, whose class is derived from the UpdateLogger class. The WriteSet method is called just before a record is modified or added by the Set method and so on. The WriteRemove method is called just before a record is removed by the Remove method and so on. The WriteClear method is called just before all records are removed the Clear method. If you want to monitor all updates on the terminal, you'll write the following code.
Note that all updates represent the current content of the database. Let's say, if the value "foo" of a record is updated by Append method to be "foobar", WriteSet("foobar") is called. It's not WriteAppend("bar"). If the Remove method is called but the record doesn't exist, no callback is called.
By storing update logs into files, you can redo the operations later. It's like row-level REDO logs of RDBMS. One easy way to achieve it is to use another DBM object. The class DBMUpdateLoggerDBM wraps any DBM object as UpdateLogger. The following code replicates every updates of the main database to the backup database.
However, using two databases is not ideal in most cases. As the backup database is also online, you cannot stop it while the main database is in operation. Moreover, the throughput of updating operatins is halved.
Instead, it is recommended to store update logs into flat files by the DBMUpdateLoggerMQ class. It serializes update logs in a binary format and appends them at the end of a flat file. Thanks to the simple structure, overhead for taking logs is relatively low. Moreover, file rotation is done implicitly; If the size of the current file exceeds a threshold, a new file is created and then newer update logs are stored there. The timestamp when the update happens is attached to each update log. Thus, you can retrieve update logs in a specific time range. Thanks to those features, you can maintain incremental backup data for online databases.
The DBMUpdateLoggerMQ class uses a message queue managed by the MessageQueue class. It realizes thread-safe operations of writing and reading logs simultaneously. You specify a prefix of log files and the maximum file size when you call the Open method of MessageQueue. The actual file name has a suffix of ten digits of a sequential ID, like "casket-ulog.0000000000" and "casket-ulog.0000000001".
The same thing can be done with PolyDBM and ShardDBM by setting tuning parameters.
If the database is completely broken or lost somehow, you restore the current state by applying the update logs to the latest backup file. It is a good practice to copy the backup file for restoration to keep the original file intact, just in case.
The typical scenario of operations with incremental backup data is like the following. After restoration and before resuming the normal operations, the restored database should be copied as the new backup file.
- Makes a full backup data.
- Does normal update operations while taking update logs.
- The system crashes and the main database is lost somehow.
- Applies the update log to the full backup file.
- Archive/remove old update log files.
The restore operation can be done by the command line utility too. If you take update logs from the beginning, you can make a full backup file by applying all update logs to an empty database. The parameter of "--ulog" means the minimum timestamp in milliseconds of the updates to be applied. So, 0 means all updates.
Unless the system crashes and restoration is done, the update log files continue to increase. Thus, you should apply the update logs periodically and archive/remove unnecessary files. The above command works with update log files which are in operation by another process. As the race condition is avoided by checking metadata for each record, you can read update log files which can be written by another process simultaneously. After you apply all exisiting update logs, you can remove the update log files except for the latest one. The following command is useful to remove update log files which are older than a threshold (3 days ago), excluding the latest one.
Even if update logs in the latest file is applied multiple times, it is safe. Update logs for each record is idempotent, which means that applying the same operation multiple times has the same effect as applying it once. The same thing can be said to a sequence of update logs sorted by the occurrence time. Thus, you can apply update log files multiple times as long as the order is by the file ID. That said, applying update logs since some seconds before the timestamp of the database is a more efficient way. The "some seconds" allowance absorbs the possibility of time skew of the system. You can set a relative timestamp in milliseconds if you set a nagative value for the parameter of "--ulog". The following commands import updates since 10 seconds before the timestamp of the backup database.
Let's confirm the behavior of the update logs using the command line utilities.
The reason why the MessageQueue class is used for taking logs is that another thread can monitor the latest updates and read them immediately. This feature is utilized to implement data replication between network servers in Tkrzw-RPC. That's why each update log has extra properties for the server and the DBM index. You can also use DBMUpdateLoggerMQ to implement data replication locally. See the API of DBMUpdateLoggerMQ and MessageQueue for details.
The message file begins with the 32-byte metadata section. It is composed of the magic data "TkrzwMQX\n" from offset 0, 1-byte cyclic magic data from offset 9, 6-byte file ID from offset 16, 6-bite timestamp from offset 22, and 1-byte cyclic magic data from offset 31. A sequence of message data comes after that. Each record begins with the 11-byte record header. It is composed of 1-byte magic data (0xFF), 6-byte timestamp, and 4-byte message size. The message body comes after that. The 1-byte checksum comes at the end of the message. It is a modulo 251 checksum of each byte and added by 1. The timestamp is represented as a UNIX epoch time in milliseconds, serialized in big-endian.
The message data for the update log begins with the metadata section. It is composed of 1-byte operation type (0xA1=Set, 0xA2=Remove, 0xA3=Clear), varint server ID, and varint DBM index. After that, the Set operation has a key and a record. The Remove operation has a key. Each of the key and the value is composed of varint size and arbitrary binary data.
The footprint of the message data is 12 bytes. The minimum footprint of each updage log of the Set operation is 5 bytes. Thus, the minimum footprint in total for each update log is 17 bytes. Let's say, each record has the 8 byte key and the 16 byte value on average. 100 million updates are done each day. Then, (17 + 8 + 16) * 100 million = 3.81 GiB data is added each day. You should estimate the data increment on your case in such a formula to schedule the backup rotation.
Data Migration
A simple format called FlatRecord is provided as a vehicle of data migration. Flat records are also used to save records of on-memory databases in a file. A binary file of FlatRecord contains a sequence of records. Each record represents a sequence of bytes, in the following format.
- The magic data
- 1 byte integer.
- The data size
- A byte delta encoded integer.
- The data
- Arbitrary string data.
The magic data is 0xFF or 0xFE and helps us detect data corruption. 0xFF is for normal records and 0xFE is for metadata. The data size is represented in byte delta encoding. A value between 0 and 127 takes 1 byte. A value between 128 and 16383 takes 2 bytes. A value between 16384 and 2097151 takes 3 bytes. A key-value pair is represented by two records in sequence. To represent a key-value database, keys and values of records appear alternately.
You can export all records of an existing database into a flat record file, by a command like this.
If all keys and values are plain text, exporting them as TSV is a feasible option. You can do it like this.
Records in exported data files can be imported into a database, by commands like these.
Whereas data of flat records can include any kind of binary data or text data, TSV data cannot include tab, linefeed or binary data. If you want to escape output data in C-style for exporting and unescape input data for importing, add the "--escape" flag.
Data migration can be done via update logs although space efficiency is not very good. The "export" subcommand takes the "--ulog" option which sets the output type to update logs and specifies the timestamp added to each update log. If the value is negative, the timestamp of the database is used. All existing records are treated as the "Set" operations. The "import" command also takes the "--ulog" option which specifies the minimum timestamp for update logs to import.
Inverted Index for Full-text Search
The DBM interface provides the Append method which appends the given value at the end of the value of the existing record. Most DBM implementations use a wrapper which calls the Get method and the Set method atomically. However, if you call the Append method N times, the time complexity is O(N^2). To overcome the problem, TinyDBM and BabyDBM override it with specialized implementations, which write the given value directly at the end of the existing value if possible. The amortized time complexity of the specialized implementations is O(N). Therefore, TinyDBM and BabyDBM are useful as buffers to build inverted indices for full-text search systems.
Typically, a full-text search system uses a buffer for posting lists of document IDs for each words. The contents of the buffer are dumped into separate DBM files periodically according to the memory capacity. If necessary, you can merge the separate DBM files into one file.
documents = {
"this is a pen", "this boy loves a girl", "the girl has the pen", "tokyo go",
"she loves tokyo", "tokyo is a big city", "boy meets girl", "girl meets boy"};
constexpr int32_t BUFFER_CAPACITY = 10;
// Register all documents into separate index files.
BabyDBM buffer;
int32_t num_files = 0;
int32_t num_words_in_buffer = 0;
int64_t doc_id = 0;
for (const auto& doc : documents) {
const std::string value = IntToStrBigEndian(doc_id);
const std::vector words = StrSplit(doc, " ");
for (const std::string& word : words) {
buffer.Append(word, value);
num_words_in_buffer++;
}
if (num_words_in_buffer > BUFFER_CAPACITY) {
DumpBuffer(&buffer, num_files);
buffer.Clear();
num_files++;
num_words_in_buffer = 0;
}
doc_id++;
}
if (num_words_in_buffer > 0) {
DumpBuffer(&buffer, num_files);
buffer.Clear();
num_files++;
}
// Merge separate index files into one.
SkipDBM merged_dbm;
merged_dbm.Open("index-merged.tks", true, File::OPEN_TRUNCATE).OrDie();
for (int file_id = 0; file_id < num_files; file_id++) {
const std::string file_name = SPrintF("index-%05d.tks", file_id);
merged_dbm.MergeSkipDatabase(file_name).OrDie();
}
merged_dbm.SynchronizeAdvanced(false, nullptr, SkipDBM::ReduceConcat).OrDie();
merged_dbm.Close().OrDie();
// Search the merged index file.
merged_dbm.Open("index-merged.tks", false).OrDie();
const std::vector queries = {"pen", "boy", "girl", "tokyo"};
for (const auto& query : queries) {
std::cout << query << ":";
std::string value;
if (merged_dbm.Get(query, &value).IsOK()) {
size_t pos = 0;
while (pos < value.size()) {
doc_id = StrToIntBigEndian(std::string_view(value.data() + pos, sizeof(int64_t)));
std::cout << " " << doc_id;
pos += sizeof(int64_t);
}
}
std::cout << std::endl;
}
merged_dbm.Close().OrDie();
}
void DumpBuffer(BabyDBM* buffer, int32_t file_id) {
const std::string file_name = SPrintF("index-%05d.tks", file_id);
SkipDBM::TuningParameters tuning_params;
tuning_params.insert_in_order = true;
SkipDBM dbm;
dbm.OpenAdvanced(file_name, true, File::OPEN_TRUNCATE, tuning_params).OrDie();
auto iter = buffer->MakeIterator();
iter->First();
std::string key, value;
while (iter->Get(&key, &value).IsOK()) {
dbm.Set(key, value).OrDie();
iter->Next();
}
dbm.Close().OrDie();
}
]]>
Utility Functions
Utility functions are provided to complement functionality of the DBM interface. They are defined in "tkrzw_dbm_common_impl.h". Especially, these are useful.
SearchDBM
: Searches a database and gets keys which match a pattern.SearchDBMForwardMatch
: Searches a database and gets keys which begin with a pattern.SearchDBMRegex
: Searches a database and gets keys which match a regular expression.SearchDBMEditDistance
: Searches a database and gets keys whose edit distance with a pattern is the least.
Scanning the whole database is sometimes costly. If the database is not updated frequently, exporting keys into a text file and scanning it is more efficient. The ExportDBMKeysAsLines function is provided for the purpose. And, the functions SearchTextFile, SearchTextFileRegex, and SearchTextFileEditDistance are available for such text files.
Multitasking
Databases of Tkrzw are designed to be shared among multiple threads. However, they are not designed to be shared among multiple processes. When one process opens a database file as a writer, an exclusive lock is applied to the file so that other processes trying to open the same database file are blocked until the database is closed. When one process opens a database file as a reader, a shared lock is applied to the file so that other reader processes can open the same database but other writer processes are blocked until the database is closed. Thereby, consistency of the state of the database and atomicity of updating operations are assured in the multi-processing environment.
A workaround way to share the same database among multiple processes is that each operation to access the database is done between opening the database and closing the database. Although it causes unnecessarily read/write operations for each database access, it is feasible up to a certain throughput (100 QPS or something). The workaround is useful with traditional shell commands and CGI scripts.
If possible, using the service of Tkrzw-RPC is more practical. A server process serves one or more databaes. Clients access the server via the gRPC protocol, which realizes more throughput than 10,000 QPS and can handle more than 10,000 clients at the same time.
File Classes
The file databases are based on abstract file storage interface so that you can build a database on any kind of file systems. Several concrete implementations on C++ API and POSIX API are bundled. MemoryMapParallelFile is used for all databases by default.
- StdFile : Based on "fstream" of the standard C++ API. Access to the data is guarded by mutex and update looks atomic.
- MemoryMapParallelFile : Based on memory mapping I/O by "mmap". Access to the data is not guarded by mutex.
- MemoryMapAtomicFile : Based on memory mapping I/O by "mmap". Access to the data is guarded by mutex and update looks atomic.
- PositionalParallelFile : Based on positional I/O by "pread" and "pwrite". Access to the data is not guarded by mutex.
- PositionalAtomicFile : Based on positional I/O by "pread" and "pwrite". Access to the data is guarded by mutex and update looks atomic.
Consistency of the data as a database is guaranteed by the database classes. So, atomicity of update is not required to the file class. Therefore, there's no need to use MemoryMapAtomicFile and PositionalAtomicFile for purposes other than performance test.
Because the MemoryMapParallelFile uses memory mapping, efficiency of I/O is the best. However, the file size cannot exceed the size of the virtual memory. If the file size is larger, the PositionalParallelFile should be used.
To check performance, let's write 10 million records each of which is 100 bytes with one thread. The file size is 954MB.
class | Sequential Write | Sequential Read | Random Write | Random Read |
---|---|---|---|---|
StdFile | 432,839 QPS | 719,570 QPS | 461,922 QPS | 613,791 QPS |
MemoryMapParallelFile | 7,581,742 QPS | 12,014,696 QPS | 4,563,364 QPS | 4,838,227 QPS |
MemoryMapAtomicFile | 7,858,269 QPS | 12,794,807 QPS | 4,480,116 QPS | 4,905,746 QPS |
PositionalParallelFile | 950,962 QPS | 1,351,362 QPS | 739,976 QPS | 1,018,328 QPS |
PositionalAtomicFile | 950,187 QPS | 1,331,649 QPS | 735,836 QPS | 993,415 QPS |
Do the same things with 10 threads. Each thread writes 1 million records and reads them.
class | Sequential Write | Sequential Read | Random Write | Random Read |
---|---|---|---|---|
StdFile | 106,545 QPS | 132,158 QPS | 106,669 QPS | 120,760 QPS |
MemoryMapParallelFile | 7,698,532 QPS | 10,020,994 QPS | 7,643,432 QPS | 9,144,095 QPS |
MemoryMapAtomicFile | 2,378,891 QPS | 10,230,659 QPS | 1,427,267 QPS | 9,538,727 QPS |
PositionalParallelFile | 1,084,185 QPS | 5,610,970 QPS | 770,063 QPS | 5,360,002 QPS |
PositionalAtomicFile | 182,761 QPS | 5,434,638 QPS | 136,015 QPS | 5,011,813 QPS |
For the command line, the "--file" option must be set always if you use a file class except for MemoryMapParallelFile.
In C++, a File object is given to the constructor of the HashDBM class. The same thing can be done with all DBM classes.
());
dbm.Open("casket.tkh", true);
]]>
All file classes except for StdFile use a preliminary allocation strategy for performance improvement by default. When the file size exceeds a threshold, the file size is doubled, which reduces the frequency of allocation operations. The initial file size is 1MB by default. PositionalParallelFile and PositionalAtomicFile can disable this feature. To disable it, you call the SetAllocationStrategy method.
As MemoryMapParallelFile and MemoryMapAtomicFile use memory mapping, the file cannot be empty and the file size must be aligned to the page size of the OS, which is usually 4096, 8192, or 16384. The file size is adjusted to the actual written data size when you call the Close method. And, the file size is also the same as the actual written data size between the time when you call the Synchronize method and the time when you call the next updating method.
The File class is an interface class and you can implement subclasses of it to support any kind of storage which supports random access. Like the polymorphic database class (PolyDBM), the polymorphic file class (PolyFile) is provided. It is useful to handle files of different implementation with the same interface. See the API document for details.
Direct I/O
Linux and Windows support direct I/O, whereby input and output from/to the storage device are done without using the cache of the underlying file system. Direct I/O is enabled by non-standard features depending on the OS (O_DIRECT of open flags on Linux, F_NOCACHE of fcntl commands on Mac OS X, and FILE_FLAG_NO_BUFFERING of CreateFile flags on Windows). If direct I/O is enabled, I/O operations with the file must be in the block I/O style, where every data sequence is aligned by at least 512-byte block. Let's say, you want to update just one byte of a file, you must read one block of 512 bytes from the device to a buffer. Then, you update the one byte and write back the buffer to the device.
Usually, such cumbersome jobs are done by the file system, which uses so-called "page cache" mechanism. Multiple operations against the same page of 4096 bytes are reflected on a cached buffer, which is written back to the storage device periodically. Using page cache is effective to improve performance in most cases. However, page cache uses memory. If your file is larger than the available memory size and there's no locality in your access pattern, the cache hit rate is low, which just wastes memory space and CPU time. In that case, using direct I/O to suppress page caching is useful. Though each operation of direct I/O is much slower than cached I/O in normal situations, it is the best practice for managing huge databases.
You should consider using direct I/O if the size of the database is more than the available RAM or the typical I/O pattern is larger than 512 bytes. With HashDBM, the logical record size directly determines the I/O pattern. So, the record size is large like 2KB or 10KB, using direct I/O is practical. With TreeDBM, multiple records are organized in a page of B+ tree and the I/O pattern is implicitly adjusted to 4KB or more. So, TreeDBM is often suitable for direct I/O.
The file classes PositionalParallelFile and PositionalAtomicFile support direct I/O. Usually, you use PositionalParallelFile. You call SetAccessStrategy method to set the block size 512 and enable ACCESS_DIRECT option. Moreover, both HashDBM and TreeDBM have the "align_pow" which set the alignment by the power of 2. You should set it at least 9, which means 2^9 = 512. You should also set the "cache_buckets" flag to improve performance of the hash table. Therefore, you'll write code like the following.
();
file->SetAccessStrategy(512, PositionalFile::ACCESS_DIRECT);
HashDBM dbm(std::move(file));
HashDBM::TuningParameters tuning_params;
tuning_params.align_pow = 9;
tuning_params.cache_buckets = true;
dbm.OpenAdvanced("casket.tkh", true, File::OPEN_DEFAULT, tuning_params);
]]>
As long as the block size is 512 or its multiple, it works. The best block size is determined by the typical record size and the physical block size of the device. While the alignment size has no hard restriction, the best alignment size should be determined by the typical record size. For HashDBM, 9 is good for most cases. If the typical record size is less than 256, setting 8 or less is reasonable. However, if the alignment is less than the block size and you use multiple threads, you must use PositionalAtomicFile because the file access across the block border is not thread-safe. Moreover, adding ACCESS_PADDING to ACCESS_DIRECT is necessary to make the file size aligned to the block size. Adding ACCESS_PAGECACHE is optional but good for performance. Therefore, you'll write code like the following.
();
file->SetAccessStrategy(512, PositionalFile::ACCESS_DIRECT |
PositionalFile::ACCESS_PADDING | PositionalFile::ACCESS_PAGECACHE);
HashDBM dbm(std::move(file));
HashDBM::TuningParameters tuning_params;
tuning_params.align_pow = 8;
tuning_params.cache_buckets = true;
dbm.OpenAdvanced("casket.tkh", true, File::OPEN_DEFAULT, tuning_params);
]]>
For TreeDBM, the default value setting is usually suitable for direct I/O too. By default, each page data of B+ tree is stored with alignment of 1024 bytes while the data size of each page is tuned to be between 4096 and 8192. You should enable "cache_buckets" for performance. Therefore, you'll write code like the following.
();
file->SetAccessStrategy(512, PositionalFile::ACCESS_DIRECT | PositionalFile::ACCESS_PADDING);
TreeDBM dbm(std::move(file));
TreeDBM::TuningParameters tuning_params;
tuning_params.cache_buckets = true;
dbm.OpenAdvanced("casket.tkt", true, File::OPEN_DEFAULT, tuning_params);
]]>
You don't have to tune the block size 512 for tuning purposes. If the underlying device driver accepts 512-byte alignment, it's OK. Meanwhile, you should tune the "align_pow" for performance. If you use TreeDBM, tuning "max_page_size" is also important.
Whereas using direct I/O avoids tainting the file system cache, using a small page cache system in the process can improve performance of some cases. In general, locality is low with a large database. However, there's some locality in the following cases.
- You use the same query often or repeatedly regardless of the database type. Then, exactly the same blocks are accessed.
- You add/update small records in HashDBM. Then, the block at the end of the file is updated repeatedly.
- You use SkipDBM. The same or adjacent blocks are read repeatedly at the last steps of a search operation.
If you add PositionalFile::ACCESS_PAGECACHE to the access strategy options, the page cache system in process is enabled. It manages "pages" on memory. Each page is associated with one or multiple blocks of the file. The last 256 pages which have been accessed recently are kept on memory. Reading data from cached blocks causes no operation of the storage device. Writing data to cached blocks marks them as "dirty", which are written back when being removed from the cache. If you use direct I/O on HashDBM and SkipDBM, enabling the page cache in process is usually beneficial. If you use TreeDBM, you shouldn't enable the page cache in process because B+ tree nodes work as a cache system.
Performance of ShardDBM
You can shard a database into several database files by using ShardDBM. Because the collision rate of locking mutex is divided by the number of shards, concurrency performance improves. To confirm it, let's do performance tests. This time we store 10 million records with 10 threads. The key is an 8-byte string from "00000000", "00000001", to "09999999" selected at random. We use ShardDBM and the internal database class varies for each test run. First, let's see the result with one shard.
class | Set | Get | Remove | memory usage | file size |
---|---|---|---|---|---|
HashDBM | 2,281,733 QPS | 3,711,545 QPS | 2,577,823 QPS | 181.0 MB | 182.8 MB |
TreeDBM | 715,372 QPS | 2,398,091 QPS | 625,034 QPS | 385.7 MB | 119.8 MB |
SkipDBM | 510,634 QPS | 866,662 QPS | 839,042 QPS | 420.5 MB | 122.5 MB |
Then, let's see the result of ten shards. Tuning parameters such as the number of buckets and the maximum number of cached pages are specified to see optimal performance for all test runs.
class | Set | Get | Remove | memory usage | file size |
---|---|---|---|---|---|
HashDBM | 5,163,622 QPS | 6,321,919 QPS | 5,653,435 QPS | 181.3 MB | 182.8 MB |
TreeDBM | 2,373,253 QPS | 3,739,910 QPS | 2,263,250 QPS | 387.5 MB | 119.9 MB |
SkipDBM | 564,395 QPS | 1,712,049 QPS | 1,220,438 QPS | 387.8 MB | 122.5 MB |
As seen above, sharding is effective to improve throughput with multi-threading. Especially, if updating operations are the bottleneck of the service, sharding is beneficial. Although there's a drawback that the iterator can be slower, the extent is usually acceptable. The following commands reproduce the performance tests on your environment.
Performance of Secondary Indices
If you use secondary indices, choosing suitable classes is important. FileIndex is relatively slow but the data is perpetuated in a file. MemIndex is fast and memory efficient. StdIndex is a template class. Time efficiency and space efficiency differ by the types of the key and the value. Let's say, 10 threads set 1 million records in total. The key and the value of each record are numeric values. We compare int64_t (from 0 to 999999) and decimal std::string (from "0" to "999999"). In addition, we check StdIndexStr which is a specialized version for memory efficiency of string-to-string use cases.
class | Add | Check | Remove | Memory Usage |
---|---|---|---|---|
FileIndex | 700,275 QPS | 2,451,437 QPS | 750,214 QPS | 21.9 MB |
MemIndex | 1,031,907 QPS | 5,703,882 QPS | 1,003,391 QPS | 50.0 MB |
StdIndex<int64_t, int64_t> | 1,228,569 QPS | 11,051,806 QPS | 1,739,578 QPS | 62.0 MB |
StdIndex<int64_t, string> | 944,862 QPS | 9,911,980 QPS | 1,309,429 QPS | 76.9 MB |
StdIndex<string, int64_t> | 806,142 QPS | 8,744,327 QPS | 1,040,206 QPS | 76.8 MB |
StdIndex<string, string> | 580,958 QPS | 7,972,128 QPS | 862,372 QPS | 106.7 MB |
StdIndexStr | 397,067 QPS | 6,731,285 QPS | 700,590 QPS | 76.8 MB |
Using StdIndex<int64_t, int64_t> as much as possible is good for both time and space efficiency. Otherwise, using MemIndex is the second best. If you need file indices, using FileIndex is the sole solution.
The above test cases should show ideal performance of each index class because records are accessed sequentially and the entire data can be cached on memory. Then, let's move on to tougher test cases. We build a large database of 100 million records with random keys between "00000000" and "99999999". As some keys are duplicated and such records are overwritten, the actual number of unique records is about 63 million. We use max_cached_pages=400000 for FileIndex.
class | Add | Check | Remove | Memory Usage |
---|---|---|---|---|
FileIndex | 61,779 QPS | 216,758 QPS | 63,744 QPS | 7202.7 MB |
MemIndex | 455,743 QPS | 2,428,402 QPS | 392,666 QPS | 6144.4 MB |
StdIndex<int64_t, int64_t> | 158,152 QPS | 4,935,703 QPS | 142,164 QPS | 5962.8 MB |
StdIndex<int64_t, string> | 128,086 QPS | 4,528,615 QPS | 128,206 QPS | 7453.0 MB |
StdIndex<string, int64_t> | 114,269 QPS | 3,768,720 QPS | 110,755 QPS | 7452.9 MB |
StdIndex<string, string> | 105,488 QPS | 3,477,022 QPS | 101,512 QPS | 10433.1 MB |
StdIndexStr | 98,652 QPS | 2,798,903 QPS | 96,561 QPS | 10422.1 MB |
In these settings, using FileIndex slows down significantly. If you need more throughput than 61K QPS, you should use on-memory indices. If parallelism of updating operations is important, using MemIndex is the best. Otherwise, using StdIndex is recommended. Using integer type as much as possible is a good idea.
DBM Performance Tests
Performance tests described in the performance section can be done by the following commands.
Frequently Asked Questions
- Q: What does Tkrzw stand for? How should I pronounce Tkrzw?
- A: There's no meaning. Please pronounce it as you like.
- Q: What's the relationship between Tkrzw and Kyoto Cabinet (Tokyo Cabinet, QDBM)?
- A: Tkrzw is a successor of them, utilizing C++17 features. I believe Tkrzw is superior in many aspects to them. Because API and data formats have been changed, there's no compatibility with the predecessors.
- Q: Do you have a plan to write a database server like Kyoto Tycoon?
- A: Please see Tkrzw-RPC.
- Q: Do you have a plan to host binary packages?
- A: I cannot do it myself. I appreciate people who maintain binary packages for various environments.
License
Tkrzw is written mainly by Mikio Hirabayashi, copyrighted by Google LLC, and distributed under the Apache license 2.0. See the COPYING file in the package for detail.
Tkrzw is NOT an official Google product but developed by an individual as a personal hobby product. It is copyrighted by the company due to his employment contract. To contact the author, send an e-mail to <hirarin@gmail.com>.