Fundamental Specifications of Tokyo Dystopia Version 1

Copyright (C) 2007-2010 FAL Labs
Last Update: Thu, 05 Aug 2010 15:19:29 +0900

Table of Contents

  1. Introduction
  2. Installation
  3. The Core API
  4. The Q-gram Database API
  5. The Simple API
  6. The Word Database API
  7. Command Line Interfaces
  8. Tutorial
  9. License

Introduction

Tokyo Dystopia is a full-text search system. You can search lots of records for some records including specified patterns. The characteristic of Tokyo Dystopia is the following.

Tokyo Dystopia is available on platforms which have API conforming to C99 and POSIX. Tokyo Dystopia is a free software licensed under the GNU Lesser General Public License.


Installation

Install the latest version of Tokyo Cabinet beforehand and get the package of Tokyo Dystopia.

When an archive file of Tokyo Dystopia is extracted, change the current working directory to the generated directory and perform installation.

Run the configuration script.

./configure

Build programs.

make

Perform self-diagnostic test.

make check

Install programs. This operation must be carried out by the root user.

make install

When a series of work finishes, the following files will be installed.

/usr/local/include/tcqdb.h
/usr/local/include/dystopia.h
/usr/local/include/tcwdb.h
/usr/local/include/laputa.h
/usr/local/lib/libtokyodystopia.a
/usr/local/lib/libtokyodystopia.so.x.y.z
/usr/local/lib/libtokyodystopia.so.x
/usr/local/lib/libtokyodystopia.so
/usr/local/lib/pkgconfig/tokyodystopia.pc
/usr/local/bin/tcqtest
/usr/local/bin/tcqmgr
/usr/local/bin/dysttest
/usr/local/bin/dystmgr
/usr/local/bin/tcwtest
/usr/local/bin/tcwmgr
/usr/local/bin/laputest
/usr/local/bin/lapumgr
/usr/local/libexec/dystsearch.cgi
/usr/local/libexec/lapusearch.cgi
/usr/local/share/tokyodystopia/...
/usr/local/man/man1/...
/usr/local/man/man3/...

The API of C is available by programs conforming to the C89 (ANSI C) standard or the C99 standard. As the header files of Tokyo Dystopia are provided as `tcrdb.h', applications should include it to use the API. As the library is provided as `libtokyodystopia.a' and `libtokyodystopia.so' and they depend on `libtokyocabinet.so', `libz.so', `libbz2.so', `libpthread.so', `libm.so', and `libc.so', linker options corresponding to them are required by the build command. The typical build command is the following.

gcc -I/usr/local/include td_example.c -o td_example \
  -L/usr/local/lib -ltokyodystopia -ltokyocabinet -lz -lbz2 -lpthread -lm -lc

You can also use Tokyo Dystopia in programs written in C++. Because each header is wrapped in C linkage (`extern "C"' block), you can simply include them into your C++ programs.


The Core API

Indexed database is a directory containing a hash database file and its index files. The key of each record is a positive number. The value of each record is an arbitrary text data whose encoding is UTF-8. See `dystopia.h' for entire specification.

Description

To use the core API, include `dystopia.h' and related standard header files. Usually, write the following description near the front of a source file.

#include <dystopia.h>
#include <stdlib.h>
#include <stdbool.h>
#include <stdint.h>

Objects whose type is pointer to `TCIDB' are used to handle indexed databases. A remote database object is created with the function `tcidbnew' and is deleted with the function `tcidbdel'. To avoid memory leak, it is important to delete every object when it is no longer in use.

Before operations to store or retrieve records, it is necessary to open a database directory and connect the indexed database object to it. The function `tcidbopen' is used to open a database directory and the function `tcidbclose' is used to close the database directory. To avoid data missing or corruption, it is important to close every database directory when it is no longer in use.

API

The function `tcidberrmsg' is used in order to get the message string corresponding to an error code.

const char *tcidberrmsg(int ecode);
`ecode' specifies the error code.
The return value is the message string of the error code.

The function `tcidbnew' is used in order to create an indexed database object.

TCIDB *tcidbnew(void);
The return value is the new indexed database object.

The function `tcidbdel' is used in order to delete an indexed database object.

void tcidbdel(TCIDB *idb);
`idb' specifies the indexed database object.
If the database is not closed, it is closed implicitly. Note that the deleted object and its derivatives can not be used anymore.

The function `tcidbecode' is used in order to get the last happened error code of an indexed database object.

int tcidbecode(TCIDB *idb);
`idb' specifies the indexed database object.
The return value is the last happened error code.
The following error code is defined: `TCESUCCESS' for success, `TCETHREAD' for threading error, `TCEINVALID' for invalid operation, `TCENOFILE' for file not found, `TCENOPERM' for no permission, `TCEMETA' for invalid meta data, `TCERHEAD' for invalid record header, `TCEOPEN' for open error, `TCECLOSE' for close error, `TCETRUNC' for trunc error, `TCESYNC' for sync error, `TCESTAT' for stat error, `TCESEEK' for seek error, `TCEREAD' for read error, `TCEWRITE' for write error, `TCEMMAP' for mmap error, `TCELOCK' for lock error, `TCEUNLINK' for unlink error, `TCERENAME' for rename error, `TCEMKDIR' for mkdir error, `TCERMDIR' for rmdir error, `TCEKEEP' for existing record, `TCENOREC' for no record found, and `TCEMISC' for miscellaneous error.

The function `tcidbtune' is used in order to set the tuning parameters of an indexed database object.

bool tcidbtune(TCIDB *idb, int64_t ernum, int64_t etnum, int64_t iusiz, uint8_t opts);
`idb' specifies the indexed database object which is not opened.
`ernum' specifies the expected number of records to be stored. If it is not more than 0, the default value is specified. The default value is 1000000.
`etnum' specifies the expected number of tokens to be stored. If it is not more than 0, the default value is specified. The default value is 1000000.
`iusiz' specifies the unit size of each index file. If it is not more than 0, the default value is specified. The default value is 536870912.
`opts' specifies options by bitwise-or: `IDBTLARGE' specifies that the size of the database can be larger than 2GB by using 64-bit bucket array, `IDBTDEFLATE' specifies that each page is compressed with Deflate encoding, `IDBTBZIP' specifies that each page is compressed with BZIP2 encoding, `IDBTTCBS' specifies that each page is compressed with TCBS encoding.
If successful, the return value is true, else, it is false.
Note that the tuning parameters should be set before the database is opened.

The function `tcidbsetcache' is used in order to set the caching parameters of an indexed database object.

bool tcidbsetcache(TCIDB *idb, int64_t icsiz, int32_t lcnum);
`idb' specifies the indexed database object which is not opened.
`icsiz' specifies the capacity size of the token cache. If it is not more than 0, the default value is specified. The default value is 134217728.
`lcnum' specifies the maximum number of cached leaf nodes of B+ tree. If it is not more than 0, the default value is specified. The default value is 64 for writer or 1024 for reader.
If successful, the return value is true, else, it is false.
Note that the caching parameters should be set before the database is opened.

The function `tcidbsetfwmmax' is used in order to set the maximum number of forward matching expansion of an indexed database object.

bool tcidbsetfwmmax(TCIDB *idb, uint32_t fwmmax);
`idb' specifies the indexed database object.
`fwmmax' specifies the maximum number of forward matching expansion.
If successful, the return value is true, else, it is false.
Note that the matching parameters should be set before the database is opened.

The function `tcidbopen' is used in order to open an indexed database object.

bool tcidbopen(TCIDB *idb, const char *path, int omode);
`idb' specifies the indexed database object.
`path' specifies the path of the database directory.
`omode' specifies the connection mode: `IDBOWRITER' as a writer, `IDBOREADER' as a reader. If the mode is `IDBOWRITER', the following may be added by bitwise-or: `IDBOCREAT', which means it creates a new database if not exist, `IDBOTRUNC', which means it creates a new database regardless if one exists. Both of `IDBOREADER' and `IDBOWRITER' can be added to by bitwise-or: `IDBONOLCK', which means it opens the database directory without file locking, or `IDBOLCKNB', which means locking is performed without blocking.
If successful, the return value is true, else, it is false.

The function `tcidbclose' is used in order to close an indexed database object.

bool tcidbclose(TCIDB *idb);
`idb' specifies the indexed database object.
If successful, the return value is true, else, it is false.
Update of a database is assured to be written when the database is closed. If a writer opens a database but does not close it appropriately, the database will be broken.

The function `tcidbput' is used in order to store a record into an indexed database object.

bool tcidbput(TCIDB *idb, int64_t id, const char *text);
`idb' specifies the indexed database object connected as a writer.
`id' specifies the ID number of the record. It should be positive.
`text' specifies the string of the record, whose encoding should be UTF-8.
If successful, the return value is true, else, it is false.

The function `tcidbout' is used in order to remove a record of an indexed database object.

bool tcidbout(TCIDB *idb, int64_t id);
`idb' specifies the indexed database object connected as a writer.
`id' specifies the ID number of the record. It should be positive.
If successful, the return value is true, else, it is false.

The function `tcidbget' is used in order to retrieve a record of an indexed database object.

char *tcidbget(TCIDB *idb, int64_t id);
`idb' specifies the indexed database object connected as a writer.
`id' specifies the ID number of the record. It should be positive.
If successful, the return value is the string of the corresponding record, else, it is `NULL'.
Because the region of the return value is allocated with the `malloc' call, it should be released with the `free' call when it is no longer in use.

The function `tcidbsearch' is used in order to search an indexed database.

uint64_t *tcidbsearch(TCIDB *idb, const char *word, int smode, int *np);
`idb' specifies the indexed database object.
`word' specifies the string of the word to be matched to.
`smode' specifies the matching mode: `IDBSSUBSTR' as substring matching, `IDBSPREFIX' as prefix matching, `IDBSSUFFIX' as suffix matching, `IDBSFULL' as full matching, `IDBSTOKEN' as token matching, `IDBSTOKPRE' as token prefix matching, or `IDBSTOKSUF' as token suffix matching.
`np' specifies the pointer to the variable into which the number of elements of the return value is assigned.
If successful, the return value is the pointer to an array of ID numbers of the corresponding records. `NULL' is returned on failure.
Because the region of the return value is allocated with the `malloc' call, it should be released with the `free' call when it is no longer in use.

The function `tcidbsearch2' is used in order to search an indexed database with a compound expression.

uint64_t *tcidbsearch2(TCIDB *idb, const char *expr, int *np);
`idb' specifies the indexed database object.
`expr' specifies the string of the compound expression.
`np' specifies the pointer to the variable into which the number of elements of the return value is assigned.
If successful, the return value is the pointer to an array of ID numbers of the corresponding records. `NULL' is returned on failure.
Because the region of the return value is allocated with the `malloc' call, it should be released with the `free' call when it is no longer in use.

The function `tcidbiterinit' is used in order to initialize the iterator of an indexed database object.

bool tcidbiterinit(TCIDB *idb);
`idb' specifies the indexed database object.
If successful, the return value is true, else, it is false.
The iterator is used in order to access the ID number of every record stored in a database.

The function `tcidbiternext' is used in order to get the next ID number of the iterator of an indexed database object.

uint64_t tcidbiternext(TCIDB *idb);
`idb' specifies the indexed database object.
If successful, the return value is the ID number of the next record, else, it is 0. 0 is returned when no record is to be get out of the iterator.
It is possible to access every record by iteration of calling this function. It is allowed to update or remove records whose keys are fetched while the iteration. However, it is not assured if updating the database is occurred while the iteration. Besides, the order of this traversal access method is arbitrary, so it is not assured that the order of storing matches the one of the traversal access.

The function `tcidbsync' is used in order to synchronize updated contents of an indexed database object with the files and the device.

bool tcidbsync(TCIDB *idb);
`idb' specifies the indexed database object connected as a writer.
If successful, the return value is true, else, it is false.
This function is useful when another process connects the same database directory.

The function `tcidboptimize' is used in order to optimize the files of an indexed database object.

bool tcidboptimize(TCIDB *idb);
`idb' specifies the indexed database object connected as a writer.
If successful, the return value is true, else, it is false.
This function is useful to reduce the size of the database files with data fragmentation by successive updating.

The function `tcidbvanish' is used in order to remove all records of an indexed database object.

bool tcidbvanish(TCIDB *idb);
`idb' specifies the indexed database object connected as a writer.
If successful, the return value is true, else, it is false.

The function `tcidbcopy' is used in order to copy the database directory of an indexed database object.

bool tcidbcopy(TCIDB *idb, const char *path);
`idb' specifies the indexed database object.
`path' specifies the path of the destination directory. If it begins with `@', the trailing substring is executed as a command line.
If successful, the return value is true, else, it is false. False is returned if the executed command returns non-zero code.
The database directory is assured to be kept synchronized and not modified while the copying or executing operation is in progress. So, this function is useful to create a backup directory of the database directory.

The function `tcidbpath' is used in order to get the directory path of an indexed database object.

const char *tcidbpath(TCIDB *idb);
`idb' specifies the indexed database object.
The return value is the path of the database directory or `NULL' if the object does not connect to any database directory.

The function `tcidbrnum' is used in order to get the number of records of an indexed database object.

uint64_t tcidbrnum(TCIDB *idb);
`idb' specifies the indexed database object.
The return value is the number of records or 0 if the object does not connect to any database directory.

The function `tcidbfsiz' is used in order to get the total size of the database files of an indexed database object.

uint64_t tcidbfsiz(TCIDB *idb);
`idb' specifies the indexed database object.
The return value is the size of the database files or 0 if the object does not connect to any database directory.

Compound Expression of Search

The function `tcidbsearch2' searches with a compound expression. In the compound expression, tokens are separated by one or more white space characters. If one token is specified, records including the specified pattern are searched for. Upper or lower case is not distinguished. Accent marks and diacritical marks are ignored. If two or more tokens are specified, records including all of the patterns are searched for. The compound expression includes the following sub expressions.

Note that the priority of "||" is higher than the one of "&&".

Example Code

The following code is an example to use an indexed database.

#include <dystopia.h>
#include <stdlib.h>
#include <stdbool.h>
#include <stdint.h>

int main(int argc, char **argv){
  TCIDB *idb;
  int ecode, rnum, i;
  uint64_t *result;
  char *text;

  /* create the object */
  idb = tcidbnew();

  /* open the database */
  if(!tcidbopen(idb, "casket", IDBOWRITER | IDBOCREAT)){
    ecode = tcidbecode(idb);
    fprintf(stderr, "open error: %s\n", tcidberrmsg(ecode));
  }

  /* store records */
  if(!tcidbput(idb, 1, "George Washington") ||
     !tcidbput(idb, 2, "John Adams") ||
     !tcidbput(idb, 3, "Thomas Jefferson")){
    ecode = tcidbecode(idb);
    fprintf(stderr, "put error: %s\n", tcidberrmsg(ecode));
  }

  /* search records */
  result = tcidbsearch2(idb, "john || thomas", &rnum);
  if(result){
    for(i = 0; i < rnum; i++){
      text = tcidbget(idb, result[i]);
      if(text){
        printf("%d\t%s\n", (int)result[i], text);
        free(text);
      }
    }
    free(result);
  } else {
    ecode = tcidbecode(idb);
    fprintf(stderr, "search error: %s\n", tcidberrmsg(ecode));
  }

  /* close the database */
  if(!tcidbclose(idb)){
    ecode = tcidbecode(idb);
    fprintf(stderr, "close error: %s\n", tcidberrmsg(ecode));
  }

  /* delete the object */
  tcidbdel(idb);

  return 0;
}

The Q-gram Database API

Q-gram database is a file containing index of text. The key of each record is a positive number. The value of each record is an arbitrary text data whose encoding is UTF-8. Note that q-gram database is pure index and does not contain entity of records. See `tcqdb.h' for entire specification.

Description

To use the q-gram database API, include `tcqdb.h' and related standard header files. Usually, write the following description near the front of a source file.

#include <tcqdb.h>
#include <stdlib.h>
#include <stdbool.h>
#include <stdint.h>

Objects whose type is pointer to `TCQDB' are used to handle q-gram databases. A remote database object is created with the function `tcqdbnew' and is deleted with the function `tcqdbdel'. To avoid memory leak, it is important to delete every object when it is no longer in use.

Before operations to store or retrieve records, it is necessary to open a database file and connect the q-gram database object to it. The function `tcqdbopen' is used to open a database file and the function `tcqdbclose' is used to close the database file. To avoid data missing or corruption, it is important to close every database file when it is no longer in use.

API

The constant `tdversion' is the string containing the version information.

extern const char *tdversion;

The function `tcqdberrmsg' is used in order to get the message string corresponding to an error code.

const char *tcqdberrmsg(int ecode);
`ecode' specifies the error code.
The return value is the message string of the error code.

The function `tcqdbnew' is used in order to create a q-gram database object.

TCQDB *tcqdbnew(void);
The return value is the new q-gram database object.

The function `tcqdbdel' is used in order to delete a q-gram database object.

void tcqdbdel(TCQDB *qdb);
`qdb' specifies the q-gram database object.
If the database is not closed, it is closed implicitly. Note that the deleted object and its derivatives can not be used anymore.

The function `tcqdbecode' is used in order to get the last happened error code of a q-gram database object.

int tcqdbecode(TCQDB *qdb);
`qdb' specifies the q-gram database object.
The return value is the last happened error code.
The following error code is defined: `TCESUCCESS' for success, `TCETHREAD' for threading error, `TCEINVALID' for invalid operation, `TCENOFILE' for file not found, `TCENOPERM' for no permission, `TCEMETA' for invalid meta data, `TCERHEAD' for invalid record header, `TCEOPEN' for open error, `TCECLOSE' for close error, `TCETRUNC' for trunc error, `TCESYNC' for sync error, `TCESTAT' for stat error, `TCESEEK' for seek error, `TCEREAD' for read error, `TCEWRITE' for write error, `TCEMMAP' for mmap error, `TCELOCK' for lock error, `TCEUNLINK' for unlink error, `TCERENAME' for rename error, `TCEMKDIR' for mkdir error, `TCERMDIR' for rmdir error, `TCEKEEP' for existing record, `TCENOREC' for no record found, and `TCEMISC' for miscellaneous error.

The function `tcqdbtune' is used in order to set the tuning parameters of a q-gram database object.

bool tcqdbtune(TCQDB *qdb, int64_t etnum, uint8_t opts);
`qdb' specifies the q-gram database object which is not opened.
`etnum' specifies the expected number of tokens to be stored. If it is not more than 0, the default value is specified. The default value is 1000000.
`opts' specifies options by bitwise-or: `QDBTLARGE' specifies that the size of the database can be larger than 2GB by using 64-bit bucket array, `QDBTDEFLATE' specifies that each page is compressed with Deflate encoding, `QDBTBZIP' specifies that each page is compressed with BZIP2 encoding, `QDBTTCBS' specifies that each page is compressed with TCBS encoding.
If successful, the return value is true, else, it is false.
Note that the tuning parameters should be set before the database is opened.

The function `tcqdbsetcache' is used in order to set the caching parameters of a q-gram database object.

bool tcqdbsetcache(TCQDB *qdb, int64_t icsiz, int32_t lcnum);
`qdb' specifies the q-gram database object which is not opened.
`icsiz' specifies the capacity size of the token cache. If it is not more than 0, the default value is specified. The default value is 134217728.
`lcnum' specifies the maximum number of cached leaf nodes of B+ tree. If it is not more than 0, the default value is specified. The default value is 64 for writer or 1024 for reader.
If successful, the return value is true, else, it is false.
Note that the caching parameters should be set before the database is opened.

The function `tcqdbsetfwmmax' is used in order to set the maximum number of forward matching expansion of a q-gram database object.

bool tcqdbsetfwmmax(TCQDB *qdb, uint32_t fwmmax);
`qdb' specifies the q-gram database object.
`fwmmax' specifies the maximum number of forward matching expansion.
If successful, the return value is true, else, it is false.
Note that the matching parameters should be set before the database is opened.

The function `tcqdbopen' is used in order to open a q-gram database object.

bool tcqdbopen(TCQDB *qdb, const char *path, int omode);
`qdb' specifies the q-gram database object.
`path' specifies the path of the database file.
`omode' specifies the connection mode: `QDBOWRITER' as a writer, `QDBOREADER' as a reader. If the mode is `QDBOWRITER', the following may be added by bitwise-or: `QDBOCREAT', which means it creates a new database if not exist, `QDBOTRUNC', which means it creates a new database regardless if one exists. Both of `QDBOREADER' and `QDBOWRITER' can be added to by bitwise-or: `QDBONOLCK', which means it opens the database file without file locking, or `QDBOLCKNB', which means locking is performed without blocking.
If successful, the return value is true, else, it is false.

The function `tcqdbclose' is used in order to close a q-gram database object.

bool tcqdbclose(TCQDB *qdb);
`qdb' specifies the q-gram database object.
If successful, the return value is true, else, it is false.
Update of a database is assured to be written when the database is closed. If a writer opens a database but does not close it appropriately, the database will be broken.

The function `tcqdbput' is used in order to store a record into a q-gram database object.

bool tcqdbput(TCQDB *qdb, int64_t id, const char *text);
`qdb' specifies the q-gram database object connected as a writer.
`id' specifies the ID number of the record. It should be positive.
`text' specifies the string of the record, whose encoding should be UTF-8.
If successful, the return value is true, else, it is false.

The function `tcqdbout' is used in order to remove a record of a q-gram database object.

bool tcqdbout(TCQDB *qdb, int64_t id, const char *text);
`qdb' specifies the q-gram database object connected as a writer.
`id' specifies the ID number of the record. It should be positive.
`text' specifies the string of the record, which should be same as the stored one.
If successful, the return value is true, else, it is false.

The function `tcqdbsearch' is used in order to search a q-gram database.

uint64_t *tcqdbsearch(TCQDB *qdb, const char *word, int smode, int *np);
`qdb' specifies the q-gram database object.
`word' specifies the string of the word to be matched to.
`smode' specifies the matching mode: `QDBSSUBSTR' as substring matching, `QDBSPREFIX' as prefix matching, `QDBSSUFFIX' as suffix matching, or `QDBSFULL' as full matching.
`np' specifies the pointer to the variable into which the number of elements of the return value is assigned.
If successful, the return value is the pointer to an array of ID numbers of the corresponding records. `NULL' is returned on failure.
Because the region of the return value is allocated with the `malloc' call, it should be released with the `free' call when it is no longer in use.

The function `tcqdbsync' is used in order to synchronize updated contents of a q-gram database object with the file and the device.

bool tcqdbsync(TCQDB *qdb);
`qdb' specifies the q-gram database object connected as a writer.
If successful, the return value is true, else, it is false.
This function is useful when another process connects the same database file.

The function `tcqdboptimize' is used in order to optimize the file of a q-gram database object.

bool tcqdboptimize(TCQDB *qdb);
`qdb' specifies the q-gram database object connected as a writer.
If successful, the return value is true, else, it is false.
This function is useful to reduce the size of the database file with data fragmentation by successive updating.

The function `tcqdbvanish' is used in order to remove all records of a q-gram database object.

bool tcqdbvanish(TCQDB *qdb);
`qdb' specifies the q-gram database object connected as a writer.
If successful, the return value is true, else, it is false.

The function `tcqdbcopy' is used in order to copy the database file of a q-gram database object.

bool tcqdbcopy(TCQDB *qdb, const char *path);
`qdb' specifies the q-gram database object.
`path' specifies the path of the destination file. If it begins with `@', the trailing substring is executed as a command line.
If successful, the return value is true, else, it is false. False is returned if the executed command returns non-zero code.
The database file is assured to be kept synchronized and not modified while the copying or executing operation is in progress. So, this function is useful to create a backup file of the database file.

The function `tcqdbpath' is used in order to get the file path of a q-gram database object.

const char *tcqdbpath(TCQDB *qdb);
`qdb' specifies the q-gram database object.
The return value is the path of the database file or `NULL' if the object does not connect to any database file.

The function `tcqdbtnum' is used in order to get the number of tokens of a q-gram database object.

uint64_t tcqdbtnum(TCQDB *qdb);
`qdb' specifies the q-gram database object.
The return value is the number of tokens or 0 if the object does not connect to any database file.

The function `tcqdbfsiz' is used in order to get the size of the database file of a q-gram database object.

uint64_t tcqdbfsiz(TCQDB *qdb);
`qdb' specifies the q-gram database object.
The return value is the size of the database file or 0 if the object does not connect to any database file.

Example Code

The following code is an example to use a q-gram database.

#include <tcqdb.h>
#include <stdlib.h>
#include <stdbool.h>
#include <stdint.h>

int main(int argc, char **argv){
  TCQDB *qdb;
  int ecode, rnum, i;
  uint64_t *result;

  /* create the object */
  qdb = tcqdbnew();

  /* open the database */
  if(!tcqdbopen(qdb, "casket", QDBOWRITER | QDBOCREAT)){
    ecode = tcqdbecode(qdb);
    fprintf(stderr, "open error: %s\n", tcqdberrmsg(ecode));
  }

  /* store records */
  if(!tcqdbput(qdb, 1, "George Washington") ||
     !tcqdbput(qdb, 2, "John Adams") ||
     !tcqdbput(qdb, 3, "Thomas Jefferson")){
    ecode = tcqdbecode(qdb);
    fprintf(stderr, "put error: %s\n", tcqdberrmsg(ecode));
  }

  /* search records */
  result = tcqdbsearch(qdb, "John", QDBSSUBSTR, &rnum);
  if(result){
    for(i = 0; i < rnum; i++){
      printf("%d\n", (int)result[i]);
    }
    free(result);
  } else {
    ecode = tcqdbecode(qdb);
    fprintf(stderr, "search error: %s\n", tcqdberrmsg(ecode));
  }

  /* close the database */
  if(!tcqdbclose(qdb)){
    ecode = tcqdbecode(qdb);
    fprintf(stderr, "close error: %s\n", tcqdberrmsg(ecode));
  }

  /* delete the object */
  tcqdbdel(qdb);

  return 0;
}

The Simple API

Tagged database is a directory containing a hash database file and its tagging files. The key of each record is a positive number. The value of each record is an arbitrary text data whose encoding is UTF-8. See `laputa.h' for entire specification.

Description

To use the simple API, include `laputa.h' and related standard header files. Usually, write the following description near the front of a source file.

#include <laputa.h>
#include <stdlib.h>
#include <stdbool.h>
#include <stdint.h>

Objects whose type is pointer to `TCJDB' are used to handle tagged databases. A remote database object is created with the function `tcjdbnew' and is deleted with the function `tcjdbdel'. To avoid memory leak, it is important to delete every object when it is no longer in use.

Before operations to store or retrieve records, it is necessary to open a database directory and connect the tagged database object to it. The function `tcjdbopen' is used to open a database directory and the function `tcjdbclose' is used to close the database directory. To avoid data missing or corruption, it is important to close every database directory when it is no longer in use.

API

The function `tcjdberrmsg' is used in order to get the message string corresponding to an error code.

const char *tcjdberrmsg(int ecode);
`ecode' specifies the error code.
The return value is the message string of the error code.

The function `tcjdbnew' is used in order to create a tagged database object.

TCJDB *tcjdbnew(void);
The return value is the new tagged database object.

The function `tcjdbdel' is used in order to delete a tagged database object.

void tcjdbdel(TCJDB *jdb);
`jdb' specifies the tagged database object.
If the database is not closed, it is closed implicitly. Note that the deleted object and its derivatives can not be used anymore.

The function `tcjdbecode' is used in order to get the last happened error code of a tagged database object.

int tcjdbecode(TCJDB *jdb);
`jdb' specifies the tagged database object.
The return value is the last happened error code.
The following error code is defined: `TCESUCCESS' for success, `TCETHREAD' for threading error, `TCEINVALID' for invalid operation, `TCENOFILE' for file not found, `TCENOPERM' for no permission, `TCEMETA' for invalid meta data, `TCERHEAD' for invalid record header, `TCEOPEN' for open error, `TCECLOSE' for close error, `TCETRUNC' for trunc error, `TCESYNC' for sync error, `TCESTAT' for stat error, `TCESEEK' for seek error, `TCEREAD' for read error, `TCEWRITE' for write error, `TCEMMAP' for mmap error, `TCELOCK' for lock error, `TCEUNLINK' for unlink error, `TCERENAME' for rename error, `TCEMKDIR' for mkdir error, `TCERMDIR' for rmdir error, `TCEKEEP' for existing record, `TCENOREC' for no record found, and `TCEMISC' for miscellaneous error.

The function `tcjdbtune' is used in order to set the tuning parameters of a tagged database object.

bool tcjdbtune(TCJDB *jdb, int64_t ernum, int64_t etnum, int64_t iusiz, uint8_t opts);
`jdb' specifies the tagged database object which is not opened.
`ernum' specifies the expected number of records to be stored. If it is not more than 0, the default value is specified. The default value is 1000000.
`etnum' specifies the expected number of tokens to be stored. If it is not more than 0, the default value is specified. The default value is 1000000.
`iusiz' specifies the unit size of each index file. If it is not more than 0, the default value is specified. The default value is 536870912.
`opts' specifies options by bitwise-or: `JDBTLARGE' specifies that the size of the database can be larger than 2GB by using 64-bit bucket array, `JDBTDEFLATE' specifies that each page is compressed with Deflate encoding, `JDBTBZIP' specifies that each page is compressed with BZIP2 encoding, `JDBTTCBS' specifies that each page is compressed with TCBS encoding.
If successful, the return value is true, else, it is false.
Note that the tuning parameters should be set before the database is opened.

The function `tcjdbsetcache' is used in order to set the caching parameters of a tagged database object.

bool tcjdbsetcache(TCJDB *jdb, int64_t icsiz, int32_t lcnum);
`jdb' specifies the tagged database object which is not opened.
`icsiz' specifies the capacity size of the token cache. If it is not more than 0, the default value is specified. The default value is 134217728.
`lcnum' specifies the maximum number of cached leaf nodes of B+ tree. If it is not more than 0, the default value is specified. The default value is 64 for writer or 1024 for reader.
If successful, the return value is true, else, it is false.
Note that the caching parameters should be set before the database is opened.

The function `tcjdbsetfwmmax' is used in order to set the maximum number of forward matching expansion of a tagged database object.

bool tcjdbsetfwmmax(TCJDB *jdb, uint32_t fwmmax);
`jdb' specifies the tagged database object.
`fwmmax' specifies the maximum number of forward matching expansion.
If successful, the return value is true, else, it is false.
Note that the matching parameters should be set before the database is opened.

The function `tcjdbopen' is used in order to open a tagged database object.

bool tcjdbopen(TCJDB *jdb, const char *path, int omode);
`jdb' specifies the tagged database object.
`path' specifies the path of the database directory.
`omode' specifies the connection mode: `JDBOWRITER' as a writer, `JDBOREADER' as a reader. If the mode is `JDBOWRITER', the following may be added by bitwise-or: `JDBOCREAT', which means it creates a new database if not exist, `JDBOTRUNC', which means it creates a new database regardless if one exists. Both of `JDBOREADER' and `JDBOWRITER' can be added to by bitwise-or: `JDBONOLCK', which means it opens the database directory without file locking, or `JDBOLCKNB', which means locking is performed without blocking.
If successful, the return value is true, else, it is false.

The function `tcjdbclose' is used in order to close a tagged database object.

bool tcjdbclose(TCJDB *jdb);
`jdb' specifies the tagged database object.
If successful, the return value is true, else, it is false.
Update of a database is assured to be written when the database is closed. If a writer opens a database but does not close it appropriately, the database will be broken.

The function `tcjdbput' is used in order to store a record into a tagged database object.

bool tcjdbput(TCJDB *jdb, int64_t id, const TCLIST *words);
`jdb' specifies the tagged database object connected as a writer.
`id' specifies the ID number of the record. It should be positive.
`words' specifies a list object contains the words of the record, whose encoding should be UTF-8.
If successful, the return value is true, else, it is false.

The function `tcjdbput2' is used in order to store a record with a text string into a tagged database object.

bool tcjdbput2(TCJDB *jdb, int64_t id, const char *text, const char *delims);
`jdb' specifies the tagged database object connected as a writer.
`id' specifies the ID number of the record. It should be positive.
`text' specifies the string of the record, whose encoding should be UTF-8.
`delims' specifies a string containing delimiting characters of the text. If it is `NULL', space characters are specified.
If successful, the return value is true, else, it is false.

The function `tcjdbout' is used in order to remove a record of a tagged database object.

bool tcjdbout(TCJDB *jdb, int64_t id);
`jdb' specifies the tagged database object connected as a writer.
`id' specifies the ID number of the record. It should be positive.
If successful, the return value is true, else, it is false.

The function `tcjdbget' is used in order to retrieve a record of a tagged database object.

TCLIST *tcjdbget(TCJDB *jdb, int64_t id);
`jdb' specifies the tagged database object connected as a writer.
`id' specifies the ID number of the record. It should be positive.
If successful, the return value is the string of the corresponding record, else, it is `NULL'.
Because the object of the return value is created with the function `tclistnew', it should be deleted with the function `tclistdel' when it is no longer in use.

The function `tcjdbget2' is used in order to retrieve a record as a string of a tagged database object.

char *tcjdbget2(TCJDB *jdb, int64_t id);
`jdb' specifies the tagged database object connected as a writer.
`id' specifies the ID number of the record. It should be positive.
If successful, the return value is the string of the corresponding record, else, it is `NULL'. Each word is separated by a tab character.
Because the region of the return value is allocated with the `malloc' call, it should be released with the `free' call when it is no longer in use.

The function `tcjdbsearch' is used in order to search a tagged database.

uint64_t *tcjdbsearch(TCJDB *jdb, const char *word, int smode, int *np);
`jdb' specifies the tagged database object.
`word' specifies the string of the word to be matched to.
`smode' specifies the matching mode: `JDBSSUBSTR' as substring matching, `JDBSPREFIX' as prefix matching, `JDBSSUFFIX' as suffix matching, `JDBSFULL' as full matching.
`np' specifies the pointer to the variable into which the number of elements of the return value is assigned.
If successful, the return value is the pointer to an array of ID numbers of the corresponding records. `NULL' is returned on failure.
Because the region of the return value is allocated with the `malloc' call, it should be released with the `free' call when it is no longer in use.

The function `tcjdbsearch2' is used in order to search a tagged database with a compound expression.

uint64_t *tcjdbsearch2(TCJDB *jdb, const char *expr, int *np);
`jdb' specifies the tagged database object.
`expr' specifies the string of the compound expression.
`np' specifies the pointer to the variable into which the number of elements of the return value is assigned.
If successful, the return value is the pointer to an array of ID numbers of the corresponding records. `NULL' is returned on failure.
Because the region of the return value is allocated with the `malloc' call, it should be released with the `free' call when it is no longer in use.

The function `tcjdbiterinit' is used in order to initialize the iterator of a tagged database object.

bool tcjdbiterinit(TCJDB *jdb);
`jdb' specifies the tagged database object.
If successful, the return value is true, else, it is false.
The iterator is used in order to access the ID number of every record stored in a database.

The function `tcjdbiternext' is used in order to get the next ID number of the iterator of a tagged database object.

uint64_t tcjdbiternext(TCJDB *jdb);
`jdb' specifies the tagged database object.
If successful, the return value is the ID number of the next record, else, it is 0. 0 is returned when no record is to be get out of the iterator.
It is possible to access every record by iteration of calling this function. It is allowed to update or remove records whose keys are fetched while the iteration. However, it is not assured if updating the database is occurred while the iteration. Besides, the order of this traversal access method is arbitrary, so it is not assured that the order of storing matches the one of the traversal access.

The function `tcjdbsync' is used in order to synchronize updated contents of a tagged database object with the files and the device.

bool tcjdbsync(TCJDB *jdb);
`jdb' specifies the tagged database object connected as a writer.
If successful, the return value is true, else, it is false.
This function is useful when another process connects the same database directory.

The function `tcjdboptimize' is used in order to optimize the files of a tagged database object.

bool tcjdboptimize(TCJDB *jdb);
`jdb' specifies the tagged database object connected as a writer.
If successful, the return value is true, else, it is false.
This function is useful to reduce the size of the database files with data fragmentation by successive updating.

The function `tcjdbvanish' is used in order to remove all records of a tagged database object.

bool tcjdbvanish(TCJDB *jdb);
`jdb' specifies the tagged database object connected as a writer.
If successful, the return value is true, else, it is false.

The function `tcjdbcopy' is used in order to copy the database directory of a tagged database object.

bool tcjdbcopy(TCJDB *jdb, const char *path);
`jdb' specifies the tagged database object.
`path' specifies the path of the destination directory. If it begins with `@', the trailing substring is executed as a command line.
If successful, the return value is true, else, it is false. False is returned if the executed command returns non-zero code.
The database directory is assured to be kept synchronized and not modified while the copying or executing operation is in progress. So, this function is useful to create a backup directory of the database directory.

The function `tcjdbpath' is used in order to get the directory path of a tagged database object.

const char *tcjdbpath(TCJDB *jdb);
`jdb' specifies the tagged database object.
The return value is the path of the database directory or `NULL' if the object does not connect to any database directory.

The function `tcjdbrnum' is used in order to get the number of records of a tagged database object.

uint64_t tcjdbrnum(TCJDB *jdb);
`jdb' specifies the tagged database object.
The return value is the number of records or 0 if the object does not connect to any database directory.

The function `tcjdbfsiz' is used in order to get the total size of the database files of a tagged database object.

uint64_t tcjdbfsiz(TCJDB *jdb);
`jdb' specifies the tagged database object.
The return value is the size of the database files or 0 if the object does not connect to any database directory.

Compound Expression of Search

The function `tcidbsearch2' searches with a compound expression. In the compound expression, tokens are separated by one or more white space characters. If one token is specified, records including the specified pattern are searched for. Upper or lower case is not distinguished. Accent marks and diacritical marks are ignored. If two or more tokens are specified, records including all of the patterns are searched for. The compound expression includes the following sub expressions.

Note that the priority of "||" is higher than the one of "&&".

Example Code

The following code is an example to use a tagged database.

#include <laputa.h>
#include <stdlib.h>
#include <stdbool.h>
#include <stdint.h>

int main(int argc, char **argv){
  TCJDB *jdb;
  int ecode, rnum, i;
  uint64_t *result;
  char *text;

  /* create the object */
  jdb = tcjdbnew();

  /* open the database */
  if(!tcjdbopen(jdb, "casket", JDBOWRITER | JDBOCREAT)){
    ecode = tcjdbecode(jdb);
    fprintf(stderr, "open error: %s\n", tcjdberrmsg(ecode));
  }

  /* store records */
  if(!tcjdbput2(jdb, 1, "George Washington", NULL) ||
     !tcjdbput2(jdb, 2, "John Adams", NULL) ||
     !tcjdbput2(jdb, 3, "Thomas Jefferson", NULL)){
    ecode = tcjdbecode(jdb);
    fprintf(stderr, "put error: %s\n", tcjdberrmsg(ecode));
  }

  /* search records */
  result = tcjdbsearch2(jdb, "john || thomas", &rnum);
  if(result){
    for(i = 0; i < rnum; i++){
      text = tcjdbget2(jdb, result[i]);
      if(text){
        printf("%d\t%s\n", (int)result[i], text);
        free(text);
      }
    }
    free(result);
  } else {
    ecode = tcjdbecode(jdb);
    fprintf(stderr, "search error: %s\n", tcjdberrmsg(ecode));
  }

  /* close the database */
  if(!tcjdbclose(jdb)){
    ecode = tcjdbecode(jdb);
    fprintf(stderr, "close error: %s\n", tcjdberrmsg(ecode));
  }

  /* delete the object */
  tcjdbdel(jdb);

  return 0;
}

The Word Database API

Word database is a file containing index of text. The key of each record is a positive number. The value of each record is a list of words whose encoding is UTF-8. Note that word database is pure index and does not contain entity of records. See `tcwdb.h' for entire specification.

Description

To use the word database API, include `tcwdb.h' and related standard header files. Usually, write the following description near the front of a source file.

#include <tcwdb.h>
#include <stdlib.h>
#include <stdbool.h>
#include <stdint.h>

Objects whose type is pointer to `TCWDB' are used to handle word databases. A remote database object is created with the function `tcwdbnew' and is deleted with the function `tcwdbdel'. To avoid memory leak, it is important to delete every object when it is no longer in use.

Before operations to store or retrieve records, it is necessary to open a database file and connect the word database object to it. The function `tcwdbopen' is used to open a database file and the function `tcwdbclose' is used to close the database file. To avoid data missing or corruption, it is important to close every database file when it is no longer in use.

API

The function `tcwdberrmsg' is used in order to get the message string corresponding to an error code.

const char *tcwdberrmsg(int ecode);
`ecode' specifies the error code.
The return value is the message string of the error code.

The function `tcwdbnew' is used in order to create a word database object.

TCWDB *tcwdbnew(void);
The return value is the new word database object.

The function `tcwdbdel' is used in order to delete a word database object.

void tcwdbdel(TCWDB *wdb);
`wdb' specifies the word database object.
If the database is not closed, it is closed implicitly. Note that the deleted object and its derivatives can not be used anymore.

The function `tcwdbecode' is used in order to get the last happened error code of a word database object.

int tcwdbecode(TCWDB *wdb);
`wdb' specifies the word database object.
The return value is the last happened error code.
The following error code is defined: `TCESUCCESS' for success, `TCETHREAD' for threading error, `TCEINVALID' for invalid operation, `TCENOFILE' for file not found, `TCENOPERM' for no permission, `TCEMETA' for invalid meta data, `TCERHEAD' for invalid record header, `TCEOPEN' for open error, `TCECLOSE' for close error, `TCETRUNC' for trunc error, `TCESYNC' for sync error, `TCESTAT' for stat error, `TCESEEK' for seek error, `TCEREAD' for read error, `TCEWRITE' for write error, `TCEMMAP' for mmap error, `TCELOCK' for lock error, `TCEUNLINK' for unlink error, `TCERENAME' for rename error, `TCEMKDIR' for mkdir error, `TCERMDIR' for rmdir error, `TCEKEEP' for existing record, `TCENOREC' for no record found, and `TCEMISC' for miscellaneous error.

The function `tcwdbtune' is used in order to set the tuning parameters of a word database object.

bool tcwdbtune(TCWDB *wdb, int64_t etnum, uint8_t opts);
`wdb' specifies the word database object which is not opened.
`etnum' specifies the expected number of tokens to be stored. If it is not more than 0, the default value is specified. The default value is 1000000.
`opts' specifies options by bitwise-or: `WDBTLARGE' specifies that the size of the database can be larger than 2GB by using 64-bit bucket array, `WDBTDEFLATE' specifies that each page is compressed with Deflate encoding, `WDBTBZIP' specifies that each page is compressed with BZIP2 encoding, `WDBTTCBS' specifies that each page is compressed with TCBS encoding.
If successful, the return value is true, else, it is false.
Note that the tuning parameters should be set before the database is opened.

The function `tcwdbsetcache' is used in order to set the caching parameters of a word database object.

bool tcwdbsetcache(TCWDB *wdb, int64_t icsiz, int32_t lcnum);
`wdb' specifies the word database object which is not opened.
`icsiz' specifies the capacity size of the token cache. If it is not more than 0, the default value is specified. The default value is 134217728.
`lcnum' specifies the maximum number of cached leaf nodes of B+ tree. If it is not more than 0, the default value is specified. The default value is 64 for writer or 1024 for reader.
If successful, the return value is true, else, it is false.
Note that the caching parameters should be set before the database is opened.

The function `tcwdbsetfwmmax' is used in order to set the maximum number of forward matching expansion of a word database object.

bool tcwdbsetfwmmax(TCWDB *wdb, uint32_t fwmmax);
`wdb' specifies the word database object.
`fwmmax' specifies the maximum number of forward matching expansion.
If successful, the return value is true, else, it is false.
Note that the matching parameters should be set before the database is opened.

The function `tcwdbopen' is used in order to open a word database object.

bool tcwdbopen(TCWDB *wdb, const char *path, int omode);
`wdb' specifies the word database object.
`path' specifies the path of the database file.
`omode' specifies the connection mode: `WDBOWRITER' as a writer, `WDBOREADER' as a reader. If the mode is `WDBOWRITER', the following may be added by bitwise-or: `WDBOCREAT', which means it creates a new database if not exist, `WDBOTRUNC', which means it creates a new database regardless if one exists. Both of `WDBOREADER' and `WDBOWRITER' can be added to by bitwise-or: `WDBONOLCK', which means it opens the database file without file locking, or `WDBOLCKNB', which means locking is performed without blocking.
If successful, the return value is true, else, it is false.

The function `tcwdbclose' is used in order to close a word database object.

bool tcwdbclose(TCWDB *wdb);
`wdb' specifies the word database object.
If successful, the return value is true, else, it is false.
Update of a database is assured to be written when the database is closed. If a writer opens a database but does not close it appropriately, the database will be broken.

The function `tcwdbput' is used in order to store a record into a word database object.

bool tcwdbput(TCWDB *wdb, int64_t id, const TCLIST *words);
`wdb' specifies the word database object connected as a writer.
`id' specifies the ID number of the record. It should be positive.
`words' specifies a list object contains the words of the record, whose encoding should be UTF-8.
If successful, the return value is true, else, it is false.

The function `tcwdbput2' is used in order to store a record with a text string into a word database object.

bool tcwdbput2(TCWDB *wdb, int64_t id, const char *text, const char *delims);
`wdb' specifies the word database object connected as a writer.
`id' specifies the ID number of the record. It should be positive.
`text' specifies the string of the record, whose encoding should be UTF-8.
`delims' specifies a string containing delimiting characters of the text. If it is `NULL', space characters are specified.
If successful, the return value is true, else, it is false.

The function `tcwdbout' is used in order to remove a record of a word database object.

bool tcwdbout(TCWDB *wdb, int64_t id, const TCLIST *words);
`wdb' specifies the word database object connected as a writer.
`id' specifies the ID number of the record. It should be positive.
`words' specifies a list object contains the words of the record, which should be same as the stored ones.
If successful, the return value is true, else, it is false.

The function `tcwdbout2' is used in order to remove a record with a text string of a word database object.

bool tcwdbout2(TCWDB *wdb, int64_t id, const char *text, const char *delims);
`wdb' specifies the word database object connected as a writer.
`id' specifies the ID number of the record. It should be positive.
`text' specifies the string of the record, which should be same as the stored one.
`delims' specifies a string containing delimiting characters of the text. If it is `NULL', space characters are specified.
If successful, the return value is true, else, it is false.

The function `tcwdbsearch' is used in order to search a word database.

uint64_t *tcwdbsearch(TCWDB *wdb, const char *word, int *np);
`wdb' specifies the word database object.
`word' specifies the string of the word to be matched to.
`np' specifies the pointer to the variable into which the number of elements of the return value is assigned.
If successful, the return value is the pointer to an array of ID numbers of the corresponding records. `NULL' is returned on failure.
Because the region of the return value is allocated with the `malloc' call, it should be released with the `free' call when it is no longer in use.

The function `tcwdbsync' is used in order to synchronize updated contents of a word database object with the file and the device.

bool tcwdbsync(TCWDB *wdb);
`wdb' specifies the word database object connected as a writer.
If successful, the return value is true, else, it is false.
This function is useful when another process connects the same database file.

The function `tcwdboptimize' is used in order to optimize the file of a word database object.

bool tcwdboptimize(TCWDB *wdb);
`wdb' specifies the word database object connected as a writer.
If successful, the return value is true, else, it is false.
This function is useful to reduce the size of the database file with data fragmentation by successive updating.

The function `tcwdbvanish' is used in order to remove all records of a word database object.

bool tcwdbvanish(TCWDB *wdb);
`wdb' specifies the word database object connected as a writer.
If successful, the return value is true, else, it is false.

The function `tcwdbcopy' is used in order to copy the database file of a word database object.

bool tcwdbcopy(TCWDB *wdb, const char *path);
`wdb' specifies the word database object.
`path' specifies the path of the destination file. If it begins with `@', the trailing substring is executed as a command line.
If successful, the return value is true, else, it is false. False is returned if the executed command returns non-zero code.
The database file is assured to be kept synchronized and not modified while the copying or executing operation is in progress. So, this function is useful to create a backup file of the database file.

The function `tcwdbpath' is used in order to get the file path of a word database object.

const char *tcwdbpath(TCWDB *wdb);
`wdb' specifies the word database object.
The return value is the path of the database file or `NULL' if the object does not connect to any database file.

The function `tcwdbtnum' is used in order to get the number of tokens of a word database object.

uint64_t tcwdbtnum(TCWDB *wdb);
`wdb' specifies the word database object.
The return value is the number of tokens or 0 if the object does not connect to any database file.

The function `tcwdbfsiz' is used in order to get the size of the database file of a word database object.

uint64_t tcwdbfsiz(TCWDB *wdb);
`wdb' specifies the word database object.
The return value is the size of the database file or 0 if the object does not connect to any database file.

Example Code

The following code is an example to use a word database.

#include <tcwdb.h>
#include <stdlib.h>
#include <stdbool.h>
#include <stdint.h>

int main(int argc, char **argv){
  TCWDB *wdb;
  int ecode, rnum, i;
  uint64_t *result;

  /* create the object */
  wdb = tcwdbnew();

  /* open the database */
  if(!tcwdbopen(wdb, "casket", WDBOWRITER | WDBOCREAT)){
    ecode = tcwdbecode(wdb);
    fprintf(stderr, "open error: %s\n", tcwdberrmsg(ecode));
  }

  /* store records */
  if(!tcwdbput2(wdb, 1, "George Washington", NULL) ||
     !tcwdbput2(wdb, 2, "John Adams", NULL) ||
     !tcwdbput2(wdb, 3, "Thomas Jefferson", NULL)){
    ecode = tcwdbecode(wdb);
    fprintf(stderr, "put error: %s\n", tcwdberrmsg(ecode));
  }

  /* search records */
  result = tcwdbsearch(wdb, "John", &rnum);
  if(result){
    for(i = 0; i < rnum; i++){
      printf("%d\n", (int)result[i]);
    }
    free(result);
  } else {
    ecode = tcwdbecode(wdb);
    fprintf(stderr, "search error: %s\n", tcwdberrmsg(ecode));
  }

  /* close the database */
  if(!tcwdbclose(wdb)){
    ecode = tcwdbecode(wdb);
    fprintf(stderr, "close error: %s\n", tcwdberrmsg(ecode));
  }

  /* delete the object */
  tcwdbdel(wdb);

  return 0;
}

Command Line Interfaces

To use the core API easily, the commands `dysttest' and `dystmgr' are provided. To use the q-gram database API easily, the commands `tcqtest' and `tcqmgr' are provided. To use the simple API easily, the commands `laputest' and `lapumgr' are provided. To use the word database API easily, the commands `tcwtest' and `tcwmgr' are provided.

dysttest

The command `dysttest' is a utility for facility test and performance test of the core API. This command is used in the following format. `path' specifies the path of a database directory. `rnum' specifies the number of iterations.

dysttest write [-tl] [-td|-tb|-tt] [-er num] [-et num] [-iu num] [-ic num] [-xnt] [-nl|-nb] [-la num] [-en] path rnum
Store records with random texts.
dysttest read [-nl|-nb] [-la num] [-lm num] [-en] [-sp|-ss|-sf|-st|-stp|-sts] path rnum
Search for records with random texts.
dysttest wicked [-tl] [-td|-tb|-tt] [-er num] [-et num] [-iu num] [-ic num] [-nl|-nb] [-la num] [-en] path rnum
Perform updating operations selected at random.

Options feature the following.

This command returns 0 on success, another on failure.

dystmgr

The command `dystmgr' is a utility for test and debugging of the core API and its applications. `path' specifies the path of a database directory. `ernum' specifies the expected number of records. `etnum' specifies the expected number of tokens. `id' specifies the ID number of a record. `text' specifies the text of a record. `word' specifies a search word. `file' specifies the input file.

dystmgr create [-tl] [-td|-tb|-tt] path [ernum [etnum]]
Create a database directory.
dystmgr inform [-nl|-nb] path
Print miscellaneous information to the standard output.
dystmgr put [-nl|-nb] path id text
Store a record.
dystmgr out [-nl|-nb] path id
Remove a record.
dystmgr get [-nl|-nb] path id
Print the text of a record.
dystmgr search [-nl|-nb] [-eu|-ei|-ed] [-sp|-ss|-sf|-st|-stp|-sts] [-max num] [-ph] [-pv] path [word...]
Search for records.
dystmgr list [-nl|-nb] [-max num] [-pv] path
Print ID numbers of all records.
dystmgr optimize [-nl|-nb] path
Optimize a database directory.
dystmgr importtsv [-ic num] [-nl|-nb] path [file]
Store records of TSV in each line of a file.
dystmgr version
Print the version information of Tokyo Dystopia.

Options feature the following.

This command returns 0 on success, another on failure.

tcqtest

The command `tcqtest' is a utility for facility test and performance test of the q-gram database API. This command is used in the following format. `path' specifies the path of a database directory. `rnum' specifies the number of iterations.

tcqtest write [-tl] [-td|-tb|-tt] [-et num] [-ic num] [-nl|-nb] [-la num] [-en] [-rc] [-ra] [-rs] path rnum
Store records with random texts.
tcqtest read [-nl|-nb] [-la num] [-lm num] [-en] [-rc] [-ra] [-rs] [-sp|-ss|-sf] path rnum
Search for records with random texts.
tcqtest wicked [-tl] [-td|-tb|-tt] [-et num] [-ic num] [-nl|-nb] [-la num] [-en] [-rc] [-ra] [-rs] path rnum
Perform updating operations selected at random.

Options feature the following.

This command returns 0 on success, another on failure.

tcqmgr

The command `tcqmgr' is a utility for test and debugging of the q-gram database API and its applications. `path' specifies the path of a database directory. `etnum' specifies the expected number of tokens. `id' specifies the ID number of a record. `text' specifies the text of a record. `word' specifies a search word. `file' specifies the input file.

tcqmgr create [-tl] [-td|-tb|-tt] path [etnum]
Create a database fie.
tcqmgr inform [-nl|-nb] path
Print miscellaneous information to the standard output.
tcqmgr put [-nl|-nb] [-rc] [-ra] [-rs] path id text
Store a record.
tcqmgr out [-nl|-nb] [-rc] [-ra] [-rs] path id text
Remove a record.
tcqmgr search [-nl|-nb] [-rc] [-ra] [-rs] [-eu|-ed] [-sp|-ss|-sf] [-max num] [-ph] path [word...]
Search for records.
tcqmgr optimize [-nl|-nb] path
Optimize a database file.
tcqmgr importtsv [-ic num] [-nl|-nb] [-rc] [-ra] [-rs] path [file]
Store records of TSV in each line of a file.
tcqmgr normalize [-rc] [-ra] [-rs] text
Normalize a text.
tcqmgr version
Print the version information of Tokyo Dystopia.

Options feature the following.

This command returns 0 on success, another on failure.

laputest

The command `laputest' is a utility for facility test and performance test of the simple API. This command is used in the following format. `path' specifies the path of a database directory. `rnum' specifies the number of iterations.

laputest write [-tl] [-td|-tb|-tt] [-er num] [-et num] [-iu num] [-ic num] [-xnt] [-nl|-nb] [-la num] [-en] path rnum
Store records with random texts.
laputest read [-nl|-nb] [-la num] [-lm num] [-en] [-sm|-sp|-ss] path rnum
Search for records with random texts.
laputest wicked [-tl] [-td|-tb|-tt] [-er num] [-et num] [-iu num] [-ic num] [-nl|-nb] [-la num] [-en] path rnum
Perform updating operations selected at random.

Options feature the following.

This command returns 0 on success, another on failure.

lapumgr

The command `lapumgr' is a utility for test and debugging of the simple API and its applications. `path' specifies the path of a database directory. `ernum' specifies the expected number of records. `etnum' specifies the expected number of tokens. `id' specifies the ID number of a record. `text' specifies the text of a record. `word' specifies a search word. `file' specifies the input file.

lapumgr create [-tl] [-td|-tb|-tt] path [ernum [etnum]]
Create a database directory.
lapumgr inform [-nl|-nb] path
Print miscellaneous information to the standard output.
lapumgr put [-nl|-nb] path id text
Store a record.
lapumgr out [-nl|-nb] path id
Remove a record.
lapumgr get [-nl|-nb] path id
Print the text of a record.
lapumgr search [-nl|-nb] [-eu|-ei|-ed] [-sm|-sp|-ss] [-max num] [-ph] [-pv] path [word...]
Search for records.
lapumgr list [-nl|-nb] [-max num] [-pv] path
Print ID numbers of all records.
lapumgr optimize [-nl|-nb] path
Optimize a database directory.
lapumgr importtsv [-ic num] [-nl|-nb] path [file]
Store records of TSV in each line of a file.
lapumgr version
Print the version information of Tokyo Dystopia.

Options feature the following.

This command returns 0 on success, another on failure.

tcwtest

The command `tcwtest' is a utility for facility test and performance test of the q-gram database API. This command is used in the following format. `path' specifies the path of a database directory. `rnum' specifies the number of iterations.

tcwtest write [-tl] [-td|-tb|-tt] [-et num] [-ic num] [-nl|-nb] [-la num] [-en] [-rc] [-ra] [-rs] path rnum
Store records with random texts.
tcwtest read [-nl|-nb] [-la num] [-lm num] [-en] [-rc] [-ra] [-rs] path rnum
Search for records with random texts.
tcwtest wicked [-tl] [-td|-tb|-tt] [-et num] [-ic num] [-nl|-nb] [-la num] [-en] [-rc] [-ra] [-rs] path rnum
Perform updating operations selected at random.

Options feature the following.

This command returns 0 on success, another on failure.

tcwmgr

The command `tcwmgr' is a utility for test and debugging of the q-gram database API and its applications. `path' specifies the path of a database directory. `etnum' specifies the expected number of tokens. `id' specifies the ID number of a record. `text' specifies the text of a record. `word' specifies a search word. `file' specifies the input file.

tcwmgr create [-tl] [-td|-tb|-tt] path [etnum]
Create a database fie.
tcwmgr inform [-nl|-nb] path
Print miscellaneous information to the standard output.
tcwmgr put [-nl|-nb] [-rc] [-ra] [-rs] path id text
Store a record.
tcwmgr out [-nl|-nb] [-rc] [-ra] [-rs] path id text
Remove a record.
tcwmgr search [-nl|-nb] [-rc] [-ra] [-rs] [-eu|-ed] [-max num] [-ph] path [word...]
Search for records.
tcwmgr optimize [-nl|-nb] path
Optimize a database file.
tcwmgr importtsv [-ic num] [-nl|-nb] [-rc] [-ra] [-rs] path [file]
Store records of TSV in each line of a file.
tcwmgr normalize [-rc] [-ra] [-rs] text
Normalize a text.
tcwmgr version
Print the version information of Tokyo Dystopia.

Options feature the following.

This command returns 0 on success, another on failure.


Tutorial

In this tutorial, we use the command `dystmgr' of the core API and try to make and search an indexed database.

Indexing

To begin with, make the database with a TSV (tab-separated values) file. Each line of the TSV file represent a record. The first field specifies the ID number of the record, and the second field specifies the text. The ID number should be an arbitrary positive numerical value. The encoding of the text should be UTF-8. The following is an example.

1       United States
33      France
34      Spain
44      United Kingdom
49      Germany
55      Brazil
81      Japan

If the TSV file is named as `calling.tsv', it can be indexed by the following command. As the result, the database `casket' is generated.

dystmgr importtsv casket calling.tsv

It is possible to add a record to an existing database. If you add "China" as the ID number 83, perform the following command.

dystmgr put casket 83 "China"

It is possible to remove a record from a database. If you remove the record of the ID number 55, perform the following command.

dystmgr out casket 55

To print all records in a database, perform the following command.

dystmgr list -pv casket

Search

To search for records including "united", perform the following command.

dystmgr search -pv casket "united"

The search expression of the command `dystmgr search' (the API function `tcidbsearch2' is called as "compound expression". This section describes the typical use of the compound expression. For first example, the following expression searches for records including "john" and "doe". That is, white space characters are treated as operators of logical intersection.

john doe

The following is equivalent to the above.

john && doe

Logical union is also supported. Use the operator "||" instead of "&&". For Example, the following expression searches for records including "john" or "james".

john || james

The token "john" matches such words as "johnson", "johnny", "demijohn" and so on. But, the token expression between "[[" and "]]" matches the word exactly. If you search for records including words exactly matching "john", specify the following expression.

[[john]]

The wild card "*" can be used in the token expression. If you search for records including words beginning with "john", specify the follow expression.

[[john*]]

Double quotation ("") is useful to void the meta characters described above. So, if you search for "xyz cookie", neither "xyzcookie", "cookie xyz", nor "xyz abc cookie", specify the following expression.

"xyz cookie"

The above expression matches "vwxyz cookie" and "xyz cookies" also. So, if you search for the exact word sequence "xyz cookie", specify the following expression.

[[xyz cookie]]

The priority of "||" is higher than the one of "&&". So, the following expression searches for records including "english" or "british", and including "bread" or "roll" or "bun".

english || british bread || roll || bun

CGI Script

The CGI script `dystsearch.cgi' is provided to search a database by Web interface. As it is installed as `/usr/local/libexec/dystsearch.cgi', copy it into a directory published by the Web server. The searched database should be named as `casket' and placed in the same directory of the CGI script.

Then, access the URL of the CGI script with a Web browser. A search form and some options are displayed there.


License

Tokyo Dystopia is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License or any later version.

Tokyo Dystopia is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public License along with Tokyo Dystopia (See the file `COPYING'); if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.

Tokyo Dystopia was written by FAL Labs. You can contact the author by e-mail to `info@fallabs.com'.