P2P Guide

Last Update: Tue, 06 Mar 2007 12:05:18 +0900

Introduction
Architecture
Tutorial
Node Master Usage
Protocol
Node API
Command of a Client
Pseudo Node Master
Meta Search Gateway

Introduction

This guide describes Hyper Estraier's client/server (C/S) and peer to peer (P2P) architecture. If you haven't read user's guide yet, now is a good moment to do so.

There was several problems which where motivation for C/S architecture. estseek.cgi is inefficient because it reconnects to database for each search query. Database updates using estcmd prevents searches on same database because estcmd uses locks when doing update. To solve those problems, estmaster server process is provided. This is resident (daemon) process which has control over database and provides services via network.

This approach also leads to following advantages:

server and clients can be distributed across different machines
multiple clients and servers can work in parallel
client crash doesn't leave database in inconsistent state
clients implementation isn't specific to any programming language

Protocol between clients and servers is based on HTTP, so normal web browsers can be used as simple clients. Clients can be implemented using any languages which supports HTTP protocol like JavaScript or Flash.

Distributed processing is based on peer to peer (P2P) architecture. This allows horizontal scalability. For example, if you use 10 servers, each with million documents, you can search 10 million documents without much additional effort. Since all servers are equal, search service is provided even if some of servers are unavailable. There is notion of relevance of each index which can improve search results (if some parts of index are more important that others).

This guide describes the node API which can be used by client applications to implement search and update capabilities without using network protocol between client and server. The node API is implemented by not only C language but also Java and Ruby.

Architecture

This section describes the P2P architecture of Hyper Estraier.

Node Master and Node Server

When using multiple indexes, it is highly inefficient to run one server for each index. estmaster is server process (daemon) implemented as single process available on network through HTTP protocol on some hight port that provides services for multiple indexes. This component is called node master. Each index has it's own unique URL which is served by node server. You can think about node server as a virtual servers for each index within one node master.

Client application just need to know URL of node server and issue queries to it on any available node masters. Term node corresponds to peer in P2P architecture. Clients can also connect to node masters directly to manage configuration of that master.

Meta Search and Credit

Each node server can have unidirectional links to other node servers. This allows distributed searching called meta search. When client sends query to node server, it will forward that query to all other node servers and merge responses before sending result back to client.

Meta search is performed hierarchically. Loops in links are detected and restrained automatically, so searching is always performed on tree structured network of nodes. This allows infinite scalability just by adding additional nodes.

There is a notion of relevance for each link between nodes called credit. It's used for weighting of scores when merging results from different nodes. Document coming from node with larger credit will have higher rank in search results. Application can set links and credits for them, and this allows improving of search precision by modifying credits of frequently used nodes.

Authentication

Client connections to node master or node server are authenticated using login names and passwords. Users and permissions are defined on node masters and all node servers on that master inherit those permissions. Users are divided into two groups: super users and normal users. Normal users have permission to search index while super users also have permission to create and update indexes. Moreover, for each node, normal users are divided into administrators and guest users.

Example Application

mod_estraier is the most interesting application of the node API of Hyper Estraier. It works as a module of a Web server (Apache) and register contents mediated by its proxy mechanism. Using it as a forward proxy, you can enjoy a search engine whose targets are documents which you (or your comrades) have seen. Using it as a reverse proxy, you can add search feature to any web application as well as BBS and Wiki. That is, the greatest characteristic of the node API is helping you to develop advanced search systems combining various applications.

Tutorial

Concept of P2P seems difficult at first, so it's best to start with a tutorial which will show commands used to setup search network.

Start and Stop

First step is creation of the server root directory which contains configuration files and indexes. Following command will create casket, the server root directory:

estmaster init casket

Next, we need to start node master using:

estmaster start casket

To stop node master you can press Ctrl-C on terminal in which it's running on issue following command:

estmaster stop casket

Administration Interface

After starting node master you can access it's administration interface pointing web browser at http://localhost:1978/master_ui. This will bring HTTP authorization dialog requesting user name and password. Enter "admin" and "admin". You will be presented with administration commands menu.

First option is Manage Master which provides options to shutdown node master and to synchronize databases. However, they are not used for now.

Since you are logged as admin which has super user privileges, we can edit users. We will create new account with super user privileges and switch to it. We will also erase admin user because it has well-known password and it's possible security hole.

Select Manage Users. There are input boxes for user name, password, flags, full name and miscellaneous information at bottom. Enter following data for new user: "clint", "tnilc", "s", "Clint Eastwood" and "Dirty Harry". Flag "s" denotes user with super user privileges. User name is limited to alphanumeric characters only.

You can now delete admin user using DELE link and confirm that by clicking on sure button.

Select Manage Nodes next. Since we just erased admin user, you will be again asked for user name and password. Enter "clint" and "tnilc" to login again. Input boxes at bottom allows you to enter new index name and label. Index name is limited to alphanumeric characters. Enter "test1" as name and "First Node" as description. Create another node with name "test2" and "Second Node" as description.

Register Documents

We'll now register new documents in index. For that, we will again use command-line utilities. Since terminal in which you started estmaster is busy outputting log messages, open another one.

Documents which are about to be registered in index must be in document draft format described in User's Guide. Create file data001.est with following content:

@uri=data001
@title=Material Girl

Living in a material world
And I am a material girl
You know that we are living in a material world
And I am a material girl

We are going to register this document to node test1. Since updating of index needs super user permissions, we will use -auth option to specify user name and password of user that we created in previous step.

estcall put -auth clint tnilc http://localhost:1978/node/test1 data001.est

Document registration will finish shortly, and absence of any error message denotes that it has been successful.

To demonstrate meta search mentioned before, we will register another document to node test2. Create file data002.est with following content:

@uri=data002
@title=Liberian Girl

Liberian girl
You came and you changed My world
A love so brand new

Then, register document using:

estcall put -auth clint tnilc http://localhost:1978/node/test2 data002.est

As you can see, it's possible to register documents from any machine connected to network on which estmaster is running. You can register few additional documents if you want now.

Search for Documents

Next step is to find some of documents we just registered. You can try search by executing:

estcall search http://localhost:1978/node/test1 "material world"

This will produce search result with draft of the document including phrase material world.

We can also create a link between two nodes and try meta search. We need super user, URL of source and destination node, link name and credit.

estcall setlink -auth clint tnilc http://localhost:1978/node/test1 \
  http://localhost:1978/node/test2 TEST02 8000

This command creates link from node test1 to node test2 called TEST02 and assign it credit of 8000.

You can now repeat search and add option -dpt which will activate meta search over both nodes.

estcall search -dpt 1 http://localhost:1978/node/test1 "girl"

Although your search is accessing node test1, we can see merged results from test2. Since nodes can be located on separate machines, meta search enables us to do distributed processing using P2P architecture described before.

We can also influence ranking of results by increasing credit for particular node. If we change it to 12000 and re-run search we will see different ranking of results.

estcall setlink -auth clint tnilc http://localhost:1978/node/test1 \
  http://localhost:1978/node/test2 TEST02 12000

Results can also be returned in XML format which is defined in estresult.dtd.

estcall search -dpt 1 -vx http://localhost:1978/node/test1 "girl"

Each node server also has a web search interface. You can go to http://localhost:1978/node/test1/search_ui and access search interface of node test1.

Node Master Usage

estmaster is provided as a command to manage the node master. This section describes how to use estmaster.

Synopsis and Description

estmaster is an aggregation of sub commands. The name of a sub command is specified by the first argument. Other arguments are parsed according to each sub command. The argument rootdir specifies the server root directory which contains configuration file and so on.

estmaster init [-ex] rootdir: Create the server root directory.; If -ex is specified, some users and some nodes are set for example. By default, only a super user whose name and password are both "admin" is set.

estmaster start [-bg] [-st] rootdir: Start the node master.; If -bg is specified, the server runs in background as a daemon process.; If -ro is specified, the server runs in read-only mode regardless of the configuration.; If -st is specified, the server runs in single thread mode.

estmaster stop rootdir: Stop the running node master.

estmaster unittest rootdir: Perform unit tests.

estmaster crypt key [hash]: Output an encrypted string of a string.; key specifies a target string.; If hash is specified, it checks whether the key and the hash matches.

All sub commands return 0 if the operation is success, else return 1. A running node master finishes with closing the database when it catches the signal 1 (SIGHUP), 2 (SIGINT), 3 (SIGQUIT), or 15 (SIGTERM). Moreover, when a node master running as a daemon catches the signal 1 (SIGHUP), the process is re-start and re-read the configuration files.

A running node server should be finished by valid means by command line or via network. Otherwise, the index may be broken.

Constitution of the Server Root Directory

The server root directory contains the following files and directories.

_conf : prime configuration file.
_user : user account file.
_log : log file.
_meta : database file for meta data.
_pid : file recording the process ID.
_stop : file to stop the node master.
_dfdb : document frequency database.
_node/ : node directory.
_sess/ : session directory.

The prime configuration file can be edit with a text editor. However, the user account file should not be edit during the node master is running.

If you have an index created by estcmd, move it into the node directory and reboot the server. So, the index will work as a node.

Prime Configuration File

The prime configuration file is composed of lines and the name of an variable and the value separated by ":" are in each line. By default, the following configuration is there.

bindaddr: 0.0.0.0
portnum: 1978
publicurl:
runmode: 1
authmode: 2
recvmax: 1024
maxconn: 30
idleflush: 20
idlesync: 300
sessiontimeout: 600
searchtimeout: 15
searchmax: 1000
searchdepth: 5
rateuri: 1
mergemethod: 2
proxyhost:
proxyport:
logfile: _log
loglevel: 2
backupcmd:
scalepred: 2
scoreexpr: 2
attrindex: @mdate{{!}}seq
attrindex: @title{{!}}str
docroot:
indexfile:
trustednode:
denyuntrusted: 0
cachesize: 64
cacheanum: 8192
cachetnum: 1024
cachernum: 256
specialcache:
helpershift: 0.9
wildmax: 256
limittextsize: 128
snipwwidth: 480
sniphwidth: 96
snipawidth: 96
scancheck: 1
smlrvnum: 32
extdelay: 4096
adminemail: magnus@hyperestraier.gov
uireplace: ^file:///home/mikio/public_html/{{!}}http://localhost/
uireplace: /index\.html?${{!}}/
uiextattr: @author|Author
uiextattr: @mdate|Modification Date
uiphraseform: 2
uismlrtune: 16 1024 4096

Meaning of each variable is the following.

bindaddr : specifies the binding address of the server. 0.0.0.0 means every address of the host.
portnum : specifies the port number of the server.
publicurl : specifies the public URL (absolute URL). By default, it is generated with "http://", the hostname, ":", and the port number.
runmode : specifies running mode (1:normal, 2:readonly).
authmode : specifies authorization mode (1:none, 2:admin, 3:all).
recvmax : maximum length of data to receive (in kilobytes).
maxconn : specifies the maximum number of connections at the same time.
idleflush : specifies idle time to start flushing (in seconds).
idlesync : specifies idle time to start synchronizing (in seconds).
sessiontimeout : specifies timeout of a session (in seconds).
searchtimeout : specifies timeout of search (in seconds).
searchmax : specifies the maximum number of documents to send.
searchdepth : specifies the maximum depth of meta search.
rateuri : specifies whether to rate URI for scoring (0:no, 1:yes).
mergemethod : specifies merge method of meta search (1:score, 2:score and rank, 3:rank).
proxyhost : specifies the host name of the proxy server.
proxyport : specifies the port number of the proxy server.
logfile : specifies the path of the log file (relative path or absolute path).
loglevel : specifies logging level (1:debug, 2:information, 3:warning, 4:error, 5:none).
backupcmd : specifies the command for backup (absolute path of a command).
scalepred : specifies scale prediction of each node (1:small, 2:medium, 3:large, 4:huge).
scoreexpr : specifies score expression (1:void, 2:char, 3:int, 4:asis).
attrindex : specifies the attribute indexes (attribute name and data type). This can be more than once.
docroot : specifies document root directory (absolute path of a directory to be public).
indexfile : specifies index file (name of directory index files).
trustednode : specifies decimal IP addresses of trusted nodes. This can be more than once.
denyuntrusted : specifies whether to deny all nodes except for trusted nodes (0:no, 1:yes).
cachesize : specifies the maximum size of the index cache (in megabytes).
cacheanum : specifies the maximum number of cached records for document attributes.
cachetnum : specifies the maximum number of cached records for document texts.
cachernum : specifies the maximum number of cached records for occurrence results.
specialcache : specifies the name of the attribute of the special cache.
helpershift : specifies the lower limit of cache usage size to use the helper.
wildmax : specifies the maximum number of expansion of wild cards.
limittextsize : specifies the text size limitation of registered documents by kilobytes.
snipwwidth : specifies whole width of the snippet of each shown document.
sniphwidth : specifies width of strings picked up from the beginning of the text.
snipawidth : specifies width of strings picked up around each highlighted word.
scancheck : specifies whether to check documents by scanning (0:no, 1:yes).
smlrvnum : specifies the number of keywords for similarity search (0 means disabled).
extdelay : specifies the number of documents for delay of keyword extraction.
adminemail : specifies an e-mail address of the administrator.
uireplace : specifies regular expressions and replacement string to convert the URI of each document. Regular expressions and replacement strings are separated by "{{!}}". This can be more than once.
uiextattr : specifies an attribute to be shown. The name and the label are separated by "|". This can be more than once.
uiphraseform : specifies mode of phrase form (1:usual, 2:simplified, 3:rough, 4:union: 5:intersection)
uismlrtune : specifies tuning parameters for similarity search. The number of keywords, the number of documents per keyword, and the number of all candidates are separated by space characters.

User Account File

The user account file is composed of lines and each includes the name, the encrypted password, the flags, the full name, and the miscellaneous information separated by tabs. The character encoding is UTF-8. By default, the following account is there.

admin   21232f297a57a5a743894a0e4a801fc3        s       Carolus Magnus  Administrator

The password is expressed as MD5 hash value. In the flags, "s" is for super users, and "b" is for banned users. Flags, full name, and miscellaneous information can be omitted.

Embedded User Interfaces

By accessing the absolute URL "/master_ui" of the node master with a web browser, you can use the administration interface. It requires authentication as a super user.

By accessing the URL which is the URL of a node server followed by "/search_ui", you can use the search interface. There are some anchors with such labels as "LINK#1" in the result page. They express nodes linked from the current node. If you select one of them, you move to the node and then search with the current conditions are performed.

If you select the link "Atom" or "RSS" in the result page, you get the search result in Atom or RSS format. By registering the URL to an application supporting Atom 1.0 or RSS 1.0, you can monitor the search result periodically. In order to work together with such site as Google, Wikipedia and so on, Hyper Estraier supports OpenSearch 1.1. To get the description for OpenSearch, access the URL of the node server trailing "/opensearch".

Protocol

Communication between nodes and communication between clients and nodes are carried out by a protocol based on HTTP. This section describes the protocol.

Introduction

The node master and node servers implement HTTP/1.0. As for now, such particular features of HTTP/1.1 as keep-alive connection, chunked encoding, and content negotiation are not supported.

While both of GET and POST are allowed for the request method of HTTP, GET is preferred if the command retrieves information, POST is preferred if the command update the node master or a node server. As the character encoding of parameters is UTF-8, meta characters and multi-byte characters should be escaped by URL encoding (application/x-www-form-urlencoded). The maximum length of data sent with the GET method is 8000. Authentication information is passed in the basic authentication mechanism of HTTP.

If an operation is done successfully, the status code 200 or 202 is returned. On error, one of the following status code is returned.

400 : parameters are invalid.
401 : authentication information lacks or is invalid.
403 : the account has not the permission.
404 : the node does not exist.
500 : some errors of due to the server occurred.

The result of operation of search or retrieve is sent as message body of response. As the format of the data is plain text whose encoding is UTF-8, it can be structured with tabs and line feeds. If a client supports deflate encoding, compressed data is sent.

Operation of Node Master

To operate the node master, connect to the path "/master" of the server. For example, if the host name is "skyhigh.estraier.go.jp" and the port number is 8888, connect to "http://skyhigh.estraier.go.jp:8888/master". Only super users are granted to operate the node master. There are some sub commands for operations of the node master. The name of a sub command is specified by the parameter "action". Other parameters vary according to each sub command.

/master ? action=shutdown: Shutdown the node master.; No parameter is used.; On success, the status code 202 is returned.

/master ? action=sync: Synchronize databases of all nodes and the disk.; No parameter is used.; On success, the status code 202 is returned.

/master ? action=backup: Synchronize databases and perform the backup command.; No parameter is used.; On success, the status code 202 is returned.

/master ? action=userlist: Get the list of user accounts.; No parameter is used.; On success, the status code 200 is returned. The entity body of response expresses the list of user accounts whose format is TSV. Each line is information of each user. There is fields of the user name, the encrypted password, the flags, the full name, and the miscellaneous information.

/master ? action=useradd & name=str & passwd=str & flags=str & fname=str & misc=str: Add a user account.; name specifies the name of a new user. It is essential. Only alphanumeric characters are in the name. If the specified name overlaps the name of an existing user, it is treated as an error.; passwd specifies the password. It is essential.; flags specifies the flags. It is optional. "s" is for super users, and "b" is for banned users.; fname specifies the full name. It is optional.; misc specifies the miscellaneous information. It is optional.; On success, the status code 200 is returned.; Because the password is sent, POST method should be used.

/master ? action=userdel & name=str: Delete a user account.; name specifies the name of a user. It is essential.; On success, the status code 200 is returned.

/master ? action=nodelist: Get the list of node servers.; No parameter is used.; On success, the status code 200 is returned. The entity body of response expresses the list of node servers whose format is TSV. Each line is information of each node. There is fields of the node name, the label, the number of documents, the number of unique words, and the size.

/master ? action=nodeadd & name=str & label=str: Add a node server.; name specifies the name of a new node. It is essential. Only alphanumeric characters are in the name. If the specified name overlaps the name of an existing node, it is treated as an error.; label specifies the label. It is optional. If it is omitted, the label is to be same as the name.; On success, the status code 200 is returned.

/master ? action=nodedel & name=str: Delete a node server.; name specifies the name of a node. It is essential.; On success, the status code 200 is returned.

/master ? action=nodeclr & name=str: Clear registered documents in a node server.; name specifies the name of a node. It is essential.; On success, the status code 200 is returned. Information of users and links is still kept.

/master ? action=logrtt: Rotate the log file.; No parameter is used.; On success, the status code 200 is returned. The existing log file is cleared and its content is escaped into a new file whose name is trailed by the date information in the format of "-YYYYMMDDhhmmss".

Operation of Node Server

To operate a node server, connect to a path which begins "/node/" and is followed by the name of the node. For example, if the host name is "skyhigh.estraier.go.jp" and the port number is 8888 and the name of the node is "foo", connect to "http://skyhigh.estraier.go.jp:8888/node/foo". There are some sub commands for operations of node servers. The name of a sub command is specified after the node name. Parameters vary according to each sub command.

/node/name/inform: Get information of a node.; No parameter is used.; On success, the status code 200 is returned. The entity body of response expresses node information whose format is TSV. The first line includes fields of the node name, the label, the number of documents, the number of unique words, and the size. The next line is empty. The succeeding lines to the next empty line are names of administrators. And, the succeeding lines to the next empty line are names of guests. And, the succeeding lines to the end are link information. There are fields of the URL, the label, and the credit.

/node/name/cacheusage: Get the cache usage of a node.; No parameter is used.; On success, the status code 200 is returned. The entity body of response expresses the cache usage in decimal format.

/node/name/search ? phrase=str & attr=str & order=str & max=num & options=num & auxiliary=num & distinct=str & depth=num & wwidth=num & hwidth=num & awidth=num & skip=num & mask=num: Search for documents.; phrase specifies the search phrase. It is optional. The format is as with the one of the core API.; attr specifies an attribute search condition. It is optional. From attr1 to attr9 works as with it. The format is as with the one of the core API.; order specifies the order expression. It is optional. The format is as with the one of the core API.; max specifies the maximum number of shown documents. It is optional. By default, it is 10.; options specifies options of the search condition. It is optional. The value is as with the one of the core API.; auxiliary specifies permission to adopt result of the auxiliary index. It is optional. By default, it is 32.; distinct specifies the attribute distinction filter. It is optional.; depth specifies depth of meta search. It is optional. By default, it is 0.; wwidth specifies whole width of the snippet of each shown document. It is optional. By default, it is due to the server configuration. If it is 0, no snippet is sent. If it is negative, whole body text is sent instead of snippet.; hwidth specifies width of strings picked up from the beginning of the text. It is optional. By default, it is due to the server configuration.; awidth specifies width of strings picked up around each highlighted word. It is optional. By default, it is due to the server configuration.; skip specifies the number of documents to be skipped. It is optional. By default, it is 0.; mask specifies the mask of search targets. It is optional. 1 means the node itself, 2 means the first link, 4 means the second link, 8 means the third link, and power values of 2 and their summation compose the mask. Otherwise, it can be specifies by expressions as from mask0=on to mask9=on.; On success, the status code 200 is returned. The entity body of response expresses the search result. The format is explained later.

/node/name/list ? max=num & prev=str: Retrieve a list of documents.; max specifies the maximum number of shown documents. It is optional. By default, it is 10.; prev specifies the URI of the previous element of iteration. It is optional.; On success, the status code 200 is returned. The entity body of response expresses the retrieval result. The format is explained later.

/node/name/get_doc ? id=num & uri=str: Get information of a document.; id specifies the ID number of a document. It is optional.; uri specifies the URI of a document. It is optional.; On success, the status code 200 is returned. The entity body of response expresses the search result. The format is document draft.

/node/name/get_doc_attr ? id=num & uri=str & attr=str: Get the value of an attribute of a document.; id specifies the ID number of a document. It is optional.; uri specifies the URI of a document. It is optional.; attr specifies the name of an attribute. It is essential.; On success, the status code 200 is returned. The entity body of response expresses the value of the attribute.

/node/name/etch_doc ? id=num & uri=str: Extract keywords of a document.; id specifies the ID number of a document. It is optional.; uri specifies the URI of a document. It is optional.; On success, the status code 200 is returned. The entity body of response expresses keywords and their scores. The format is TSV.

/node/name/uri_to_id ? uri=str: Get the ID number of a document specified by URI.; uri specifies the URI of a document. It is essential.; On success, the status code 200 is returned. The entity body of response expresses the ID number of the document.

/node/name/put_doc ? draft=str: Register a document. It is available by administrators only.; draft specifies content of a document in document draft format. It is essential.; On success, the status code 200 is returned.

/node/name/out_doc ? id=num & uri=str: Remove a document. It is available by administrators only.; id specifies the ID number of a document. It is optional.; uri specifies the URI of a document. It is optional.; On success, the status code 200 is returned.

/node/name/edit_doc ? draft=str: Edit attributes of a document. It is available by administrators only.; draft specifies content of a document in document draft format. It is essential.; On success, the status code 200 is returned.

/node/name/sync: Synchronize updating contents of the database. It is available by administrators only.; No parameter is used.; On success, the status code 200 is returned.

/node/name/optimize: Optimize the database. It is available by administrators only.; No parameter is used.; On success, the status code 200 is returned.

/node/name/_set_user ? name=str & mode=num: Set permission of a user. It is available by administrators only.; name specifies the name of a user. It is essential.; mode specifies operation mode. It is essential. 1 means to set the user as an administrator, 2 means to set the user as a guest, and 0 means to revoke the user account.; On success, the status code 200 is returned.

/node/name/_set_link ? url=str & label=str & credit=num: Set a link to another node. It is available by administrators only.; url specifies the URL of a destination node. It is essential. If the specified URL overlaps the URL of an existing link, the label and the credit is overwritten.; label specifies the label of the link. It is essential.; credit specifies the credit of the link. It is optional. If it is omitted, the link is removed.; On success, the status code 200 is returned.

Note that while super users has permission to administrate all nodes, an administrator of a node may not be a super user. Moreover, setting of guests of each node have meaning only when the authorization mode is 3 (all).

Format of Search Result

The format of the entity body of response of search command is alike to multipart of MIME. The following is an example.

--------[2387AD2E34554FFF]--------
VERSION 1.0
NODE    http://localhost:1978/node/sample1
HIT     2
HINT#1  give    2
DOCNUM  2
WORDNUM 31
TIME    0.006541
TIME#i  0.000058
TIME#0  0.002907
TIME#1  0.001578
LINK#0  http://localhost:1978/node/sample1      Sample1 10000   2       31      2731304 2
LINK#1  http://localhost:1978/node/sample2      Sample2 4000    3       125     8524522 1
VIEW    SNIPPET

--------[2387AD2E34554FFF]--------
#nodelabel=Sample Node One
#nodescore=7823432
#nodeurl=http://localhost:1978/node/sample1
@id=1
@uri=http://localhost/foo.html
%VECTOR give    8502    dispose 7343    griefs  5932    king    2343    void    1232

You may my glories and my state dispose, But not my griefs; still am I king of those. (
Give    give
 it u

p, Yo!
Give    give
 it up, Yo!)

--------[2387AD2E34554FFF]--------
#nodelabel=Sample Node One
#nodescore=5623772
#nodeurl=http://localhost:1978/node/sample1
@id=2
@uri=http://localhost/bar.html
%VECTOR faster  9304    give    7723    griefs  6632    go      5343    you     3289

The faster I go, the behinder I get. (
Give    give
 it up, Yo!
Give    give
 it up, Yo!)

--------[2387AD2E34554FFF]--------:END

Each line feed is a single LF. The first line is definition of the border string. Each parts are delimited by the border string. The last border string is followed by ":END". The first part is the meta section. The other parts are document sections.

The format of the meta section is TSV. Meaning of each string is picked out by the first field. There are the following kinds.

VERSION : specifies the version of the protocol.
NODE : specifies the URL of the node.
HIT : specifies the total number of the corresponding documents.
HINT#n : specifies the number of documents corresponding each word. The second field specifies the word. The third field specifies the number.
DOCNUM : specifies the total number of documents in target nodes.
WORDNUM : specifies the total number of words in target nodes.
TIME : specifies total elapsed time in seconds.
TIME#n : specifies elapsed time for each node in seconds. "TIME#i" means elapsed time for the local inverted index only.
LINK#n : specifies information of each linked node. Fields express the URL, the label, the credit, the number of documents, the number of unique words, the size of the database, and the number of corresponding documents. "LINK#0" means the node it self.
VIEW : specifies the format of document parts. As for now, it is constantly "SNIPPET".

Each document part expresses attributes and a snippet of a document. Top lines to the first empty line expresses attributes. Their format is as with the one of document draft. If keywords are attached to the document, they are expressed with the "%VECTOR" control command. If score is specified explicitly, it is expressed with the "%SCORE" control command. The format of the snippet is TSV. There are tab separated values. Each line is a string to be shown. Though most lines have only one field, some lines have two fields. If the second field exists, the first field is to be shown with highlighted, and the second field means its normalized form.

The following pseudo-attributes are added to each result documents of the search command or the get_doc command includes.

#nodeurl : specifies the URL of the node into which the document was registered.
#nodescore : specifies the node local score of the document.
#nodelabel : specifies the label of the node into which the document was registered.

Format of List Retrieval

The format of the entity body of response of list command is in TSV format. The following is an example though some strings follows "..." actually.

181     http://localhost/data/ihaveadream.xml   31e51df5f33131943dda22bd0fd755a0 ...
1       http://localhost/prog/hyperestraier-1.0.2/doc/index.html        45368fa3c...
2       http://localhost/prog/hyperestraier-1.0.2/doc/index.ja.html     0e9edf4ae...
3       http://localhost/prog/hyperestraier-1.0.2/doc/intro-en.html     ec622d19a...
18      http://localhost/prog/hyperestraier-1.0.2/doc/intro-ja.html     96f743fa6...
5       http://localhost/prog/hyperestraier-1.0.2/doc/javanativeapi/allclasses-fr...
25      http://localhost/prog/hyperestraier-1.0.2/doc/javanativeapi/allclasses-no...
26      http://localhost/prog/hyperestraier-1.0.2/doc/javanativeapi/constant-valu...
20      http://localhost/prog/hyperestraier-1.0.2/doc/javanativeapi/estraier/Cmd....
1022    http://localhost/prog/hyperestraier-1.0.2/doc/javanativeapi/estraier/Docu...

Each line feed is a single LF. Each line expresses each document and has 14 fields of system attributes. They are "@id", "@uri", "@digest", "@cdate", "@mdate", "@adate", "@title", "@author", "@type", "@lang", "@genre", "@size", "@weight", and "@misc". Fields of undefined attributes are empty strings.

Special Format for Document Registration

Because URL encoding is not efficient as for large data sent for the put_doc command and edit_doc command, the raw mode is supported. If the value of "Content-Type" is "text/x-estraier-draft", the entity body is treated as a document draft itself. The following is an example.

POST /node/foo/put_doc HTTP/1.0
Content-Type: text/x-estraier-draft
Content-Length: 138

@uri=http://gogo.estraier.go.jp/sample.html
@title=Twinkle Twinkle Little Star

Twinkle, twinkle, little star,
How I wonder what you are.

Node API

As it is a bother to implement HTTP, the node API is useful. This section describes how to use the node API.

Introduction

Using the node API, you can implement clients communicating node severs without considering such low level processing as TCP/IP and HTTP. Though the node API has overhead comparing to the core API, it is important to be able to execute at remote host and to perform parallel processing without discrimination of readers and writers.

In each source of applications of the node API, include `estraier.h', `estnode.h', `cabin.h', and `stdlib.h'.

#include <estraier.h>
#include <estnode.h>
#include <cabin.h>
#include <stdlib.h>

To build an application, perform the following command. It is same as with the core API.

gcc `estconfig --cflags` -o foobar foobar.c `estconfig --ldflags` `estconfig --libs`

Because the node API uses features of the core API also, if you have never read the programming guide, please read it beforehand.

API for Initializing

For preparation to use the node API, initialize the network environment at the beginning of a program. Moreover, the environment should be freed at the end of the program.

The function `est_init_net_env' is used in order to initialize the networking environment.

int est_init_net_env(void);: The return value is true if success, else it is false. As it is allowable to call this function multiple times, it is needed to call the function `est_free_net_env' at the same frequency.

The function `est_free_net_env' is used in order to free the networking environment.

void est_free_net_env(void);: There is no parameter and no return value.

API for Nodes

The type of the structure `ESTNODE' is for abstraction of connection to a node. A node has its own URL. No entity of `ESTNODE' is accessed directly, but it is accessed by the pointer. The term of node connection object means the pointer and its referent. A node connection object is created by the function `est_node_new' and destroyed by `est_node_delete'. Every created node connection object should be destroyed.

The following is a typical use case of node connection object.

ESTNODE *node;

/* create a node connection object */
node = est_node_new("http://estraier.gov:1978/node/foo");

/* set the proxy, the timeout, and the authentication */
est_node_set_proxy(node, "proxy.qdbm.go.jp", 8080);
est_node_set_timeout(node, 5);
est_node_set_auth(node, "mikio", "oikim");

  /* register documents or search for documents here */

/* destroy the object */
est_node_delete(node);

The function `est_node_new' is used in order to create a node connection object.

ESTNODE *est_node_new(const char *url);: `url' specifies the URL of a node. The return value is a node connection object.

The function `est_node_delete' is used in order to destroy a node connection object.

void est_node_delete(ESTNODE *node);: `node' specifies a node connection object.

The function `est_node_set_proxy' is used in order to set the proxy information of a node connection object.

void est_node_set_proxy(ESTNODE *node, const char *host, int port);: `node' specifies a node connection object. `host' specifies the host name of a proxy server. `port' specifies the port number of the proxy server.

The function `est_node_set_timeout' is used in order to set timeout of a connection.

void est_node_set_timeout(ESTNODE *node, int sec);: `node' specifies a node connection object. `sec' specifies timeout of the connection in seconds.

The function `est_node_set_auth' is used in order to set the authentication information of a node connection object.

void est_node_set_auth(ESTNODE *node, const char *name, const char *passwd);: `node' specifies a node connection object. `name' specifies the name of authentication. `passwd' specifies the password of the authentication.

The function `est_node_status' is used in order to get the status code of the last request of a node.

int est_node_status(ESTNODE *node);: `node' specifies a node connection object. The return value is the status code of the last request of the node. -1 means failure of connection.

The function `est_node_sync' is used in order to synchronize updating contents of the database of a node.

int est_node_sync(ESTNODE *node);: `node' specifies a node connection object. The return value is true if success, else it is false.

The function `est_node_optimize' is used in order to optimize the database of a node.

int est_node_optimize(ESTNODE *node);: `node' specifies a node connection object. The return value is true if success, else it is false.

The function `est_node_put_doc' is used in order to add a document to a node.

int est_node_put_doc(ESTNODE *node, ESTDOC *doc);: `node' specifies a node connection object. `doc' specifies a document object. The document object should have the URI attribute. The return value is true if success, else it is false. If the URI attribute is same with an existing document in the node, the existing one is deleted.

The function `est_node_out_doc' is used in order to remove a document from a node.

int est_node_out_doc(ESTNODE *node, int id);: `node' specifies a node connection object. `id' specifies the ID number of a registered document. The return value is true if success, else it is false.

The function `est_node_out_doc_by_uri' is used in order to remove a document specified by URI from a node.

int est_node_out_doc_by_uri(ESTNODE *node, const char *uri);: `node' specifies a node connection object. `uri' specifies the URI of a registered document. The return value is true if success, else it is false.

The function `est_node_edit_doc' is used in order to edit attributes of a document in a node.

int est_node_edit_doc(ESTNODE *node, ESTDOC *doc);: `node' specifies a node connection object. `doc' specifies a document object. The return value is true if success, else it is false. The ID can not be changed. If the URI is changed and it overlaps the URI of another registered document, this function fails.

The function `est_node_get_doc' is used in order to retrieve a document in a node.

ESTDOC *est_node_get_doc(ESTNODE *node, int id);: `node' specifies a node connection object. `id' specifies the ID number of a registered document. The return value is a document object. It should be deleted with `est_doc_delete' if it is no longer in use. On error, `NULL' is returned.

The function `est_node_get_doc_by_uri' is used in order to retrieve a document specified by URI in a node.

ESTDOC *est_node_get_doc_by_uri(ESTNODE *node, const char *uri);: `node' specifies a node connection object. `uri' specifies the URI of a registered document. The return value is a document object. It should be deleted with `est_doc_delete' if it is no longer in use. On error, `NULL' is returned.

The function `est_node_get_doc_attr' is used in order to retrieve the value of an attribute of a document in a node.

char *est_node_get_doc_attr(ESTNODE *node, int id, const char *name);: `node' specifies a node connection object. `id' specifies the ID number of a registered document. `name' specifies the name of an attribute. The return value is the value of the attribute or `NULL' if it does not exist. Because the region of the return value is allocated with the `malloc' call, it should be released with the `free' call if it is no longer in use.

The function `est_node_get_doc_attr_by_uri' is used in order to retrieve the value of an attribute of a document specified by URI in a node.

char *est_node_get_doc_attr_by_uri(ESTNODE *node, const char *uri, const char *name);: `node' specifies a node connection object. `uri' specifies the URI of a registered document. `name' specifies the name of an attribute. The return value is the value of the attribute or `NULL' if it does not exist. Because the region of the return value is allocated with the `malloc' call, it should be released with the `free' call if it is no longer in use.

The function `est_node_etch_doc' is used in order to extract keywords of a document.

CBMAP *est_node_etch_doc(ESTNODE *node, int id);: `node' specifies a node connection object. `id' specifies the ID number of a registered document. The return value is a new map object of keywords and their scores in decimal string or `NULL' on error. Because the object of the return value is opened with the function `cbmapopen', it should be closed with the function `cbmapclose' if it is no longer in use.

The function `est_node_etch_doc_by_uri' is used in order to extract keywords of a document specified by URI in a node.

CBMAP *est_node_etch_doc_by_uri(ESTNODE *node, const char *uri);: `node' specifies a node connection object. `uri' specifies the URI of a registered document. The return value is a new map object of keywords and their scores in decimal string or `NULL' on error. Because the object of the return value is opened with the function `cbmapopen', it should be closed with the function `cbmapclose' if it is no longer in use.

The function `est_node_uri_to_id' is used in order to get the ID of a document specified by URI.

int est_node_uri_to_id(ESTNODE *node, const char *uri);: `node' specifies a node connection object. `uri' specifies the URI of a registered document. The return value is the ID of the document. On error, -1 is returned.

The function `est_node_name' is used in order to get the name of a node.

const char *est_node_name(ESTNODE *node);: `node' specifies a node connection object. The return value is the name of the node. On error, `NULL' is returned. The life duration of the returned string is synchronous with the one of the node object.

The function `est_node_label' is used in order to get the label of a node.

const char *est_node_label(ESTNODE *node);: `node' specifies a node connection object. The return value is the label of the node. On error, `NULL' is returned. The life duration of the returned string is synchronous with the one of the node object.

The function `est_node_doc_num' is used in order to get the number of documents in a node.

int est_node_doc_num(ESTNODE *node);: `node' specifies a node connection object. The return value is the number of documents in the node. On error, -1 is returned.

The function `est_node_word_num' is used in order to get the number of unique words in a node.

int est_node_word_num(ESTNODE *node);: `node' specifies a node connection object. The return value is the number of unique words in the node. On error, -1 is returned.

The function `est_node_size' is used in order to get the size of the database of a node.

double est_node_size(ESTNODE *node);: `node' specifies a node connection object. The return value is the size of the database of the node. On error, -1.0 is returned.

The function `est_node_cache_usage' is used in order to get the usage ratio of the cache of a node.

double est_node_cache_usage(ESTNODE *node);: `node' specifies a node connection object. The return value is the usage ratio of the cache of the node. On error, -1.0 is returned.

The function `est_node_admins' is used in order to get a list of names of administrators of a node.

const CBLIST *est_node_admins(ESTNODE *node);: `node' specifies a node connection object. The return value is a list object of names of administrators. On error, `NULL' is returned. The life duration of the returned object is synchronous with the one of the node object.

The function `est_node_users' is used in order to get a list of names of users of a node.

const CBLIST *est_node_users(ESTNODE *node);: `node' specifies a node connection object. The return value is a list object of names of users. On error, `NULL' is returned. The life duration of the returned object is synchronous with the one of the node object.

The function `est_node_links' is used in order to get a list of expressions of links of a node.

const CBLIST *est_node_links(ESTNODE *node);: `node' specifies a node connection object. The return value is a list object of expressions of links. Each element is a TSV string and has three fields of the URL, the label, and the score. On error, `NULL' is returned. The life duration of the returned object is synchronous with the one of the node object.

The function `est_node_search' is used in order to search a node for documents corresponding a condition.

ESTNODERES *est_node_search(ESTNODE *node, ESTCOND *cond, int depth);: `node' specifies a node connection object. `cond' specifies a condition object. `depth' specifies the depth of meta search. The return value is a node result object. It should be deleted with `est_noderes_delete' if it is no longer in use. On error, `NULL' is returned.

The function `est_node_set_snippet_width' is used in order to set width of snippet in the result from a node.

void est_node_set_snippet_width(ESTNODE *node, int wwidth, int hwidth, int awidth);: `node' specifies a node connection object. `wwidth' specifies whole width of a snippet. By default, it is 480. If it is 0, no snippet is sent. If it is negative, whole body text is sent instead of snippet. `hwidth' specifies width of strings picked up from the beginning of the text. By default, it is 96. If it is negative 0, the current setting is not changed. `awidth' specifies width of strings picked up around each highlighted word. By default, it is 96. If it is negative, the current setting is not changed.

The function `est_node_set_user' is used in order to manage a user account of a node.

int est_node_set_user(ESTNODE *node, const char *name, int mode);: `node' specifies a node connection object. `name' specifies the name of a user. `mode' specifies the operation mode. 0 means to delete the account. 1 means to set the account as an administrator. 2 means to set the account as a guest. The return value is true if success, else it is false.

The function `est_node_set_link' is used in order to manage a link of a node.

int est_node_set_link(ESTNODE *node, const char *url, const char *label, int credit);: `node' specifies a node connection object. `url' specifies the URL of the target node of a link. `label' specifies the label of the link. `credit' specifies the credit of the link. If it is negative, the link is removed. The return value is true if success, else it is false.

API for Search Results of Nodes

The type of the structure `ESTNODERES' is for abstraction of search result from a node. A result is composed of a list of corresponding documents and information of hints. No entity of `ESTNODERES' is accessed directly, but it is accessed by the pointer. The term of node result object means the pointer and its referent. A node result object is created by the function `est_node_search' and destroyed by `est_noderes_delete'. Every created node connection object should be destroyed.

The type of the structure `ESTRESDOC' is for abstraction of a document in search result. A result document is composed of some attributes and a snippet. No entity of `ESTRESDOC' is accessed directly, but it is accessed by the pointer. The term of result document object means the pointer and its referent. A result document object is gotten by the function `est_noderes_get_doc' but it should not be destroyed because the entity is managed inside the node result object.

The following is a typical use case of node connection object and result document object.

ESTNODERES *nres;
CBMAP *hints;
ESTRESDOC *rdoc;
int i;

/* create a node result object */
nres = est_node_search(node, cond, 1);

/* get hints */
hints = est_noderes_hints(nres);

   /* show the hints here */

/* scan documents in the result */
for(i = 0; i < est_noderes_doc_num(nres); i++){

  /* get a result document object */
  rdoc = est_noderes_get_doc(nres, i);

  /* show the result document object here */

}

/* destroy the node result object */
est_noderes_delete(nres);

The function `est_noderes_delete' is used in order to delete a node result object.

void est_noderes_delete(ESTNODERES *nres);: `nres' specifies a node result object.

The function `est_noderes_hints' is used in order to get a map object for hints of a node result object.

CBMAP *est_noderes_hints(ESTNODERES *nres);: `nres' specifies a node result object. The return value is a map object for hints. Keys of the map are "VERSION", "NODE", "HIT", "HINT#n", "DOCNUM", "WORDNUM", "TIME", "TIME#n", "LINK#n", and "VIEW". The life duration of the returned object is synchronous with the one of the node result object.

The function `est_noderes_eclipse' is used in order to eclipse similar documents of a node result object.

void est_noderes_eclipse(ESTNODERES *nres, int num, double limit);: `nres' specifies a node result object. `num' specifies the number of documents to be shown. If it is not more than 0, eclipse is undone. `limit' specifies the lower limit of similarity for documents to be eclipsed. Similarity is between 0.0 and 1.0.

The function `est_noderes_doc_num' is used in order to get the number of documents in a node result object.

int est_noderes_doc_num(ESTNODERES *nres);: `nres' specifies a node result object. The return value is the number of documents in a node result object.

The function `est_noderes_get_doc' is used in order to refer a result document object in a node result object.

ESTRESDOC *est_noderes_get_doc(ESTNODERES *nres, int index);: `nres' specifies a node result object. `index' specifies the index of a document. The return value is a result document object or `NULL' if `index' is equal to or more than the number of documents. The life duration of the returned object is synchronous with the one of the node result object.

The function `est_resdoc_uri' is used in order to get the URI of a result document object.

const char *est_resdoc_uri(ESTRESDOC *rdoc);: `rdoc' specifies a result document object. The return value is the URI of the result document object. The life duration of the returned string is synchronous with the one of the result document object.

The function `est_resdoc_attr_names' is used in order to get a list of attribute names of a result document object.

CBLIST *est_resdoc_attr_names(ESTRESDOC *rdoc);: `rdoc' specifies a result document object. The return value is a new list object of attribute names of the result document object. Because the object of the return value is opened with the function `cblistopen', it should be closed with the function `cblistclose' if it is no longer in use.

The function `est_resdoc_attr' is used in order to get the value of an attribute of a result document object.

const char *est_resdoc_attr(ESTRESDOC *rdoc, const char *name);: `rdoc' specifies a result document object. `name' specifies the name of an attribute. The return value is the value of the attribute or `NULL' if it does not exist. The life duration of the returned string is synchronous with the one of the result document object.

The function `est_resdoc_snippet' is used in order to get the snippet of a result document object.

const char *est_resdoc_snippet(ESTRESDOC *rdoc);: `rdoc' specifies a result document object. The return value is a string of the snippet of the result document object. There are tab separated values. Each line is a string to be shown. Though most lines have only one field, some lines have two fields. If the second field exists, the first field is to be shown with highlighted, and the second field means its normalized form. The life duration of the returned string is synchronous with the one of the result document object.

The function `est_resdoc_keywords' is used in order to get keywords of a result document object.

const char *est_resdoc_keywords(ESTRESDOC *rdoc);: `rdoc' specifies a result document object. The return value is a string of serialized keywords of the result document object. There are tab separated values. Keywords and their scores come alternately. The life duration of the returned string is synchronous with the one of the result document object.

The function `est_resdoc_shadows' is used in order to get an array of documents eclipsed by a result document object.

ESTRESDOC **est_resdoc_shadows(ESTRESDOC *rdoc, int *np);: `rdoc' specifies a result document object. `np' specifies the pointer to a variable to which the number of elements of the return value is assigned. The return value is an array of eclipsed result document objects. The life duration of the returned array and its elements is synchronous with the one of the result document object.

The function `est_resdoc_similarity' is used in order to get similarity of an eclipsed result document object.

double est_resdoc_similarity(ESTRESDOC *rdoc);: `rdoc' specifies a result document object. The return value is similarity of the result document object to the front document or -1.0 if it is not eclipsed.

Paralleling

Each of node connection objects, node result objects, and result document objects can not be shared by threads. If you use multi threads, make each thread have its own objects. If the precondition is kept, functions of the node API can be treated as thread-safe functions.

Example of Gatherer

The following is the simplest implementation of a gatherer.

#include <estraier.h>
#include <estnode.h>
#include <cabin.h>
#include <stdlib.h>
#include <stdio.h>

int main(int argc, char **argv){
  ESTNODE *node;
  ESTDOC *doc;
  /* initialize the network environment */
  if(!est_init_net_env()){
    fprintf(stderr, "error: network is unavailable\n");
    return 1;
  }
  /* create and configure the node connection object */
  node = est_node_new("http://localhost:1978/node/test1");
  est_node_set_auth(node, "admin", "admin");
  /* create a document object */
  doc = est_doc_new();
  /* add attributes to the document object */
  est_doc_add_attr(doc, "@uri", "http://estraier.gov/example.txt");
  est_doc_add_attr(doc, "@title", "Over the Rainbow");
  /* add the body text to the document object */
  est_doc_add_text(doc, "Somewhere over the rainbow.  Way up high.");
  est_doc_add_text(doc, "There's a land that I heard of once in a lullaby.");
  /* register the document object to the node */
  if(!est_node_put_doc(node, doc))
    fprintf(stderr, "error: %d\n", est_node_status(node));
  /* destroy the document object */
  est_doc_delete(doc);
  /* destroy the node object */
  est_node_delete(node);
  /* free the networking environment */
  est_free_net_env();
  return 0;
}

Example of Searcher

The following is the simplest implementation of a searcher.

#include <estraier.h>
#include <estnode.h>
#include <cabin.h>
#include <stdlib.h>
#include <stdio.h>

int main(int argc, char **argv){
  ESTNODE *node;
  ESTCOND *cond;
  ESTNODERES *nres;
  ESTRESDOC *rdoc;
  int i;
  const char *value;
  /* initialize the network environment */
  if(!est_init_net_env()){
    fprintf(stderr, "error: network is unavailable\n");
    return 1;
  }
  /* create the node connection object */
  node = est_node_new("http://localhost:1978/node/test1");
  /* create a search condition object */
  cond = est_cond_new();
  /* set the search phrase to the search condition object */
  est_cond_set_phrase(cond, "rainbow AND lullaby");
  /* get the result of search */
  nres = est_node_search(node, cond, 0);
  if(nres){
    /* for each document in the result */
    for(i = 0; i < est_noderes_doc_num(nres); i++){
      /* get a result document object */
      rdoc = est_noderes_get_doc(nres, i);
      /* display attributes */
      if((value = est_resdoc_attr(rdoc, "@uri")) != NULL)
        printf("URI: %s\n", value);
      if((value = est_resdoc_attr(rdoc, "@title")) != NULL)
        printf("Title: %s\n", value);
      /* display the snippet text */
      printf("%s", est_resdoc_snippet(rdoc));
    }
    /* delete the node result object */
    est_noderes_delete(nres);
  } else {
    fprintf(stderr, "error: %d\n", est_node_status(node));
  }
  /* destroy the search condition object */
  est_cond_delete(cond);
  /* destroy the node object */
  est_node_delete(node);
  /* free the networking environment */
  est_free_net_env();
  return 0;
}

Command of a Client

estcall is provided as a client command to manage the node server. This section describes how to use estcall.

Synopsis and Description

estcall is an aggregation of sub commands. The name of a sub command is specified by the first argument. Other arguments are parsed according to each sub command. The argument nurl specifies the URL of a node. The option -proxy specifies the host name and the port number of a proxy server. The option -tout specifies timeout in seconds. The option -auth specifies the user name and the password of authentication information.

estcall put [-proxy host port] [-tout num] [-auth user pass] nurl [file]: Register a document of document draft to a node.; file specifies a target file. If it is omitted, the standard input is read.

estcall out [-proxy host port] [-tout num] [-auth user pass] nurl expr: Remove information of a document from a node.; expr specifies the ID number or the URI of a document.

estcall edit [-proxy host port] [-tout num] [-auth user pass] nurl expr name [value]: Edit an attribute of a document in a node.; expr specifies the ID number or the URI of a document.; name specifies the name of an attribute.; value specifies the value of the attribute. If it is omitted, the attribute is removed.

estcall get [-proxy host port] [-tout num] [-auth user pass] nurl expr [attr]: Output document draft of a document in a node.; expr specifies the ID number or the URI of a document.; If attr is specified, only the value of the attribute is output.

estcall etch [-proxy host port] [-tout num] [-auth user pass] nurl expr: Output TSV of keywords and their scores of a document in a node.; expr specifies the ID number or the URI of a document.

estcall uriid [-proxy host port] [-tout num] [-auth user pass] nurl uri: Output the ID number of a document specified by URI.; uri specifies the URI of a document.

estcall inform [-proxy host port] [-tout num] [-auth user pass] [-ia|-iu|-il] nurl: Output the name, the label, the number of documents, the number of unique words, and the cache usage of a node.; If -ia is specified, names of administrators are output.; If -iu is specified, names of users are output.; If -il is specified, expressions of links are output.

estcall sync [-proxy host port] [-tout num] [-auth user pass] nurl: Synchronize updating contents of the database of a node.

estcall optimize [-proxy host port] [-tout num] [-auth user pass] nurl: Optimize the database of a node.

estcall search [-proxy host port] [-tout num] [-auth user pass] [-vu|-vx] [-kw] [-ec rn] [-sf] [-attr expr] [-ord expr] [-max num] [-sk num] [-aux num] [-dis name] [-dpt num] [-mask num] nurl [phrase]: Search a node for documents.; phrase specifies the search phrase.; If -vu is specified, TSV of URI and node information are output.; If -vx is specified, XML including including attributes and snippets is output.; If -kw is specified, keyword vectors are retrieved.; -ec specifies lower limit of similarity eclipse.; If -sf is specified, the phrase is treated as a simplified form.; -attr specifies an attribute search condition. This option can be specified multiple times.; -ord specifies the order expression. By default, it is descending by score.; -max specifies the maximum number of show documents. Negative means unlimited. By default, it is 10.; -sk specifies the number of documents to be skipped. By default, it is 0.; -aux specifies permission to adopt result of the auxiliary index. If it is not more than 0, the auxiliary index is not used. By default, it is 32.; -dis specifies the name of the distinct attribute.; -dpt specifies the depth of meta search. by default it is 0.; -mask specifies the mask of meta search. by default it is 0.

estcall list [-proxy host port] [-tout num] [-auth user pass] nurl: Retriave a list of all documents.

estcall setuser [-proxy host port] [-tout num] [-auth user pass] nurl name mode: Set permission of a user.; name specifies the name of a user.; mode specifies operation mode. 1 means to set the user as an administrator, 2 means to set the user as a guest, and 0 means to revoke the user account.

estcall setlink [-proxy host port] [-tout num] [-auth user pass] nurl url label credit: Set a link to another node.; url specifies the URL of a destination node.; label specifies the label of the link.; credit specifies the credit of the link. If the value is negative, the link is removed.

estcall raw [-proxy host port] [-tout num] [-auth user pass] [-np] [-eh expr] url [file]: Output response of an HTTP request.; url specifies the URL of a target.; If file is specified, the content is sent by POST method. If not, GET method is used. If "-" is specified, the standard input is read.; If -np is specified, output response headers also.; -eh specifies an additional HTTP header. By default, "Host", "Connection", "User-Agent", and "Content-Length" is added.

All sub commands return 0 if the operation is success, else return 1.

Examples

Operations for the node maser itself is not provided as APIs, use the raw sub command for that purpose. For example, the following command is used in order to shutdown the node master.

estcall raw -auth admin admin \
  'http://localhost:1978/master?action=shutdown'

In order to add a user, perform the following command.

estcall raw -auth admin admin \
  'http://localhost:1978/master?action=useradd&name=mikio&passwd=iloveyou'

In order to use POST method, perform the following command.

echo -n 'action=useradd&name=mikio&passwd=iloveyou' |
  estcall raw -auth admin admin \
    -eh 'Content-Type: application/x-www-form-urlencoded' \
    'http://localhost:1978/master' -

Pseudo Node Master

estfraud.cgi is a CGI script working as a node master.

Constitution

Running a node master bothers you to write start-up and shutdown scripts. And, it may be forbidden to run any daemon process on some shared servers. In that case, pseudo node master is useful. You can publish some indexes as node servers, on any environment where a web server runs and CGI scripts are available. Such virtual node server working on a pseudo master is called pseudo node server. Each pseudo node server can be searched with the node API.

Pseudo node master is useful in case that the number of node is enormous. Different from usual node server which regidents with keeping file descriptios, pseudo node server connects the database on demand of each request. So, file descriptors are not exhausted. Actually, the maximum number of node servers running on a usual node master is about 30. However, a pseudo node master can handle one thousand or more pseudo node servers.

To use pseudo node master, deploy a CGI script estfraud.cgi and a configuration file estfraud.conf into a directory where CGI scripts are available.

Configuration File

The configuration file is composed of lines and the name of an variable and the value separated by ":" are in each line. Lines leaded by "#" are ignored as comments. By default, the following configuration is there.

indexdir: .
runmode: 2
pidxsuffix: -pidx
pidxdocmax: 256
pidxdocmin: 0
lockindex: 0
searchmax: 1000
rateuri: 1
mergemethod: 2
scoreexpr: 2
wildmax: 256
snipwwidth: 480
sniphwidth: 96
snipawidth: 96
scancheck: 1
smlrvnum: 32
extdelay: 4096

Means of each variable is the following.

indexdir : specifies the path of a directory containing indexes.
runmode : specifies running mode (1:normal, 2:readonly).
pidxsuffix : specifies the suffix of pseudo indexes.
pidxdocmax : specifies the maximum number of documents in each pseudo index.
pidxdocmin : specifies the minimum number of documents in each pseudo index.
lockindex : specifies whether to perform file locking to the database (0:no, 1:yes).
searchmax : specifies the maximum number of documents to send.
rateuri : specifies whether to rate URI for scoring (0:no, 1:yes).
mergemethod : specifies merge method of meta search (1:score, 2:score and rank, 3:rank).
scoreexpr : specifies score expression (1:void, 2:char, 3:int, 4:asis).
wildmax : specifies the maximum number of expansion of wild cards.
snipwwidth : specifies whole width of the snippet of each shown document.
sniphwidth : specifies width of strings picked up from the beginning of the text.
snipawidth : specifies width of strings picked up around each highlighted word.
scancheck : specifies whether to check documents by scanning (0:no, 1:yes).
smlrvnum : specifies the number of keywords for similarity search (0 means disabled).
extdelay : specifies the number of documents for delay of keyword extraction.

For example, if the value of indexdir is "/home/mikio/myindex" and the URL of estfraud.cgi is "http://abc.def/ghi/estfraud.cgi", the URL "http://abc.def/ghi/estfraud.cgi/foo" is to access the index "/home/mikio/myindex/foo".

Restrictions

The following conditions must be satisfied for updating indexes by node servers.

Running mode of the pseudo node master is normal (set "runmode: 1").
Pseudo indexes exist (make directories whose names are the index names trailed by "-pidx".
The user executing the CGI script has permission to write indexes and pseudo indexes.

If documents are registered via node server, they are stored in a pseudo index. When the number of files in the pseudo index reaches the number specified by pidxdocmax, content of the pseudo index is merged into the index and the pseudo index become empty. Due to the mechanism, the CGI script which does not resident as daemon can update the index. If pidxdocmax is set larger, faster the updating process performs, but slower the search process performs. To merge the pseudo index into the index explicitly, call the sync method (est_node_sync).

As pseudo node server does not support query relay to other nodes, the depth parameter of search query is ignored. Moreover, pseudo node server does not support changing the URI attribute by document editing (est_node_doc_edit).

Meta Search Gateway

estscout.cgi is a gateway for intersection meta search and estsupt.cgi is a gateway for union meta search.

Constitution

In order to retrieve intersection set of search results from plural indexes by using such identifiers as URI attribute, the intersection meta search gateway is useful. For example, you can search a staff register for records whose name and address match some full-text search conditions, by using URI attribute of staff number as identifier. Moreover, the union meta search gateway is provided in order to get union set of search results from plural intersection meta search gateways.

The two gateways are implemented as CGI scripts and used with parameters specified by URL. No such abstraction interface as API is provided. Because meta search is performed by multi thread or multi process, improvement of scalability for distributed processing is expected.

Intersection Meta Search Gateway

To use the intersection meta search gateway, deploy a CGI script estscout.cgi and a configuration file estscout.conf into a directory where CGI scripts are available.

The configuration file is composed of lines and the name of an variable and the value separated by ":" are in each line. By default, the following configuration is there.

indexname: casket-1
indexname: casket-2
indexname: casket-3
lockindex: 1
condgstep: 2
dotfidf: true
scancheck: 3
phraseform: 2
wildmax: 256
stmode: 0
idattr: @uri
idsuffix:
ordexpr: @uri STRA
dupcheck: 0
union: 0
tmpdir: /tmp
cclife: 300
logfile:
logformat: {time}\t{REMOTE_ADDR}:{REMOTE_PORT}\t{cond}\t{hnum}\n

Means of each variable is the following.

indexname : specifies the name of an index. This can be more than once.
lockindex : specifies whether to perform file locking to the database (0:no, 1:yes).
condgstep : specifies accuracy of N-gram checking.
dotfidf : specifies whether to do TF-IDF score tuning (0:no, 1:yes).
scancheck : specifies the number of checked documents by scanning.
phraseform : specifies the phrase form (1:usual form, 2:simplified form, 3:rough form, 4:union form, 5:intersection form).
wildmax : specifies the maximum number of expansion of wild cards.
stmode : specifies whether to be single thread mode (0:no, 1:yes).
idattr : specifies the name of an attribute used as document identifier. If it is an empty string, score is used.
idsuffix : specifies the name of an attribute used as the suffix of document identifier. This can be more than once.
ordexpr : specifies an ordering expression used when attribute search.
dupcheck : specifies whether to check duplication of identifiers (0:no, 1:yes).
union : specifies whether to perform union meta search.
score : specifies a scoring method when logical operation (0:sum, 1:max, 2:min, 3:average).
tmpdir : specifies the path of the directory for temporary files.
cclife : specifies the lifetime of cache files (in seconds). If it is negative, the lifetime is unlimited.
logfile : specifies the path of the log file.
logformat : specifies the format of each log data.

The intersection meta search gateway receives the following parameters. The order expression is not supported.

phrasen : specifies full-text search conditions with n assigned from 1 to 9.
attrn : specifies attribute search condition with n assigned from 1 to 9.
max : specifies the maximum number of documents to send.
distinct : the name of the distinct attribute.
fresh : specifies whether to disable cache temporarily (0:no, 1:yes).

In the output, the first line specifies the approximation number of corresponding documents. Each of the subsequent lines specifies identifier and score of a corresponding document. For example, the following is output.

256
file:///home/mikio/tako.html   12561
file:///home/mikio/ika.html    11624
file:///home/mikio/uni.html    9232
file:///home/mikio/kani.html   8293
file:///home/mikio/ebi.html    8312

Union Meta Search Gateway

To use the union meta search gateway, deploy a CGI script estsupt.cgi and a configuration file estsupt.conf into a directory where CGI scripts are available.

The configuration file is composed of lines and the name of an variable and the value separated by ":" are in each line. By default, the following configuration is there.

targeturl: http://searcher1/estscout.cgi
targeturl: http://searcher2/estscout.cgi
stmode: 0
tmpdir: /tmp
cclife: 1800
#shareurl: http://merger1/estsupt.cgi
#shareurl: http://merger2/estsupt.cgi
failfile:
logfile:
logformat: {time}\t{REMOTE_ADDR}:{REMOTE_PORT}\t{cond}\t{hnum}\n

Means of each variable is the following.

targeturl : specifies the URL of an intersection meta search gateway. This can be more than once.
stmode : specifies whether to be single thread mode (0:no, 1:yes).
score : specifies a scoring method when logical operation (0:sum, 1:max, 2:min, 3:average).
tmpdir : specifies the path of the directory for temporary files.
cclife : specifies the lifetime of cache files (in seconds). If it is negative, the lifetime is unlimited.
shareurl : specifies the URL of a union meta search gateway for shared cache. This can be more than once.
failfile : specifies the path of the fail rate file.
logfile : specifies the path of the log file.
logformat : specifies the format of each log data.

The format of output and parameters of the union meta search gateway are same with ones of the intersection meta search gateway.

P2P Guide

Table of Contents

Introduction

Architecture

Node Master and Node Server

Meta Search and Credit

Authentication

Example Application

Tutorial

Start and Stop

Administration Interface

Register Documents

Search for Documents

Node Master Usage

Synopsis and Description

Constitution of the Server Root Directory

Prime Configuration File

User Account File

Embedded User Interfaces

Protocol

Introduction

Operation of Node Master

Operation of Node Server

Format of Search Result

Format of List Retrieval

Special Format for Document Registration

Node API

Introduction

API for Initializing

API for Nodes

API for Search Results of Nodes

Paralleling

Example of Gatherer

Example of Searcher

Command of a Client

Synopsis and Description

Examples

Pseudo Node Master

Constitution

Configuration File

Restrictions

Meta Search Gateway

Constitution

Intersection Meta Search Gateway

Union Meta Search Gateway