Fundamental Specifications of Estraier Version 1

Last Update: Sat, 07 Jan 2006 21:37:07 +0900

Overview
Installation
Managing Inverted Index
User Interface for Search
Meta Search System
Tips
Bugs
Copying

Overview

Estraier is a full-text search system for personal use. Full-text search means functions to search lots of documents for some documents including specified words. The principal purpose of Estraier is to realize a full-text search system of a web site. It functions similarly to Google, but for a personal web site or sites in an intranet. The following are features of Estraier.

Its search speed is high.
Its results are conspicuous.
It implements relational document search.
It can handle various languages.
It can handle various file formats.
It can handle large amount of documents.
Its installation is easy.

Estraier realizes fast full-text search using a database called inverted index. An inverted index is created by an administrator of a web site. Its operations are performed on the system of the web server. Users use web browsers to access the CGI script installed on the system, and perform full-text search. The user interface can be customized by editing a simple template file. A simple web server featuring full-text search is also provided.

When a user inputs a search phrase into the form of the search page, a list of documents corresponding the search condition is shown with summaries of the text of result documents. Each summary is generated by extraction of phrases around search words in the document. Search words in a summary are highlighted. Documents in the result are sorted in descending order of scores for the search words. The score of each document are calculated, based on the number and the fraction of the search words in the document.

Estraier implements relational document search. It is functions to offer some documents related to a document in a search result. Documents in the result are sorted in descending order of relational degree. Relational degree is calculated, based on vector space model. Simply put, you can search for documents which have similar tendencies about occurrence of words. Moreover, document clustering is also supported. It is a feature to categorize result documents automatically by their similarity.

Because Estraier expresses characters as Unicode (UCS-2), it can handles not only such European languages as English, but also such Asian languages as Japanese. As for the current version, it can analyze European languages and Japanese within practical accuracy.

Estraier provides functions to extract texts from files on the local file system. Supported formats are plain text, HTML, and MIME (e-mail and MHTML). Moreover, Files of various formats can be treated by calling an arbitrary outer command. For example, files of MS-Word can be treated with `wvWare', and files of PDF can be treated with `pdftotext'.

Estraier can construct an inverted index of greater than a hundred of thousands of documents. As for a software, the upper limit of the number of registered documents are not given. However, actual limit are determined by capability of hardwares, due to the time of constructing an inverted index. Meanwhile, the time of search is almost constant, regardless of the scale of an inverted index. If the number of registered documents is about a hundred of thousands, the search result will be shown in one second or less.

Installation of Estraier is very easy. In most cases, working for installation will be completed in twenty minutes. To construct an inverted index, you have to execute only one or two commands. From some minutes to hours after, the inverted index will be built up. Then, you can enjoy full-text search by accessing the installed CGI script.

Estraier is available on Linux, Solaris, HP-UX, FreeBSD, NetBSD, OpenBSD, Mac OS X, and Windows (Cygwin). Other UNIX systems are also. Estraier is a free software licensed under the GNU General Public License.

Installation

Preparation

To install Estraier from a source package, GCC of 2.8 or later version and `make' are required. As for Linux and BSDs, they must be already installed.

As Estraier uses GNU libiconv, you have to install it beforehand. Although your system has libiconv by default, you should use the newest version of GNU libiconv. Moreover, note that earlier versions than 1.9.1 have memory leak problem. You can get GNU libiconv at the following site.

http://www.gnu.org/software/libiconv/

As Estraier uses zlib, you have to install it beforehand. As for many systems, zlib is already installed by default. You can get zlib at the following site.

http://www.gzip.org/zlib/

If you build Estraier on Windows, Cygwin environment is needed. If you are stranger to Cygwin, using the binary package of Estraier. You can get Cygwin at the following site.

http://www.cygwin.com/

When an archive file of Estraier is extracted, change the current working directory to the generated directory and perform installation.

Installation

Run the configuration script.

./configure

Build programs. On Windows, perform `make win' instead of it.

make

Perform self-diagnostic test.

make check

Install programs. This operation must be carried out by the root user. On Windows, perform `make install-win' instead of it.

make install

Result

When a series of work finishes, the following files will be installed.

/usr/local/bin/estindex
/usr/local/bin/estserver
/usr/local/bin/estxview
/usr/local/bin/estsiutil
/usr/local/bin/estmbtomh
/usr/local/bin/estpdfhtml
/usr/local/bin/estdochtml
/usr/local/bin/estxlshtml
/usr/local/bin/estppthtml
/usr/local/bin/estmanhtml
/usr/local/bin/estgzhtml
/usr/local/bin/estxdwhtml
/usr/local/bin/estxdthtml
/usr/local/bin/estfind
/usr/local/bin/estautoreg
/usr/local/bin/estwolels
/usr/local/libexec/estsearch.cgi
/usr/local/libexec/estmerge.cgi
/usr/local/libexec/estspellen
/usr/local/share/estraier/estxview.dtd
/usr/local/share/estraier/estxview.css
/usr/local/share/estraier/estxview.xsl
/usr/local/share/estraier/estsearch.conf
/usr/local/share/estraier/estsearch.tmpl
/usr/local/share/estraier/estsearch.top
/usr/local/share/estraier/estmerge.conf
/usr/local/share/estraier/estmerge.tmpl
/usr/local/share/estraier/estmerge.top
/usr/local/share/estraier/locale/...
/usr/local/share/estraier/skins/...

At first, all you have to know is that. You may skip reading the following configuration options and jump to the section of Managing Inverted Index.

Configurations of the Web Server

This document does not refer to configurations of any web server. You should read some manuals of a web server, and set the configurations to enable CGI scripts. Any web server supporting CGI is fine. Apache, Microsoft IIS, and AnHTTPd are okay. Instead of them, you can use a search server provided within Estraier. It is integrated program of a web server and a full-text search system.

Using Regular Expressions

By defult, a search word is specified with its full matching pattern. However, it is possible to use wild cards and regular expressions by featuring regular expressions. To enable functions of regular expressions, configure the building environment as the following.

./configure --enable-regex

While functions of regular expressions are embedded in the standard library by GNU (glibc), if you use another library without features of regular expressions, use the GNU regex.

Using Dynamic Linked Filters

When Estraier calls an external command for text filtering, it costs overhead of calling the filter command via the system shell. You can solve the problem by using a filter function implemented in a dynamic linking library. To enable this feature, configure the building environment as the following.

./configure --enable-dlfilter

Estraier uses the system call `dlopen' for dynamic linking libraries. At least, it is implemented on Linux, FreeBSD, Solaris, and HP-UX.

Options for Text Analyzing

By default, words in European text are divided by space characters and mark characters. If dividing by space characters only is preferred, configure the building environment as the following.

./configure --enable-strict

By default, such too generic words as `a', `the', `to' and so on are excluded as stop words. If you do not prefer this behavior, configure the building environment as the following.

./configure --disable-stopword

Other Options

According to environment, you may ought to specify the following options.

--disable-lock : build for environments without file locking. This option is helpful to make an inverted index on NFS. However, an application is responsible for exclusion control when updating the inverted index.
--disable-zlib : build without zlib. This option is helpful for testing and debugging also.
--with-sysqdbm : build with QDBM already installed in the system. QDBM should be built with zlib and iconv enabled.

Managing Inverted Index

Typical Example

To enable full-text search, you should construct an inverted index beforehand. For example, if your web contents are under `/home/mikio/public_html' and CGI script is available there, perform the following steps.

cd /home/mikio/public_html
estindex register casket
estindex relate casket

Then, all of plain text, HTML, and MIME files are registered into an inverted index named as `casket'.

When your site is updated, perform the following steps.

cd /home/mikio/public_html
estindex purge casket
estindex register casket
estindex optimize casket
estindex relate casket

Then, deleted files are reflected to the inverted index, and new or modified files are also reflected to the inverted index.

At first, all you have to know is that. You may skip reading the following usage and jump to the section of User Interface for Search.

Usage

Usage of the command `estindex' to manage an inverted index is the following. This command is composed of sub commands for some purposes. The name of a sub command is specified by the second argument. If `*' is specified as a suffix rule, any file matches it. The name of an encoding should be specified as a formal name registered to IANA. When an outer command is called as a filter, the first argument specifies the name of the input file, the second argument specifies the name of the output file, and the environment variable `ESTORIG' specifies the name of the original file. If the name of the outer command begins with `@', a dynamic linking library whose name is the substring except for beginning `@' is linked, and a function whose name is `estfilter' is called.

The sub command `register' is used in order to construct or update an inverted index.

estindex register [-list file] [-force] [-relax] [-wmax num] [-tsuf sufs] [-hsuf sufs] [-msuf sufs] [-mn] [-xsuf sufs type cmd] [-xtype type cmd] [-xt] [-xm] [-iz] [-ipre pres] [-isiz size] [-enc code] [-pt code] [-ft code] [-tattr attrs] [-rich] [-plute] name [dir]: `name' specifies the name of an inverted index.; `dir' specifies a directory which contains files to register. If it is omitted, the current directory is specified. The specified directory is scanned recursively and symbolic links are followed through.; If the option `-list' is specified, the file specified by `file' is read and files specified paths in each line of the read file are registered. If `file' is `-', the standard input is read. This option is useful with combination of the `find' command of UNIX. If a tab character is in each line, the string after the tab is treated as the value of the attribute `realuri' of the registered document.; If the option `-force' is specified, files already registered in the inverted index and not modified are also registered again.; If the option `-relax' is specified, the process sleeps moderately and relaxes the stress of the system.; If the option `-wmax' is specified, the number words specified by `num' is recorded as information for generating summary. By default, all words are recorded for summary. This option is useful to reduce the size of the inverted index and improve response of search.; The option `-tsuf' specifies suffixes of files to be handled as plain text. `sufs' specifies a list of suffixes separated with a comma. By default, it is the same as `-tsuf .txt,.asc'.; The option `-hsuf' specifies suffixes of files to be handled as HTML. `sufs' specifies a list of suffixes separated with a comma. By default, it is the same as `-hsuf .html,.htm'.; The option `-msuf' specifies suffixes of files to be handled as MIME. `sufs' specifies a list of suffixes separated with a comma. By default, it is the same as `-msuf .eml,.mht'.; If the option `-mn' is specified, attributes of the content body of MIME are prior for the attributes of the document.; The option `-xsuf' specifies suffixes of files to be processed by an arbitrary outer command. `sufs' specifies a list of suffixes separated with a comma. `type' specifies a media type. `cmd' specifies a command to convert an original data to HTML.; The option `-xtype' specifies a media type to be processed by an arbitarary outer command. `type' specifies a media type. `cmd' specifies a command to convert an original data to HTML. This option is used with combination of the `estfind' command.; If the option `-xt' is specified, output of the outer command is treated as plain text.; If the option `-xm' is specified, output of the outer command is treated as MIME.; If the option `-iz' is specified, empty documents are not registered.; The option `-ipre' specifies prefixes of files to be ignored. `pres' specifies a list of prefixes separated with a comma.; If the option `-isiz' is specified, files whose size is larger than the specified size are ignored. `size' specifies the size by bytes.; The option `-enc' specifies the encoding of the registered files with `code'. By default, the encoding of each files are detected automatically due to the extracted text.; If the option `-pt' is specified, the title of each registered document is overwritten with its path on the local file system. The encoding of the file system is specified with `code'.; If the option `-ft' is specified, the title of each registered document is overwritten with its file name on the local file system. The encoding of the file system is specified with `code'.; The option `-tattr' specifies attributes to be merged to the text and to be treated as search words. `attrs' specifies a list of attribute names separated with a comma.; If the option `-rich' is specified, RAM and disk are utilized bountifully for large sites (more than 100 thousands of documents).; If the option `-plute' is specified, RAM and disk are utilized bountifully for large sites (more than 500 thousands of documents).; If a documents already registered in the inverted index is being registered, if its last modified time is newer than its registration time in the inverted index, it is registered, else, it is ignored.

The sub command `relate' is used in order to add score information for relational document search.

estindex relate [-list file] [-force] [-relax] [-ni] name [prefix]: `name' specifies the name of an inverted index.; `prefix' specifies a prefix of the URI of target documents. If it is omitted, all documents are related.; If the option `-list' is specified, the file specified by `file' is read and files specified paths in each line of the read file are processed. If `file' is `-', the standard input is read.; If the option `-force' is specified, score information of all target documents are registered regardless whether they are already registered or not.; If the option `-relax' is specified, the process sleeps moderately and relaxes the stress of the system.; If the option `-ni' is specified, TF-IDF is disabled. By default, it is enabled.; If you do not need relational document search, you do not have to perform this sub command.

The sub command `purge' is used in order to reflect deleted files to an inverted index.

estindex purge [-list file] [-force] [-relax] name [prefix]: `name' specifies the name of an inverted index.; `prefix' specifies a prefix of the URI of target documents. If it is omitted, all documents are checked.; If the option `-list' is specified, the file specified by `file' is read and files specified paths in each line of the read file are processed. If `file' is `-', the standard input is read.; If the option `-force' is specified, all target documents are removed from the inverted index regardless whether the files exist or not.; If the option `-relax' is specified, the process sleeps moderately and relaxes the stress of the system.

The sub command `optimize' is used in order to delete useless information which arisen by updating an inverted index.

estindex optimize [-relax] [-small] name: `name' specifies the name of an inverted index.; If the option `-relax' is specified, the process sleeps moderately and relaxes the stress of the system.; If the option `-small' is specified, optimization preferring size reduction is performed.

The sub command `inform' is used in order to get information of an inverted index.

estindex inform name: `name' specifies the name of an inverted index.

The sub command `merge' is used in order to merge plural inverted indexes.

estindex merge [-relax] [-rich] [-plute] name elems...: `name' specifies the name of an inverted index.; `elems' specifies the names of element inverted indexes.; If the option `-relax' is specified, the process sleeps moderately and relaxes the stress of the system.; If the option `-rich' is specified, RAM and disk are utilized bountifully for large sites (more than 100 thousands of documents).; If the option `-plute' is specified, RAM and disk are utilized bountifully for large sites (more than 500 thousands of documents).

The sub command `pree' is used in order to test text extraction and word breaking.

estindex pree [-h] [-m] [-x type cmd] [-xt] [-xm] [-enc code] [-pt code] [-ft code] [-tattr attrs] [-wl] [file]: `file' specifies the name of a file to read. If it is omitted, the standard input is read.; If the option `-h' is specified, the input is handled as HTML. The default is plain text.; If the option `-m' is specified, the input is handled as e-mail. The default is plain text.; The option `-x' is used for the input to be processed by an arbitrary outer command. `type' specifies a media type. `cmd' specifies a command to convert an original data to HTML.; If the option `-xt' is specified, output of the outer command is treated as plain text.; If the option `-xm' is specified, output of the outer command is treated as MIME.; The option `-enc' specifies the encoding of the registered files with `code'. By default, the encoding of each files are detected automatically due to the extracted text.; If the option `-pt' is specified, the title of each registered document is overwritten with its path on the local file system. The encoding of the file system is specified with `code'.; If the option `-ft' is specified, the title of each registered document is overwritten with its file name on the local file system. The encoding of the file system is specified with `code'.; The option `-tattr' specifies attributes to be merged to the text and to be treated as search words. `attrs' specifies a list of attribute names separated with a comma.; If the option `-wl' is specified, only split words in normalized form were output.

The sub command `version' is used in order to know the version information of Estraier.

estindex version: No argument and no option.

Each sub command returns 0 if it finishes successfully, or 1 if any error has occurred. If the environment variable `ESTDBGFD' is set, debug information is output to the specified file descriptor. If you abort a running command, send one signal of SIGINT (Control-C), SIGQUIT (Control-/), and SIGTERM. Then the inverted index is closed normally and the command finishes. Any other meaning of forced termination may destroy the inverted index.

Extraction of Text and Attributes

When parsing plain text, the following steps are performed.

Detect the encoding and normalize it to UTF-8.
Extract whole text as the text of the registered document. Prefixed quoting marks of each line are deleted and folded lines are concatenated.
Extract the last modified time of the file as the `date' attribute of the registered document.
Specify the `type' attribute of the registered document as `text/plain'.
Extract the detected encoding name as the `encoding' attribute of the registered document.
Extract the size of the data as the `size' attribute of the registered document.

When parsing HTML, the following steps are performed.

Detect the encoding and normalize it to UTF-8.
If the encoding is specified with a `meta' element, normalize the encoding again.
Divide the data into tags and text sections.
Extract text sections in the `body' element as the text of the registered document. However, contents of `script' and `style' elements are ignored.
Extract the text section in the `title' element as the `title' attribute of the registered document. As the value is included in the body text, it is reflected on the inverted index.
If the value of the `name' attribute of a `meta' element is `author', the value of the `content' attribute is extracted as the `author' attribute of the registered document.
Extract the last modified time of the file as the `date' attribute of the registered document.
Specify the `type' attribute of the registered document as `text/html'.
Extract the detected encoding name as the `encoding' attribute of the registered document.
Extract the size of the data as the `size' attribute of the registered document.

When parsing MIME, the following steps are performed.

Divide the headers and the message body.
Detect the media type and the encoding with the `Content-Type' header.
Check the encoding of the message body, and if it conflicts with the assignment by the header, normalize the encoding again.
Extract the value of the `Subject' header as the `title' attribute of the registered document. As the value is included in the body text, it is reflected on the inverted index.
Extract the value of the `From' header as the `author' attribute of the registered document.
Extract the value of the `To' header as the `recipient' attribute of the registered document.
Extract the value of the `Cc' header as the `multicast' attribute of the registered document.
Extract the value of the `Date' header as the `date' attribute of the registered document.
Specify the `type' attribute of the registered document as `message/rfc822'.
Extract the detected encoding name as the `encoding' attribute of the registered document.
Process the message body as the plain text or HTML. If the media type is multipart, perform the above steps against the first part recursively.

If the `title' attribute of a registered document is not extracted, and if the encoding of its file name is US-ASCII, the file name is treated as the `title' attribute. Moreover, if the attribute `realuri' is specified, search programs show the value instead of the URI of the registered document.

Filter Programs

The source package of Estraier provides the command `estpdfhtml' to handle PDF files, `estdochtml' to handle MS-Word files, `estxlshtml' to handle MS-Excel files, `estppthtml' to handle MS-PowerPoint files, `estmanhtml' to handle man pages of UNIX, `estgzhtml' to handle gzipped or zipped file of plain text or HTML, `estxdwhtml' to handle DocuWorks files, and `estxdthtml' to handle files in various formats on Windows. They are called as filter programs when construting an inverted index. Their functions are to read data from a file specified with the first argument, and convert the data to HTML, and write the HTML to a file specified with the second argument.

To use `estpdfhtml', you should install `pdftotext' beforehand. It works on UNIX. Usually, you will perform the following command in order to register PDF files into an inverted index.

estindex register -tsuf "" -hsuf "" -msuf "" \
  -xsuf .pdf application/pdf estpdfhtml casket

To use `estdochtml', you should install `wvWare' beforehand. It works on UNIX. Usually, you will perform the following command in order to register MS-Word files into an inverted index.

estindex register -tsuf "" -hsuf "" -msuf "" \
  -xsuf .doc application/msword estdochtml casket

To use `estxlshtml', you should install `xlhtml' beforehand. It works on UNIX. Usually, you will perform the following command in order to register MS-Excel files into an inverted index.

estindex register -tsuf "" -hsuf "" -msuf "" \
  -xsuf .xls application/vnd.ms-excel estxlshtml casket

To use `estppthtml', you should install `ppthtml' beforehand. It works on UNIX. Usually, you will perform the following command in order to register MS-PowerPoint files into an inverted index.

estindex register -tsuf "" -hsuf "" -msuf "" \
  -xsuf .ppt application/vnd.ms-powerpoint estppthtml casket

To use `estmanhtml', you should install `groff' beforehand. It works on UNIX. Usually, you will perform the following command in order to register `man' files into an inverted index.

estindex register -tsuf "" -hsuf "" -msuf "" \
  -xsuf .1,.2,.3,.4,.5,.6,.7,.8 application/x-troff-man estmanhtml casket

To use `estgzhtml', you should install `gzip' beforehand. It works on UNIX. Usually, you will perform the following command in order to register gzipped or zipped files into an inverted index.

estindex register -tsuf "" -hsuf "" -msuf "" \
  -xsuf .txt.gz,.asc.gz,.txt.zip,.asc.zip,.html.gz,.htm.gz,.html.zip,.htm.zip \
      text/html estgzhtml casket

To use `estxdwhtml', you should install `xdw2text' beforehand. It works on UNIX. Usually, you will perform the following command in order to register DocuWorks files into an inverted index.

estindex register -tsuf "" -hsuf "" -msuf "" \
  -xsuf .xdw application/vnd.fujixerox.docuworks estxdwhtml casket

To use `estxdthtml', you should install `xdoc2html' beforehand. It works on Windows, and can process files of PDF, RDF, MS-Word, MS-Excel, MS-PowerPoint, and so on. Usually, you will perform the following command.

estindex register -tsuf "" -hsuf "" -msuf "" \
  -xsuf .pdf,.rtf,.doc,.xls,.ppt application/octet-stream estxdthtml casket

The utility command `estfind' enables to combine some filters. For examle, the following registers files of PDF, MS-Word, and MS-Excel at the same time.

./estfind -pdf -doc -xls . |
  ./estindex register \
    -xtype application/pdf estpdfhtml \
    -xtype application/msword estdochtml \
    -xtype application/vnd.ms-excel estxlshtml \
    -list - casket

`estfind' is a utility to locate files under a directory and output TSV of paths and media types. the usege is the following.

estfind [options] directory [iregex mime] ...: `directory' specifies the directory of starting point.; `iregex' specifies regular expressions for file names.; `mime' specifies a mime type corresponding to the regular expressions.; The option `-html' is a shortcut for ".*\.htm$l$?" "text/html".; The option `-text' is a shortcut for ".*\.$txt\|asc$" "text/plain".; The option `-pdf' is a shortcut for ".*\.pdf" "application/pdf".; The option `-doc' is a shortcut for ".*\.doc" "application/msword".; The option `-xls' is a shortcut for ".*\.xls" "application/vnd.ms-excel".; The option `-ppt' is a shortcut for ".*\.ppt" "application/vnd.ms-powerpoint".; The option `-xdw' is a shortcut for ".*\.xdw" "application/vnd.fujixerox.docuworks".; The option `-man' is a shortcut for ".*\.[0-9]" "application/x-troff-man".; The option `-magic' specifies to use "file" to determine mime type as last resort.

You can get programs on which the filters depend, at the following sites. Refer to the license of each product for use conditions.

pdftotext : http://www.foolabs.com/xpdf/
wvWare : http://wvware.sourceforge.net/
xlhtml : http://chicago.sourceforge.net/xlhtml/
ppthtml : included in the package of `xlhtml'
groff : http://www.gnu.org/software/groff/groff.html
xdw2text : http://www.fujixerox.co.jp/soft/docuworks/
xdoc2txt : http://www31.ocn.ne.jp/~h_ishida/xdoc2txt.html

You can use your own filter programs except for the above. You can use any language to implement a filter program. C, Perl and shells are okay.

User Interface for Search

Typical Example

To expose full-text search, you should deploy a CGI script and its configuration files. For example, the inverted index is placed in `/home/mikio/public_html' and CGI script is available there, perform the following steps.

cd /home/mikio/public_html
cp /usr/local/libexec/estsearch.cgi .
cp /usr/local/share/estraier/estsearch.conf .
cp /usr/local/share/estraier/estsearch.tmpl .
cp /usr/local/share/estraier/estsearch.top .

`estsearch.cgi' is the CGI script for users to access. Each setting files can be edited with a text editor. At first, you do not have to modify them. For the meanwhile, access `estsearch.cgi' with via Web.

Search Conditions

When a user input search words into the input form for search conditions and push the `Search' button, the result of search is shown. In case that one word is specified, documents including the word are exhibited. In case that two or more words are specified, documents including all of them are exhibited. A delemeter between words is half-width space or full-width space.

Search words of the second and after can have prepositive operators. Each operator and word should be separated with space. Precedence orders of all operators are equal. Operators are evaluated from left.

`[and] word`	remove documents containing the specified word from the result. It is default behavior without any operator.
`[or] word`	add documents containing the specified word to the result.
`[not] word`	remove documents containing the specified word from the result.

For example, the following condition searches for documents containing both of `apple' and `orange'

apple orange

The following condition has the same semantics to the above, with a operator.

apple [and] orange

The following condition searches for documents containing either `apple' or `orange'.

apple [or] orange

The following condition searches for documents containing `apple' while without `orange'.

apple [not] orange

The following condition searches for documents containing both of `apple' and `orange' while without `grape'.

apple [and] orange [not] grape

The following condition searches for documents containing either `apple' or `orange', and containing `grape', but without `melon'.

apple [or] orange [and] grape [not] melon

The following condition searches for documents containing `apple' but neither `orange' nor `grape'.

apple [not] orange [not] grape

It is impossible to search for documents containing `apple' but without both of `orange' and `grape'.

Some compound words are separated even if they are not separated with space. Then, words at the second and after are treated as having the previous operator or `[and]' if no operator. Using this trick, you can search with condition composed of a natural text.

By setting `n per page', you can change the number of shown documents in a page. By setting `n clusters', you can see the result documents categorized with their similarity. By setting `n level rep', you can pick up the top document in the same directory of URI. By setting `sort by order', you can change the order of sorting shown documents.

If Estraier was built with features of regular expressions enabled, you can select one of some expression formats. If `as-is expression' is selected, each word expresses the specified pattern itself. If `with wild cards' is selected, `*' can be placed as a wild card in each expression. The wild card matches any string. If `regular expressions' is selected, each word is treated as a regular expressions (POSIX Extended Regular Expressions). The following are elements of regular expressions.

`P`	matching `P' itself.
`PQ`	matching a sequence of `P' and `Q'.
`.`	matching any one character.
`P*`	matching a sequence of 0 or more matches of `P'.
`P+`	matching a sequence of 1 or more matches of `P'.
`P?`	matching a sequence of 0 or 1 matches of `P'.
`P\|Q`	matching both of `P' and `Q'.
`^P`	matching `P' at the beginning.
`P$`	matching `P' at the end.
`(P)`	giving higher priority to evaluation of `P'.

For example, to express words which begin with `work' (for example, `work', `worked', `worker', and `works'), select wild cards and input `work*', or select regular expressions and input `^work.*'.

You can combine wild cards or regular expressions and operators. For example, if you search for documents including `apple' or `orange', and including `grape' or `melon', but not including `kiwi' nor `banana', select regular expressions and input the following.

^(apple|orange)$ [and] ^(grape|melon)$ [not] ^(kiwi|banana)$

The max number of candidate words expressed with wild cards or regular expressions is 1024.

Results

Attributes and summary of documents corresponding to a search condition are shown as a search result. A list of keywords are also shown in each summary. When you select a keyword, the word is added to the current search condition, and search documents again. This feature is useful to narrow down a result. As search words in each summary are highlighted, it is easy to know how the words are used.

When you select `(detail)' in each summary, a list of words registered for the summary are shown. As search words in each summary are highlighted as with summary, it is easy to know where the words.

When you select `(related)' in each summary. Documents related to the selected documents are searched for. Relational document search is a feature to retrieve documents which have similar tendencies of word occurrences. Even when you can not come up with appropriate words, you may reach your target documents with following relational documents.

In case that more documents corresponds than the specified number, they are shown in some pages. You can bring the page backward or forward by selecting `PREV' or `NEXT'.

In case that great many documents corresponds, the result is shown without adjudication on the all. Then, the link `(or more)' is shown on the top of the page. You can get the complete result by selecting it.

Setting

`estsearch.cgi' is the CGI script for search. It is installed as `/usr/local/libexec/estsearch.cgi'. Copy it into a directory which is public via Web. Usually, an inverted index and some configuration files are placed in the same directory.

Configuration files are `estsearch.conf', `estsearch.tmpl', and `estsearch.top'. Their default descriptions are installed in `/usr/local/share/estraier'. Copy and customize them.

Configuration files should be placed in the current directory of a process of `estsearch.cgi'. In most cases, the current directory of a CGI script is the same directory where the script is placed. However, it depends on implementations of web servers. As for Microsoft IIS, the current directory of a CGI script is the root of a `virtual directory'.

Prime Configuration File

`estsearch.cgi' reads configurations of the file named as `estsearch.conf' in the current directory. Each line of the configuration file begins with the name of an attribute tailed with `:'. The value of each attribute is placed after `:'. The encoding of the prime configuration file should be US-ASCII or UTF-8. Principal attributes are the following.

indexname: casket
tmplfile: estsearch.tmpl
topfile: estsearch.top
prefix: ./
suffix:
replace:
diridx: index.html
decuri: false
boolunit: 1024
relkeys: 16
defmax: 8
reevmax: 1024
showkeys: 8
sumall: 96
sumtop: 24
sumwidth: 16
clustunit: 128
clustkeys: 8
logfile:

`indexname' specifies the name of the inverted index. If you want to use another name or place it in another directory, change the value. `tmplfile' specifies the after-mentioned template file. `topfile' specifies the after-mentioned top page file. If you want to use other names, change their values. `prefix' specifies the prefixal string of the URI of each document in the results. For example, if it is `http://foo.bar/baz/', `./apple.html' is shown as `http://foo.bar/baz/apple.html'. `suffix' specifies the suffix string of the URI of each document in the results. For example, if it is `.html', `./data/751' is shown as `./data/751.html'. `replace' specifies an expression to replace the URI of each document. The before string and the after string are delimited with space characters. For example, if it is `/foo/ /bar/', `./foo/apple.html' is replaced to be `./bar/apple.html'. This attribute can be specified multiple times. `diridx' specifies the index file of each directory. For example, if it is `index.html', `./foo/index.html' is shown as `./foo/'. If `decuri' is `true', the URI of each document of the results are decoded with treated as URL-encoded. This feature is useful with files saved by the GNU `wget' command. `boolunit' and `relkeys' are parameters for accuracy of search. If they are increased, accuracy is up while processing speed is down. If they are decreased, accuracy is down while processing speed is up. Usually, you do not have to change them. `defmax' specifies the default number of shown documents in the results. `reevmax' specifies the max number of words which is evolved from a regular expressions. `sumall' specifies the number of words in the summary of a document. `sumtop' specifies the number of words picked from the top of a document. `sumwidth' specifies the number of words picked around each search words. `clustunit' specifies the number of targets of document clustering. `clustkeys' specifies the number of shown keywords for each cluster. `logfile' specifies the log file into which input search conditions are output. Besides, attributes for label strings are defined in the configuration file.

Template File

The template file is to specify the template of the user interface for search. It is so-called a skin. Although the name of this file is `estsearch.tmpl' usually, you can change it by editing the prime configuration file. You can customize the user interface freely by editing the template file. The encoding of the template file should be US-ASCII or UTF-8. As most contents are reflected directly on the user interface, the following expressions are replaced.

: show the input form for search conditions.
: show the search results.
: show the number of documents and words in the inverted index.
: show the processing time of the CPU.
: show the version information of Estraier.
: show contents of a file specified with `name'.
: show the result of a command specified with `command' with the shell.
{ESTPHRASE}: show the search phrase input by users.
{ESTMAX}: show the number of shown documents specified by users.
{ESTDNUM}: show the number of documents in the inverted index.
{ESTWNUM}: show the number of words in the inverted index.

By using `', you can insert HTML dynamically into the page. It is a sogenannt plug-in. The called command can get the value of the `phrase' parameter from the value of the environment variable `ESTPHRASE'. The command can get the other parameters by analyzing the value of the environment variable `QUERY_STRING'.

Top Page File

The top page file is to specify the messages shown when a user does not input any search condition. In other words, the messages are shown when a user visit the search page at the start. Although the name of this file is `estsearch.top' usually, you can change it by editing the prime configuration file. As the data of the top page file are inserted in the place of `', they should be parts of HTML. The encoding of the template file should be US-ASCII or UTF-8.

You do not have to leave intact the logo and version information of Estraier. You can apply your favorite design while the syntax of the page should be valid as HTML (XHTML). It is suggested to give a description of help or tutorial of this search system in your language. Sample files for localization are installed under `/usr/local/share/estraier/locale'. Sample template files for different flavor are installed under `/usr/local/share/estraier/skins'.

Log File

If `logfile' is specified in the prime configuration file, search conditions input by users are output into the log file. Each condition is separated with a line feed, and the following terms separated with tabs are recorded.

Current date
IP address of the client
Search phrase
Number of shown documents
Document clustering
Level of directory rep
Order of sorting
Number of the page
ID number of the seed document for relational search
ID number of the document for detail viewing

Search Form

If you want to place the search form in another page, write such HTML as the following. The parameter `phrase' specifies the search phrase. The parameter `enc' specifies the character encoding of the page where the search form is.

<form method="get" action="estsearch.cgi">
<div>
<input type="text" name="phrase" value="" size="64" tabindex="1" />
<input type="submit" value="Search" tabindex="2" />
<input type="hidden" name="enc" value="UTF-8" />
</div>
</form>

Search Server

Because `estsearch.cgi' is implemented as a CGI script, every access has overhead to connect the database and the morphological analyzer. To deal with this problem, Estraier provides `estserver' which is a web server featuring full-text search. Usage of this command is the following. It publishes contents under the current directory, and when the URL `/estsearch' is requested, it provides interfaces of full-text search as with `estsearch.cgi'. If no argument is specified, `casket' is read as the inverted index, `estsearch.conf' is read as the configuration file, `estsearch.tmpl' is read as the template file, and `estsearch.top' is read as the top page file.

estserver [-host name] [-port num] [-div num] [-uid num] [-dtype type] [-auth user:pass] [index conffile tmplfile topfile]: `index' specifies an inverted index created with the command `estindex'.; `conffile' specifies the configuration file commonly used by `estsearch.cgi'.; `tmplfile' specifies the template file commonly used by `estsearch.cgi'.; `topfile' specifies the top page file commonly used by `estsearch.cgi'.; `-host name' specifies the host name. By default, the server is binded to all network interfaces, and the host name is specified with the formal name of the system.; `-port num' specifies the port number of the server. By default, it is 4210.; `-div num' specifies the number of child processes. By default, it is 4.; `-uid num' specifies the user ID of child processes. By default, it is the same with the parent. This option is available only if the executant of the parent process is the super user.; `-dtype type' specifies the content type of files without any suffix. By default, it is `application/octet-stream'.; `-auth user:pass' specifies a user and his password for the basic authentication. By default, no authentication is performed.

For example, if you run the server at the port 8888 of the host `estraier.foo.edu' and publish contents under `/home/mikio/public_html', install configuration files and the inverted index there, and perform the following command. After that, access `http://estraier.foo.edu:8888/estsearch'.

cd /home/mikio/public_html
estserver -host estraier.foo.edu -port 8888 \
  casket estsearch.conf estsearch.tmpl estsearch.top

Supported methods are GET, POST, and HEAD. CGI is not supported. The setting of `indexname', `tmplfile', and `topfile' in the prime configuration file are ignored. Log messages are output to the standard output.

Processes of this command are composed of one parent and multiple children. Requests from clients are dealt by the children. Each child exits when he have dealt 128 requests. The parent generates an alternate child as soon as a child exits. According to this mechanism, safe and parallel processing is realized. To stop the service, send SIGINT signal (Control-C) to the parent. If 0 is specified as the number of children, the parent itself deals requests.

Parameters

`estsearch.cgi' and `estserver' read the following parameters. The value of each parameter should be URL-encoded as with usual CGI.

phrase : the search phrase.
max : the number of show documents per page.
clshow : the number of shown clusters of document clustering.
clcode : the code of the cluster for narrowing.
clall : whether to show all clusters. By default, it is disabled. If `true', it is enabled.
drep : the level of directory rep. By default, it is disabled.
sort : the order of sorting the result. It is `score', `r-score', `date', or `r-date'. By default, it is by score.
expr : the expression format of search words. It is `asis', `wild', or `regex'. By default, it is as-is expression.
page : the number of a page. It does not have to be specified explicitly.
skip : the number of documents omitted to be shown. It does not have to be specified explicitly.
unit : the unit number of documents searched for. It does not have to be specified explicitly.
relid : the ID number of a seed document of relational search.
detid : the ID number of a document shown in detail.
tfidf : whether to enable TF-IDF. By default, it is enabled. If `false', it is disabled.
mrglbl : the label specified by the merger of meta search.
showsc : whether to enable score output for meta search. By default, it is disabled. If `true', it is enabled.
relsc : scores of the seed document specified by the merger of meta search.
enc : the encoding of the search phrase and label. By default, it is UTF-8.

Search Command

In order to search documents by shells or some kinds of scripting languages, the command `estxview' is provided. This command output results in XML. By processing them, it is possible to realize original user interfaces. The following is the usage of this command. By default, the search condition is treated as search phrase compatible with `estsearch.cgi'. The format of the option `-relsc' is TSV and each line has a word and its score.

estxview [-id] [-uri] [-rel] [-relsc] [-expr type] [-sort type] [-drep num] [-clshow num] [-clcode num] [-max num] [-ni] [-tiny] [-nt] [-snum num] [-nk] [-css] [-xsl] [-dtd] [-rf] [-dlist] [-wlist] [-ic code] [-oc code] name [expression...]: `name' specifies the name of an inverted index.; `expression' specifies the search condition.; If `-id' is specified, a document which has ID of the expression is retrieved.; If `-uri' is specified, a document which has URI of the expression is retrieved.; If `-rel' is specified, related documents with a document which has ID of the expression are retrieved.; If `-relsc' is specified, related documents with the score expression are retrieved.; `-expr' specifies the expression format of search words. `type' is `asis', `wild', or `regex'. By default, it is as-is expression.; `-sort' specifies the order of sorting the result. `type' is `score', `r-score', `date', or `r-date'. By default, it is by score.; `-drep' specifies the level of directory rep. By default, it is disabled.; `-clshow' specifies the number of shown clusters of document clustering. By default, it is disabled.; `-clcode' specifies the code of the cluster for narrowing. By default, it is disabled.; `-max' specifies the number of shown documents. By default, it is 8.; If `-ni' is specified, TF-IDF is disabled. By default, it is enabled.; If `-tiny' is specified, it outputs only ID and score of each corresponding document.; If `-nt' is specified, text of documents are hidden.; `-snum' specifies the number of words in summary. By default, it is 80.; If `-nk' is specified, keywords of documents and clusters are hidden.; If `-css' is specified, the processing instruction referring to `estxview.css' is embedded.; If `-xsl' is specified, the processing instruction referring to `estxview.xsl' is embedded.; If `-dtd' is specified, the document type declaration referring to `estxview.dtd' is embedded.; If `-rf' is specified, the search condition is read from the file specified with `expression'; If `-dlist' is specified, information of all registered documents is output.; If `-wlist' is specified, information of all index terms is output.; `-ic' specifies the character encoding of the arguments. By default, it is UTF-8.; `-oc' specifies the character encoding of the output. By default, it is UTF-8.

This command returns 0 if searching finishes successfully, or 1 if any error has occurred. If the environment variable `ESTDBGFD' is set, debug information is output to the specified file descriptor. Syntax and semantics of the output is explained in `estxview.dtd'.

Scoring Algorithm

Scores shown when searching are calculated by the following algorithm.

tf = the frequency of the search word in the document
wt = 5000 if the search word appears first in the front 10% of the document, else 0
ds = the number of words in the document
rsc = the raw score of the search word = ((tf * 10000) + wt) / ((log(ds) ^ 3) / 8)
df = the frequency of documents including the search word
ad = the number of documents registered into the inverted index
asc = the adjusted score of the search word = (rsc / ((log(df) ^ 3) / 8)) * log(ad)
tsc = the shown score = the total of `asc' of all search words

Similarities shown when relational document search are calculated by the following algorithm.

ovec = the vector composed of greatest `asc' of words in the seed document.
tvec = the vector composed of greatest `asc' of words in the target document, which is mapped into the space of `ovec'.
sim = the shown similarity = (ovec * tvec) / (|ovec| * |tvec|)

Meta Search System

Description

The number of documents which can be stored in an inverted index is, though it depends on the size of documents, actually limited up to about one million. If you want to realize a larger search system, you should deploy inverted indexes on plural computers. When searching, an agent request queries all at once to each computer, and it merges the results. `estmerge.cgi' is such agent.

`estmerge.cgi' can merge the result of not only `estsearch.cgi' and `estserver', but also `estmerge.cgi' recursively. Multilevel meta search realizes a very large search system. For example, if the number of targets at the first level is 10 and the number of targets in the second level is 10, it can search 100 sites.

Setting

`estsearch.cgi' is the CGI script for meta search. It is installed as `/usr/local/libexec/estmerge.cgi'. Copy it into a directory which is public via Web. Whether the search targets are on the same host does not matter.

Configuration files are `estmerge.conf', `estmerge.tmpl', and `estmerge.top'. Their default descriptions are installed in `/usr/local/share/estraier'. Copy and customize them.

Configuration files should be placed in the current directory of a process of `estmerge.cgi'. In most cases, the current directory of a CGI script is the same directory where the script is placed.

Prime Configuration File

`estmerge.cgi' reads configurations of the file named as `estmerge.conf' in the current directory. Each line of the configuration file begins with the name of an attribute tailed with `:'. The value of each attribute is placed after `:'. The encoding of the prime configuration file should be US-ASCII or UTF-8. Principal attributes are the following.

target: Foo@http://www.foofoo.go.jp/foo/estsearch.cgi
target: Bar@http://www.barbar.ad.jp/bar/estsearch.cgi
target: Baz@http://www.bazbaz.ac.jp/baz/estsearch.cgi
proxy: proxy.mydomain.gov:3128
tmplfile: estmerge.tmpl
topfile: estmerge.top
hidetl: false
defmax: 8
logfile:

`target' specifies a search target. Its label and its URL are divided with `@'. This attribute can be specified multiple times. `proxy' specifies a proxy of HTTP. Its host name and its port number are divided with `:'. If the value is empty, no proxy is used. `tmplfile' specifies the after-mentioned template file. `topfile' specifies the after-mentioned top page file. If `hidetl' is `true', labels of the targets are hidden. `defmax' specifies the default number of shown documents in the results. `logfile' specifies the log file into which input search conditions are output. Besides, attributes for label strings are defined in the configuration file.

Template File

The template file is to specify the template of the user interface for search. Although the name of this file is `estmerge.tmpl' usually, you can change it by editing the prime configuration file. You can customize the user interface freely by editing the template file. The encoding of the template file should be US-ASCII or UTF-8. As most contents are reflected directly on the user interface, the following expressions are replaced.

: show the input form for search conditions.
: show the search results.
: show the version information of Estraier.
: show contents of a file specified with `name'.
: show the result of a command specified with `command' with the shell.
{ESTPHRASE}: show the search phrase input by users.
{ESTMAX}: show the number of shown documents specified by users.

Top Page File

The top page file is to specify the messages shown when a user does not input any search condition. In other words, the messages are shown when a user visit the search page at the start. Although the name of this file is `estmerge.top' usually, you can change it by editing the prime configuration file. As the data of the top page file are inserted in the place of `', they should be parts of HTML. The encoding of the template file should be US-ASCII or UTF-8.

Tips

Constructing a Large Index

In case that you construct a full-text search system of a large site, it costs long time to build the inverted index for the first time. The processing speed begins to slow down when registered documents is up to tens of thousands. If your system has 1GB RAM o more, you should add the option `-plute' to the command `estindex register'. Thus, you can construct an inverted index more quickly.

Alternatively, you can construct inverted indexes in some parts, and merge them. The command `estautoreg' is provided to simplify those steps. Change the current directory to the root directory of contents and perform this command. Then, the inverted index is built up, the CGI script and its configurations are deployed there. You can specify the number of documents in each element index by the first argument.

cd /home/mikio/public_html
estautoreg

Because `estautoreg' is a simple shell script, you can customize it as much as you like. By default, the number of documents in each element index is 65536, and the number of words for summary is 4096.

To construct an inverted index quickly, the system should have abundant RAM. The measure is 20MB per ten thousands of documents. For example, if the number of target documents is one million, 2GB of RAM is requested. Moreover, you should tune the system to reduce frequency of synchronizing I/O buffers. For example, if you are on Linux 2.4, add the following line to the file `/etc/sysctl.conf'. To know function of each parameter, perform `update -d'.

vm.bdflush = 80 1000 0 0 0 10000 100 0 0

If your system has RAM whose size is larger than the size of an inverted index, you can construct the inverted index quickly, using a RAM disk. As for linux, perform the following steps. These operations must be carried out by the root user.

mkdir -p /mnt/ram
mount -t tmpfs -o size=512m /dev/shm /mnt/ram
cd /home/mikio/public_html
estindex register /mnt/ram/casket
estindex relate /mnt/ram/casket
rm -rf casket
cp -r /mnt/ram/casket .
umount /mnt/ram
rmdir /mnt/ram

Even if you do not use a RAM disk but a hard disk, an inverted index can be constructed more quickly on another disk rather than the same disk of target documents. When you merge some inverted indexes also, processing speed is higher if the merged index is created on another disk rather than the same disk of element indexes.

For Availability

For operational availability, an inverted index should be geminated. The reason is that it is impossible to search an inverted index while updating. Moreover, the index may be broken while updating due to unexpected shutdown of the system or full-up of the disk. Thus, you should update the inverted index after the following steps. It is useful to automate these operations with `crond' of UNIX.

cd /home/mikio/public_html &&     # change the current directory
test -d casket &&                 # confirm the original exists
test ! -d casket-tmp &&           # confirm the copy does not exist
cp -r casket casket-tmp &&        # create the copy
estindex purge casket-tmp &&      # reflect deleted documents on the copy
estindex register casket-tmp &&   # reflect new or modified documents on the copy
estindex optimize casket-tmp &&   # optimize the copy
estindex relate casket-tmp &&     # reflect relational information on the copy
rm -rf casket &&                  # remove the original
mv casket-tmp casket              # rename the copy as the original

Actually, `purge' and `optimize' are not needed to be performed so often. It is conceivable that `register' is per day, `relate' is per week, and that `purge' and `optimize' are per month.

If you make use of Estraier on a shared rental server or a system provided by a service provider, you may bother other users due to heavy loading of construction of the inverted index. In that cases, you can construct the inverted index on your local system and upload it to the system of the web server. However, the technique is not available between systems with defferent byte orders.

Utility for SI

The command `estsiutil' is provided as a utilities for system integration. This command is helpful to make an application of Estraier and its installer. Moreover, you can make a CGI script of the shell script easily. This command is used in the following format.

estsiutil prefix: Output the prefix of the install directory.
estsiutil bindir: Output the path of the install directory of commands.
estsiutil libexecdir: Output the path of the install directory of CGI scripts.
estsiutil datadir: Output the path of the install directory of configuration files.
estsiutil version: Output the version number.
estsiutil cgiparam name [qstr]: Parse CGI parameters and output the value of a variable. `name' specifies the value of a variable. If it is an empty string, names and values of all variables are output. If the method is GET, assign the value of the environment variable `QUERY_STRING' to `qstr'. If the method is POST, omit `qstr', and the standard input is read.
estsiutil htmlesc str: Output a string, escaping meta characters of HTML. `str' specifies a string.
estsiutil urlenc str: Output a string, encoding it with URL encoding. `str' specifies a string.

Searching Remote Sites

To construct a full-text search system for remote sites, retrieve contents of remote sites with `wget' and build the inverted index of them. For example, if target sites are `http://estraier.foo.edu' and `http://snatcher.foo.edu', perform the following steps.

cd /home/mikio/public_html
wget -r -N -np -l inf -A ".txt,.TXT,.html,.HTML,.htm,.HTM" \
  "http://estraier.foo.edu/"
wget -r -N -np -l inf -A ".txt,.TXT,.html,.HTML,.htm,.HTM" \
  "http://snatcher.foo.edu/"
estindex register casket
estindex relate casket
cp /usr/local/libexec/estsearch.cgi /usr/local/share/estraier/estsearch.* .

In `estsearch.conf', set the value of `prefix' as `http://', and set the value of `decuri' as `true'. Thus, when a user follow a link in the search results, he visit the original site, not the local cache.

Searching the Mailbox

If you use an e-mail client managing the mail box like `mh' (Sylpeed, Mew, and so on), you can construct a full-text search system for the mail box. For example, if mails are deployed in the folders `business' and `friends', perform the following steps.

cd /home/mikio/public_html
ln -s /home/mikio/Mail/inbox .
estindex register -msuf "*" -tattr "author,recipient,multicast" \
  casket ./inbox/business
estindex register -msuf "*" -tattr "author,recipient,multicast" \
  casket ./inbox/friends
estindex relate casket
cp /usr/local/libexec/estsearch.cgi /usr/local/share/estraier/estsearch.* .

If you use a client managing the mail box in its own format (Becky, Outlook Express, and so on), export files in `eml' format, and construct the inverted index of them. Moreover, if you use a client managing the mail box in mbox format (Eudora, Thunderbird, and so on), use the command `estmbtomh' to convert the mail box to files in mh format. Its usage is the following.

estmbtomh [-pre str] [-suf str] [-col num] [file]: `file' specifies a file in mbox format. If it is omitted, the standard input is read.; The option `-pre' specifies the prefix of each output file with `str'.; The option `-suf' specifies the suffix of each output file with `str'.; The option `-col' specifies the number of columns of each output file with `num'.

Because it is risky to expose personal mails via web, you should configure the web server to deny accesses of unknown users, and set the permissions of the `mail' directory appropriately. When sending MIME via web, the value of `Content-Type' should be `message/rfc822'.

Searching Caches of an HTTP Proxy

Estraier let you search caches of an HTTP proxy, WWWOFFLE. The command `estwolels' is provided to help operations for making the inverted index. This command outputs paths and URIs of each document in the cache. You can make the inverted index for the cache of WWWOFFLE by the following commands. Usually, these operations must be carried out by the root user.

estwolels | estindex register -list - -tsuf "" -hsuf "" -msuf "*" -iz -mn casket
estindex relate casket
cp /usr/local/libexec/estsearch.cgi /usr/local/share/estraier/estsearch.* .

Moreover, you should edit `estsearch.conf' and set the value of the attribute `prefix' to be an empty string.

Easy Customization of the UI

Even if customizing with the prime configuration file is not enough, your aim may be satisfied with CSS or JavaScript in the template file. For example, all you have to do for narrowing the width of the input form for search phrase is to write the following into the definition of CSS.

#phrase { width: 32ex; }

If you want to delete the input forms of `max', `drep', and `sort', write the following. To delete input forms severally, set the style of `display: none;' to an element whose ID is `maxspan', `drepspan', `sortspan', or `exprspan'.

#advancedform { display: none; }

If you use JavaScript, add the attribute `onload' to the element `body', and the specified function will be called when the page is loaded.

<body onload="startup();">

Moreover, define the function in the element `script' in the element `head'. For example, if you want to narrow the width of the input form, define the following.

<script type="text/javascript">
function startup(){
  document.getElementById('phrase').setAttribute('size', '32');
}
</script>

If you want customization beyond the capability of CSS and JavaScript, make a wrapper script which calls `estsearch.cgi'. Because parameters of CGI are given as environment variables, you can pass them to the original script by calling it inside the wrapper. Then, process the output with `awk', `perl', and so on. For example, the following wrapper script narrows the width of the input form.

#! /bin/sh

./estsearch.cgi | awk '
{
  if(match($0, /^<input.*id="phrase"/)){
    sub(/size="[0-9]*"/, "size=\"6\"", $0)
  }
  printf("%s\n", $0)
}
'

Plug-in to show spelling alternation

`estspellen' is provided to show spelling alternation of the search phrase. In order to use the plug-in, it is needed for the command `aspell' to be installed on your system. Moreover, edit `estsearch.tmpl' or `estmerge.tmpl' and add the line `' above the line `'.

Bugs

When a version of Estraier is upgraded, backward compatibility of inverted indexes is not assured. So, you should rebuild the inverted indexes when upgrading.

If you find any bug, please report it to the author, with the information of the version of Estraier, the operating system, and the compiler. If possible, please send me `config.log' which was generated when performing `./configure'.

Copying

Estraier is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2.1 of the License or any later version.

Estraier is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with Estraier (See the file `COPYING'); if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.

Estraier was written by Mikio Hirabayashi. You can contact the author by e-mail to `mikio@fallabs.com'.

Fundamental Specifications of Estraier Version 1

Table of Contents

Preparation

Installation

Result

Configurations of the Web Server

Using Regular Expressions

Using Dynamic Linked Filters

Options for Text Analyzing

Other Options

Typical Example

Usage

Extraction of Text and Attributes

Filter Programs

Typical Example

Search Conditions

Results

Setting

Prime Configuration File

Template File

Top Page File

Log File

Search Form

Search Server

Parameters

Search Command

Scoring Algorithm

Description

Setting

Prime Configuration File

Template File

Top Page File

Constructing a Large Index

For Availability

Utility for SI

Searching Remote Sites

Searching the Mailbox

Searching Caches of an HTTP Proxy

Easy Customization of the UI

Plug-in to show spelling alternation