Estraier is a full-text search system for personal use. Full-text search means functions to search lots of documents for some documents including specified words. The principal purpose of Estraier is to realize a full-text search system of a web site. It functions similarly to Google, but for a personal web site or sites in an intranet. The following are features of Estraier.
Estraier realizes fast full-text search using a database called inverted index. An inverted index is created by an administrator of a web site. Its operations are performed on the system of the web server. Users use web browsers to access the CGI script installed on the system, and perform full-text search. The user interface can be customized by editing a simple template file. A simple web server featuring full-text search is also provided.
When a user inputs a search phrase into the form of the search page, a list of documents corresponding the search condition is shown with summaries of the text of result documents. Each summary is generated by extraction of phrases around search words in the document. Search words in a summary are highlighted. Documents in the result are sorted in descending order of scores for the search words. The score of each document are calculated, based on the number and the fraction of the search words in the document.
Estraier implements relational document search. It is functions to offer some documents related to a document in a search result. Documents in the result are sorted in descending order of relational degree. Relational degree is calculated, based on vector space model. Simply put, you can search for documents which have similar tendencies about occurrence of words. Moreover, document clustering is also supported. It is a feature to categorize result documents automatically by their similarity.
Because Estraier expresses characters as Unicode (UCS-2), it can handles not only such European languages as English, but also such Asian languages as Japanese. As for the current version, it can analyze European languages and Japanese within practical accuracy.
Estraier provides functions to extract texts from files on the local file system. Supported formats are plain text, HTML, and MIME (e-mail and MHTML). Moreover, Files of various formats can be treated by calling an arbitrary outer command. For example, files of MS-Word can be treated with `wvWare', and files of PDF can be treated with `pdftotext'.
Estraier can construct an inverted index of greater than a hundred of thousands of documents. As for a software, the upper limit of the number of registered documents are not given. However, actual limit are determined by capability of hardwares, due to the time of constructing an inverted index. Meanwhile, the time of search is almost constant, regardless of the scale of an inverted index. If the number of registered documents is about a hundred of thousands, the search result will be shown in one second or less.
Installation of Estraier is very easy. In most cases, working for installation will be completed in twenty minutes. To construct an inverted index, you have to execute only one or two commands. From some minutes to hours after, the inverted index will be built up. Then, you can enjoy full-text search by accessing the installed CGI script.
Estraier is available on Linux, Solaris, HP-UX, FreeBSD, NetBSD, OpenBSD, Mac OS X, and Windows (Cygwin). Other UNIX systems are also. Estraier is a free software licensed under the GNU General Public License.
To install Estraier from a source package, GCC of 2.8 or later version and `make' are required. As for Linux and BSDs, they must be already installed.
As Estraier uses GNU libiconv, you have to install it beforehand. Although your system has libiconv by default, you should use the newest version of GNU libiconv. Moreover, note that earlier versions than 1.9.1 have memory leak problem. You can get GNU libiconv at the following site.
As Estraier uses zlib, you have to install it beforehand. As for many systems, zlib is already installed by default. You can get zlib at the following site.
If you build Estraier on Windows, Cygwin environment is needed. If you are stranger to Cygwin, using the binary package of Estraier. You can get Cygwin at the following site.
When an archive file of Estraier is extracted, change the current working directory to the generated directory and perform installation.
Run the configuration script.
./configure
Build programs. On Windows, perform `make win' instead of it.
make
Perform self-diagnostic test.
make check
Install programs. This operation must be carried out by the root user. On Windows, perform `make install-win' instead of it.
make install
When a series of work finishes, the following files will be installed.
/usr/local/bin/estindex /usr/local/bin/estserver /usr/local/bin/estxview /usr/local/bin/estsiutil /usr/local/bin/estmbtomh /usr/local/bin/estpdfhtml /usr/local/bin/estdochtml /usr/local/bin/estxlshtml /usr/local/bin/estppthtml /usr/local/bin/estmanhtml /usr/local/bin/estgzhtml /usr/local/bin/estxdwhtml /usr/local/bin/estxdthtml /usr/local/bin/estfind /usr/local/bin/estautoreg /usr/local/bin/estwolels /usr/local/libexec/estsearch.cgi /usr/local/libexec/estmerge.cgi /usr/local/libexec/estspellen /usr/local/share/estraier/estxview.dtd /usr/local/share/estraier/estxview.css /usr/local/share/estraier/estxview.xsl /usr/local/share/estraier/estsearch.conf /usr/local/share/estraier/estsearch.tmpl /usr/local/share/estraier/estsearch.top /usr/local/share/estraier/estmerge.conf /usr/local/share/estraier/estmerge.tmpl /usr/local/share/estraier/estmerge.top /usr/local/share/estraier/locale/... /usr/local/share/estraier/skins/...
At first, all you have to know is that. You may skip reading the following configuration options and jump to the section of Managing Inverted Index.
This document does not refer to configurations of any web server. You should read some manuals of a web server, and set the configurations to enable CGI scripts. Any web server supporting CGI is fine. Apache, Microsoft IIS, and AnHTTPd are okay. Instead of them, you can use a search server provided within Estraier. It is integrated program of a web server and a full-text search system.
By defult, a search word is specified with its full matching pattern. However, it is possible to use wild cards and regular expressions by featuring regular expressions. To enable functions of regular expressions, configure the building environment as the following.
./configure --enable-regex
While functions of regular expressions are embedded in the standard library by GNU (glibc), if you use another library without features of regular expressions, use the GNU regex.
When Estraier calls an external command for text filtering, it costs overhead of calling the filter command via the system shell. You can solve the problem by using a filter function implemented in a dynamic linking library. To enable this feature, configure the building environment as the following.
./configure --enable-dlfilter
Estraier uses the system call `dlopen' for dynamic linking libraries. At least, it is implemented on Linux, FreeBSD, Solaris, and HP-UX.
By default, words in European text are divided by space characters and mark characters. If dividing by space characters only is preferred, configure the building environment as the following.
./configure --enable-strict
By default, such too generic words as `a', `the', `to' and so on are excluded as stop words. If you do not prefer this behavior, configure the building environment as the following.
./configure --disable-stopword
According to environment, you may ought to specify the following options.
To enable full-text search, you should construct an inverted index beforehand. For example, if your web contents are under `/home/mikio/public_html' and CGI script is available there, perform the following steps.
cd /home/mikio/public_html estindex register casket estindex relate casket
Then, all of plain text, HTML, and MIME files are registered into an inverted index named as `casket'.
When your site is updated, perform the following steps.
cd /home/mikio/public_html estindex purge casket estindex register casket estindex optimize casket estindex relate casket
Then, deleted files are reflected to the inverted index, and new or modified files are also reflected to the inverted index.
At first, all you have to know is that. You may skip reading the following usage and jump to the section of User Interface for Search.
Usage of the command `estindex' to manage an inverted index is the following. This command is composed of sub commands for some purposes. The name of a sub command is specified by the second argument. If `*' is specified as a suffix rule, any file matches it. The name of an encoding should be specified as a formal name registered to IANA. When an outer command is called as a filter, the first argument specifies the name of the input file, the second argument specifies the name of the output file, and the environment variable `ESTORIG' specifies the name of the original file. If the name of the outer command begins with `@', a dynamic linking library whose name is the substring except for beginning `@' is linked, and a function whose name is `estfilter' is called.
The sub command `register' is used in order to construct or update an inverted index.
The sub command `relate' is used in order to add score information for relational document search.
The sub command `purge' is used in order to reflect deleted files to an inverted index.
The sub command `optimize' is used in order to delete useless information which arisen by updating an inverted index.
The sub command `inform' is used in order to get information of an inverted index.
The sub command `merge' is used in order to merge plural inverted indexes.
The sub command `pree' is used in order to test text extraction and word breaking.
The sub command `version' is used in order to know the version information of Estraier.
Each sub command returns 0 if it finishes successfully, or 1 if any error has occurred. If the environment variable `ESTDBGFD' is set, debug information is output to the specified file descriptor. If you abort a running command, send one signal of SIGINT (Control-C), SIGQUIT (Control-/), and SIGTERM. Then the inverted index is closed normally and the command finishes. Any other meaning of forced termination may destroy the inverted index.
When parsing plain text, the following steps are performed.
When parsing HTML, the following steps are performed.
When parsing MIME, the following steps are performed.
If the `title' attribute of a registered document is not extracted, and if the encoding of its file name is US-ASCII, the file name is treated as the `title' attribute. Moreover, if the attribute `realuri' is specified, search programs show the value instead of the URI of the registered document.
The source package of Estraier provides the command `estpdfhtml' to handle PDF files, `estdochtml' to handle MS-Word files, `estxlshtml' to handle MS-Excel files, `estppthtml' to handle MS-PowerPoint files, `estmanhtml' to handle man pages of UNIX, `estgzhtml' to handle gzipped or zipped file of plain text or HTML, `estxdwhtml' to handle DocuWorks files, and `estxdthtml' to handle files in various formats on Windows. They are called as filter programs when construting an inverted index. Their functions are to read data from a file specified with the first argument, and convert the data to HTML, and write the HTML to a file specified with the second argument.
To use `estpdfhtml', you should install `pdftotext' beforehand. It works on UNIX. Usually, you will perform the following command in order to register PDF files into an inverted index.
estindex register -tsuf "" -hsuf "" -msuf "" \ -xsuf .pdf application/pdf estpdfhtml casket
To use `estdochtml', you should install `wvWare' beforehand. It works on UNIX. Usually, you will perform the following command in order to register MS-Word files into an inverted index.
estindex register -tsuf "" -hsuf "" -msuf "" \ -xsuf .doc application/msword estdochtml casket
To use `estxlshtml', you should install `xlhtml' beforehand. It works on UNIX. Usually, you will perform the following command in order to register MS-Excel files into an inverted index.
estindex register -tsuf "" -hsuf "" -msuf "" \ -xsuf .xls application/vnd.ms-excel estxlshtml casket
To use `estppthtml', you should install `ppthtml' beforehand. It works on UNIX. Usually, you will perform the following command in order to register MS-PowerPoint files into an inverted index.
estindex register -tsuf "" -hsuf "" -msuf "" \ -xsuf .ppt application/vnd.ms-powerpoint estppthtml casket
To use `estmanhtml', you should install `groff' beforehand. It works on UNIX. Usually, you will perform the following command in order to register `man' files into an inverted index.
estindex register -tsuf "" -hsuf "" -msuf "" \ -xsuf .1,.2,.3,.4,.5,.6,.7,.8 application/x-troff-man estmanhtml casket
To use `estgzhtml', you should install `gzip' beforehand. It works on UNIX. Usually, you will perform the following command in order to register gzipped or zipped files into an inverted index.
estindex register -tsuf "" -hsuf "" -msuf "" \ -xsuf .txt.gz,.asc.gz,.txt.zip,.asc.zip,.html.gz,.htm.gz,.html.zip,.htm.zip \ text/html estgzhtml casket
To use `estxdwhtml', you should install `xdw2text' beforehand. It works on UNIX. Usually, you will perform the following command in order to register DocuWorks files into an inverted index.
estindex register -tsuf "" -hsuf "" -msuf "" \ -xsuf .xdw application/vnd.fujixerox.docuworks estxdwhtml casket
To use `estxdthtml', you should install `xdoc2html' beforehand. It works on Windows, and can process files of PDF, RDF, MS-Word, MS-Excel, MS-PowerPoint, and so on. Usually, you will perform the following command.
estindex register -tsuf "" -hsuf "" -msuf "" \ -xsuf .pdf,.rtf,.doc,.xls,.ppt application/octet-stream estxdthtml casket
The utility command `estfind' enables to combine some filters. For examle, the following registers files of PDF, MS-Word, and MS-Excel at the same time.
./estfind -pdf -doc -xls . | ./estindex register \ -xtype application/pdf estpdfhtml \ -xtype application/msword estdochtml \ -xtype application/vnd.ms-excel estxlshtml \ -list - casket
`estfind' is a utility to locate files under a directory and output TSV of paths and media types. the usege is the following.
You can get programs on which the filters depend, at the following sites. Refer to the license of each product for use conditions.
You can use your own filter programs except for the above. You can use any language to implement a filter program. C, Perl and shells are okay.
To expose full-text search, you should deploy a CGI script and its configuration files. For example, the inverted index is placed in `/home/mikio/public_html' and CGI script is available there, perform the following steps.
cd /home/mikio/public_html cp /usr/local/libexec/estsearch.cgi . cp /usr/local/share/estraier/estsearch.conf . cp /usr/local/share/estraier/estsearch.tmpl . cp /usr/local/share/estraier/estsearch.top .
`estsearch.cgi' is the CGI script for users to access. Each setting files can be edited with a text editor. At first, you do not have to modify them. For the meanwhile, access `estsearch.cgi' with via Web.
When a user input search words into the input form for search conditions and push the `Search' button, the result of search is shown. In case that one word is specified, documents including the word are exhibited. In case that two or more words are specified, documents including all of them are exhibited. A delemeter between words is half-width space or full-width space.
Search words of the second and after can have prepositive operators. Each operator and word should be separated with space. Precedence orders of all operators are equal. Operators are evaluated from left.
[and] word | remove documents containing the specified word from the result. It is default behavior without any operator. |
[or] word | add documents containing the specified word to the result. |
[not] word | remove documents containing the specified word from the result. |
For example, the following condition searches for documents containing both of `apple' and `orange'
apple orange
The following condition has the same semantics to the above, with a operator.
apple [and] orange
The following condition searches for documents containing either `apple' or `orange'.
apple [or] orange
The following condition searches for documents containing `apple' while without `orange'.
apple [not] orange
The following condition searches for documents containing both of `apple' and `orange' while without `grape'.
apple [and] orange [not] grape
The following condition searches for documents containing either `apple' or `orange', and containing `grape', but without `melon'.
apple [or] orange [and] grape [not] melon
The following condition searches for documents containing `apple' but neither `orange' nor `grape'.
apple [not] orange [not] grape
It is impossible to search for documents containing `apple' but without both of `orange' and `grape'.
Some compound words are separated even if they are not separated with space. Then, words at the second and after are treated as having the previous operator or `[and]' if no operator. Using this trick, you can search with condition composed of a natural text.
By setting `n per page', you can change the number of shown documents in a page. By setting `n clusters', you can see the result documents categorized with their similarity. By setting `n level rep', you can pick up the top document in the same directory of URI. By setting `sort by order', you can change the order of sorting shown documents.
If Estraier was built with features of regular expressions enabled, you can select one of some expression formats. If `as-is expression' is selected, each word expresses the specified pattern itself. If `with wild cards' is selected, `*' can be placed as a wild card in each expression. The wild card matches any string. If `regular expressions' is selected, each word is treated as a regular expressions (POSIX Extended Regular Expressions). The following are elements of regular expressions.
P | matching `P' itself. |
PQ | matching a sequence of `P' and `Q'. |
. | matching any one character. |
P* | matching a sequence of 0 or more matches of `P'. |
P+ | matching a sequence of 1 or more matches of `P'. |
P? | matching a sequence of 0 or 1 matches of `P'. |
P|Q | matching both of `P' and `Q'. |
^P | matching `P' at the beginning. |
P$ | matching `P' at the end. |
(P) | giving higher priority to evaluation of `P'. |
For example, to express words which begin with `work' (for example, `work', `worked', `worker', and `works'), select wild cards and input `work*', or select regular expressions and input `^work.*'.
You can combine wild cards or regular expressions and operators. For example, if you search for documents including `apple' or `orange', and including `grape' or `melon', but not including `kiwi' nor `banana', select regular expressions and input the following.
^(apple|orange)$ [and] ^(grape|melon)$ [not] ^(kiwi|banana)$
The max number of candidate words expressed with wild cards or regular expressions is 1024.
Attributes and summary of documents corresponding to a search condition are shown as a search result. A list of keywords are also shown in each summary. When you select a keyword, the word is added to the current search condition, and search documents again. This feature is useful to narrow down a result. As search words in each summary are highlighted, it is easy to know how the words are used.
When you select `(detail)' in each summary, a list of words registered for the summary are shown. As search words in each summary are highlighted as with summary, it is easy to know where the words.
When you select `(related)' in each summary. Documents related to the selected documents are searched for. Relational document search is a feature to retrieve documents which have similar tendencies of word occurrences. Even when you can not come up with appropriate words, you may reach your target documents with following relational documents.
In case that more documents corresponds than the specified number, they are shown in some pages. You can bring the page backward or forward by selecting `PREV' or `NEXT'.
In case that great many documents corresponds, the result is shown without adjudication on the all. Then, the link `(or more)' is shown on the top of the page. You can get the complete result by selecting it.
`estsearch.cgi' is the CGI script for search. It is installed as `/usr/local/libexec/estsearch.cgi'. Copy it into a directory which is public via Web. Usually, an inverted index and some configuration files are placed in the same directory.
Configuration files are `estsearch.conf', `estsearch.tmpl', and `estsearch.top'. Their default descriptions are installed in `/usr/local/share/estraier'. Copy and customize them.
Configuration files should be placed in the current directory of a process of `estsearch.cgi'. In most cases, the current directory of a CGI script is the same directory where the script is placed. However, it depends on implementations of web servers. As for Microsoft IIS, the current directory of a CGI script is the root of a `virtual directory'.
`estsearch.cgi' reads configurations of the file named as `estsearch.conf' in the current directory. Each line of the configuration file begins with the name of an attribute tailed with `:'. The value of each attribute is placed after `:'. The encoding of the prime configuration file should be US-ASCII or UTF-8. Principal attributes are the following.
indexname: casket tmplfile: estsearch.tmpl topfile: estsearch.top prefix: ./ suffix: replace: diridx: index.html decuri: false boolunit: 1024 relkeys: 16 defmax: 8 reevmax: 1024 showkeys: 8 sumall: 96 sumtop: 24 sumwidth: 16 clustunit: 128 clustkeys: 8 logfile:
`indexname' specifies the name of the inverted index. If you want to use another name or place it in another directory, change the value. `tmplfile' specifies the after-mentioned template file. `topfile' specifies the after-mentioned top page file. If you want to use other names, change their values. `prefix' specifies the prefixal string of the URI of each document in the results. For example, if it is `http://foo.bar/baz/', `./apple.html' is shown as `http://foo.bar/baz/apple.html'. `suffix' specifies the suffix string of the URI of each document in the results. For example, if it is `.html', `./data/751' is shown as `./data/751.html'. `replace' specifies an expression to replace the URI of each document. The before string and the after string are delimited with space characters. For example, if it is `/foo/ /bar/', `./foo/apple.html' is replaced to be `./bar/apple.html'. This attribute can be specified multiple times. `diridx' specifies the index file of each directory. For example, if it is `index.html', `./foo/index.html' is shown as `./foo/'. If `decuri' is `true', the URI of each document of the results are decoded with treated as URL-encoded. This feature is useful with files saved by the GNU `wget' command. `boolunit' and `relkeys' are parameters for accuracy of search. If they are increased, accuracy is up while processing speed is down. If they are decreased, accuracy is down while processing speed is up. Usually, you do not have to change them. `defmax' specifies the default number of shown documents in the results. `reevmax' specifies the max number of words which is evolved from a regular expressions. `sumall' specifies the number of words in the summary of a document. `sumtop' specifies the number of words picked from the top of a document. `sumwidth' specifies the number of words picked around each search words. `clustunit' specifies the number of targets of document clustering. `clustkeys' specifies the number of shown keywords for each cluster. `logfile' specifies the log file into which input search conditions are output. Besides, attributes for label strings are defined in the configuration file.
The template file is to specify the template of the user interface for search. It is so-called a skin. Although the name of this file is `estsearch.tmpl' usually, you can change it by editing the prime configuration file. You can customize the user interface freely by editing the template file. The encoding of the template file should be US-ASCII or UTF-8. As most contents are reflected directly on the user interface, the following expressions are replaced.
By using `<!--ESTEXEC:command-->', you can insert HTML dynamically into the page. It is a sogenannt plug-in. The called command can get the value of the `phrase' parameter from the value of the environment variable `ESTPHRASE'. The command can get the other parameters by analyzing the value of the environment variable `QUERY_STRING'.
The top page file is to specify the messages shown when a user does not input any search condition. In other words, the messages are shown when a user visit the search page at the start. Although the name of this file is `estsearch.top' usually, you can change it by editing the prime configuration file. As the data of the top page file are inserted in the place of `<!--ESTRESULT-->', they should be parts of HTML. The encoding of the template file should be US-ASCII or UTF-8.
You do not have to leave intact the logo and version information of Estraier. You can apply your favorite design while the syntax of the page should be valid as HTML (XHTML). It is suggested to give a description of help or tutorial of this search system in your language. Sample files for localization are installed under `/usr/local/share/estraier/locale'. Sample template files for different flavor are installed under `/usr/local/share/estraier/skins'.
If `logfile' is specified in the prime configuration file, search conditions input by users are output into the log file. Each condition is separated with a line feed, and the following terms separated with tabs are recorded.
If you want to place the search form in another page, write such HTML as the following. The parameter `phrase' specifies the search phrase. The parameter `enc' specifies the character encoding of the page where the search form is.
<form method="get" action="estsearch.cgi"> <div> <input type="text" name="phrase" value="" size="64" tabindex="1" /> <input type="submit" value="Search" tabindex="2" /> <input type="hidden" name="enc" value="UTF-8" /> </div> </form>
Because `estsearch.cgi' is implemented as a CGI script, every access has overhead to connect the database and the morphological analyzer. To deal with this problem, Estraier provides `estserver' which is a web server featuring full-text search. Usage of this command is the following. It publishes contents under the current directory, and when the URL `/estsearch' is requested, it provides interfaces of full-text search as with `estsearch.cgi'. If no argument is specified, `casket' is read as the inverted index, `estsearch.conf' is read as the configuration file, `estsearch.tmpl' is read as the template file, and `estsearch.top' is read as the top page file.
For example, if you run the server at the port 8888 of the host `estraier.foo.edu' and publish contents under `/home/mikio/public_html', install configuration files and the inverted index there, and perform the following command. After that, access `http://estraier.foo.edu:8888/estsearch'.
cd /home/mikio/public_html estserver -host estraier.foo.edu -port 8888 \ casket estsearch.conf estsearch.tmpl estsearch.top
Supported methods are GET, POST, and HEAD. CGI is not supported. The setting of `indexname', `tmplfile', and `topfile' in the prime configuration file are ignored. Log messages are output to the standard output.
Processes of this command are composed of one parent and multiple children. Requests from clients are dealt by the children. Each child exits when he have dealt 128 requests. The parent generates an alternate child as soon as a child exits. According to this mechanism, safe and parallel processing is realized. To stop the service, send SIGINT signal (Control-C) to the parent. If 0 is specified as the number of children, the parent itself deals requests.
`estsearch.cgi' and `estserver' read the following parameters. The value of each parameter should be URL-encoded as with usual CGI.
In order to search documents by shells or some kinds of scripting languages, the command `estxview' is provided. This command output results in XML. By processing them, it is possible to realize original user interfaces. The following is the usage of this command. By default, the search condition is treated as search phrase compatible with `estsearch.cgi'. The format of the option `-relsc' is TSV and each line has a word and its score.
This command returns 0 if searching finishes successfully, or 1 if any error has occurred. If the environment variable `ESTDBGFD' is set, debug information is output to the specified file descriptor. Syntax and semantics of the output is explained in `estxview.dtd'.
Scores shown when searching are calculated by the following algorithm.
Similarities shown when relational document search are calculated by the following algorithm.
The number of documents which can be stored in an inverted index is, though it depends on the size of documents, actually limited up to about one million. If you want to realize a larger search system, you should deploy inverted indexes on plural computers. When searching, an agent request queries all at once to each computer, and it merges the results. `estmerge.cgi' is such agent.
`estmerge.cgi' can merge the result of not only `estsearch.cgi' and `estserver', but also `estmerge.cgi' recursively. Multilevel meta search realizes a very large search system. For example, if the number of targets at the first level is 10 and the number of targets in the second level is 10, it can search 100 sites.
`estsearch.cgi' is the CGI script for meta search. It is installed as `/usr/local/libexec/estmerge.cgi'. Copy it into a directory which is public via Web. Whether the search targets are on the same host does not matter.
Configuration files are `estmerge.conf', `estmerge.tmpl', and `estmerge.top'. Their default descriptions are installed in `/usr/local/share/estraier'. Copy and customize them.
Configuration files should be placed in the current directory of a process of `estmerge.cgi'. In most cases, the current directory of a CGI script is the same directory where the script is placed.
`estmerge.cgi' reads configurations of the file named as `estmerge.conf' in the current directory. Each line of the configuration file begins with the name of an attribute tailed with `:'. The value of each attribute is placed after `:'. The encoding of the prime configuration file should be US-ASCII or UTF-8. Principal attributes are the following.
target: Foo@http://www.foofoo.go.jp/foo/estsearch.cgi target: Bar@http://www.barbar.ad.jp/bar/estsearch.cgi target: Baz@http://www.bazbaz.ac.jp/baz/estsearch.cgi proxy: proxy.mydomain.gov:3128 tmplfile: estmerge.tmpl topfile: estmerge.top hidetl: false defmax: 8 logfile:
`target' specifies a search target. Its label and its URL are divided with `@'. This attribute can be specified multiple times. `proxy' specifies a proxy of HTTP. Its host name and its port number are divided with `:'. If the value is empty, no proxy is used. `tmplfile' specifies the after-mentioned template file. `topfile' specifies the after-mentioned top page file. If `hidetl' is `true', labels of the targets are hidden. `defmax' specifies the default number of shown documents in the results. `logfile' specifies the log file into which input search conditions are output. Besides, attributes for label strings are defined in the configuration file.
The template file is to specify the template of the user interface for search. Although the name of this file is `estmerge.tmpl' usually, you can change it by editing the prime configuration file. You can customize the user interface freely by editing the template file. The encoding of the template file should be US-ASCII or UTF-8. As most contents are reflected directly on the user interface, the following expressions are replaced.
The top page file is to specify the messages shown when a user does not input any search condition. In other words, the messages are shown when a user visit the search page at the start. Although the name of this file is `estmerge.top' usually, you can change it by editing the prime configuration file. As the data of the top page file are inserted in the place of `<!--ESTRESULT-->', they should be parts of HTML. The encoding of the template file should be US-ASCII or UTF-8.
In case that you construct a full-text search system of a large site, it costs long time to build the inverted index for the first time. The processing speed begins to slow down when registered documents is up to tens of thousands. If your system has 1GB RAM o more, you should add the option `-plute' to the command `estindex register'. Thus, you can construct an inverted index more quickly.
Alternatively, you can construct inverted indexes in some parts, and merge them. The command `estautoreg' is provided to simplify those steps. Change the current directory to the root directory of contents and perform this command. Then, the inverted index is built up, the CGI script and its configurations are deployed there. You can specify the number of documents in each element index by the first argument.
cd /home/mikio/public_html estautoreg
Because `estautoreg' is a simple shell script, you can customize it as much as you like. By default, the number of documents in each element index is 65536, and the number of words for summary is 4096.
To construct an inverted index quickly, the system should have abundant RAM. The measure is 20MB per ten thousands of documents. For example, if the number of target documents is one million, 2GB of RAM is requested. Moreover, you should tune the system to reduce frequency of synchronizing I/O buffers. For example, if you are on Linux 2.4, add the following line to the file `/etc/sysctl.conf'. To know function of each parameter, perform `update -d'.
vm.bdflush = 80 1000 0 0 0 10000 100 0 0
If your system has RAM whose size is larger than the size of an inverted index, you can construct the inverted index quickly, using a RAM disk. As for linux, perform the following steps. These operations must be carried out by the root user.
mkdir -p /mnt/ram mount -t tmpfs -o size=512m /dev/shm /mnt/ram cd /home/mikio/public_html estindex register /mnt/ram/casket estindex relate /mnt/ram/casket rm -rf casket cp -r /mnt/ram/casket . umount /mnt/ram rmdir /mnt/ram
Even if you do not use a RAM disk but a hard disk, an inverted index can be constructed more quickly on another disk rather than the same disk of target documents. When you merge some inverted indexes also, processing speed is higher if the merged index is created on another disk rather than the same disk of element indexes.
For operational availability, an inverted index should be geminated. The reason is that it is impossible to search an inverted index while updating. Moreover, the index may be broken while updating due to unexpected shutdown of the system or full-up of the disk. Thus, you should update the inverted index after the following steps. It is useful to automate these operations with `crond' of UNIX.
cd /home/mikio/public_html && # change the current directory test -d casket && # confirm the original exists test ! -d casket-tmp && # confirm the copy does not exist cp -r casket casket-tmp && # create the copy estindex purge casket-tmp && # reflect deleted documents on the copy estindex register casket-tmp && # reflect new or modified documents on the copy estindex optimize casket-tmp && # optimize the copy estindex relate casket-tmp && # reflect relational information on the copy rm -rf casket && # remove the original mv casket-tmp casket # rename the copy as the original
Actually, `purge' and `optimize' are not needed to be performed so often. It is conceivable that `register' is per day, `relate' is per week, and that `purge' and `optimize' are per month.
If you make use of Estraier on a shared rental server or a system provided by a service provider, you may bother other users due to heavy loading of construction of the inverted index. In that cases, you can construct the inverted index on your local system and upload it to the system of the web server. However, the technique is not available between systems with defferent byte orders.
The command `estsiutil' is provided as a utilities for system integration. This command is helpful to make an application of Estraier and its installer. Moreover, you can make a CGI script of the shell script easily. This command is used in the following format.
To construct a full-text search system for remote sites, retrieve contents of remote sites with `wget' and build the inverted index of them. For example, if target sites are `http://estraier.foo.edu' and `http://snatcher.foo.edu', perform the following steps.
cd /home/mikio/public_html wget -r -N -np -l inf -A ".txt,.TXT,.html,.HTML,.htm,.HTM" \ "http://estraier.foo.edu/" wget -r -N -np -l inf -A ".txt,.TXT,.html,.HTML,.htm,.HTM" \ "http://snatcher.foo.edu/" estindex register casket estindex relate casket cp /usr/local/libexec/estsearch.cgi /usr/local/share/estraier/estsearch.* .
In `estsearch.conf', set the value of `prefix' as `http://', and set the value of `decuri' as `true'. Thus, when a user follow a link in the search results, he visit the original site, not the local cache.
If you use an e-mail client managing the mail box like `mh' (Sylpeed, Mew, and so on), you can construct a full-text search system for the mail box. For example, if mails are deployed in the folders `business' and `friends', perform the following steps.
cd /home/mikio/public_html ln -s /home/mikio/Mail/inbox . estindex register -msuf "*" -tattr "author,recipient,multicast" \ casket ./inbox/business estindex register -msuf "*" -tattr "author,recipient,multicast" \ casket ./inbox/friends estindex relate casket cp /usr/local/libexec/estsearch.cgi /usr/local/share/estraier/estsearch.* .
If you use a client managing the mail box in its own format (Becky, Outlook Express, and so on), export files in `eml' format, and construct the inverted index of them. Moreover, if you use a client managing the mail box in mbox format (Eudora, Thunderbird, and so on), use the command `estmbtomh' to convert the mail box to files in mh format. Its usage is the following.
Because it is risky to expose personal mails via web, you should configure the web server to deny accesses of unknown users, and set the permissions of the `mail' directory appropriately. When sending MIME via web, the value of `Content-Type' should be `message/rfc822'.
Estraier let you search caches of an HTTP proxy, WWWOFFLE. The command `estwolels' is provided to help operations for making the inverted index. This command outputs paths and URIs of each document in the cache. You can make the inverted index for the cache of WWWOFFLE by the following commands. Usually, these operations must be carried out by the root user.
estwolels | estindex register -list - -tsuf "" -hsuf "" -msuf "*" -iz -mn casket estindex relate casket cp /usr/local/libexec/estsearch.cgi /usr/local/share/estraier/estsearch.* .
Moreover, you should edit `estsearch.conf' and set the value of the attribute `prefix' to be an empty string.
Even if customizing with the prime configuration file is not enough, your aim may be satisfied with CSS or JavaScript in the template file. For example, all you have to do for narrowing the width of the input form for search phrase is to write the following into the definition of CSS.
#phrase { width: 32ex; }
If you want to delete the input forms of `max', `drep', and `sort', write the following. To delete input forms severally, set the style of `display: none;' to an element whose ID is `maxspan', `drepspan', `sortspan', or `exprspan'.
#advancedform { display: none; }
If you use JavaScript, add the attribute `onload' to the element `body', and the specified function will be called when the page is loaded.
<body onload="startup();">
Moreover, define the function in the element `script' in the element `head'. For example, if you want to narrow the width of the input form, define the following.
<script type="text/javascript"> function startup(){ document.getElementById('phrase').setAttribute('size', '32'); } </script>
If you want customization beyond the capability of CSS and JavaScript, make a wrapper script which calls `estsearch.cgi'. Because parameters of CGI are given as environment variables, you can pass them to the original script by calling it inside the wrapper. Then, process the output with `awk', `perl', and so on. For example, the following wrapper script narrows the width of the input form.
#! /bin/sh ./estsearch.cgi | awk ' { if(match($0, /^<input.*id="phrase"/)){ sub(/size="[0-9]*"/, "size=\"6\"", $0) } printf("%s\n", $0) } '
`estspellen' is provided to show spelling alternation of the search phrase. In order to use the plug-in, it is needed for the command `aspell' to be installed on your system. Moreover, edit `estsearch.tmpl' or `estmerge.tmpl' and add the line `<!--ESTEXEC:/usr/local/libexec/estspellen-->' above the line `<!--ESTRESULT-->'.
When a version of Estraier is upgraded, backward compatibility of inverted indexes is not assured. So, you should rebuild the inverted indexes when upgrading.
If you find any bug, please report it to the author, with the information of the version of Estraier, the operating system, and the compiler. If possible, please send me `config.log' which was generated when performing `./configure'.
Estraier is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2.1 of the License or any later version.
Estraier is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with Estraier (See the file `COPYING'); if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.
Estraier was written by Mikio Hirabayashi. You can contact the author by e-mail to `mikio@fallabs.com'.