User's Guide
Copyright (C) 2004-2007 Mikio Hirabayashi
Last Update: Tue, 06 Mar 2007 12:05:18 +0900
Table of Contents
- Introduction
- Attributes
- File Formats
- Search Conditions
- Administration Command
- CGI Script for Search
- CGI Script for Highlight
Introduction
This guide describes detail of how to use applications of Hyper Estraier. If you have never read the introduction document, please read it beforehand.
Hyper Estraier is a full-text search system using index database. So, before search, it is needed to prepare an index into which target documents have been registered. Hyper Estraier provides the administration command `estcmd
' and the CGI script `estsearch.cgi
' for search. The former is used in order to administrate the index by command line interface. The latter is used in order to search the index for documents with a web browser.
estcmd
can handle various file formats and features various operations to administrate index. How to use it is described in this guide.
Hyper Estraier supports such various methods for search as combining some search phrase and search with attributes of documents. Moreover, it is possible to customize presentation according to the configuration of estseek.cgi
. How to do it is described in this guide.
Attributes
Not only information of the body text but also such attributes as the title, the modification date, and so on can be added to documents handled by Hyper Estraier. Attributes are used for such various purposes as search with attributes and determination of difference updating.
Attribute Name
Any attribute has a name. As the name can be determined arbitrarily, some names are reserved for being used as system attributes. Names of system attributes begin with "@
". There are the following system attributes.
- @id : the ID number determined automatically when the document is registered.
- @uri : the location of a document which any document should have.
- @digest : the message digest calculated automatically when the document is registered.
- @cdate : the creation date.
- @mdate : the last modification date.
- @adate : the last access date.
- @title : the title used as a headline in the search result.
- @author : the author.
- @type : the media type.
- @lang : the language.
- @genre : the genre.
- @size : the size.
- @weight : the scoring weight.
- @misc : miscellaneous information.
The other attributes except for system attributes are called user-defined attributes. They can be defined by document draft said later. Meta attributes in HTML and headers of MIME are also treated as user-defined attributes. Any attribute name should not begin with "%
".
Attribute Type
There are two data types for attributes; string and number. Data of the string type are arbitrary strings. There are such operations as full matching, forward matching, backward matching, partial matching. Data of the number type are numbers or date information. A string of the number type is converted into the number and calculated according to the following formats. If the format is for date, the value is computed based on the UNIX epoch (1 Jan 1970).
- If all characters are digits, it is computed as a decimal number.
- If it begins with "0x", it is computed as a hexadecimal number.
- If it conforms to W3CDTF (e.g. 1978-02-11T18:05:32+09:00), it is computed as a date.
- If it conforms to RFC822 (e.g. Sat, 11 Feb 1978 18:05:32 +0900), it is computed as a date.
- If it is in YYYY/MM/DD format (e.g. 1978/02/11 18:05:32), it is computed as a date.
- Else, it is computed as -1.
The data type is not determined when registration. It is determined when search. Length of the value of an attribute is not limited.
Attributes and the body text of a document should be expressed in UTF-8 encoding. If another encoding is used, it should be converted into UTF-8. By the way, estcmd
detect the encoding automatically if it is not clearly specified.
estcmd
defines the URI attribute begins with "file://
" for each document. However, if a document defines its own URI, it comes first. The URI of the local file system is defined as an attribute whose name is "_lpath
". The absolute path on the local file system is defined as an attribute whose name is "_lreal
". The file name is normalized to UTF-8 is defined as an attribute whose name is "_lfile
". The encoding of the value of each attribute is normalized as UTF-8. Attributes whose name begins with "_
" are hidden in detail display by estseek.cgi
.
estcmd
handles four file formats. This section describes how the four are processed.
Plain Text
A document of plain-text is composed of strings with no structure. By default, files whose names end with ".txt
", ".text
", or ".asc
" are treated as plain-text.
- The character encoding is detected automatically.
- "text/plain" is recorded as the "@type" attribute.
- The file size is recorded as the "@size" attribute.
HTML
As we all know, a document of HTML is used as a hyper-text on the Web. By default, files whose names end with ".html
", ".htm
", "xhtml
", or ".xht
" are treated as HTML.
- The character encoding is detected automatically. But, if the encoding is specified by a "meta" attribute, it comes first.
- If there is a "title" attribute, its content is recorded as the "@title" attribute.
- If the "name" attribute of a "meta" element specifies "author", the value of the "content" attribute is recorded as the "@author" attribute.
- If the "html" element has the "lang" or "xml:lang" attribute, the value is recorded as the "@lang" attribute.
- "text/html" is recorded as the "@type" attribute.
- The file size is recorded as the "@size" attribute.
- If the "name" or the "http-equiv" is specified in a "meta" element, the value of the "content" attribute is recorded as an attribute whose name is converted from the value of them into lower cases.
- The value of the attribute "@title" is treated as a hidden text.
MIME (e-mail)
MIME is used for communication by e-mail based on RFC822 and so on. By default, files whose names end with ".eml
", ".mime
", ".mht
", or ".mhtml
" are treated as HTML.
- The character encoding is detected automatically. But, if the encoding is specified by the "Content-Type" header, it comes first.
- If the "Subject" header is, the value is recorded as the "@title" attribute.
- If the "From" header is, the value is recorded as the "@author" attribute.
- If the "Date" header is, the value is recorded as the "@cdate" attribute and the "@mdate" attribute.
- "message/rfc822" is recorded as the "@type" attribute.
- The file size is recorded as the "@size" attribute.
- The value of each header is recorded as an attribute whose name is converted from the header name into lower cases.
- The value of the attribute "@title" is treated as a hidden text.
If the content of each part of multipart is "text/plain", "text/html", or "message/rfc822", the content is treated as a part of the body text so that web archive can be supported.
Document Draft
Document draft is a original format of Hyper Estraier. It is possible to handle various formats in the integrative way by using document draft as intermediate format. By default, files whose names end with ".est
" are treated as document draft.
Though format of document draft is similar to RFC822, detail points differ. The delimiter for headers is not ":
" but "=
". Moreover, no space character is needed after "=
". The following is an example data to handle a MIDI document.
@uri=http://www.music-estraier.com/mididb/t/tw/twinkle.kar
@title=Twinkle Twinkle Little Star
@author=Jane Taylor
@cdate=2004-11-01T23:11:18+09:00
@mdate=2005-03-21T08:07:45+09:00
category=chorus,dance
Twinkle, twinkle, little star,
How I wonder what you are.
Up above the world so high,
Like a diamond in the sky.
Twinkle, twinkle, little star,
How I wonder what you are!
Twinkle Twinkle Little Star
Jane Taylor
The following specifications are required for document draft.
- It is composed of valid UTF-8 strings.
- Line feeds are one of UNIX style (LF) or DOS style (CR+LF).
- There are the attribute section and the text section and they are separated by the first empty line.
- Each line in the attribute section specifies an attribute. The name and the value is separated by the first "=".
- Each line in the text section specifies a sentence of the body text. If a line begins with a tab character, the line is treated as a hidden text.
In the attribute section, lines which begin with "%" are regarded as control commands and are ignored. Lines which begin with "%VECTOR" trailed by a tab express keyword vectors. The format of the keyword vector is TSV. Keywords and their scores come alternately. Lines which begin with "%SCORE" trailed by a tab express the substitute score.
A hidden text is the same as normal text except not displayed in the snippet of the result. It is useful to search with some attributes.
Search Conditions
Two kinds of search conditions are supported. One is for full-text search and the other is for attribute search. If both are specified at the same time, documents corresponding to the both are searched for. Moreover, usual form, simplified form and so on are supported for full-text search condition.
Full-text Search Conditions
The purpose of full-text search expression is to search for documents including some specified words. For example, if you search for documents including a word "computer
", specify "computer
" in the search phrase as it is.
You can specify two or more words. For example, if you specify "United Nations
", documents including "united
" followed by "nations
" are searched for. In case of simplified form, specify the following.
"united nations"
Intersection operation is supported by the "AND" operator. For example, if you specify "internet AND security
", documents including both of "internet
" and "security
" are searched for. In case of simplified form, specify the following.
internet security
Difference operation is supported by the "ANDNOT" operator. For example, if you specify "hacker ANDNOT cracker
", documents including "hacker
" but not including "cracker
" are searched for. In case of simplified form, specify the following.
hacker ! cracker
Union operation is supported by the "OR" operator. For example, if you specify "proxy OR firewall
", documents including one or both of "proxy
" and "firewall
" are searched for. In case of simplified form, specify the following.
proxy | firewall
Note that the priority of "OR" is higher than ones of "AND" and "ANDNOT". For example, if you specify "F1 OR F-1 OR Formula One AND Champion OR Victory
", documents including one or both of "f1
", "f-1
", and "formula one
", and including one or both of "champion
" and "victory
". In case of simplified form, specify the following.
F1 | F-1 | "Formula One" Champion | Victory
Search words are case insensitive. However, operators are case sensitive. If you want to search for documents including "AND
", specify "and
" instead.
Wild card is also supported. It can be used for forward match search and backward match search. For example, "[BW] euro
" matches words which begin with "euro
". "[EW] shere
" matches words which end with "sphere
". Moreover, regular expression is also supported. For example, "[RX] ^inter.*al$
" matches words which begin with "inter
" and end with "al
". In case of simplified form, "euro*
", "*sphere
", and "*^inter.*al$*
" are used instead.
Attribute Search Conditions
The purpose of attribute search expression is to search for documents whose attributes are corresponding to the specified expression. An expression of attribute search is composed of an attribute name, an operator, and a value. They are separated with space characters. For example, if you specify "@title STRINC IMPORTANT
", documents whose title includes "IMPORTANT
". The following operators for attribute search are supported.
- STREQ : is equal to the string
- STRNE : is not equal to the string
- STRINC : includes the string
- STRBW : begins with the string
- STREW : ends with the string
- STRAND : includes all tokens in the string
- STROR : includes at least one token in the string
- STROREQ : is equal to at least one token in the string
- STRRX : matches regular expressions of the string
- NUMEQ : is equal to the number or date
- NUMNE : is not equal to the number or date
- NUMGT : is greater than the number or date
- NUMGE : is greater than or equal to the number or date
- NUMLT : is less than the number or date
- NUMLE : is less than or equal to the number or date
- NUMBT : is between the two numbers or dates
If an operator is leaded by "!", the meaning is inverted. If an operator is leaded by "I", case of the value is ignored. If no operator is specified, all documents with the attribute correspond regardless of the value. STRAND
, STROR
, STROREQ
, and NUMBT
take plural parameters separated by space. Range of NUMBT
is inclusive of border values. Two or more attribute names can be listed with separated by "," to mean logical addition.
Order of the Result
You can specify the order of the result by an expression. An ordering expression is composed of an attribute name and an operator. For example, if you specify "@size NUMA", documents in the result are in ascending order of the size. The following operators for ordering are supported.
- STRA : ascending by string
- STRD : descending by string
- NUMA : ascending by number or date
- NUMD : descending by number or date
By default, the order of the result is descending by score. The score is calculated by the number of specified words in each document.
Other Faculties
estseek.cgi
provides other faculties also. "[...] per page
" in the form is to specify the number of shown documents per page. If documents over one page correspond, you can move to another page via anchors of "PREV
" and "NEXT
" at the bottom of the page. "clip by [...]
" in the form is to specify strength of clipping similar documents. It is useful if too similar documents occupy the page. Each of "[detail]
" links in the result is to show detail information. Each of "[similar]
" links in the result is to search for similar documents. Each of "[include]
" links in the result is to include clipped documents.
Search phrase has other kinds of formats; rough form, union form, and intersection form. Though rough form is similar to simplified form, negative conditions are specified by tokens leaded by "-
". Union form specifies only union conditions by tokens. Intersection form specifies only intersection conditions by tokens. These forms do not support wild card nor other special operators.
Administration Command
This section describes specification of estcmd
. estcmd
can do not only indexing but also search.
Synopsis and Description
estcmd
is an aggregation of sub commands. The name of a sub command is specified by the first argument. Other arguments are parsed according to each sub command. The argument db specifies the path of an index.
- estcmd create [-tr] [-apn|-acc] [-xs|-xl|-xh|-xh2|-xh3] [-sv|-si|-sa] [-attr name type] db
- Create an index.
- If -tr is specified, a new index is created regardless if one exists.
- If -apn is specified, N-gram analysis is performed against European text also.
- If -acc is specified, character category analysis is performed instead of N-gram analysis.
- If -xs is specified, the index is tuned to register less than 50000 documents.
- If -xl is specified, the index is tuned to register more than 300000 documents.
- If -xh is specified, the index is tuned to register more than 1000000 documents.
- If -xh2 is specified, the index is tuned to register more than 5000000 documents.
- If -xh3 is specified, the index is tuned to register more than 10000000 documents.
- If -sv is specified, scores are stored as void.
- If -si is specified, scores are stored as 32-bit integer.
- If -sa is specified, scores are stored as-is and marked not to be tuned when search.
- -attr specifies an attribute index and its data type. This option can be specified multiple times.
- estcmd put [-tr] [-cl] [-ws] [-apn|-acc] [-xs|-xl|-xh|-xh2|-xh3] [-sv|-si|-sa] db [file]
- Register a document of document draft to an index.
- file specifies a target file. If it is omitted, the standard input is read.
- If -tr is specified, a new index is created regardless if one exists.
- If -cl is specified, regions of a overwritten document are cleaned up.
- If -ws is specified, scores are weighted statically with score weighting attribute.
- If -apn is specified, N-gram analysis is performed against European text also.
- If -acc is specified, character category analysis is performed instead of N-gram analysis.
- If -xs is specified, the index is tuned to register less than 50000 documents.
- If -xl is specified, the index is tuned to register more than 300000 documents.
- If -xh is specified, the index is tuned to register more than 1000000 documents.
- If -xh2 is specified, the index is tuned to register more than 5000000 documents.
- If -xh3 is specified, the index is tuned to register more than 10000000 documents.
- If -sv is specified, scores are stored as void.
- If -si is specified, scores are stored as 32-bit integer.
- If -sa is specified, scores are stored as-is and marked not to be tuned when search.
- estcmd out [-cl] [-pc enc] db expr
- Remove information of a document from an index.
- expr specifies the ID number, the URI, or the local path of a document.
- If -cl is specified, regions of the document are cleaned up.
- -pc specifies the encoding of file paths. By default, it is ISO-8859-1.
- estcmd edit [-pc enc] db expr name [value]
- Edit an attribute of a document in an index.
- expr specifies the ID number, the URI, or the local path of a document.
- name specifies the name of an attribute.
- value specifies the value of the attribute. If it is omitted, the attribute is removed.
- -pc specifies the encoding of the file path and the attribute value. By default, it is ISO-8859-1.
- estcmd get [-nl|-nb] [-pidx path] [-pc enc] db expr [attr]
- Output document draft of a document in an index.
- expr specifies the ID number, the URI, or the local path of a document.
- If attr is specified, only the value of the attribute is output.
- If -nl is specified, the index is opened without file locking.
- If -nb is specified, file locking is performed without blocking.
- -pidx specifies the path of a pseudo index. This option can be specified multiple times.
- -pc specifies the encoding of file paths. By default, it is ISO-8859-1.
- estcmd list [-nl|-nb] [-lp] db
- Output a list of all document in an index.
- If -nl is specified, the index is opened without file locking.
- If -nb is specified, file locking is performed without blocking.
- If -lp is specified, local path equivalent to URL of "file://" is output.
- estcmd uriid [-pidx path] [-nl|-nb] [-pc enc] db expr
- Output the ID number of a document specified by URI.
- expr specifies the URI or the local path of a document.
- If -nl is specified, the index is opened without file locking.
- If -nb is specified, file locking is performed without blocking.
- -pidx specifies the path of a pseudo index. This option can be specified multiple times.
- -pc specifies the encoding of file paths. By default, it is ISO-8859-1.
- estcmd meta db [name [value]]
- Handle meta data.
- name specifies the name of a piece of meta data. If it is omitted, a list of all names is output.
- value specifies the value of the meta data to be recorded. If it is omitted, the current value is output. If it is an empty string, the meta data is removed.
- estcmd inform [-nl|-nb] db
- Output the number of documents and the number of unique words in an index.
- If -nl is specified, the index is opened without file locking.
- If -nb is specified, file locking is performed without blocking.
- estcmd optimize [-onp] [-ond] db
- Optimize an index and clean up dispensable regions.
- If -onp is specified, it is omitted to clean up dispensable regions.
- If -ond is specified, it is omitted to optimize the database files.
- estcmd merge [-cl] db target
- Merge another index.
- target specifies the path of another index.
- If -cl is specified, regions of overwritten documents are cleaned up.
- estcmd repair [-rst|-rsh] db
- Repair a broken index.
- If -rst is specified, strict consistency check is performed.
- If -rsh is specified, consistency check is omitted.
- estcmd search [-nl|-nb] [-pidx path] [-ic enc] [-vu|-va|-vf|-vs|-vh|-vx|-dd] [-sn wnum hnum anum] [-kn num] [-ec rn] [-gs|-gf|-ga] [-cd] [-ni] [-sf|-sfr|-sfu|-sfi] [-hs] [-attr expr] [-ord expr] [-max num] [-sk num] [-aux num] [-dis name] [-sim id] db [phrase]
- Search an index for documents.
- phrase specifies the search phrase.
- If -nl is specified, the index is opened without file locking.
- If -nb is specified, file locking is performed without blocking.
- -pidx specifies the path of a pseudo index. This option can be specified multiple times.
- -ic specifies the input encoding. By default, it is UTF-8.
- If -vu is specified, TSV of ID number and URI are output.
- If -va is specified, multipart format including attributes is output.
- If -vf is specified, multipart format including document draft is output.
- If -vs is specified, multipart format including attributes and snippets is output.
- If -vh is specified, human readable format including attributes and snippets is output.
- If -vx is specified, XML including including attributes and snippets is output.
- If -dd is specified, document draft data are dumped and saved into separated files.
- -sn specifies the number of whole width of snippet and width of strings picked up from the beginning of the text and width of strings picked up around each highlighted word.
- -kn specifies the number of keywords to be extracted. By default, keyword extraction is not performed.
- If -um is specified, morphological analyzers are used for keyword extraction.
- -ec specifies lower limit of similarity eclipse. If it is negative, similarity is weighted by URL. "serv" specifies server basis. "dir" specifies directory basis. "file" specifies file basis.
- If -gs is specified, every key of N-gram is checked. By default, it is alternately.
- If -gf is specified, keys of N-gram are checked every three.
- If -ga is specified, keys of N-gram are checked every four.
- If -cd is specified, whether documents match the search phrase definitely is checked.
- If -ni is specified, TF-IDF tuning is omitted.
- If -sf is specified, the phrase is treated as a simplified form.
- If -sfr is specified, the phrase is treated as a rough form.
- If -sfu is specified, the phrase is treated as a union form.
- If -sfi is specified, the phrase is treated as an intersection form.
- If -hs is specified, score information is output as an attribute.
- -attr specifies an attribute search condition. This option can be specified multiple times.
- -ord specifies the order expression. By default, it is descending by score.
- -max specifies the maximum number of shown documents. Negative means unlimited. By default, it is 10.
- -sk specifies the number of documents to be skipped. By default, it is 0.
- -aux specifies permission to adopt result of the auxiliary index. If it is not more than 0, the auxiliary index is not used. By default, it is 32.
- -dis specifies the name of the distinct attribute.
- -sim specifies the ID number of the seed document for similarity search.
- estcmd gather [-tr] [-cl] [-ws] [-no] [-fe|-ft|-fh|-fm] [-fx sufs cmd] [-fz] [-fo] [-rm sufs] [-ic enc] [-il lang] [-bc] [-lt num] [-lf num] [-pc enc] [-px name] [-aa name value] [-apn|-acc] [-xs|-xl|-xh|-xh2|-xh3] [-sv|-si|-sa] [-ss name] [-sd] [-cm] [-cs num] [-ncm] [-kn num] [-um] db [file|dir]
- Scan the local file system and register documents into an index.
- If the third argument is the name of a file, a list of paths of target documents are read from it. If it is "-", the standard input is specified.
- If the third argument is the name of a directory. All files under the directory are treated as target documents.
- If -tr is specified, a new index is created regardless if one exists.
- If -cl is specified, regions of overwritten documents are cleaned up.
- If -ws is specified, scores are weighted statically with score weighting attribute.
- If -no is specified, operations are printed but not executed actually.
- If -fe is specified, target files are treated as document draft. By default, the format is detected by the suffix of each document.
- If -ft is specified, target files are treated as plain text.
- If -fh is specified, target files are treated as HTML.
- If -fm is specified, target files are treated as MIME.
- If -fx is specified, target files with the specified suffixes are processed by the specified outer command. "*" matches any file. If the command is leaded by "T@", the output of the command is treated as plain text. If the command is leaded by "H@", the output of the command is treated as HTML. If the command is leaded by "M@", the output of the command is treated as MIME. Else, the output is treated as document draft. This option can be specified multiple times.
- If -fz is specified, documents which do not corresponding to the condition of -fx are ignored.
- If -fo is specified, target files are not read. It is useful for efficient process of the outer command.
- If -rm is specified, target files with the specified suffixes are removed. "*" matches any file. This option can be specified multiple times.
- -ic specifies the input encoding. By default, it is detected automatically.
- -il specifies the preferred input language. By default, English is preferred.
- If -bc is specified, binary files are detected and ignored.
- -lt specifies the text size limitation by kilobytes. By default, it is 128KB. If it is negative, the size is unlimited.
- -lf specifies the file size limitation by megabytes. By default, it is 32MB. If it is negative, the size is unlimited.
- -pc specifies the encoding of file paths. By default, it is ISO-8859-1.
- -px specifies the name of an attribute read from the list of paths. As the list of paths can be in TSV format, the first field is treated as the path of a target document, the second field and the followers are definitions of attribute values. -px specifies the name of each values of the second field and the followers. This option can be specified multiple times.
- -aa specifies the name and the value of an additional attribute. This option can be specified multiple times.
- If -apn is specified, N-gram analysis is performed against European text also.
- If -acc is specified, character category analysis is performed instead of N-gram analysis.
- If -xs is specified, the index is tuned to register less than 50000 documents.
- If -xl is specified, the index is tuned to register more than 300000 documents.
- If -xh is specified, the index is tuned to register more than 1000000 documents.
- If -xh2 is specified, the index is tuned to register more than 5000000 documents.
- If -xh3 is specified, the index is tuned to register more than 10000000 documents.
- If -sv is specified, scores are stored as void.
- If -si is specified, scores are stored as 32-bit integer.
- If -sa is specified, scores are stored as-is and marked not to be tuned when search.
- -ss specifies the name of an attribute for substitute score.
- If -sd is specified, the modification date of each file is recorded as an attribute.
- If -cm is specified, documents whose modification date has not changed are ignored.
- -cs specifies the size of cache memory by megabytes. By default, it is 64MB.
- If -ncm is specified, checking availability of the virtual memory is omitted.
- -kn specifies the number of keywords to be extracted. By default, keyword extraction is not performed.
- If -um is specified, morphological analyzers are used for keyword extraction.
- estcmd purge [-cl] [-no] [-fc] [-pc enc] [-attr expr] db [prefix]
- Purge information of documents which do not exist on the file system.
- If prefix is specified, only documents whose URIs are begins with it. It can be specified by the local path of a directory.
- If -cl is specified, regions of the deleted documents are cleaned up.
- If -no is specified, operations are printed but not executed actually.
- If -fc is specified, information of all target documents are deleted.
- -pc specifies the encoding of file paths. By default, it is ISO-8859-1.
- -attr specifies an attribute search condition. This option can be specified multiple times.
- estcmd extkeys [-no] [-fc] [-dfdb file] [-ncm] [-ni] [-kn num] [-um] [-attr expr] db [prefix]
- Create a database of keywords extracted from documents.
- If prefix is specified, only documents whose URIs are begins with it.
- If -no is specified, operations are printed but not executed actually.
- If -fc is specified, all target documents are processed whichever they have existing records or not.
- -dfdb specifies an outher database of document frequency. By default, document frequency is calculated dynamically according to the index.
- If -ncm is specified, checking availability of the virtual memory is omitted.
- If -ni is specified, TF-IDF tuning is omitted.
- -kn specifies the number of keywords to be extracted. By default, it is 32.
- If -um is specified, morphological analyzers are used for keyword extraction.
- -attr specifies an attribute search condition. This option can be specified multiple times.
- estcmd words [-nl|-nb] [-dfdb file] [-kw] db
- Output a list of all unique words and each record size which is treated as docuemnt frequency.
- If -nl is specified, the index is opened without file locking.
- If -nb is specified, file locking is performed without blocking.
- -dfdb specifies an outer database where the result is stored. By default, the result is output to the standard output as TSV. If the outer database already exists, the value of each record is incremented.
- If -kw is specified, keywords and numbers of corresponding documents are output.
- If -kt is specified, keywords and their related terms are output.
- estcmd draft [-ft|-fh|-fm] [-ic enc] [-il lang] [-bc] [-lt num] [-kn num] [-um] [file]
- For test and debug.
- estcmd break [-ic enc] [-il lang] [-apn|-acc] [-wt] [file]
- For test and debug.
- estcmd iconv [-ic enc] [-il lang] [-oc enc] [file]
- For test and debug.
- estcmd regex [-inv] [-repl str] expr [file]
- For test and debug.
- estcmd scandir [-tf|-td] [-pa|-pu] [dir]
- For test and debug.
- estcmd multi [-db db] [-nl|-nb] [-ic enc] [-gs|-gf|-ga] [-cd] [-ni] [-sf|-sfr|-sfu|-sfi] [-hs] [-hu] [-attr expr] [-ord expr] [-max num] [-sk num] [-aux num] [-dis name] [phrase]
- For test and debug.
- estcmd randput [-ren|-rla|-reu|-ror|-rjp|-rch] [-cs num] db dnum
- For test and debug.
- estcmd wicked db dnum
- For test and debug.
- estcmd regression db
- For test and debug.
- estcmd version
- Show the version information.
All sub commands return 0 if the operation is success, else return 1. As for put, out, gather, purge, randput, wicked, and regression, they finish with closing the database when they catch the signal 1 (SIGHUP), 2 (SIGINT), 3 (SIGQUIT), 13 (SIGPIPE), or 15 (SIGTERM).
The data type of attribute indexes specified by -attr
option of create
sub command should be "seq
" for sequencial type, "str
" for string type, or "num
" for number type.
Each pseudo index specified by -pidx
option of search
sub command and so on is a directory containing files of document draft. If you search a main index with pseudo indexes, meta search of the main index and pseudo indexes is performed.
The encoding name specified by -ic
option should be such name registered to IETF as UTF-8
, ISO-8859-1
, and so on. The language name specified by -il
option should be one of "en
" (English), "ja
" (Japanese), "zh
" (Chinese), "ko
" (Korean).
The outer command specified by -fx
option of gather
receives the path of the target document by the first argument and the path for output by the second argument. The original path of the target document is given as the value of the environment variable `ESTORIGFILE
'.
Note that similarity search is very slow, by default. To improve the performance of similarity search, running "estcmd extkeys
" beforehand is strongly recommended.
Examples
The following is to register mail files of mh format.
find /home/mikio/Mail -type f | egrep 'inbox/(business|friends)/[0-9]+$' |
estcmd gather -cl -fm -cm casket -
The following is to register MS-Office files. estfxmsotohtml
requires wvWare and xlhtml.
PATH=$PATH:/usr/local/share/hyperestraier/filter ; export PATH
estcmd gather -cl -fx ".doc,.xls,.ppt" "H@estfxmsotohtml" -fz -sd -cm casket .
The following is to register PDF files. estfxpdftohtml
requires pdftotext.
PATH=$PATH:/usr/local/share/hyperestraier/filter ; export PATH
estcmd gather -cl -fx ".pdf" "H@estfxpdftohtml" -fz -sd -cm casket .
The following is to register cache files of WWWOFFLE, a proxy server. estwolefind
requires WWWOFFLE.
estwolefind /var/spool/wwwoffle |
estcmd gather -cl -fm -bc -px @uri -px _lfile -sd -cm casket -
The following is to output the search result as XML.
estcmd search -vx -max 8 casket 'socket AND shutdown'
CGI Script for Search
This section describes specification of estseek.cgi
. The subject matter is to write configuration files.
Composition
estseek.cgi
needs three configuration files; the prime configuration file, the template file, the top page file, and the help file. Their default names are `estseek.cgi
', `estseek.tmpl
', `estseek.top
', and `estseek.help
'.
The name of the prime configuration file is determined by changing the suffix of the CGI script to ".conf
". If you change the name of `estseek.cgi
' to `estsearch.cgi
', `estsearch.conf
' is read. Names of the other files are specified in the prime configuration file. So, you can install some sets of search scripts in one directory.
As estseek.cgi
is installed as `/usr/local/libexec/estseek.cgi
', copy it to a directory for CGI scripts. Moreover, as samples of configurations are installed in `/usr/local/share/hyperestraier/
', copy and modify them.
Prime Configuration File
The prime configuration file is composed of lines and the name of an variable and the value separated by ":
" are in each line. Lines leaded by "#
" are ignored as comments. By default, the following configuration is there.
indexname: casket
tmplfile: estseek.tmpl
topfile: estseek.top
helpfile: estseek.help
lockindex: true
pseudoindex:
replace: ^file:///home/mikio/public_html/{{!}}http://localhost/
replace: /index\.html?${{!}}/
showlreal: false
deftitle: Hyper Estraier: a full-text search system for communities
formtype: normal
perpage: 10 100 10
attrselect: false
#genrecheck: private{{!}}private
#genrecheck: business{{!}}business
#genrecheck: misc{{!}}miscellaneous
attrwidth: 80
showscore: true
extattr: author|Author
extattr: from|From
extattr: to|To
extattr: cc|Cc
extattr: date|Date
snipwwidth: 480
sniphwidth: 96
snipawidth: 96
condgstep: 2
dotfidf: true
scancheck: 3
phraseform: 2
dispproxy:
candetail: true
candir: false
auxmin: 32
smlrvnum: 32
smlrtune: 16 1024 4096
clipview: 2
clipweight: none
relkeynum: 0
spcache:
wildmax: 256
qxpndcmd:
logfile:
logformat: {time}\t{REMOTE_ADDR}:{REMOTE_PORT}\t{cond}\n
Means of each variable is the following.
- indexname : specifies the name of the index.
- tmplfile : specifies the path of the template file.
- topfile : specifies the path the top page file.
- helpfile : specifies the path the help file.
- lockindex : specifies whether to perform file locking to the database. "
true
" or "false
".
- pseudoindex : specifies the path of a pseudo index. This can be more than once.
- replace : specifies regular expressions and replacement string to convert the URI of each document. Regular expressions and replacement strings are separated by "
{{!}}
". Each "&" in a replacement string is evolved to the matched string. Special escapes "\1" through "\9" referring to the corresponding matching sub-expressions are also supported. This can be more than once.
- showlreal : specifies whether to show the absolute path instead of the URI. "
true
" or "false
".
- deftitle : specifies the default title of the page.
- formtype : specifies the type of the input form. "
normal
" for generic purpose, "web
", for web site, "file
" for file system, or "mail
" for mail box.
- perpage : specifies parameters of the perpage select box; the minimum number, the maximum number, and the step of increment.
- attrselect : specifies whether to use select boxes for attribute conditions.
- genrecheck : specifies a check box to narrow by genre attribute. A condition value and a label are separated by "
{{!}}
". This can be more than once.
- attrwidth : specifies mamximum width of each shown attribute.
- showscore : specifies whether to show scores.
- extattr : specifies an attribute to be shown. The name and the label are separated by "
|
". This can be more than once.
- snipwwidth : specifies whole width of the snippet of each shown document.
- sniphwidth : specifies width of strings picked up from the beginning of the text.
- snipawidth : specifies width of strings picked up around each highlighted word.
- condgstep : specifies accuracy of N-gram checking. "
1
" is to check every key. "2
" is to check keys of N-gram are checked every two. "3
" is every three. "4
" is every four.
- dotfidf : specifies whether to do TF-IDF score tuning. "
true
" or "false
".
- scancheck : specifies the number of checked documents by scanning.
- phraseform : specifies the phrase form. "
1
" is usual form. "2
" is simplified form. "3
" is rough form. "4
" is union form. "5
" is intersection form.
- dispproxy : specifies the URL of
estproxy.cgi
to display original pages with highlighted words. "[URI]
" is replaced by the URI of each document.
- candetail : specifies whether to enable detail display of a document. "
true
" or "false
".
- candir : specifies whether to enable directory display of a document. "
true
" or "false
".
- auxmin : specifies the minimum hits to adopt result of the auxiliary index. If it is less than 1, the auxiliary index is not used.
- smlrvnum : specifies the number of keywords for similarity search. If it is less than 1, similarity search is disabled.
- smlrtune : specifies tuning parameters for similarity search; the number of keywords, the number of documents per keyword, and the number of all candidates are separated by space characters.
- clipview : specifies the number of clipped documents to be shown. If it is negative, similarity eclipse is disabled.
- clipweight : specifies weighting algorithm of documents clipping. "
none
" or "url
".
- relkeynum : specifies the number of related terms to be shown.
- spcache : specifies the name of an attribute of the special cache.
- wildmax : specifies the maximum number of expansion of wild cards.
- qxpndcmd : specifies a command line for query expansion. If it is an empty string, query expansion is disabled.
- logfile : specifies the path of the log file.
- logformat : specifies the format of each log data.
Template File
The template file is to determine appearance of the page. It describes HTML and the data is shown as it is. However, "<!--ESTTITLE-->
" is replaced by the page title. "<!--ESTFORM-->
" is replaced by the form to input search conditions. "<!--ESTRESULT-->
" is replaced by the search result. "<!--ESTINFO-->
" is replaced by information of the index.
Top Page File
When a user access the CGI script first or if no configuration is input, the content of the top page file is displayed instead of the search result. By default, the banner of Hyper Estraier is described there.
Help File
When a user select the "help
" link near the input form, the content of the help file is displayed instead of the search result. By default, usage of the CGI script is described there.
Search Form
If you want set the search form in another page, write the following HTML.
<form method="get" action="estseek.cgi">
<div>
<input type="text" name="phrase" value="" size="32" />
<input type="submit" value="Search" />
<input type="hidden" name="enc" value="UTF-8" />
</div>
</form>
Change "estseek.cgi
" to the URI of setseek.cgi
. Change "UTF-8
" to the encoding name of the page.
Query Expansion
If you want query expansion, enable an outer command by editing qxpndcmd
in estseek.conf
. It specifies the absolute path of an arbitrary command which output synonyms of a word specified by the environment variable `ESTWORD
'.
Make qxpndcmd
specify "/usr/local/share/hyperestraier/filter/estwnetxpnd
" in order to enable query expansion with WordNet, English thesaurus. It requires WordNet installed on your system.
CGI Script for Highlight
This section describes specification of estproxy.cgi
. It features search word highlighting view of corresponding documents.
Composition and Features
estproxy.cgi
needs the configuration file estproxy.conf
. The name of the configuration file is determined by changing the suffix of the CGI script to ".conf".
As estsproxy.cgi
is installed as `/usr/local/libexec/estproxy.cgi
', copy it to a directory for CGI scripts. Moreover, as samples of the configuration is installed as `/usr/local/share/hyperestraier/estproxy.conf
', copy and modify it.
estproxy.cgi
works like a proxy server though it is implemented as a CGI script. So, you can browse arbitrary documents on the Web via the script, by giving their URLs with the `url
' parameter. Any data format except for HTML is converted into HTML. Moreover, words specified by such parameters as from `word1
' to `word32
' are highlighted.
Supported protocols are HTTP and FILE. If a URL beginning with "http://
" is specified, the document is fetched by HTTP (as for now, HTTPS is not supported). If a URL beginning with "file://
" is specified, the document is readed directly from the local file system of the server.
Supported file formats by built-in filters are plain text (text/plain), HTML (text/html), and MIME (message/rfc822). Other formats can handled with arbitrary outer commands.
Configuration
The configuration file is composed of lines and the name of an variable and the value separated by ":
" are in each line. Lines leaded by "#
" are ignored as comments. By default, the following configuration is there.
#replace: ^http://localhost/{{!}}file:///home/mikio/public_html/
allowrx: ^http://
#allowrx: ^file://
denyrx: /\.
passaddr: 1
limitsize: 32
urlrule: \.est${{!}}text/x-estraier-draft
urlrule: \.(eml|mime|mht|mhtml)${{!}}message/rfc822
typerule: ^text/x-estraier-draft${{!}}[DRAFT]
typerule: ^text/plain${{!}}[TEXT]
typerule: ^(text/html|application/xhtml+xml)${{!}}[HTML]
typerule: ^message/rfc822${{!}}[MIME]
language: 0
shownavi: 1
Means of each variable is the following.
- replace : specifies regular expressions and replacement string to convert the URL of the target document. Regular expressions and replacement strings are separated by "
{{!}}
". Each "&" in a replacement string is evolved to the matched string. Special escapes "\1" through "\9" referring to the corresponding matching sub-expressions are also supported. This can be more than once.
- allowrx : specifies allowing regular expressions of URLs to be visited. This can be more than once.
- denyrx : specifies denying regular expressions of URLs to be visited. This can be more than once.
- passaddr : specifies whether to pass the IP adresses of the clients to remote servers (0:no, 1:yes).
- limitsize : specifies the maximum size of downloading data (in mega bytes).
- urlrule : specifies URL rules (regular expressions and media types). This can be more than once.
- typerule : specifies media type rules (regular expressions and filter commands). This can be more than once.
- language : specifies the preferred language (0:English, 1:Japanese, 2:Chinese, 3:Korean, 4:misc).
- shownavi : specifies whether to show the navigation bar (0:no, 1:yes).
allowrx
and denyrx
are evaluated in the order of description. Alphabetical characters are case-insensitive.