This guide describes usage of Hyper Estraier's web crawler. If you haven't read user's guide and P2P guide yet, now is a good moment to do so.
estcmd
can index files on local file system only. Though files on remote hosts can be indexed by using NFS or SMB remote mount mechanism, unspecified number of web sites on Internet can not be mounted by them. Though such web crawlers as wget
can do prefetch of those files, it involves high overhead and wastes much disk space.
The command estwaver
is useful to crawl arbitrary web sites and to index their documents directly. estwaver
is so intelligent that it supports not only depth first order and width first but also similarity oriented order. It crawls documents similar to specified seed documents preferentially.
First step is creation of the crawler root directory which contains a configuration file and some databases. Following command will create casket
, the crawler root directory:
estwaver init casket
By default, the configuration is to start crawling at the project page of Hyper Estraier. Let's try it as it is:
estwaver crawl casket
Then, documents are fetched one after another and they are indexed into the index. To stop the operation, you can press Ctrl-C
on terminal.
When the operation finishes, there is a directory _index
in the crawler root directory. It is an index which can be treated with estcmd
and so on. Let's try to search the index as with the following command:
estcmd search -vs casket/_index "hyper estraier"
If you want to resume the crawling operation, perform estwaver crawl
again.
This section describes specification of estwaver
, whose purpose is to index documents on the Web.
estwaver
is an aggregation of sub commands. The name of a sub command is specified by the first argument. Other arguments are parsed according to each sub command. The argument rootdir
specifies the crawler root directory which contains configuration file and so on.
All sub commands return 0 if the operation is success, else return 1. A running crawler finishes with closing the database when it catches the signal 1 (SIGHUP), 2 (SIGINT), 3 (SIGQUIT), or 15 (SIGTERM).
When crawling finishes, there is a directory _index
in the crawler root directory. It is an index available by estcmd
and so on.
The crawler root directory contains the following files and directories.
The configuration file is composed of lines and the name of an variable and the value separated by ":
" are in each line. By default, the following configuration is there.
seed: 1.5|http://fallabs.com/hyperestraier/uguide-en.html seed: 1.0|http://fallabs.com/hyperestraier/pguide-en.html seed: 1.0|http://fallabs.com/hyperestraier/nguide-en.html seed: 0.0|http://fallabs.com/qdbm/ proxyhost: proxyport: interval: 500 timeout: 30 strategy: 0 inherit: 0.4 seeddepth: 0 maxdepth: 20 masscheck: 500 queuesize: 50000 replace: ^http://127.0.0.1/{{!}}http://localhost/ allowrx: ^http:// denyrx: \.(css|js|csv|tsv|log|md5|crc|conf|ini|inf|lnk|sys|tmp|bak)$ denyrx: \.(zip|tar|tgz|gz|bz2|tbz2|z|lha|lzh)(\?.*)?$ denyrx: ://(localhost|[a-z]*\.localdomain|127\.0\.0\.1)/ noidxrx: /\?[a-z]=[a-z](;|$) urlrule: \.est${{!}}text/x-estraier-draft urlrule: \.(eml|mime|mht|mhtml)${{!}}message/rfc822 typerule: ^text/x-estraier-draft${{!}}[DRAFT] typerule: ^text/plain${{!}}[TEXT] typerule: ^(text/html|application/xhtml+xml)${{!}}[HTML] typerule: ^message/rfc822${{!}}[MIME] language: 0 textlimit: 128 seedkeynum: 256 savekeynum: 32 threadnum: 10 docnum: 10000 period: 10000s revisit: 7d cachesize: 256 #nodeserv: 1|http://admin:admin@localhost:1978/node/node1 #nodeserv: 2|http://admin:admin@localhost:1978/node/node2 #nodeserv: 3|http://admin:admin@localhost:1978/node/node3 logfile: _log loglevel: 2 draftdir: entitydir: postproc:
Meaning of each variable is the following.
|
". This can be more than once.|
". This can be more than once.allowrx
, denyrx
, and noidxrx
are evaluated in the order of description. Alphabetical characters are case-insensitive.
Arbitrary filter commands can be specified with typerule
. The interface of filter command is same as with -fx
option of estcmd gather
. For example, the following specifies to process PDF documents.
typerule: ^application/pdf${{!}}H@/usr/local/share/hyperestraier/filter/estfxpdftohtml