Crawler Guide

Copyright (C) 2004-2007 Mikio Hirabayashi
Last Update: Tue, 06 Mar 2007 12:05:18 +0900

Table of Contents

  1. Introduction
  2. Tutorial
  3. Crawler Command

Introduction

This guide describes usage of Hyper Estraier's web crawler. If you haven't read user's guide and P2P guide yet, now is a good moment to do so.

estcmd can index files on local file system only. Though files on remote hosts can be indexed by using NFS or SMB remote mount mechanism, unspecified number of web sites on Internet can not be mounted by them. Though such web crawlers as wget can do prefetch of those files, it involves high overhead and wastes much disk space.

The command estwaver is useful to crawl arbitrary web sites and to index their documents directly. estwaver is so intelligent that it supports not only depth first order and width first but also similarity oriented order. It crawls documents similar to specified seed documents preferentially.


Tutorial

First step is creation of the crawler root directory which contains a configuration file and some databases. Following command will create casket, the crawler root directory:

estwaver init casket

By default, the configuration is to start crawling at the project page of Hyper Estraier. Let's try it as it is:

estwaver crawl casket

Then, documents are fetched one after another and they are indexed into the index. To stop the operation, you can press Ctrl-C on terminal.

When the operation finishes, there is a directory _index in the crawler root directory. It is an index which can be treated with estcmd and so on. Let's try to search the index as with the following command:

estcmd search -vs casket/_index "hyper estraier"

If you want to resume the crawling operation, perform estwaver crawl again.


Crawler Command

This section describes specification of estwaver, whose purpose is to index documents on the Web.

Synopsis and Description

estwaver is an aggregation of sub commands. The name of a sub command is specified by the first argument. Other arguments are parsed according to each sub command. The argument rootdir specifies the crawler root directory which contains configuration file and so on.

estwaver init [-apn|-acc] [-xs|-xl|-xh] [-sv|-si|-sa] rootdir
Create the crawler root directory.
If -apn is specified, N-gram analysis is performed against European text also.
If -acc is specified, character category analysis is performed instead of N-gram analysis.
If -xs is specified, the index is tuned to register less than 50000 documents.
If -xl is specified, the index is tuned to register more than 300000 documents.
If -xh is specified, the index is tuned to register more than 1000000 documents.
If -sv is specified, scores are stored as void.
If -si is specified, scores are stored as 32-bit integer.
If -sa is specified, scores are stored as-is and marked not to be tuned when search.
estwaver crawl [-restart|-revisit|-revcont] rootdir
Start crawling.
If -restart is specified, crawling is restarted from the seed documents.
If -revisit is specified, collected documents are revisited.
If -revcont is specified, collected documents are revisited and then crawling is continued.
estwaver unittest rootdir
Perform unit tests.
estwaver fetch [-proxy host port] [-tout num] [-il lang] url
Fetch a document.
url specifies the URL of a document.
-proxy specifies the host name and the port number of the proxy server.
-tout specifies timeout in seconds.
-il specifies the preferred language. By default, it is English.

All sub commands return 0 if the operation is success, else return 1. A running crawler finishes with closing the database when it catches the signal 1 (SIGHUP), 2 (SIGINT), 3 (SIGQUIT), or 15 (SIGTERM).

When crawling finishes, there is a directory _index in the crawler root directory. It is an index available by estcmd and so on.

Constitution of the Crawler Root Directory

The crawler root directory contains the following files and directories.

Configuration File

The configuration file is composed of lines and the name of an variable and the value separated by ":" are in each line. By default, the following configuration is there.

seed: 1.5|http://fallabs.com/hyperestraier/uguide-en.html
seed: 1.0|http://fallabs.com/hyperestraier/pguide-en.html
seed: 1.0|http://fallabs.com/hyperestraier/nguide-en.html
seed: 0.0|http://fallabs.com/qdbm/
proxyhost:
proxyport:
interval: 500
timeout: 30
strategy: 0
inherit: 0.4
seeddepth: 0
maxdepth: 20
masscheck: 500
queuesize: 50000
replace: ^http://127.0.0.1/{{!}}http://localhost/
allowrx: ^http://
denyrx: \.(css|js|csv|tsv|log|md5|crc|conf|ini|inf|lnk|sys|tmp|bak)$
denyrx: \.(zip|tar|tgz|gz|bz2|tbz2|z|lha|lzh)(\?.*)?$
denyrx: ://(localhost|[a-z]*\.localdomain|127\.0\.0\.1)/
noidxrx: /\?[a-z]=[a-z](;|$)
urlrule: \.est${{!}}text/x-estraier-draft
urlrule: \.(eml|mime|mht|mhtml)${{!}}message/rfc822
typerule: ^text/x-estraier-draft${{!}}[DRAFT]
typerule: ^text/plain${{!}}[TEXT]
typerule: ^(text/html|application/xhtml+xml)${{!}}[HTML]
typerule: ^message/rfc822${{!}}[MIME]
language: 0
textlimit: 128
seedkeynum: 256
savekeynum: 32
threadnum: 10
docnum: 10000
period: 10000s
revisit: 7d
cachesize: 256
#nodeserv: 1|http://admin:admin@localhost:1978/node/node1
#nodeserv: 2|http://admin:admin@localhost:1978/node/node2
#nodeserv: 3|http://admin:admin@localhost:1978/node/node3
logfile: _log
loglevel: 2
draftdir:
entitydir:
postproc:

Meaning of each variable is the following.

allowrx, denyrx, and noidxrx are evaluated in the order of description. Alphabetical characters are case-insensitive.

Arbitrary filter commands can be specified with typerule. The interface of filter command is same as with -fx option of estcmd gather. For example, the following specifies to process PDF documents.

typerule: ^application/pdf${{!}}H@/usr/local/share/hyperestraier/filter/estfxpdftohtml