Crawler Guide

Last Update: Tue, 06 Mar 2007 12:05:18 +0900

Introduction
Tutorial
Crawler Command

Introduction

This guide describes usage of Hyper Estraier's web crawler. If you haven't read user's guide and P2P guide yet, now is a good moment to do so.

estcmd can index files on local file system only. Though files on remote hosts can be indexed by using NFS or SMB remote mount mechanism, unspecified number of web sites on Internet can not be mounted by them. Though such web crawlers as wget can do prefetch of those files, it involves high overhead and wastes much disk space.

The command estwaver is useful to crawl arbitrary web sites and to index their documents directly. estwaver is so intelligent that it supports not only depth first order and width first but also similarity oriented order. It crawls documents similar to specified seed documents preferentially.

Tutorial

First step is creation of the crawler root directory which contains a configuration file and some databases. Following command will create casket, the crawler root directory:

estwaver init casket

By default, the configuration is to start crawling at the project page of Hyper Estraier. Let's try it as it is:

estwaver crawl casket

Then, documents are fetched one after another and they are indexed into the index. To stop the operation, you can press Ctrl-C on terminal.

When the operation finishes, there is a directory _index in the crawler root directory. It is an index which can be treated with estcmd and so on. Let's try to search the index as with the following command:

estcmd search -vs casket/_index "hyper estraier"

If you want to resume the crawling operation, perform estwaver crawl again.

Crawler Command

This section describes specification of estwaver, whose purpose is to index documents on the Web.

Synopsis and Description

estwaver is an aggregation of sub commands. The name of a sub command is specified by the first argument. Other arguments are parsed according to each sub command. The argument rootdir specifies the crawler root directory which contains configuration file and so on.

estwaver init [-apn|-acc] [-xs|-xl|-xh] [-sv|-si|-sa] rootdir: Create the crawler root directory.; If -apn is specified, N-gram analysis is performed against European text also.; If -acc is specified, character category analysis is performed instead of N-gram analysis.; If -xs is specified, the index is tuned to register less than 50000 documents.; If -xl is specified, the index is tuned to register more than 300000 documents.; If -xh is specified, the index is tuned to register more than 1000000 documents.; If -sv is specified, scores are stored as void.; If -si is specified, scores are stored as 32-bit integer.; If -sa is specified, scores are stored as-is and marked not to be tuned when search.

estwaver crawl [-restart|-revisit|-revcont] rootdir: Start crawling.; If -restart is specified, crawling is restarted from the seed documents.; If -revisit is specified, collected documents are revisited.; If -revcont is specified, collected documents are revisited and then crawling is continued.

estwaver unittest rootdir: Perform unit tests.

estwaver fetch [-proxy host port] [-tout num] [-il lang] url: Fetch a document.; url specifies the URL of a document.; -proxy specifies the host name and the port number of the proxy server.; -tout specifies timeout in seconds.; -il specifies the preferred language. By default, it is English.

All sub commands return 0 if the operation is success, else return 1. A running crawler finishes with closing the database when it catches the signal 1 (SIGHUP), 2 (SIGINT), 3 (SIGQUIT), or 15 (SIGTERM).

When crawling finishes, there is a directory _index in the crawler root directory. It is an index available by estcmd and so on.

Constitution of the Crawler Root Directory

The crawler root directory contains the following files and directories.

_conf : configuration file.
_log : log file.
_meta : database file for meta data.
_queue : priority queue of URLs to be crawled.
_trace/ : tracking records of crawled URLs.
_index/ : index directory.
_tmp/ : directory for temporary files.

Configuration File

The configuration file is composed of lines and the name of an variable and the value separated by ":" are in each line. By default, the following configuration is there.

seed: 1.5|http://fallabs.com/hyperestraier/uguide-en.html
seed: 1.0|http://fallabs.com/hyperestraier/pguide-en.html
seed: 1.0|http://fallabs.com/hyperestraier/nguide-en.html
seed: 0.0|http://fallabs.com/qdbm/
proxyhost:
proxyport:
interval: 500
timeout: 30
strategy: 0
inherit: 0.4
seeddepth: 0
maxdepth: 20
masscheck: 500
queuesize: 50000
replace: ^http://127.0.0.1/{{!}}http://localhost/
allowrx: ^http://
denyrx: \.(css|js|csv|tsv|log|md5|crc|conf|ini|inf|lnk|sys|tmp|bak)$
denyrx: \.(zip|tar|tgz|gz|bz2|tbz2|z|lha|lzh)(\?.*)?$
denyrx: ://(localhost|[a-z]*\.localdomain|127\.0\.0\.1)/
noidxrx: /\?[a-z]=[a-z](;|$)
urlrule: \.est${{!}}text/x-estraier-draft
urlrule: \.(eml|mime|mht|mhtml)${{!}}message/rfc822
typerule: ^text/x-estraier-draft${{!}}[DRAFT]
typerule: ^text/plain${{!}}[TEXT]
typerule: ^(text/html|application/xhtml+xml)${{!}}[HTML]
typerule: ^message/rfc822${{!}}[MIME]
language: 0
textlimit: 128
seedkeynum: 256
savekeynum: 32
threadnum: 10
docnum: 10000
period: 10000s
revisit: 7d
cachesize: 256
#nodeserv: 1|http://admin:admin@localhost:1978/node/node1
#nodeserv: 2|http://admin:admin@localhost:1978/node/node2
#nodeserv: 3|http://admin:admin@localhost:1978/node/node3
logfile: _log
loglevel: 2
draftdir:
entitydir:
postproc:

Meaning of each variable is the following.

seed : specifies the weight and the URL of a seed document, separated by "|". This can be more than once.
proxyhost : specifies the host name of the proxy server.
proxyport : specifies the port number of the proxy server.
interval : specifies waiting interval of each request (in milliseconds).
timeout : specifies timeout of each request (in seconds).
strategy : specifies strategy of crawling path (0:balanced, 1:similarity, 2:depth, 3:width, 4:random).
inherit : specifies inheritance ratio of similarity from the parent.
seeddepth : specifies maximum depth of seed documents.
maxdepth : specifies maximum depth of recursion.
masscheck : specifies standard value for checking mass sites.
queuesize : specifies maximum number of records of the priority queue.
replace : specifies regular expressions and replacement strings to normalize URLs. This can be more than once.
allowrx : specifies allowing regular expressions of URLs to be visited. This can be more than once.
denyrx : specifies denying regular expressions of URLs to be visited. This can be more than once.
noidxrx : specifies denying regular expressions of URLs to be indexed. This can be more than once.
urlrule : specifies URL rules (regular expressions and media types). This can be more than once.
typerule : specifies media type rules (regular expressions and filter commands). This can be more than once.
language : specifies the preferred language (0:English, 1:Japanese, 2:Chinese, 3:Korean, 4:misc).
textlimit : specifies text size limitation (in kilobytes).
seedkeynum : specifies the total number of keywords for seed documents.
savekeynum : specifies the number of keywords saved for each document.
threadnum : specifies the number of threads running in parallel.
docnum : specifies the number of documents to collect.
period : specifies running time period (in s:seconds, m:minutes, h:hours, d:days).
revisit : specifies revisit span (in s:seconds, m:minutes, h:hours, d:days).
cachesize : specifies the maximum size of the index cache (in megabytes).
nodeserv : specifies the ID number and the URL of a node server, separated by "|". This can be more than once.
logfile : specifies the path of the log file (relative path or absolute path).
loglevel : specifies logging level (1:debug, 2:information, 3:warning, 4:error, 5:none).
draftdir : specifies the path of the draft directory (relative path or absolute path).
entitydir : specifies the path of the entity directory (relative path or absolute path).
postproc : the postprocessor for retrieved files.

allowrx, denyrx, and noidxrx are evaluated in the order of description. Alphabetical characters are case-insensitive.

Arbitrary filter commands can be specified with typerule. The interface of filter command is same as with -fx option of estcmd gather. For example, the following specifies to process PDF documents.

typerule: ^application/pdf${{!}}H@/usr/local/share/hyperestraier/filter/estfxpdftohtml