When indexer found a new URL during crawling,
indexer checks whether the URL has
a corresponding Web space definition command
Server,
Realm or
Subnet
given in indexer.conf.
URLs which do not have a corresponding Web space definition
command are not added into the database.
Also, the URLs which already present in the search database
and appear not to have corresponding Web space definition commands
are deleted from the database. This can happen after removing of
some of the definition commands from indexer.conf.
The Web definiton commands have the following format:
Server [Method] [SubSection] <pattern> [alias]
Realm [Method] [CaseType] [MatchType] [CmpType] <pattern> [alias]
Subnet [Method] [MatchType] <pattern>
The mandatory parameter pattern specifies an URL,
or its part, or a pattern.
The optional parameter method
specifies the action for this command.
It can take one of the following values:
Allow,
Disallow,
HrefOnly,
CheckOnly,
Skip,
CheckMP3,
CheckMP3Only.
By default, the value Allow is used.
Allow
Allow specifies that all corresponding
documents will be indexed and scanned for new links.
Depending on Content-Type,
an external parser can be executed if needed.
Disallow
Disallow specifies that all corresponding
documents will be ignored and deleted from the database.
HrefOnly
HrefOnly specifies that all corresponding
documents will only be scanned for new links, but their
content won't be indexed. This is useful, for example,
when indexing mail archives, when the index pages are only
scanned for links to new messages.
Server HrefOnly Page http://www.mail-archive.com/general%40mnogosearch.org/
Server Allow Path http://www.mail-archive.com/general%40mnogosearch.org/
CheckOnly
CheckOnly specifies that all corresponding
documents will be requested using the HEAD
HTTP method instead of the default GET method.
When using CheckOnly, only brief information
about the documents (such as size, last modification time,
content type) will be fetched. This method can be helpful
to detect broken links on your site. For example:
Server HrefOnly http://www.mnogosearch.org/
Realm CheckOnly *
These commands instruct indexer
to scan all documents on the site www.mnogosearch.org
and collect all outgoing links. Brief info about every document
outside www.mnogosearch.org will be requested
using the HEAD method. After indexing is done,
use indexer -S command to see if there are
any pages with status 404 Not found.
Skip
Skip specifies that all corresponding
documents will be skipped while indexing. This is useful
when you need to disable temporarily reindexing of some sites,
but to keep them available for search with their previous content.
indexer will mark these documents as "fresh"
and put in the end of its queue.
CheckMP3
CheckMP3 specifies that the corresponding
documents will be checked for MP3 tags even if the
Content-Type is not equal
to audio/mpeg. This is useful if the remote
server sends application/octet-stream as
Content-Type for MP3 files. In case when
MP3 tags are found in some document, they will be indexed,
otherwise the document will be further processed according
to the Content-Type.
CheckMP3Only
This method is very similar to CheckMP3,
but in case when MP3 tags are not found in a document,
the document is not further processed.
The optional SubSection parameter specifies
the pattern match method, which can be one of the following values:
page, path,
site, world,
with path being the default.
Server path
All URLs from the same directory match. For example, if:
Server path http://localhost/path/to/index.html
is given, all URLs starting with
http://localhost/path/to/
will match this command.
The following commands have the same effect
when searching for a matching Web space definition command:
Server path http://localhost/path/to/index.html
Server path http://localhost/path/to/index
Server path http://localhost/path/to/index.cgi?q=bla
Server path http://localhost/path/to/index?q=bla
Server site
All URLs from the same host match.
For example, Server site http://localhost/path/to/a.html
will allow to index the entire site
http://localhost/.
Server world
If world subsection is specified,
then absolutely any URL will correspond
to this definiton command. See the explanation below.
Server page
Means exact match, only the given URL will match this command.
subsection in news:// schema
Subsection is always considered as site
for the news:// URL schema.
This is because unlike ftp:// or
http://, the news:// schema
has no recursive paths.
Use Server news://news.server.com/ to index
the whole news server or, for example,
Server news://news.server.com/udm to index all
messages from the /udm hierarchy.
The optional parameter CaseType specifies case
sensitivity for string comparison, it can take one of the following
values: case - case insensitive comparison,
or nocase - case sensitive comparison.
The optional parameter CmpType specifies
comparison type and can take two values:
Regex and String.
String wildcards are the default match type.
You can use ? and * signs in the patter,
they mean "one character" and "any number of characters" respectively.
For example, if you want to index all HTTP sites in the
.ru domain, you can use this command:
Realm http://*.ru/*
The regex comparison type says that
the pattern is a regular expression. For example, you can describe
everything in the .ru domain using the
regex comparison type:
Realm Regex ^http://.*\.ru/
The optional parameter MatchType
can be Match or NoMatch,
with Match as default.
Realm NoMatch has reverse effect.
It means that URLs not matching the given pattern
will correspond to this Realm command.
For example, use this command to index everything but the
.com domain:
Realm NoMatch http://*.com/*
The optional alias argument provides
URL rewrite rules, described in details in the Section called Aliases.
indexer examines the
Web space definition command in order of their appearance
in indexer.conf.
Thus, if you want to give different parameters to
a site and its subsections, you can add the command
describing a subsection before the command describing
the entire site. Imagine that you have a subdirectory
which contains news articles and want those articles
to be reindexed more often than the rest of the site.
The following combination can be useful in this cases:
# Add subsection
Period 200000
Server http://servername/news/
# Add server
Period 600000
Server http://servername/
These commands give different reindexing periods for the
/news/ subdirectory and the rest of the site.
indexer will choose the first command for
the URL http://servername/news/page1.html.
The default behavior of indexer is to follow through
the links found having correspondent Web space definition commands
given in the indexer.conf file.
indexer jumps between sites if both
of them have a corresponding Web definition command.
For example, there are two commands:
Server http://www/
Server http://web/
When indexing http://www/page1.html
indexer WILL follow the link http://web/page2.html.
Note that although these pages are on different sites, BOTH of
them have a correspondent Web space definition command.
If we delete one of the commands, indexer
will remove all expired URLs from this server during the next
crawling sessions.
mnoGoSearch offers a flexible technique
of aliases and reverse aliases, making it possible to index sites by downloading
documents from another location. For example, if you index your local web server,
it is possible to load pages directly from the hard disk without involving your
web server in the crawling process. Another example is building of a search engine
for the primary site using its mirror to download the documents.
Different ways of using aliases are described in the next sections.
The Alias indexer.conf command
uses this format:
Alias <masterURL> <mirrorURL>
For example, if you wish to index http://www.mnogosearch.ru/
using the nearest German mirror
http://www.gstammw.de/mirrors/mnoGoSearch/, you can add these lines
into your indexer.conf:
Server http://www.mnogosearch.ru/
Alias http://www.mnogosearch.ru/ http://www.gstammw.de/mirrors/mnoGoSearch/
When crawling, indexer will download the
documents from the mirror site http://www.gstammw.de/mirrors/mnoGoSearch/.
At search time search.cgi will display URLs from
the master site http://www.mnogosearch.ru/.
Another example: You want to index all sites from the domain
udm.net. Suppose one of the servers (e.g.
http://home.udm.net/) is stored on the local machine in
the directory /home/httpd/htdocs/. These commands will be useful:
Realm http://*.udm.net/
Alias http://home.udm.net/ file:///home/httpd/htdocs/
Indexer will load documents form the site home.udm.net
using the local disk, and will use HTTP for the other sites.
Aliases are searched in the order of their appearance in indexer.conf.
So, you can create different aliases for a server and its parts:
# First, create alias for example for /stat/ directory which
# is not under common location:
Alias http://home.udm.net/stat/ file:///usr/local/stat/htdocs/
# Then create alias for the rest of the server:
Alias http://home.udm.net/ file:///usr/local/apache/htdocs/
Note: If you change the order of these commands, the alias for the
directory /stat/ will never be found.
You can specify the location used by indexer as an optional argument
in a Server command:
Server http://home.udm.net/ file:///home/httpd/htdocs/
Aliases in the Realm command
are based on regular expressions.
The implementation of this feature reminds PHP's preg_replace()
function. Aliases in the Realm command
work only if the regex match type is used, and do not work in case
of the string match type.
Use this syntax for Realm aliases:
Realm regex <URL_pattern> <alias_pattern>
When indexer finds a URL matching to
URL_pattern, it builds an alias using
alias_pattern. alias_pattern
can contain references of the form $n, where n is a number in the range of 0-9.
Every reference is replaced to text captured by the
n-th parenthesized sub-pattern.
$0 refers to text matched by the whole pattern.
Opening parentheses are counted from left to right
(starting from 1) to obtain the number of the capturing
sub-pattern.
Example: your company hosts a few hundred users with their own domains in the form
of www.username.yourname.com. All user sites are stored on
the disk in the subdirectory /htdocs under their home
directories: /home/username/htdocs/.
You can write this command into indexer.conf
(note that the dot '.' character has a special meaning in regular expressions
and should be escaped with a '\' sign when dot is used in its literal meaning):
Realm regex (http://www\.)(.*)(\.yourname\.com/)(.*) file:///home/$2/htdocs/$4
Imagine that indexer processes a document
located at http://www.john.yourname.com/news/index.html.
These patterns will be captured:
$0 = http://www.john.yourname.com/news/index.htm (the whole pattern match)
$1 = http://www. - subpattern matching (http://www\.)
$2 = john - subpattern matching (.*)
$3 = .yourname.com/ - subpattern matching (\.yourname\.com/)
$4 = /news/index.html - subpattern matching (.*)
After the matches are found, the subpatterns $2
and $4 are substituted to
alias_pattern, which will result into this alias:
file:///home/john/htdocs/news/index.html
AliasProg can be useful for
a web hosting company indexing its customer web sites by loading documents
directly from the disk without having to involve the HTTP server into
crawling process (to offload the server). Document layout can be very complex
to describe it using the Server or
Realm
commands. AliasProg defines an external
program that can be executed with an URL in the command line argument and
return the corresponding alias to STDOUT.
Use $1 to pass URLs to the command line.
The command in this example uses the replace program
from MySQL distribution and replaces URL
substring http://www.apache.org/ to
file:///usr/local/apache/htdocs/:
AliasProg "echo $1 | /usr/local/mysql/bin/mysql/replace http://www.apache.org/ file:///usr/local/apache/htdocs/"
You can write your own complex program for converting URLs int
their aliases using any preferred programming language.
The ReverseAlias indexer.conf
command allows mapping of URLs before a URL is inserted into the database. Unlike the
Alias command (which
performs mapping right before a document is downloaded), the ReverseAlias command performs mapping
immediately after a new link is found.
ReverseAlias http://name2/ http://name2.yourname.com/
Server http://name2.yourname.com/
In the above example, all links with the short server name
will be converted to links with the full server and will be put
into the database after converting.
Another possible use of the ReverseAlias
is stripping off various undesired query string parameters like
PHPSESSID=XXXX.
The following example will strip off the
PHPSESSID=XXXX part from the URLs
like http://www/a.php?PHPSESSID=XXX, when there
are no any other query string parameters other than PHPSESSID.
The question mark is deleted as well:
ReverseAlias regex (http://[^?]*)[?]PHPSESSID=[^&]*$ $1$2
Stripping the PHPSESSID=XXXX from the URL
like w/a.php?PHPSESSID=xxx&.., that is when
PHPSESSID=XXXX is the very first query string
parameter followed by a number of other parameters.
The ampersand sign & after the
PHPSESSID=XXXX part is deleted as well.
The question mark ? is not deleted:
ReverseAlias regex (http://[^?]*[?])PHPSESSID=[^&]*&(.*) $1$2
Stripping the PHPSESSID=XXXX part from the URLs
like http://www/a.php?a=b&PHPSESSID=xxx or
http://www/a.php?a=b&PHPSESSID=xxx&c=d,
where PHPSESSID=XXXX is not the first parameter.
The ampersand sign & before
PHPSESSID=XXXX is deleted:
ReverseAlias regex (http://.*)&PHPSESSION=[^&]*(.*) $1$2
It is also possible to define aliases in the search template (search.htm).
The Alias command in search.htm
is identical to the one in indexer.conf, but is
applied at search time rather than during crawling.
The syntax of the Alias
command in search.htm is similar to indexer.conf:
Alias <find-prefix> <replace-prefix>
Suppose your search.htm has the following
command:
Alias http://localhost/ http://www.mnogo.ru/
When search.cgi returns a page with
the URL http://localhost/news/article10.html,
it will be replaced to
http://www.mnogo.ru/news/article10.html.
Note: When you need aliases, you can put aliases either into indexer.conf
(to convert the remote notation to the local notation during crawling
time) or into search.htm (to convert the
local notation to the remote notation during search time). Use the
approach which looks more convenient for you.
The quick way to specify URLs to be indexed by mnoGoSearch is just
to specify them using the Server or Realm directives in the indexer.conf file.
However, in some cases users might already have URLs saved in a SQL database,
it would be much simpler to have mnoGoSearch use this information. This can be
done using the ServerTable command,
which is available in mnoGoSearch starting from the version 3.3.7.
When ServerTable mysql://user:pass@host/dbname/my_server?srvinfo=my_srvinfo
is specified, indexer loads server information from
the given SQL table my_server and loads
the server parameters from the table my_srvinfo.
The following sections provide step-by-step instructions how to create,
populate and load Server tables.
The tables server and srvinfo
that are already present in mnoGoSearch are used internally. One should not
try to use these tables to insert your own URLs. Instead, you must create
your own tables with similar structures.
For example, with MySQL you can do:
CREATE TABLE my_server LIKE server;
CREATE TABLE my_srvinfo LIKE srvinfo;
Note:
You may find useful to do some modifications in the column types,
for example, add AUTOINCREMENT flag to rec_id.
However, don't change the column names - mnoGoSearch looks up the columns
by their names.
Now that you have your custom tables, you can load data:
INSERT INTO my_server (rec_id, enabled, command, url) VALUES (1, 1, 'S', 'http://server1/');
INSERT INTO my_srvinfo (srv_id, sname, sval) VALUES ('Period', '30d');
INSERT INTO my_server (rec_id, enabled, command, url) VALUES (2, 1, 'S', 'http://server2/');
INSERT INTO my_srvinfo (srv_id, sname, sval) VALUES ('MaxHops', '3');
The columns rec_id, enabled and url
must be specified in the INSERT INTO my_server statements.
The columns parent and pop_weight
should NOT be specified, as these columns used by mnoGoSearch internally.
The columns tag, category,
ordre, weight can be specified optionally.
my_srvinfo is a child table of my_server.
These tables are joint using the condition
my_server.rec_id = my_srvinfo.srv_id.
sname in the table my_srvinfo
is the name of a directive that might be specified for the particular URL
in indexer.conf. For example, you might want to specify Period
of "30d" for the respective URL,
so you insert a record with sname="Period" and sval="30d",
or set MaxHops to "3",
so you insert a record with sname="MaxHops" and sval="3".
The meaning of various columns is explained in the Section called Database schema in Chapter 12.
Note:
Look at the table srvinfo data to get examples about how it is used.
Now that you have data in your custom Server tables, you need to specify the new tables in indexer.conf.
Just add the following line:
ServerTable mysql://user:pass@host/dbname/my_server?srvinfo=my_srvinfo
Note:
If the srvinfo parameter is omitted,
parameters are loaded from the table with name srvinfo
by default.
A quick way to test if your Server table works fine is to insert one or two URLs into
the my_server table that do not already exist in your indexer.conf,
then run indexer and specify that only the given URLs are to be indexed, e.g.:
./indexer -a -u http://server1/
./indexer -a -u http://server2/
If it is working properly, you should see the test URLs being indexed.
1) You can create as many custom server/srvinfo tables as you like,
and then specify each pair in the indexer.conf file using a different
ServerTable directive with the appropriate values.
2) Using your own Server table does not stop other URLs that
are specified in your indexer.conf from being indexed. indexer will do both.
So you can define some non-changing URLs in the indexer.conf file, and
put the URLs that tend to come and go into your custom Server table.
You can also write some scripts that copy URLs from
your own database into your custom Server table used by mnoGoSearch.