Chapter 11. mnoGoSearch cluster
Starting from the version 3.3.0, mnoGoSearch provides a clustered
solution, which allows to scale search on several computers,
extending database size up to several dozens or even hundreds
million documents.
A typical cluster consists of several database machines (cluster nodes)
and a single front-end machine. The front-end machine receives HTTP
requests from a user's browser, forwards search queries to the
database machines using HTTP protocol, receives back a limited
number of the best top search results (using a simple XML format,
based on OpenSearch specifications) from every database machine,
then parses and merges the results ordering them by score,
and displays the results applying HTML template. This approach
distributes operations with high CPU and hard disk consumption
between the database machines in parallel, leaving simple merge
and HTML template processing functions to the the front-end
machine.
In a clustered mnoGoSearch installation, all hard operations
are done on the database machines.
This is an approximate distribution chart of time spent
on different search steps:
5% - fetching word information
30% - grouping words by doc ID
30% - calculating score for each document
20% - sorting documents according to score
10% - fetching cached copies and generating excerpts for the best 10 documents
5% - generating XML or HTML results using template
<rss>
<channel>
<!-- General search results information block -->
<openSearch:totalResults>20</openSearch:totalResults>
<openSearch:startIndex>1</openSearch:startIndex>
<openSearch:itemsPerPage>2</openSearch:itemsPerPage>
<!-- Word statistics block -->
<mnoGoSearch:WordStatList>
<mnoGoSearch:WordStatItem order="0" count="300" word="apache"/>
<mnoGoSearch:WordStatItem order="1" count="103" word="web"/>
<mnoGoSearch:WordStatItem order="2" count="250" word="server"/>
</mnoGoSearch:WordStatList>
<!-- document information block -->
<item>
<id>1</id>
<score>70.25%</score>
<title>Test page of the Apache HTTP Server</title>
<link>http://hostname/index.html</link>
<description>...to use the images below on Apache
and Fedora Core powered HTTP servers. Thanks for using Apache ...
</description>
<updated>2006-12-13T18:30:02Z</updated>
<content-length>3956</content-length>
<content-type>text/html</content-type>
</item>
<!-- more items, typically 10 items total -->
</channel>
</rss>
The front-end machine receives XML responses
from every database machine. On the first query,
the front-end machine requests top 10 results
from every database machine. An XML response for
the top 10 results page is about 5Kb. Parsing of
each XML response takes less than 1% of time.
Thus, a cluster consisting of 50 machines is about 50%
slower than a cluster consisting of a single machine,
but allows to search through a 50-times bigger
collection of documents.
If the user is not satisfied with search results
returned on the first page and navigates to higher pages,
then the front-end machine requests ps*np results
from each database machine, where ps is page size
(10 by default), and np is page number.
Thus, to display the 5th result page, the front-end
machine requests 50 results from every database
machine and has to do five-times more parsing job,
which makes search on higher pages a little bit
slower. But typically, users look through not more
than 2-3 pages.
mnoGoSearch supports two cluster types.
A "merge" cluster is to join results from multiple independent machines,
each one created by its own indexer.conf. This type of cluster is recommended
when it is possible to distribute your web space into separate databases evenly
using URL patterns by means of the Server or
Realm commands.
For example, if you need to index three sites siteA,
siteB and siteC with
an approximately equal number of documents on each site, and you have three
cluster machines nodeA, nodeB and
nodeC, you can put each site to a separate
machine using a corresponding
Server command in the indexer.conf file
on each cluster machine:
# indexer.conf on machine nodeA:
DBAddr mysql://root@localhost/test/
Server http://siteA/
# indexer.conf on machine nodeB:
DBAddr mysql://root@localhost/test/
Server http://siteB/
# indexer.conf on machine nodeC:
DBAddr mysql://root@localhost/test/
Server http://siteC/
A "distributed" cluster is created by a single indexer.conf,
with "indexer" automatically distributing search data
between database machines. This type of cluster is recommended when
it is hard to distribute web space between cluster machines using
URL patterns, for example when you don't know your site sizes or
the site sizes are very different.
Note:
Even distribution of search data between cluster machines is important to achieve
the best performance. Search front-end waits for the slowest cluster node.
Thus, if cluster machines nodeA and nodeB
return search results in 0.1 seconds and nodeC return results
in 0.5 seconds, the overall cluster response time will be about 0.5 seconds.
On each database machine install mnoGoSearch using usual procedure:
Configure indexer.conf: Edit DBAddr -
usually specifying a database installed on
the local host, for example:
DBAddr mysql://localhost/dbname/?dbmode=blob
Add a "Server" command corresponding
to a desired collection of documents -
its own collection on every database machine.
Index collections on every database machines
by running "indexer" then "indexer -Eblob".
Configure search.htm by copying the DBAddr
command from indexer.conf.
Make sure that search works in "usual" (non-clustered)
mode by opening http://hostname/cgi-bin/search.cgi
in your browser and typing some search query,
for example the word "test", or some other word
which present in the document collection.
Additionally to the usual installation steps,
it's also necessary to configure XML interface
on every database machine.
Go through the following steps:
cd /usr/local/mnogosearch/etc/
cp node.xml-dist node.xml
Edit node.xml by specifying the same DBAddr
make sure XML search returns a well-formed response
(according to the above format) by opening
http://hostname/cgi-bin/search.cgi/node.xml?q=test
After these steps, you will have several separate
document collections, every collection indexed into
its own database, and configured XML interfaces
on all database machine.
Install mnoGoSearch using usual procedure,
then do the following additional steps:
cd /usr/local/mnogosearch/etc/
cp search.htm-dist search.htm
Edit search.htm by specifying URLs of XML interfaces
of all database machines, adding "?${NODE_QUERY_STRING}"
after "node.xml":
DBAddr http://hostname1/cgi-bin/search.cgi/node.xml?${NODE_QUERY_STRING}
DBAddr http://hostname2/cgi-bin/search.cgi/node.xml?${NODE_QUERY_STRING}
DBAddr http://hostname3/cgi-bin/search.cgi/node.xml?${NODE_QUERY_STRING}
You're done. Now open http://frontend-hostname/cgi-bin/search.cgi
in your browser and test searches.
Note: "DBAddr file:///path/to/response.xml" is also understood -
to load an XML-formatted response from a static file.
This is mostly for test purposes.
Install mnoGoSearch on a single database machine.
Edit indexer.conf by specifying multiple
DBAddr commands:
DBAddr mysql://hostname1/dbname/?dbmode=blob
DBAddr mysql://hostname2/dbname/?dbmode=blob
DBAddr mysql://hostname3/dbname/?dbmode=blob
and describing web space using Realm or Server
commands. For example:
#
# The entire top level domain .ru,
# using http://www.ru/ as a start point
#
Server http://www.ru/
Realm http://*.ru/*
After that, install
mnoGoSearch on all other database
machines and copy
indexer.conf from the first database
machine. Configuration of indexer is done. Now
you can start "indexer" on any database machine,
and then "indexer -Eblob" after it finishes.
indexer will distribute data between the databases
specified in the DBAddr commands.
The number of the database a document is put into
is calculated as a result of division of url.seed
by the number of DBAddr commands specified in indexer.conf,
where url.seed is calculated using hash(URL).
Thus, for indexer.conf having three DBAddr command,
distribution is done as follows:
URLs with seed 0..85 go to the first DBAddr
URLs with seed 85..170 go to the second DBAddr
URLs with seed 171..255 go to the third DBAddr
Prior to version 3.3.0, indexer could also distribute
data between several databases, but the distribution
was done using reminder of division url.seed by the
number of DBAddr commands.
The new distribution style, introduced in 3.3.0,
simplifies manual redistribution of an existing clustered
database when adding a new DBAddr (i.e. a new database machine).
Future releases will likely provide automatic tools
for redistribution data when adding or deleting machines in
an existing cluster, as well as more configuration commands to
control distribution.
Follow the same configuration instructions
with the "merge" cluster type.
Starting from the version 3.3.9,
mnoGoSearch allows to add new nodes into a cluster (or remove nodes)
without having to re-crawl the documents once again.
Suppose you have 5 cluster nodes and what extend
the cluster to 10 nodes. Please go through the
following steps:
Stop all indexer processes.
Create all new 10 SQL databases and
create a new .conf file with 10
DBAddr commands.
Note, the old and the new SQL databases can NOT overlap.
The new databases must be freshly created empty databases.
Run
indexer -d /path/to/new.conf -Ecreate
to create table structure in all 10 new SQL databases.
Make sure you have enough disk space - you'll need about 2
times extra disk space of the all original SQL databases size.
Create a directory, say, /usr/local/mnogosearch/dump-restore/,
where you will put the dump file, then go to this directory.
Run
indexer -Edumpdata | gzip > dumpfile.sql.gz
It will create the dump file.
Run
zcat dumpfile.sql.gz | indexer -d /path/to/new.conf -Esql -v2
It will load information from the dump file and put it into the new SQL databases.
Note, all document IDs will be automatically re-assigned.
Check that restoring worked fine. These two commands should report
the same statistics:
indexer -Estat
indexer -Estat -d /path/to/new.conf
Run
indexer -d /path/to/new.conf -Eblob
to create
inverted search index in the new SQL databases;
Configure a new search front-end to use the new SQL databases and
check that search bring the same results from the old and the new
databases.
As of version 3.3.0, mnoGoSearch allows to join up to 256
database machines into a single cluster.
Only "body" and "title" sections are currently supported.
Support for other standard sections, like meta.keywords and meta.description,
as well as for user-defined sections will be added in the future 3.3.x releases.
Cluster of clusters is not fully functional yet. It will likely
be implemented in one of the near future 3.3.x release.
Popularity rank does not work well
with cluster. Links between documents residing on different
cluster nodes are not taken into account. This will be fixed
in future releases.