| Back to mnoGoSearch site |
|
| Language group | Languages | Character sets |
| Group 1 | Western Europe: Albanian, Catalan, Danish, Dutch, English, Faeroese, Finnish, French, Galician, German, Icelandic, Italian, Norwegian, Portuguese, Spanish, Swedish | ASCII 8, CP437, CP850, CP860, CP1252, ISO 8859-1, ISO 8859-15, MacRoman, MacIceland |
| Group 2 | Eastern Europe: Croatian, Czech, Estonian, Hungarian, Latvian, Lithuanian, Polish, Romanian, Slovak, Slovene | CP852, CP1250, ISO 8859-2, MacCentralEurope, MacRomania, MacCroatian |
| Group 4 | Baltic | CP1257, ISO-8859-4, ISO-8859-13 |
| Group 5 | Cyrillic: Bulgarian, Belorussian, Macedonian, Russian, Serbian, Ukrainian | CP855, CP866, CP1251, ISO 8859-5, Koi8-r, Koi8-u, MacCyrillic |
| Group 6 | Arabic | CP864, CP1256, ISO 8859-6, MacArabic |
| Group 7 | Greek | CP869, CP1253, ISO 8859-7, MacGreek |
| Group 8 | Hebrew | CP1255, ISO 8859-8, MacHebrew |
| Group 9 | Turkish | CP857, CP1254, ISO 8859-9, MacTurkish |
| Group 101 | Japanese | Shift-JIS, EUC-JP, ISO-2022-JP |
| Group 102 | Simplified Chinese (PRC) | GB2312 |
| Group 103 | Traditional Chinese (ROC) | Big5 |
| Group 104 | Korean | EUC-KR |
| Group 105 | Thai | CP874, TIS 620, MacThai |
| Group 106 | Vietnamese | CP1258 |
| Group 107 | Indian | MacGujarati, TSCII |
| Group 108 | Georgian | geostd8 |
| Unicode | Over 650 languages | UTF-8 (Unicode) |
E.g. in case your search engine was configured to use LocalCharset from the 5th group (Cyrillic), you may index servers containing documents in Bulgarian, Belorussian, Macedonian, Russian, Serbian and Ukrainian. Indexing a multi-language document in UTF-8 is possible as well; however, the indexer will extract and save only Cyrillic content from the page. To provide support for over 650 languages, please use LocalCharset UTF-8.
The indexer recodes all documents to the character set specified in the indexer.conf LocalCharset command. Internal recoding is implemented using Unicode. Please note that some recoding procedures may loose some data. For example, recoding between any Greek and Russian charsets looses all national characters. This does not matter for a single language sites. If you want to build multi-lingual search engine use the UTF8 character set as LocalCharset.
You may use the BrowserCharset command to choose a charset which will be used to display search results. BrowserCharset may differ from LocalCharset.
Each charset is recognized by a number of its aliases. Different web servers could return the same charset in different notations. For example, ISO-8859-2, ISO8859-2, latin2 are the same charsets. The search engine understands the following charsets names aliases:
Table 7-2. Charsets aliases
| ISO-2022-JP: | ISO-2022-JP |
| ISO-8859-1: | CP819, CSISOLATIN, IBM819, ISO-8859-1, ISO-IR-100, ISO_8859-1, ISO_8859-1:1987, L1, LATIN1 |
| ISO-8859-10: | CSISOLATIN6, ISO-8859-10, ISO-IR-157, ISO_8859-10, ISO_8859-10:1992, L6, LATIN6 |
| ISO-8859-11: | ISO-8859-11, TIS-620, TIS620, TACTIS |
| ISO-8869-13: | ISO-8859-13, ISO-IR-179, ISO_8859-13, L7, LATIN7 |
| ISO-8859-14: | ISO-8859-14, ISO-IR-199, ISO_8859-14, ISO_8859-14:1998, L8, LATIN8 |
| ISO-8859-15: | ISO-8859-15, ISO-IR-203, ISO_8859-15, ISO_8859-15:1998 |
| ISO-8859-16: | ISO-8859-16, ISO-IR-226, ISO_8859-16, ISO_8859-16:2000 |
| ISO-8859-2: | CSISOLATIN2, ISO-8859-2, ISO-IR-101, ISO_8859-2, ISO_8859-2:1987, L2, LATIN2 |
| ISO-8859-3: | CSISOLATIN3, ISO-8859-3, ISO-IR-109, ISO_8859-3, ISO_8859-3:1988, L3, LATIN3 |
| ISO-8859-4: | CSISOLATIN4, ISO-8859-4, ISO-IR-110, ISO_8859-4, ISO_8859-4:1988, L4, LATIN4 |
| ISO-8859-5: | CSISOLATINCYRILLIC, CYRILLIC, ISO-8859-5, ISO-IR-144, ISO_8859-5, ISO_8859-5:1988 |
| ISO-8859-6: | ARABIC, ASMO-708, CSISOLATINARABIC, ECMA-114, ISO-8859-6, ISO-IR-127, ISO_8859-6, ISO_8859-6:1987 |
| ISO-8859-7: | CSISOLATINGREEK, ECMA-118, ELOT_928, GREEK, GREEK8, ISO-8859-7, ISO-IR-126, ISO_8859-7, ISO_8859-7:1987 |
| ISO-8859-8: | CSISOLATINHEBREW, HEBREW, ISO-8859-8, ISO-IR-138, ISO_8859-8, ISO_8859-8:1988 |
| ISO-8859-9: | CSISOLATIN5, ISO-8859-9, ISO-IR-148, ISO_8859-9, ISO_8859-9:1989, L5, LATIN5 |
| armscii-8: | ARMSCII-8, ARMSCII8 |
| big5: | BIG-5, BIG-FIVE, BIG5, BIGFIVE, CN-BIG5, CSBIG5 |
| cp1250: | CP1250, MS-EE, WINDOWS-1250 |
| cp1251: | CP1251, MS-CYRL, WINDOWS-1251 |
| cp1252: | CP1252, MS-ANSI, WINDOWS-1252 |
| cp1253: | CP1253, MS-GREEK, WINDOWS-1253 |
| cp1254: | CP1254, MS-TURK, WINDOWS-1254 |
| cp1255: | CP1255, MS-HEBR, WINDOWS-1255 |
| cp1256: | CP1256, MS-ARAB, WINDOWS-1256 |
| cp1257: | CP1257, WINBALTRIM, WINDOWS-1257 |
| cp1258: | CP1258, WINDOWS-1258 |
| cp437: | 437, CP437, IBM437 |
| cp850: | 850, CP850, CSPC850MULTILINGUAL, IBM850 |
| cp852: | 852, CP852, IBM852 |
| cp855: | 855, CP855, IBM855 |
| cp857: | 857, CP857, IBM857 |
| cp860: | 860, CP860, IBM860 |
| cp861: | 861, CP861, IBM861 |
| cp862: | 862, CP862, IBM862 |
| cp863: | 863, CP863, IBM863 |
| cp864: | 864, CP864, IBM864 |
| cp865: | 865, CP865, IBM865 |
| cp866: | 866, CP866, CSIBM866, IBM866 |
| cp869: | 869, CP869, IBM869, CP874, WINDOWS-874 |
| EUC-JP: | CSEUCJP, EUC-JP, EUCJP, UJIS, X-EUC-JP |
| EUC-KR: | CSEUCKR, EUC-KR, EUCKR |
| GB2312: | CHINESE, CSGB2312, CSISO58GB231280, GB2312, GB_2312-80, ISO-IR-58 |
| koi8-r: | CSKOI8R, KOI8-R, KOI8R |
| KOI8-u | KOI8-U, KOI8U |
| shift-JIS: | CSSHIFTJIS, MS_KANJI, S-JIS, SHIFT-JIS, SHIFT_JIS, SJIS |
| cp367: | ANSI_X3.4-1968, ASCII, CP367, CSASCII, IBM367, ISO-IR-6, ISO646-US, ISO_646.IRV:1991, US, US-ASCII |
| UTF8: | UTF-8, UTF8 |
| viscii: | CSVISCII, VISCII, VISCII1.1-1 |
| MacCyrillic: | MACCYRILLIC, X-MAC-CYRILLIC |
| MacRoman: | MACROMAN, MACINTOSH, CSMACINTOSH, MAC |
| MacCentralEurope: | MACCENTRALEUROPE, MACCE |
The indexer detects document charsets in this order:
"Content-type: text/html; charset=xxx"
<META NAME="Content-Type" CONTENT="text/html; charset=xxx"> (for HTML documents) or
<?xml version="1.0" encoding="xxx"?> (for XML documents)
The selection of this variant may be switched off by using the: GuesserUseMeta no command in your indexer.conf.
The defaults to "Charset" settings of the corresponding Server or Realm command.
Since 3.2.0, mnoGoSearch has an automatic charset and language guesser. It currently recognizes more than 100 various charsets and languages. Charset and language detection is implemented using the "N-Gram-Based Text Categorization" technique. There is a number of so called "language map" files, one for each language-charset pair. They are installed under /usr/local/mnogosearch/etc/langmap/ directory by default. Take a look there to check the list of currently provided charset-language pairs. Guesser works fine for texts bigger than 500 characters. Shorter texts may not be guessed well.
To build your own language map use mguesser utility. In addition, your need to collect files with language samples in the desired charset. For new language maps creation, use the following command:
mguesser -p -c charset -l language < FILENAME > language.charset.lm
You can also use mguesser utility to guess document's language and charset by using existing language maps. To do this, use following command:
mguesser [-n maxhits] < FILENAME
For some languages, you may use several different charsets. To convert from one charset supported by mnoGoSearch to another, use mconv utility.
mconv [OPTIONS] -f charset_from -t charset_to [configfile] < infile > outfile
By default, both mguesser and mconv utilities are installed into the /usr/local/mnogosearch/sbin/ directory.
Since version 3.2.14, mnoGoSearch has an ability to update language and charset maps automatically while indexing, if the remote server supplies pages with exactly specified language and charset. To enable this function, specify command
LangMapUpdate yesin your indexer.conf file.
Use the RemoteCharset indexer.conf command to choose the default charset of indexed servers.
You can set the default language for Servers by using the DefaultLang indexer.conf command. This is useful for further restricting search results language.