232 lines
14 KiB
Plaintext
232 lines
14 KiB
Plaintext
[/
|
|
Copyright 2006-2007 John Maddock.
|
|
Distributed under the Boost Software License, Version 1.0.
|
|
(See accompanying file LICENSE_1_0.txt or copy at
|
|
http://www.boost.org/LICENSE_1_0.txt).
|
|
]
|
|
|
|
[section:locale Localization]
|
|
|
|
Boost.Regex provides extensive support for run-time localization, the
|
|
localization model used can be split into two parts: front-end and back-end.
|
|
|
|
Front-end localization deals with everything which the user sees -
|
|
error messages, and the regular expression syntax itself. For example a
|
|
French application could change \[\[:word:\]\] to \[\[:mot:\]\] and \\w to \\m.
|
|
Modifying the front end locale requires active support from the developer,
|
|
by providing the library with a message catalogue to load, containing the
|
|
localized strings. Front-end locale is affected by the LC_MESSAGES category only.
|
|
|
|
Back-end localization deals with everything that occurs after the expression
|
|
has been parsed - in other words everything that the user does not see or
|
|
interact with directly. It deals with case conversion, collation, and character
|
|
class membership. The back-end locale does not require any intervention from
|
|
the developer - the library will acquire all the information it requires for
|
|
the current locale from the underlying operating system / run time library.
|
|
This means that if the program user does not interact with regular
|
|
expressions directly - for example if the expressions are embedded in your
|
|
C++ code - then no explicit localization is required, as the library will
|
|
take care of everything for you. For example embedding the expression
|
|
\[\[:word:\]\]+ in your code will always match a whole word, if the
|
|
program is run on a machine with, for example, a Greek locale, then it
|
|
will still match a whole word, but in Greek characters rather than Latin ones.
|
|
The back-end locale is affected by the LC_TYPE and LC_COLLATE categories.
|
|
|
|
There are three separate localization mechanisms supported by Boost.Regex:
|
|
|
|
[h4 Win32 localization model.]
|
|
|
|
This is the default model when the library is compiled under Win32, and is
|
|
encapsulated by the traits class `w32_regex_traits`. When this model is in
|
|
effect each [basic_regex] object gets it's own LCID, by default this is
|
|
the users default setting as returned by GetUserDefaultLCID, but you can
|
|
call imbue on the `basic_regex` object to set it's locale to some other
|
|
LCID if you wish. All the settings used by Boost.Regex are acquired directly
|
|
from the operating system bypassing the C run time library. Front-end
|
|
localization requires a resource dll, containing a string table with the
|
|
user-defined strings. The traits class exports the function:
|
|
|
|
static std::string set_message_catalogue(const std::string& s);
|
|
|
|
which needs to be called with a string identifying the name of the resource
|
|
dll, before your code compiles any regular expressions (but not necessarily
|
|
before you construct any `basic_regex` instances):
|
|
|
|
boost::w32_regex_traits<char>::set_message_catalogue("mydll.dll");
|
|
|
|
The library provides full Unicode support under NT, under Windows 9x
|
|
the library degrades gracefully - characters 0 to 255 are supported, the
|
|
remainder are treated as "unknown" graphic characters.
|
|
|
|
[h4 C localization model.]
|
|
|
|
This model has been deprecated in favor of the C++ locale for all non-Windows
|
|
compilers that support it. This locale is encapsulated by the traits class
|
|
`c_regex_traits`, Win32 users can force this model to take effect by
|
|
defining the pre-processor symbol BOOST_REGEX_USE_C_LOCALE. When this model is
|
|
in effect there is a single global locale, as set by `setlocale`. All settings
|
|
are acquired from your run time library, consequently Unicode support is
|
|
dependent upon your run time library implementation.
|
|
|
|
Front end localization is not supported.
|
|
|
|
Note that calling setlocale invalidates all compiled regular expressions,
|
|
calling `setlocale(LC_ALL, "C")` will make this library behave equivalent to
|
|
most traditional regular expression libraries including version 1 of this library.
|
|
|
|
[h4 C++ localization model.]
|
|
|
|
This model is the default for non-Windows compilers.
|
|
|
|
When this model is in effect each instance of [basic_regex] has its own
|
|
instance of `std::locale`, class [basic_regex] also has a member function
|
|
`imbue` which allows the locale for the expression to be set on a
|
|
per-instance basis. Front end localization requires a POSIX message catalogue,
|
|
which will be loaded via the `std::messages` facet of the expression's locale,
|
|
the traits class exports the symbol:
|
|
|
|
static std::string set_message_catalogue(const std::string& s);
|
|
|
|
which needs to be called with a string identifying the name of the
|
|
message catalogue, before your code compiles any regular expressions
|
|
(but not necessarily before you construct any basic_regex instances):
|
|
|
|
boost::cpp_regex_traits<char>::set_message_catalogue("mycatalogue");
|
|
|
|
Note that calling `basic_regex<>::imbue` will invalidate any expression
|
|
currently compiled in that instance of [basic_regex].
|
|
|
|
Finally note that if you build the library with a non-default localization model,
|
|
then the appropriate pre-processor symbol (BOOST_REGEX_USE_C_LOCALE or
|
|
BOOST_REGEX_USE_CPP_LOCALE) must be defined both when you build the support
|
|
library, and when you include `<boost/regex.hpp>` or `<boost/cregex.hpp>`
|
|
in your code. The best way to ensure this is to add the #define to
|
|
`<boost/regex/user.hpp>`.
|
|
|
|
[h4 Providing a message catalogue]
|
|
|
|
In order to localize the front end of the library, you need to provide the
|
|
library with the appropriate message strings contained either in a resource
|
|
dll's string table (Win32 model), or a POSIX message catalogue (C++ models).
|
|
In the latter case the messages must appear in message set zero of the
|
|
catalogue. The messages and their id's are as follows:
|
|
|
|
[table
|
|
[[Message][id][Meaning][Default value]]
|
|
[[101][The character used to start a sub-expression.]["(" ]]
|
|
[[102][The character used to end a sub-expression declaration.][")" ]]
|
|
[[103][The character used to denote an end of line assertion.]["$" ]]
|
|
[[104][The character used to denote the start of line assertion.]["^" ]]
|
|
[[105][The character used to denote the "match any character expression".]["." ]]
|
|
[[106][The match zero or more times repetition operator.]["*" ]]
|
|
[[107][The match one or more repetition operator.]["+" ]]
|
|
[[108][The match zero or one repetition operator.]["?" ]]
|
|
[[109][The character set opening character.]["\[" ]]
|
|
[[110][The character set closing character.]["\]" ]]
|
|
[[111][The alternation operator.]["|" ]]
|
|
[[112][The escape character.]["\\" ]]
|
|
[[113][The hash character (not currently used).]["#" ]]
|
|
[[114][The range operator.]["-" ]]
|
|
[[115][The repetition operator opening character.]["{" ]]
|
|
[[116][The repetition operator closing character.]["}" ]]
|
|
[[117][The digit characters.]["0123456789" ]]
|
|
[[118][The character which when preceded by an escape character represents the word boundary assertion.]["b" ]]
|
|
[[119][The character which when preceded by an escape character represents the non-word boundary assertion.]["B" ]]
|
|
[[120][The character which when preceded by an escape character represents the word-start boundary assertion.]["<" ]]
|
|
[[121][The character which when preceded by an escape character represents the word-end boundary assertion.][">" ]]
|
|
[[122][The character which when preceded by an escape character represents any word character.]["w" ]]
|
|
[[123][The character which when preceded by an escape character represents a non-word character.]["W" ]]
|
|
[[124][The character which when preceded by an escape character represents a start of buffer assertion.]["`A" ]]
|
|
[[125][The character which when preceded by an escape character represents an end of buffer assertion.]["'z" ]]
|
|
[[126][The newline character. ]["\\n" ]]
|
|
[[127][The comma separator.]["," ]]
|
|
[[128][The character which when preceded by an escape character represents the bell character.]["a" ]]
|
|
[[129][The character which when preceded by an escape character represents the form feed character.]["f" ]]
|
|
[[130][The character which when preceded by an escape character represents the newline character.]["n" ]]
|
|
[[131][The character which when preceded by an escape character represents the carriage return character.]["r" ]]
|
|
[[132][The character which when preceded by an escape character represents the tab character.]["t" ]]
|
|
[[133][The character which when preceded by an escape character represents the vertical tab character.]["v" ]]
|
|
[[134][The character which when preceded by an escape character represents the start of a hexadecimal character constant.]["x" ]]
|
|
[[135][The character which when preceded by an escape character represents the start of an ASCII escape character.]["c" ]]
|
|
[[136][The colon character.][":" ]]
|
|
[[137][The equals character.]["=" ]]
|
|
[[138][The character which when preceded by an escape character represents the ASCII escape character.]["e" ]]
|
|
[[139][The character which when preceded by an escape character represents any lower case character.]["l" ]]
|
|
[[140][The character which when preceded by an escape character represents any non-lower case character.]["L" ]]
|
|
[[141][The character which when preceded by an escape character represents any upper case character.]["u" ]]
|
|
[[142][The character which when preceded by an escape character represents any non-upper case character.]["U" ]]
|
|
[[143][The character which when preceded by an escape character represents any space character.]["s" ]]
|
|
[[144][The character which when preceded by an escape character represents any non-space character.]["S" ]]
|
|
[[145][The character which when preceded by an escape character represents any digit character.]["d" ]]
|
|
[[146][The character which when preceded by an escape character represents any non-digit character.]["D" ]]
|
|
[[147][The character which when preceded by an escape character represents the end quote operator.]["E" ]]
|
|
[[148][The character which when preceded by an escape character represents the start quote operator.]["Q" ]]
|
|
[[149][The character which when preceded by an escape character represents a Unicode combining character sequence.]["X" ]]
|
|
[[150][The character which when preceded by an escape character represents any single character.]["C" ]]
|
|
[[151][The character which when preceded by an escape character represents end of buffer operator.]["Z" ]]
|
|
[[152][The character which when preceded by an escape character represents the continuation assertion.]["G" ]]
|
|
[[153][The character which when preceded by (? indicates a zero width negated forward lookahead assert.][! ]]
|
|
]
|
|
|
|
Custom error messages are loaded as follows:
|
|
|
|
[table
|
|
[[Message ID][Error message ID][Default string ]]
|
|
[[201][REG_NOMATCH]["No match" ]]
|
|
[[202][REG_BADPAT]["Invalid regular expression" ]]
|
|
[[203][REG_ECOLLATE]["Invalid collation character" ]]
|
|
[[204][REG_ECTYPE]["Invalid character class name" ]]
|
|
[[205][REG_EESCAPE]["Trailing backslash" ]]
|
|
[[206][REG_ESUBREG]["Invalid back reference" ]]
|
|
[[207][REG_EBRACK]["Unmatched \[ or \[^" ]]
|
|
[[208][REG_EPAREN]["Unmatched ( or \\(" ]]
|
|
[[209][REG_EBRACE]["Unmatched \\{" ]]
|
|
[[210][REG_BADBR]["Invalid content of \\{\\}" ]]
|
|
[[211][REG_ERANGE]["Invalid range end" ]]
|
|
[[212][REG_ESPACE]["Memory exhausted" ]]
|
|
[[213][REG_BADRPT]["Invalid preceding regular expression" ]]
|
|
[[214][REG_EEND]["Premature end of regular expression" ]]
|
|
[[215][REG_ESIZE]["Regular expression too big" ]]
|
|
[[216][REG_ERPAREN]["Unmatched ) or \\)" ]]
|
|
[[217][REG_EMPTY]["Empty expression" ]]
|
|
[[218][REG_E_UNKNOWN]["Unknown error" ]]
|
|
]
|
|
|
|
Custom character class names are loaded as followed:
|
|
|
|
[table
|
|
[[Message ID][Description][Equivalent default class name ]]
|
|
[[300][The character class name for alphanumeric characters.]["alnum" ]]
|
|
[[301][The character class name for alphabetic characters.]["alpha" ]]
|
|
[[302][The character class name for control characters.]["cntrl" ]]
|
|
[[303][The character class name for digit characters.]["digit" ]]
|
|
[[304][The character class name for graphics characters.]["graph" ]]
|
|
[[305][The character class name for lower case characters.]["lower" ]]
|
|
[[306][The character class name for printable characters.]["print" ]]
|
|
[[307][The character class name for punctuation characters.]["punct" ]]
|
|
[[308][The character class name for space characters.]["space" ]]
|
|
[[309][The character class name for upper case characters.]["upper" ]]
|
|
[[310][The character class name for hexadecimal characters.]["xdigit" ]]
|
|
[[311][The character class name for blank characters.]["blank" ]]
|
|
[[312][The character class name for word characters.]["word" ]]
|
|
[[313][The character class name for Unicode characters.]["unicode" ]]
|
|
]
|
|
|
|
Finally, custom collating element names are loaded starting from message
|
|
id 400, and terminating when the first load thereafter fails. Each message
|
|
looks something like: "tagname string" where tagname is the name used
|
|
inside [[.tagname.]] and string is the actual text of the collating element.
|
|
Note that the value of collating element [[.zero.]] is used for the
|
|
conversion of strings to numbers - if you replace this with another value then
|
|
that will be used for string parsing - for example use the Unicode
|
|
character 0x0660 for [[.zero.]] if you want to use Unicode Arabic-Indic
|
|
digits in your regular expressions in place of Latin digits.
|
|
|
|
Note that the POSIX defined names for character classes and collating elements
|
|
are always available - even if custom names are defined, in contrast,
|
|
custom error messages, and custom syntax messages replace the default ones.
|
|
|
|
[endsect]
|
|
|
|
|