Regular Expressions and Character Sets
Regular Expression Syntax
This table shows the Check Point implementation of standard regular expression metacharacters.
Metacharacter |
Name |
Description |
---|---|---|
\ |
Backslash |
escape metacharacters non-printable characters character types |
[ ] |
Square Brackets |
character class definition |
( ) |
Parenthesis |
sub-pattern, to use metacharacters on the closed string |
{min[,max]} |
Curly Brackets |
min/max quantifier {n} - exactly n occurrences {n,m} - from n to m occurrences {n,} - at least n occurrences |
. |
Dot |
match any character |
? |
Question Mark |
zero or one occurrences (equals {0,1}) |
* |
Asterisk |
zero or more occurrences of character before this character |
+ |
Plus Sign |
one or more occurrences (equals {1,}) |
| |
Vertical Bar |
alternative |
^ |
Circumflex |
anchor pattern to start of buffer (usually a word) |
$ |
Dollar |
anchor pattern to end of buffer (usually a word) |
- |
hyphen |
range in character class |
Non-Printable Characters
To use non-printable characters in patterns, deflate the reserved character set.
Character |
Description |
---|---|
\a |
alarm the BEL character (hex code |
\cX |
"control-X", where X is any character |
\e |
escape (hex code |
\f |
formfeed (hex code |
\n |
newline (hex code |
\r |
carriage return (hex code |
\t |
tab (hex code |
\ddd |
character with octal code |
\xhh |
character with hex code |
Character Types
To specify types of characters in patterns, deflate the reserved character.
Character |
Description |
---|---|
\d |
any decimal digit [0-9] |
\D |
any character that is not a decimal digit |
\s |
any whitespace character |
\S |
any character that is not whitespace |
\w |
any word character (underscore or alphanumeric character) |
\W |
any non-word character (not underscore or alphanumeric) |
Supported Character Sets
The DLP Gateway examines texts in the UTF-8 Unicode character encoding. It therefore changes the messages and files that it examines from its initial encoding to UTF-8.
Before the DLP Gateway can change the encoding of the message or file, the DLP Gateway must identify the encoding. To do this, the DLP Gateway uses the meta data or the MIME headers. If not, then it uses the default gateway encoding.
The DLP Gateway determines the encoding of the message or file it examines as follows:
-
If the file contains meta data, the DLP Gateway reads the encoding from there. For example: Microsoft Word files contain the encoding in the file.
-
Some files have no meta data, but do have MIME headers. For example, text files or the body of an email. For those files the DLP Gateway reads the encoding from the MIME headers:
Content-Type: text/plain; charset="iso-2022-jp"
-
Some files do not have meta data or MIME headers. For those files, the DLP Gateway assumes that the encoding of the original message or file is the default encoding of the gateway. A log message is written to
$DLPDIR/log/dlpe_problem_files.log
:Charset for file <file name> is not provided. Using the default: <charset name>
The out-of-the-box default encoding is
Windows Code Page 1252 (Latin I)
. This can be changed.
To change the default encoding of the DLP Gateway:
-
On the DLP Gateway, edit the
$FWDIR/conf/file_convert.conf
file. -
In the
engine
section, find thedefault_charset_for_text_files
field.For example:
:default_charset_for_text_files (windows-1252)
Use one of the supported aliases as the value of this field. Each character set has one or more optional aliases.
For example, to make the default character set encoding
Russian KOI8-R
, change the field value as follows::default_charset_for_text_files (KOI8-R)
If the DLP Gateway cannot use an encoding for a message or file, an error message shows in $DLPDIR/log/dlpe_problem_files.log
:
File <file name> has unsupported charset: <charset name>. Trying to convert anyway
If the DLP Gateway cannot use an encoding, it is possible that it cannot change the message (or parts of it) to UTF-8. If that is so, the DLP Gateway does not fully examine the message.
Character Set Aliases
The table below shows character sets you can use as the default input character set of the DLP Gateway.

Name of Character Set |
Alias |
---|---|
UTF-8 Encoded Unicode |
UTF-8 |
UTF-7 Encoded Unicode |
UTF-7 |
ASCII (7-bit) |
ASCII |
Japanese (JIS) |
JIS_X0201 |
Japanese (EUC) |
EUC-JP |
Korean Standard |
KSC_5601 |
Simplified Chinese |
GB2312 |
EBCDIC Code Page 37 (United States) |
IBM037 |
EBCDIC Code Page 273 (Germany) |
IBM273 |
EBCDIC Code Page 274 (Belgium) |
IBM274 |
EBCDIC Code Page 277 (Denmark, Norway) |
IBM277 |
EBCDIC Code Page 278 (Finland, Sweden) |
IBM278 |
EBCDIC Code Page 280 (Italy) |
IBM280 |
EBCDIC Code Page 284 (Latin America, Spain) |
IBM284 |
EBCDIC Code Page 285 (Ireland, UK) |
IBM285 |
EBCDIC Code Page 297 (France) |
IBM297 |
EBCDIC Code Page 500 (International) |
IBM500 |
EBCDIC Code Page 1026 (Turkey) |
IBM1026 |
DOS Code Page 850 (Multilingual Latin I) |
IBM850 |
DOS Code Page 852 (Latin II) |
IBM852 |
DOS Code Page 855 (Cyrillic) |
IBM855 |
DOS Code Page 857 (Turkish) |
IBM857 |
DOS Code Page 860 (Portuguese) |
IBM860 |
DOS Code Page 861 (Icelandic) |
IBM861 |
DOS Code Page 863 (French) |
IBM863 |
DOS Code Page 865 (Danish, Norwegian) |
IBM865 |
DOS Code Page 869 (Greek) |
IBM869 |
Windows Code Page 932 (Japanese Shift-JIS) |
Shift_JIS |
Windows Code Page 874 (Thai) |
ibm874 |
Windows Code Page 949 (Korean) |
KS_C_5601-1987 |
Windows Code Page 950 (Traditional Chinese Big 5) |
csBig5 |
Windows Code Page 1250 (Central Europe) |
windows-1250 |
Windows Code Page 1251 (Cyrillic) |
windows-1251 |
Windows Code Page 1252 (Latin I) |
windows-1252 |
Windows Code Page 1253 (Greek) |
windows-1253 |
Windows Code Page 1254 (Turkish) |
windows-1254 |
Windows Code Page 1255 (Hebrew) |
windows-1255 |
Windows Code Page 1256 (Arabic) |
windows-1256 |
Windows Code Page 1257 (Baltic) |
windows-1257 |
ISO-8859-1 (Latin 1) |
ISO-8859-1 |
ISO-8859-2 (Latin 2) |
ISO-8859-2 |
ISO-8859-3 (Latin 3) |
ISO-8859-3 |
ISO-8859-4 (Baltic) |
ISO-8859-4 |
ISO-8859-5 (Cyrillic) |
ISO-8859-5 |
ISO-8859-6 (Arabic) |
ISO-8859-6 |
ISO-8859-7 (Greek) |
ISO-8859-7 |
ISO-8859-8 (Hebrew) |
ISO-8859-8 |
ISO-8859-9 (Turkish) |
ISO-8859-9 |
Mac OS Roman |
csMacintosh |
Russian KOI8-R |
KOI8-R |