In This Section: |
This table shows the Check Point implementation of standard regular expression metacharacters.
Metacharacter |
Name |
Description |
---|---|---|
\ |
Backslash |
escape metacharacters non-printable characters character types |
[ ] |
Square Brackets |
character class definition |
( ) |
Parenthesis |
sub-pattern, to use metacharacters on the enclosed string |
{min[,max]} |
Curly Brackets |
min/max quantifier {n} - exactly n occurrences {n,m} - from n to m occurrences {n,} - at least n occurrences |
. |
Dot |
match any character |
? |
Question Mark |
zero or one occurrences (equals {0,1}) |
* |
Asterisk |
zero or more occurrences of preceding character |
+ |
Plus Sign |
one or more occurrences (equals {1,}) |
| |
Vertical Bar |
alternative |
^ |
Circumflex |
anchor pattern to beginning of buffer (usually a word) |
$ |
Dollar |
anchor pattern to end of buffer (usually a word) |
- |
hyphen |
range in character class |
To use non-printable characters in patterns, escape the reserved character set.
Character |
Description |
---|---|
\a |
alarm; the BEL character (hex code |
\cX |
"control-X", where X is any character |
\e |
escape (hex code |
\f |
formfeed (hex code |
\n |
newline (hex code |
\r |
carriage return (hex code |
\t |
tab (hex code |
\ddd |
character with octal code |
\xhh |
character with hex code |
To specify types of characters in patterns, escape the reserved character.
Character |
Description |
---|---|
\d |
any decimal digit [0-9] |
\D |
any character that is not a decimal digit |
\s |
any whitespace character |
\S |
any character that is not whitespace |
\w |
any word character (underscore or alphanumeric character) |
\W |
any non-word character (not underscore or alphanumeric) |
The DLP gateway scans texts in the UTF-8 Unicode character encoding. It therefore converts the messages and files that it scans from its initial encoding to UTF-8.
Before it can change the encoding of the message or file, the DLP gateway must identify the encoding. The DLP gateway does this using the meta data or the MIME Headers. If none of the two exist, the default gateway encoding is used.
The DLP gateway determines the encoding of the message or file it scans as follows:
Content-Type: text/plain; charset="iso-2022-jp"
$DLPDIR/log/dlpe_problem_files.log
:Charset for file <file name> is not provided. Using the default: <charset name>
The out-of-the-box default encoding is Windows Code Page 1252 (Latin I)
. This can be changed.
To change the default encoding of the DLP gateway:
$DLPDIR/config/dlp.conf
$FWDIR/conf/file_convert.conf
engine
section , search for the default_charset_for_text_files
field. For example::default_charset_for_text_files (windows-1252)
Use one of the supported aliases as the value of this field. Each character set has one or more optional aliases.
For example, to make the default character set encoding Russian KOI8-R
, change the field value as follows:
:default_charset_for_text_files (KOI8-R)
If the DLP gateway cannot use an encoding for a message or file, an error message shows in $DLPDIR/log/dlpe_problem_files.log
:
File <file name> has unsupported charset: <charset name>. Trying to convert anyway
If the DLP gateway cannot use an encoding, it is possible that it cannot convert the message (or parts of it) to UTF-8. If that is so, the DLP gateway will not fully scan the message.
The character sets that can be used as the default input character set of the DLP gateway are:
Name of Character Set |
Alias |
---|---|
UTF-8Encoded Unicode |
UTF-8 |
UTF-7 Encoded Unicode |
UTF-7 |
ASCII (7-bit) |
ASCII |
Japanese (JIS) |
JIS_X0201 |
Japanese (EUC) |
EUC-JP |
Korean Standard |
KSC_5601 |
Simplified Chinese |
GB2312 |
EBCDIC Code Page 37 (United States) |
IBM037 |
EBCDIC Code Page 273 (Germany) |
IBM273 |
EBCDIC Code Page 274 (Belgium) |
IBM274 |
EBCDIC Code Page 277 (Denmark, Norway) |
IBM277 |
EBCDIC Code Page 278 (Finland, Sweden) |
IBM278 |
EBCDIC Code Page 280 (Italy) |
IBM280 |
EBCDIC Code Page 284 (Latin America, Spain) |
IBM284 |
EBCDIC Code Page 285 (Ireland, UK) |
IBM285 |
EBCDIC Code Page 297 (France) |
IBM297 |
EBCDIC Code Page 500 (International) |
IBM500 |
EBCDIC Code Page 1026 (Turkey) |
IBM1026 |
DOS Code Page 850 (Multilingual Latin I) |
IBM850 |
DOS Code Page 852 (Latin II) |
IBM852 |
DOS Code Page 855 (Cyrillic) |
IBM855 |
DOS Code Page 857 (Turkish) |
IBM857 |
DOS Code Page 860 (Portuguese) |
IBM860 |
DOS Code Page 861 (Icelandic) |
IBM861 |
DOS Code Page 863 (French) |
IBM863 |
DOS Code Page 865 (Danish, Norwegian) |
IBM865 |
DOS Code Page 869 (Greek) |
IBM869 |
Windows Code Page 932 (Japanese Shift-JIS) |
Shift_JIS |
Windows Code Page 874 (Thai) |
ibm874 |
Windows Code Page 949 (Korean) |
KS_C_5601-1987 |
Windows Code Page 950 (Traditional Chinese Big 5) |
csBig5 |
Windows Code Page 1250 (Central Europe) |
windows-1250 |
Windows Code Page 1251 (Cyrillic) |
windows-1251 |
Windows Code Page 1252 (Latin I) |
windows-1252 |
Windows Code Page 1253 (Greek) |
windows-1253 |
Windows Code Page 1254 (Turkish) |
windows-1254 |
Windows Code Page 1255 (Hebrew) |
windows-1255 |
Windows Code Page 1256 (Arabic) |
windows-1256 |
Windows Code Page 1257 (Baltic) |
windows-1257 |
ISO-8859-1 (Latin 1) |
ISO-8859-1 |
ISO-8859-2 (Latin 2) |
ISO-8859-2 |
ISO-8859-3 (Latin 3) |
ISO-8859-3 |
ISO-8859-4 (Baltic) |
ISO-8859-4 |
ISO-8859-5 (Cyrillic) |
ISO-8859-5 |
ISO-8859-6 (Arabic) |
ISO-8859-6 |
ISO-8859-7 (Greek) |
ISO-8859-7 |
ISO-8859-8 (Hebrew) |
ISO-8859-8 |
ISO-8859-9 (Turkish) |
ISO-8859-9 |
Mac OS Roman |
csMacintosh |
Russian KOI8-R |
KOI8-R |