Regular Expressions and Character Sets

In This Section:

Regular Expression Syntax

This table shows the Check Point implementation of standard regular expression metacharacters.

Metacharacter	Name	Description
\	Backslash	escape metacharacters non-printable characters character types
[ ]	Square Brackets	character class definition
( )	Parenthesis	sub-pattern, to use metacharacters on the enclosed string
{min[,max]}	Curly Brackets	min/max quantifier {n} - exactly n occurrences {n,m} - from n to m occurrences {n,} - at least n occurrences
.	Dot	match any character
?	Question Mark	zero or one occurrences (equals {0,1})
*	Asterisk	zero or more occurrences of preceding character
+	Plus Sign	one or more occurrences (equals {1,})
\|	Vertical Bar	alternative
^	Circumflex	anchor pattern to beginning of buffer (usually a word)
$	Dollar	anchor pattern to end of buffer (usually a word)
-	hyphen	range in character class

Using Non-Printable Characters

To use non-printable characters in patterns, escape the reserved character set.

Character	Description
\a	alarm; the BEL character (hex code `07`)
\cX	"control-X", where X is any character
\e	escape (hex code `1B`)
\f	formfeed (hex code `0C`)
\n	newline (hex code `0A`)
\r	carriage return (hex code `0D`)
\t	tab (hex code `09`)
\ddd	character with octal code `ddd`
\xhh	character with hex code `hh`

Using Character Types

To specify types of characters in patterns, escape the reserved character.

Character	Description
\d	any decimal digit [0-9]
\D	any character that is not a decimal digit
\s	any whitespace character
\S	any character that is not whitespace
\w	any word character (underscore or alphanumeric character)
\W	any non-word character (not underscore or alphanumeric)

Supported Character Sets

The DLP gateway scans texts in the UTF-8 Unicode character encoding. It therefore converts the messages and files that it scans from its initial encoding to UTF-8.

Before it can change the encoding of the message or file, the DLP gateway must identify the encoding. The DLP gateway does this using the meta data or the MIME Headers. If none of the two exist, the default gateway encoding is used.

The DLP gateway determines the encoding of the message or file it scans as follows:

If the file contains meta data, the DLP gateway reads the encoding from there. For example: Microsoft Word files contain the encoding in the file.
Some files have no meta data, but do have MIME headers. Text files or the body of an email, for example. For those files the DLP gateway reads the encoding from the MIME headers:
Content-Type: text/plain; charset="iso-2022-jp"
Some files do not have meta data or MIME headers. For those files, the DLP gateway assumes that the encoding of the original message or file is the default encoding of the gateway. A log message is written to $DLPDIR/log/dlpe_problem_files.log:
Charset for file <file name> is not provided. Using the default: <charset name>

The out-of-the-box default encoding is Windows Code Page 1252 (Latin I). This can be changed.

To change the default encoding of the DLP gateway:

On the DLP gateway, edit the file:
- R77, R77.10, R77.20 - $DLPDIR/config/dlp.conf
- R77.30 - $FWDIR/conf/file_convert.conf
In the engine section , search for the default_charset_for_text_files field. For example:
:default_charset_for_text_files (windows-1252)

Use one of the supported aliases as the value of this field. Each character set has one or more optional aliases.

For example, to make the default character set encoding Russian KOI8-R, change the field value as follows:

:default_charset_for_text_files (KOI8-R)

If the DLP gateway cannot use an encoding for a message or file, an error message shows in $DLPDIR/log/dlpe_problem_files.log:

File <file name> has unsupported charset: <charset name>. Trying to convert anyway

If the DLP gateway cannot use an encoding, it is possible that it cannot convert the message (or parts of it) to UTF-8. If that is so, the DLP gateway will not fully scan the message.

Character Set Aliases

The character sets that can be used as the default input character set of the DLP gateway are:

Name of Character Set	Alias
UTF-8Encoded Unicode	UTF-8
UTF-7 Encoded Unicode	UTF-7
ASCII (7-bit)	ASCII
Japanese (JIS)	JIS_X0201
Japanese (EUC)	EUC-JP
Korean Standard	KSC_5601
Simplified Chinese	GB2312
EBCDIC Code Page 37 (United States)	IBM037
EBCDIC Code Page 273 (Germany)	IBM273
EBCDIC Code Page 274 (Belgium)	IBM274
EBCDIC Code Page 277 (Denmark, Norway)	IBM277
EBCDIC Code Page 278 (Finland, Sweden)	IBM278
EBCDIC Code Page 280 (Italy)	IBM280
EBCDIC Code Page 284 (Latin America, Spain)	IBM284
EBCDIC Code Page 285 (Ireland, UK)	IBM285
EBCDIC Code Page 297 (France)	IBM297
EBCDIC Code Page 500 (International)	IBM500
EBCDIC Code Page 1026 (Turkey)	IBM1026
DOS Code Page 850 (Multilingual Latin I)	IBM850
DOS Code Page 852 (Latin II)	IBM852
DOS Code Page 855 (Cyrillic)	IBM855
DOS Code Page 857 (Turkish)	IBM857
DOS Code Page 860 (Portuguese)	IBM860
DOS Code Page 861 (Icelandic)	IBM861
DOS Code Page 863 (French)	IBM863
DOS Code Page 865 (Danish, Norwegian)	IBM865
DOS Code Page 869 (Greek)	IBM869
Windows Code Page 932 (Japanese Shift-JIS)	Shift_JIS
Windows Code Page 874 (Thai)	ibm874
Windows Code Page 949 (Korean)	KS_C_5601-1987
Windows Code Page 950 (Traditional Chinese Big 5)	csBig5
Windows Code Page 1250 (Central Europe)	windows-1250
Windows Code Page 1251 (Cyrillic)	windows-1251
Windows Code Page 1252 (Latin I)	windows-1252
Windows Code Page 1253 (Greek)	windows-1253
Windows Code Page 1254 (Turkish)	windows-1254
Windows Code Page 1255 (Hebrew)	windows-1255
Windows Code Page 1256 (Arabic)	windows-1256
Windows Code Page 1257 (Baltic)	windows-1257
ISO-8859-1 (Latin 1)	ISO-8859-1
ISO-8859-2 (Latin 2)	ISO-8859-2
ISO-8859-3 (Latin 3)	ISO-8859-3
ISO-8859-4 (Baltic)	ISO-8859-4
ISO-8859-5 (Cyrillic)	ISO-8859-5
ISO-8859-6 (Arabic)	ISO-8859-6
ISO-8859-7 (Greek)	ISO-8859-7
ISO-8859-8 (Hebrew)	ISO-8859-8
ISO-8859-9 (Turkish)	ISO-8859-9
Mac OS Roman	csMacintosh
Russian KOI8-R	KOI8-R