Print Download PDF Send Feedback

Previous

Regular Expressions and Character Sets

In This Section:

Regular Expression Syntax

Supported Character Sets

Regular Expression Syntax

This table shows the Check Point implementation of standard regular expression metacharacters.

Metacharacter

Name

Description

\

Backslash

escape metacharacters

non-printable characters

character types

[ ]

Square Brackets

character class definition

( )

Parenthesis

sub-pattern, to use metacharacters on the enclosed string

{min[,max]}

Curly Brackets

min/max quantifier

{n} - exactly n occurrences

{n,m} - from n to m occurrences

{n,} - at least n occurrences

.

Dot

match any character

?

Question Mark

zero or one occurrences (equals {0,1})

*

Asterisk

zero or more occurrences of preceding character

+

Plus Sign

one or more occurrences (equals {1,})

|

Vertical Bar

alternative

^

Circumflex

anchor pattern to beginning of buffer (usually a word)

$

Dollar

anchor pattern to end of buffer (usually a word)

-

hyphen

range in character class

Using Non-Printable Characters

To use non-printable characters in patterns, escape the reserved character set.

Character

Description

\a

alarm; the BEL character (hex code 07)

\cX

"control-X", where X is any character

\e

escape (hex code 1B)

\f

formfeed (hex code 0C)

\n

newline (hex code 0A)

\r

carriage return (hex code 0D)

\t

tab (hex code 09)

\ddd

character with octal code ddd

\xhh

character with hex code hh

Using Character Types

To specify types of characters in patterns, escape the reserved character.

Character

Description

\d

any decimal digit [0-9]

\D

any character that is not a decimal digit

\s

any whitespace character

\S

any character that is not whitespace

\w

any word character (underscore or alphanumeric character)

\W

any non-word character (not underscore or alphanumeric)

Supported Character Sets

The DLP gateway scans texts in the UTF-8 Unicode character encoding. It therefore converts the messages and files that it scans from its initial encoding to UTF-8.

Before it can change the encoding of the message or file, the DLP gateway must identify the encoding. The DLP gateway does this using the meta data or the MIME Headers. If none of the two exist, the default gateway encoding is used.

The DLP gateway determines the encoding of the message or file it scans as follows:

  1. If the file contains meta data, the DLP gateway reads the encoding from there. For example: Microsoft Word files contain the encoding in the file.
  2. Some files have no meta data, but do have MIME headers. Text files or the body of an email, for example. For those files the DLP gateway reads the encoding from the MIME headers:

    Content-Type: text/plain; charset="iso-2022-jp"

  3. Some files do not have meta data or MIME headers. For those files, the DLP gateway assumes that the encoding of the original message or file is the default encoding of the gateway. A log message is written to $DLPDIR/log/dlpe_problem_files.log:

    Charset for file <file name> is not provided. Using the default: <charset name>

    The out-of-the-box default encoding is Windows Code Page 1252 (Latin I). This can be changed.

To change the default encoding of the DLP gateway:

  1. On the DLP gateway, edit the file:
    • R77, R77.10, R77.20 - $DLPDIR/config/dlp.conf
    • R77.30 - $FWDIR/conf/file_convert.conf
  2. In the engine section , search for the default_charset_for_text_files field. For example:

    :default_charset_for_text_files (windows-1252)

Use one of the supported aliases as the value of this field. Each character set has one or more optional aliases.

For example, to make the default character set encoding Russian KOI8-R, change the field value as follows:

:default_charset_for_text_files (KOI8-R)

If the DLP gateway cannot use an encoding for a message or file, an error message shows in $DLPDIR/log/dlpe_problem_files.log:

File <file name> has unsupported charset: <charset name>. Trying to convert anyway

If the DLP gateway cannot use an encoding, it is possible that it cannot convert the message (or parts of it) to UTF-8. If that is so, the DLP gateway will not fully scan the message.

Character Set Aliases

The character sets that can be used as the default input character set of the DLP gateway are:

Name of Character Set

Alias

UTF-8Encoded Unicode

UTF-8

UTF-7 Encoded Unicode

UTF-7

ASCII (7-bit)

ASCII

Japanese (JIS)

JIS_X0201

Japanese (EUC)

EUC-JP

Korean Standard

KSC_5601

Simplified Chinese

GB2312

EBCDIC Code Page 37 (United States)

IBM037

EBCDIC Code Page 273 (Germany)

IBM273

EBCDIC Code Page 274 (Belgium)

IBM274

EBCDIC Code Page 277 (Denmark, Norway)

IBM277

EBCDIC Code Page 278 (Finland, Sweden)

IBM278

EBCDIC Code Page 280 (Italy)

IBM280

EBCDIC Code Page 284 (Latin America, Spain)

IBM284

EBCDIC Code Page 285 (Ireland, UK)

IBM285

EBCDIC Code Page 297 (France)

IBM297

EBCDIC Code Page 500 (International)

IBM500

EBCDIC Code Page 1026 (Turkey)

IBM1026

DOS Code Page 850 (Multilingual Latin I)

IBM850

DOS Code Page 852 (Latin II)

IBM852

DOS Code Page 855 (Cyrillic)

IBM855

DOS Code Page 857 (Turkish)

IBM857

DOS Code Page 860 (Portuguese)

IBM860

DOS Code Page 861 (Icelandic)

IBM861

DOS Code Page 863 (French)

IBM863

DOS Code Page 865 (Danish, Norwegian)

IBM865

DOS Code Page 869 (Greek)

IBM869

Windows Code Page 932 (Japanese Shift-JIS)

Shift_JIS

Windows Code Page 874 (Thai)

ibm874

Windows Code Page 949 (Korean)

KS_C_5601-1987

Windows Code Page 950 (Traditional Chinese Big 5)

csBig5

Windows Code Page 1250 (Central Europe)

windows-1250

Windows Code Page 1251 (Cyrillic)

windows-1251

Windows Code Page 1252 (Latin I)

windows-1252

Windows Code Page 1253 (Greek)

windows-1253

Windows Code Page 1254 (Turkish)

windows-1254

Windows Code Page 1255 (Hebrew)

windows-1255

Windows Code Page 1256 (Arabic)

windows-1256

Windows Code Page 1257 (Baltic)

windows-1257

ISO-8859-1 (Latin 1)

ISO-8859-1

ISO-8859-2 (Latin 2)

ISO-8859-2

ISO-8859-3 (Latin 3)

ISO-8859-3

ISO-8859-4 (Baltic)

ISO-8859-4

ISO-8859-5 (Cyrillic)

ISO-8859-5

ISO-8859-6 (Arabic)

ISO-8859-6

ISO-8859-7 (Greek)

ISO-8859-7

ISO-8859-8 (Hebrew)

ISO-8859-8

ISO-8859-9 (Turkish)

ISO-8859-9

Mac OS Roman

csMacintosh

Russian KOI8-R

KOI8-R