Regular Expressions and Character Sets

Regular Expression Syntax

This table shows the Check Point implementation of standard regular expression metacharacters.

Metacharacter

Name

Description

\

Backslash

escape metacharacters

non-printable characters

character types

[ ]

Square Brackets

character class definition

( )

Parenthesis

sub-pattern, to use metacharacters on the closed string

{min[,max]}

Curly Brackets

min/max quantifier

{n} - exactly n occurrences

{n,m} - from n to m occurrences

{n,} - at least n occurrences

.

Dot

match any character

?

Question Mark

zero or one occurrences (equals {0,1})

*

Asterisk

zero or more occurrences of character before this character

+

Plus Sign

one or more occurrences (equals {1,})

|

Vertical Bar

alternative

^

Circumflex

anchor pattern to start of buffer (usually a word)

$

Dollar

anchor pattern to end of buffer (usually a word)

-

hyphen

range in character class

Non-Printable Characters

To use non-printable characters in patterns, deflate the reserved character set.

Character

Description

\a

alarm

the BEL character (hex code 07)

\cX

"control-X", where X is any character

\e

escape (hex code 1B)

\f

formfeed (hex code 0C)

\n

newline (hex code 0A)

\r

carriage return (hex code 0D)

\t

tab (hex code 09)

\ddd

character with octal code ddd

\xhh

character with hex code hh

Character Types

To specify types of characters in patterns, deflate the reserved character.

Character

Description

\d

any decimal digit [0-9]

\D

any character that is not a decimal digit

\s

any whitespace character

\S

any character that is not whitespace

\w

any word character (underscore or alphanumeric character)

\W

any non-word character (not underscore or alphanumeric)

Supported Character Sets

The DLP Gateway examines texts in the UTF-8 Unicode character encoding. It therefore changes the messages and files that it examines from its initial encoding to UTF-8.

Before the DLP Gateway can change the encoding of the message or file, the DLP Gateway must identify the encoding. To do this, the DLP Gateway uses the meta data or the MIME headers. If not, then it uses the default gateway encoding.

The DLP Gateway determines the encoding of the message or file it examines as follows:

  1. If the file contains meta data, the DLP Gateway reads the encoding from there. For example: Microsoft Word files contain the encoding in the file.

  2. Some files have no meta data, but do have MIME headers. For example, text files or the body of an email. For those files the DLP Gateway reads the encoding from the MIME headers:

    Content-Type: text/plain; charset="iso-2022-jp"

  3. Some files do not have meta data or MIME headers. For those files, the DLP Gateway assumes that the encoding of the original message or file is the default encoding of the gateway. A log message is written to $DLPDIR/log/dlpe_problem_files.log:

    Charset for file <file name> is not provided. Using the default: <charset name>

    The out-of-the-box default encoding is Windows Code Page 1252 (Latin I). This can be changed.

To change the default encoding of the DLP Gateway:

  1. On the DLP Gateway, edit the $FWDIR/conf/file_convert.conf file.

  2. In the engine section, find the default_charset_for_text_files field.

    For example:

    :default_charset_for_text_files (windows-1252)

    Use one of the supported aliases as the value of this field. Each character set has one or more optional aliases.

    For example, to make the default character set encoding Russian KOI8-R, change the field value as follows:

    :default_charset_for_text_files (KOI8-R)

If the DLP Gateway cannot use an encoding for a message or file, an error message shows in $DLPDIR/log/dlpe_problem_files.log:

File <file name> has unsupported charset: <charset name>. Trying to convert anyway

If the DLP Gateway cannot use an encoding, it is possible that it cannot change the message (or parts of it) to UTF-8. If that is so, the DLP Gateway does not fully examine the message.

Character Set Aliases

The table below shows character sets you can use as the default input character set of the DLP Gateway.