Some more words about Regular Expressions

Here I am trying to clarify a little bit more what regular expressions are. I do hope it will help you. Please don't hesitate contacting me if you have further questions.

1.

Regular expressions are key concepts in Unix. Basically you can find them in two major fields: wildcards under shell (like ls a* or rm d?.tmp), and as searching patterns for grep, sed, Perl, etc.

The main idea is that a regular expression describes a set of strings (a set of character sequences). Or, if you prefer, a regular expression describes some property, and some strings will match this property (so they will be the element of the set), while others won't.

For instance, the d?.tmp describes the following set of character strings when appearing as the argument of a Unix command:

{da.tmp, db.tmp, ..., dz.tmp, d0.tmp, d1.tmp, ... , d9.tmp, d$.tmp, d?.tmp, d+.tmp, d(.tmp, ...}

Putting that into words: any string beginning with the character 'd', followed by one character (that can be any possible character), and followed by the substring '.tmp'. If you want, this is the property which characterizes the elements of the above set: all strings that match this property, will be elements of the set, but all strings that do not match this property will be out of the above set.

Another example: under grep, the expression ' [a-z]..x' describes any string that begins with a character within the range 'a' to 'z', followed by two characters (that can be letters, numbers, punctuation marks, etc.), and ending with the character 'x'. If you want, the regular expression describes this set of strings. Or, if you prefer, the regular expression describes this property (of "beginning with ..., ending with..."). The point is that it can be always decided whether a string matches this property (is the element of the set), or not.

Generally speaking, a regular expression is a formalism to describe a set of character strings. Or, if you prefer: a regular expression describes a pattern / a property , and the set corresponding to the regular expression is the set of strings that match this pattern.

2.

The details of the formalism and its purpose of using can be very different. For instance:

- In shell, if you use wildcards in the argument of a commands (e.g. ls a* or rm d?.tmp), then the preprocessor of the shell will replace the expression with all filenames that match this expression (that belong to the set described by the expression). All this applies only if there is at least one filename (in the current directory, or in the referred directory, -ies) that would match the expression, otherwise the wildcards are understood as being regular characters. Furthermore, all this is done before the actual command is run (in a preprocessing stage), therefore this replacement is executed independently from the command itself (whether it would look for filenames as arguments or not).

- In grep: the command would return all lines of the input that have a substring that matches the regular expression (a sequence within the line that is an element of the set described by the regular expression).

- In sed, the command /RegExp/d will delete all lines of the input string that have a substring that matches the regular expression. The command s/RegExp/String/ will change in each line the (first) substring that matches RegExp into String.

- In Perl the expression String =~ RegExp is true if and only if String has a substring that matches RegExp. (!~ is its negation.)

You can see that both in grep, sed and Perl, the regular expressions describe a pattern, and the question is always whether it is possible to find such a substring within some longer strings. This longer string is a line within a file in the first two cases, and it is the value of some variables or the value of some expressions in the third one.

3.

So now the question is what tools you have in order to describe the patterns that you want to be matched by filenames or by substrings? In other words: what tools do you have in order to describe this set of strings?

In fact, this question is part of a much larger field that is called "formal language theory". It is a subfield of mathematics (of algebra), and it was developed in the last 50-60 years in order to meet the needs of computational science and modern linguistics. (Therefore you will have a special course on formal languages during your studies at Alfa-Informatica.) The starting point of formal language theory is the following: you have an alphabet (e.g. a-z, A-Z, 0-9, punctuation marks, etc.), and using this alphabet you can write down all kinds of words. A language is a set of words written down using this alphabet.

Let us consider the following language: {a, ab, aba, abab, ababa, ababab, abababa,...}. Now let's try to describe that language. How would you do that? The rule is the following: a word must begin with the character 'a', and then a 'b' must be preceeded by an 'a', and an 'a' (except of the first one) must be preceeded by a 'b'. So the word 'abababababababababa' is an element of this language, because it follows the rule. But 'babababab' is not an element of this language, because its first character is not an 'a' (this was the first part of our rule). And 'abaababbaabbaba' is not an element of this language either, because the second part of the rule is not satisfied.

The goal of formal language theory is to invent different sets of formal tools with which you can describe such a language; and then to investigate the following question: what kind of tools can describe what kind of languages?

In the case of regular expressions for example, we have the following four basic tools:

Some characters (a, b, &, >, 0, 6, ...), i.e. you can use the elements of your alphabet.

Concatenation, i.e. writing the two strings next to each other, forming one long string: the concatenation of the string 'ta' with the string 'mas' is 'tamas'.

Kleene-star: the previous unit zero times, once, twice, three times, etc. (In shell '*' means the Kleene-star of "any character", while in grep, sed and Perl '*' refers to the Kleene-star of the previous string).

Set union, i.e. "logical or": for instance in both shell and grep or sed or Perl, [abc] means 'a' or 'b' or 'c', while [a-m] means 'a' or 'b' or ... or 'm'.

Then you can have additional tools, for instance:

Any character of the character set (? in shell, . in grep, sed and Perl).

Kleene-plus: it refers to the previous unit ones or twice or more times (but not zero times; a Kleene-plus is nothing but one instance being compulsory followed by a Kleene-star).

Optionality: zero times or once (? in grep, sed and Perl).

Using these, you can derive how to do something exactly n times, minimum n times and maximum n times (using curly bracket {...,...}, cf. examples in the lecture notes).

Parantheses, e.g. used to show to which string does the *, +, ? symbols refer to in grep, sed and Perl.

Negation: anything, but not these ones. For instance, in grep, sed and Perl [^bc,] refers to any character but not the ones listed ('b', 'c' or ',').

For instance, how would we formalize the above rule, describing the language {a, ab, aba, abab, ababa,..}, using the tools of regular expressions? Well, let us reformulate our rule. The string should start with a character 'a', then you can have any number of the bi-gram 'ba' (zero or more times), followed by an optional 'b'. Try it out, this rule will describe exactly the above language. So now we can formulate this set in a formal way, using the tools of regular expressions (using the notations of grep, sed or Perl):

a(ba)*b?

A further remark: since grep, sed and Perl usually use regular expressions for looking for a pattern within a string, you sometimes want to refer to the place of this pattern within the bigger string. For instance you want your pattern to occur at the beginning of your larger string (e.g. at the beginning of the line in the case of grep and sed). Therefore you can use the ^ character that reads "beginning of the line / beginning of the string", so the RegExp ^[aeiou][0-9] means: 'at the beginning of the line, you want first a vowel, then a digit'. The same applies to the $ symbol, meaning 'end of the string / end of the line'.

By the way, this helps you in formulating queries when you don't want just part of your string / part of your line to match a certain regular expression, but you want instead the whole string / the whole line to match your regular expression. So ^RegExp$ means, that the regular expression covers exactly the string / the line, because on the left it must hit the beginning, and on the right end it must hit the end of the string / of the line.

4.

Finally, let me give you the mathematical definition of a regular expression:

Definition: Let S be your alpahabet. Then the set of regular expressions with respect to the set S meets the following criteria:

(i) The empty string is a regular expression and the set {a} for any element a of S is a regular expression.

(ii) If P and Q are regular expressions, then
     - their concatenation P.Q is also a regular expression;
     - their union P U Q is also a regular expression;
     - the Kleene-star of P, i.e. P* is also a regular expression.

(iii) There are no other regular expressions.

Finally, let me mention that regular expressions are closely related to the so-called Finite State Automata and to Regular Grammars, and the set of languages that can be described by all of these tools are called regular languages. The Alfa Informatica Department is proud of being one of the most important centers in the world in implementing these mathematical tools for linguistic models. Here is the Finite State web-site of Jan Daciuk, and here is the web page of FSA Utilities developped by Gertjan van Noord.