Tekstmanipulatie, week 4

 

 
 
 
 
 
 
 
 

1. Metacharacters, escapes, quotes

End of last week: what was the problem with

echo (13/26)*4 | bc
and why does
echo \(13/26\)*4 | bc
works???

Because the characters '(' and ')' have special meanings, and the shell misunderstands it in the preprocessing stage. The same is true for the '*' and '?' symbols in

echo What does the * symbol mean?
The characters used in a special meaning (like ?, *, (, ), < , >, |, &, etc.) are called metacharacters. If you want to use them as a simple character, you have two options: Remark: the difference between the apostrophe ('hard quotation') and the double quotation ("soft quotation") mark will become clear when dealing with variables. In fact if you put a variable between "-s, the variable will be replaced with its actual value.

Try

echo "Here you have 'single quotation marks' between double ones"
echo 'And here you have "double quotation marks" between single ones'

echo There are plenty of room                     between these words
echo There are plenty of room "                 " between these words
 

An other way to calculate something is putting the expression  between '$['   and ']', like:
echo $[3+5] $[11/5]
Question: the backslash  character (\) is a metacharacter itself. How to neutralize it? By typing it twice.

An important principle in UNIX: the parameters are seperated by a space. Therefore space is also a metacharacter, as seen by the above examples ("echo there are plenty room..."), so you need to neutralize it, too.

Remark: you could use most metacharacters in file names (e.g. spaces, as in Windows >95), but you should always think of neutralizing them. Therefore it is better to avoid them. For instance, " rm file name " would remove two files, one named "file" and one named "name", and not the file named "file name", if you happen to have all of them. To remove this third file, use " rm 'file name' ".
 
 
 
 

2. Coding different alphabets, code pages, etc.
 

"In UNIX everything is text": what does this mean?

Originally computers were built and used by English speaking people. English is a special language, because if (practically) doesn't use diacritical marks. What about other languages, that have all kind of strange characters (e.g. á, é , í, ó, ú, ö, ü,...)? What about languages using not latin alphabets (Arabic, Greek, Hebrew, languages using Cyrillic alphabet, Asian languages,...)?

You have to differentiate between the following "levels":

1. Keyboard layout: Which key is understood as what character? Different languages have different standards: e.g. English: QWERTY..., German: QWERTZ..., French: AZERTY, etc. Pressing a key sends a signal to the computer (to the central unit of the machine, and not to the computer used as a terminal!), that is understood as a number according to the layout set up.

2. Code page: a number (either a signal received from the keyboard, i.e. the standard input, or a byte appearing in a file) is associated with a character (e.g. 32 is a space, 48 is the character '0', 65 is the character 'A' in the ASCII standard). This associating process needs a coding system. On modern computers (from the late 80's) there are several coding systems, different standards for associating numbers with characters. These standards are called code pages.

3. Font: the graphics of different characters (e.g. Tiems, Times New Roman, Arial, Courier, Helvetica,...). Originally fonts were defined using a matrix of points (either points of the screen or points of a plotter / printer). (E.g. Commodore's used a matrix of 8 × 8 points.) Nowdays people use "vector-graphic" fonts, that allow enlarging without loosing the quality of the text. "Bold", "italic", "underlined", etc. multiplies the possibility of fonts.


Standard code pages are:

ASCII: American Standard Code for Information Interchange, using 7 bits. (The 8th bit used to be a parity or security bit, to check whether something went wrong.)

ANSI-standards: Ameriacan National Standards Institute
                                 e.g. ANSI-1252: Latin1 for Windows, ANSI-1250: Central- and Eastern-European characters

ISO-8859 series: International Standards Organization
                                 e.g. 8859-1: Latin-1 (Western European languages),
                                         8859-2: Central- and Eastern-European languages,
                                         8859-4: Baltic Languages,
                                         8859-5: Cyrillic languages
                                         8859-6: Arabic
                                         8859-7: Greek
                                         8859-8: Hebrew
                                         8859-11: Thai (planned?)
                                         8859-15: Latin-1, with the Euro-symbol

Unicode (ISO-10646): it uses two bytes for one character (possible by now, due to the increased memory of computers, and increased storring capacity), therefore there are not 128 (c.f. ASCII) or 256 (c.f. ANSI, ISO), but 65,536 possibilities. This makes possible to use more than one code page in the same time, otherwise it used to be difficult to edit a document containing more languages with different alphabets.

Here is a link to Egyptian hierogriphs, using Unicode, as well as about coding different "exotic" languages in different ways.

Furthermore, we have to differentiate between to types of characters:


Tip: try out 'man ascii', and check its "see also chapter", too!
 
 

3. Regular expressions using 'grep'
 

A regular expression is a set of strings, defined by using "concatenation" (joining substrings), Kleene-star and Kleene-plus (concatenation of taking finite times elements from a given set of strings; in the case of Kleene-star this can be zero time, too), as well as union, intersection of complement of previously defined regular expressions.

concatenatie, repetitie, vereniging, intersectie, complement.

'grep' is a very useful command, we will use it a lot. Its simplest syntax is:

grep <reg_ex> [file_names]
What does it do? It outputs the lines of the given file(s - if more than one given) (or, if not specified, from the input) that match the given regular expression. It can be seen as a filter to collect only the useful information (e.g. if too much output from a program).

If you want the lines that match both of two conditions (conjunction), then use a pipe-line. If you want lines that match (at least) one of two conditions (disjunction), then use the -F option or the 'fgrep' command. A few  others of its most important options:

-c    returns you only the number of lines matching the given regular expression
-i    ignore case distinction: does not differentiate between capital and lowercase letters
-v    inverse: returns those lines that don't match the condition


But, the syntax of a regular expression here is slightly different from the one used for file names (remember the wildcards).
The metacharacters are the following: . (period), *, [ ], \, ^ and $, as well as '-' within the [   ] brackets. Their meanings are:

.        any character (as ? for file names)
*       Kleene-star (Kleene closure): the repetition of the expression before it, any times (even 0 times)
^       beginning of the line (only at the beginning, otherwise it matches itself)
$       end of the line (at the end of the outermost expression, otherwise it matches itself)
[  ]    any character within the brackets. Special rules for this:
- an interval of characters can be abbreviated by '-': [a-z], [0-9], [m-p]
- a ^ written in the first position means the inverse of the listed characters (anything except those)
- if you want to list the character "]" within this list, you should put it into the first position, thus '[][]' matches a left bracket and a right bracket.
Furthermore, you have so-called "character classes", like:
[:upper:]    uppercase letters (A-Z, and including some further, non English characters depending on your system)
[:lower:]   lowercase letters (similarly)
[:alpha:]    all letters (A-Z, a-z, and maybe more)
[:digit:]       the digits 0 through 9, precisely
[:xdigit:]    the hexadecimal digits (0-9, A-F, a-f)
[:punct:]    the punctuation characters, such as !"#$%'()*=,-./;:<=>?@[\]^_`{}|~
[:graph:]    all "graphic" characters, including the mutually excluding 'alpha', 'punct' and 'digit' classes, except <space>
[:print:]       all "printable" characters, like graphic characters and <space>
[:blank:]     <space> and <tab>
E.g. grep [[:digit:]] will return all lines containing a number. Notice the double brackets, that you will need in some systems!

Further possibilities:

\{m\}        matches exactly m times
\{m,n\}     matches between m and n times
\{0,n\}       matches maximum n times
\{m,\}        matches minimum n times
These are called BRE = Basic Regular Expressions.

Remarks: The concatenation (written one after the other) of two regular expressions is also a regular expression. The so-called Kleene-plus (any number of repetitions of the given regular expression, but at least one) can be realized as: <reg_ex> <reg_ex>*.

Further possibilities (ERE = Extended Regular Expressions), using egrap or grep -E :

- don't use the backslash (\) before the {, } symbols in {m, n}, etc.
- ? matches 0 or 1 time
- + matches at least one time (Kleene-plus)
-  | means disjunction (OR), like in:  echo aaa | grep 'a|b'
- you can form groups with ( and ), e.g. when having a disjunction


A few examples:

[oai]n            either 'on' or 'an' or 'in'
[0-9][0-9]      two consecutive digits
^[aeiou]        a vowel at the beginning of the line
^.[aeiou]       a vowel at the second position of a line
^[aeiou]$      a line consisting exactly of a vowel
[^0-9]            anything but a digit (it will return you all lines containing (also) something different from a digit
[^0-9]$          a line ending with something different from a digit
^[d\-]             a line beginning with a 'd' or a '-' (when is it useful?)
abb*              an occurence of 'a', followed by any number of occurences of 'b' (but at least one)
[0-9][0-9]*    a sequence of any number (but at least one) of digits (an unsigned integer)
Don't forget using escape characters or quotes, when needed!
 
 
 
 

4. Link, 'ln'
 

A person can have several names, like diminutives of aliases. Similarly, you can link several names to the same file.

First, let's understand a little bit the mechanism of the file structure. The information relating to a given file is to be found on three levels:

There are two types of linking: The command 'ln <existing_file> <new_name>' creates a hard link. Adding the -s will create a soft link.

The number appearing after the permissions in the long list ('ls -l') shows the number of hard links to the given file.  The very first character in the long list is 'l' in the case of symbolic links.

Changing the content of a linked file will affect the third level. Moving and removing a file will affect only the first level, that is only the file name. If you delete by chance a file that has been linked with a hard link to another one, you are on the save side, because the content still exist, and the other file name points to it. One the other hand, if the given content is pointed to by only one file name (independently of whether this file is pointed to by a soft link), then deleting this file will result in the lost of the content, too.

Try out what happens if you have a soft link 'A' pointing to a file 'B', and then you delete 'B': what happens to 'A'? And what happens if you create a new file with the name 'B'?