1. Metacharacters, escapes, quotes
End of last week: what was the problem with
echo (13/26)*4 | bcand why does
echo \(13/26\)*4 | bcworks???
Because the characters '(' and ')' have special meanings, and the shell misunderstands it in the preprocessing stage. The same is true for the '*' and '?' symbols in
echo What does the * symbol mean?The characters used in a special meaning (like ?, *, (, ), < , >, |, &, etc.) are called metacharacters. If you want to use them as a simple character, you have two options:
Try
echo "Here you have 'single quotation marks' between double ones"An other way to calculate something is putting the expression between '$[' and ']', like:
echo 'And here you have "double quotation marks" between single ones'echo There are plenty of room between these words
echo There are plenty of room " " between these words
echo $[3+5] $[11/5]Question: the backslash character (\) is a metacharacter itself. How to neutralize it? By typing it twice.
An important principle in UNIX: the parameters are seperated by a space. Therefore space is also a metacharacter, as seen by the above examples ("echo there are plenty room..."), so you need to neutralize it, too.
Remark: you could use most metacharacters in file names (e.g. spaces,
as in Windows >95), but you should always think of neutralizing them. Therefore
it is better to avoid them. For instance, " rm
file name " would remove two files, one named "file" and one named
"name", and not the file named "file name", if you happen to have all of
them. To remove this third file, use " rm
'file name' ".
2. Coding different alphabets, code pages, etc.
"In UNIX everything is text": what does this mean?
Originally computers were built and used by English speaking people. English is a special language, because if (practically) doesn't use diacritical marks. What about other languages, that have all kind of strange characters (e.g. á, é , í, ó, ú, ö, ü,...)? What about languages using not latin alphabets (Arabic, Greek, Hebrew, languages using Cyrillic alphabet, Asian languages,...)?
You have to differentiate between the following "levels":
1. Keyboard layout: Which key is understood as what character? Different languages have different standards: e.g. English: QWERTY..., German: QWERTZ..., French: AZERTY, etc. Pressing a key sends a signal to the computer (to the central unit of the machine, and not to the computer used as a terminal!), that is understood as a number according to the layout set up.2. Code page: a number (either a signal received from the keyboard, i.e. the standard input, or a byte appearing in a file) is associated with a character (e.g. 32 is a space, 48 is the character '0', 65 is the character 'A' in the ASCII standard). This associating process needs a coding system. On modern computers (from the late 80's) there are several coding systems, different standards for associating numbers with characters. These standards are called code pages.
3. Font: the graphics of different characters (e.g. Tiems, Times New Roman, Arial, Courier, Helvetica,...). Originally fonts were defined using a matrix of points (either points of the screen or points of a plotter / printer). (E.g. Commodore's used a matrix of 8 × 8 points.) Nowdays people use "vector-graphic" fonts, that allow enlarging without loosing the quality of the text. "Bold", "italic", "underlined", etc. multiplies the possibility of fonts.
Standard code pages are:
ASCII: American Standard Code for Information Interchange, using 7 bits. (The 8th bit used to be a parity or security bit, to check whether something went wrong.)Here is a link to Egyptian hierogriphs, using Unicode, as well as about coding different "exotic" languages in different ways.ANSI-standards: Ameriacan National Standards Institute
e.g. ANSI-1252: Latin1 for Windows, ANSI-1250: Central- and Eastern-European charactersISO-8859 series: International Standards Organization
e.g. 8859-1: Latin-1 (Western European languages),
8859-2: Central- and Eastern-European languages,
8859-4: Baltic Languages,
8859-5: Cyrillic languages
8859-6: Arabic
8859-7: Greek
8859-8: Hebrew
8859-11: Thai (planned?)
8859-15: Latin-1, with the Euro-symbolUnicode (ISO-10646): it uses two bytes for one character (possible by now, due to the increased memory of computers, and increased storring capacity), therefore there are not 128 (c.f. ASCII) or 256 (c.f. ANSI, ISO), but 65,536 possibilities. This makes possible to use more than one code page in the same time, otherwise it used to be difficult to edit a document containing more languages with different alphabets.
Furthermore, we have to differentiate between to types of characters:
Tip: try out 'man ascii', and check its "see also chapter", too!
3. Regular expressions using 'grep'
A regular expression is a set of strings, defined by using "concatenation" (joining substrings), Kleene-star and Kleene-plus (concatenation of taking finite times elements from a given set of strings; in the case of Kleene-star this can be zero time, too), as well as union, intersection of complement of previously defined regular expressions.
concatenatie, repetitie, vereniging, intersectie, complement.
'grep' is a very useful command, we will use it a lot. Its simplest syntax is:
grep <reg_ex> [file_names]What does it do? It outputs the lines of the given file(s - if more than one given) (or, if not specified, from the input) that match the given regular expression. It can be seen as a filter to collect only the useful information (e.g. if too much output from a program).
If you want the lines that match both of two conditions (conjunction), then use a pipe-line. If you want lines that match (at least) one of two conditions (disjunction), then use the -F option or the 'fgrep' command. A few others of its most important options:
-c returns you only the number of lines matching the given regular expression
-i ignore case distinction: does not differentiate between capital and lowercase letters
-v inverse: returns those lines that don't match the condition
But, the syntax of a regular expression here is slightly
different from the one used for file names (remember the wildcards).
The metacharacters are the following: . (period), *,
[ ], \, ^ and $, as well as '-' within the [ ] brackets. Their
meanings are:
. any character (as ? for file names)Furthermore, you have so-called "character classes", like:
* Kleene-star (Kleene closure): the repetition of the expression before it, any times (even 0 times)
^ beginning of the line (only at the beginning, otherwise it matches itself)
$ end of the line (at the end of the outermost expression, otherwise it matches itself)
[ ] any character within the brackets. Special rules for this:- an interval of characters can be abbreviated by '-': [a-z], [0-9], [m-p]
- a ^ written in the first position means the inverse of the listed characters (anything except those)
- if you want to list the character "]" within this list, you should put it into the first position, thus '[][]' matches a left bracket and a right bracket.
[:upper:] uppercase letters (A-Z, and including some further, non English characters depending on your system)E.g. grep [[:digit:]] will return all lines containing a number. Notice the double brackets, that you will need in some systems!
[:lower:] lowercase letters (similarly)
[:alpha:] all letters (A-Z, a-z, and maybe more)
[:digit:] the digits 0 through 9, precisely
[:xdigit:] the hexadecimal digits (0-9, A-F, a-f)
[:punct:] the punctuation characters, such as !"#$%'()*=,-./;:<=>?@[\]^_`{}|~
[:graph:] all "graphic" characters, including the mutually excluding 'alpha', 'punct' and 'digit' classes, except <space>
[:print:] all "printable" characters, like graphic characters and <space>
[:blank:] <space> and <tab>
Further possibilities:
\{m\} matches exactly m timesThese are called BRE = Basic Regular Expressions.
\{m,n\} matches between m and n times
\{0,n\} matches maximum n times
\{m,\} matches minimum n times
Remarks: The concatenation (written one after the other) of two regular expressions is also a regular expression. The so-called Kleene-plus (any number of repetitions of the given regular expression, but at least one) can be realized as: <reg_ex> <reg_ex>*.
Further possibilities (ERE = Extended Regular Expressions), using egrap or grep -E :
- don't use the backslash (\) before the {, } symbols in {m, n}, etc.
- ? matches 0 or 1 time
- + matches at least one time (Kleene-plus)
- | means disjunction (OR), like in: echo aaa | grep 'a|b'
- you can form groups with ( and ), e.g. when having a disjunction
A few examples:
[oai]n either 'on' or 'an' or 'in'Don't forget using escape characters or quotes, when needed!
[0-9][0-9] two consecutive digits
^[aeiou] a vowel at the beginning of the line
^.[aeiou] a vowel at the second position of a line
^[aeiou]$ a line consisting exactly of a vowel
[^0-9] anything but a digit (it will return you all lines containing (also) something different from a digit
[^0-9]$ a line ending with something different from a digit
^[d\-] a line beginning with a 'd' or a '-' (when is it useful?)
abb* an occurence of 'a', followed by any number of occurences of 'b' (but at least one)
[0-9][0-9]* a sequence of any number (but at least one) of digits (an unsigned integer)
4. Link, 'ln'
A person can have several names, like diminutives of aliases. Similarly, you can link several names to the same file.
First, let's understand a little bit the mechanism of the file structure. The information relating to a given file is to be found on three levels:
The number appearing after the permissions in the long list ('ls -l') shows the number of hard links to the given file. The very first character in the long list is 'l' in the case of symbolic links.
Changing the content of a linked file will affect the third level. Moving and removing a file will affect only the first level, that is only the file name. If you delete by chance a file that has been linked with a hard link to another one, you are on the save side, because the content still exist, and the other file name points to it. One the other hand, if the given content is pointed to by only one file name (independently of whether this file is pointed to by a soft link), then deleting this file will result in the lost of the content, too.
Try out what happens if you have a soft link 'A' pointing
to a file 'B', and then you delete 'B': what happens to 'A'? And what happens
if you create a new file with the name 'B'?