Tekstmanipulatie, week 10

Two mottos for the day:

In Unix everything is file.
In Unix everything is text.

1. A few more commands, just for fun (?)

echo: diplays a line of text.

Example:

> echo hello word
hello word
> echo I love you!
I love you!

expr: evaluates a mathematical expression. Arguments of the expression to be evaluated, as well as the operations are given as separate arguments of expr. Check man expr for further details (logical operations, modulo, etc.). Example:

> expr 3 + 5 / 2

2. Wild cards

UNIX gives you the possibility of using wildcards. These are:

The character * denotes any sequence of zero or more characters
The character ? denotes a single character
The construct [cset] denotes any single character in the cset.

A few examples:

x*            any name beginning with 'x' (e.g. x, xold, xerxes)
*x*          any name containing an 'x' (e.g. x, xold, fox, maxi, xx)
x?             any two-character-long name beginning with 'x' (e.g. xx, xy, x2)
[ptks]       either the character 'p' or 't' or 'k' or 's'.
x[aeiou]   any two-character-long name beginning with an 'x' followed by a vowel (e.g. xa, xu)
x[aeiou]* any name beginning with an 'x' followed by a vowel (e.g. xa, xaver, xerxes)
x[aeiou]*[abc]*x any name beginning with an 'x' followed by a vowel, then any characters (or none), then 'a' or 'b' or 'c', then any characters (none or one or more), and ending with an 'x' (e.g. xabx, xanax, xenmnaqwx).
*.*            any name containing a period
[A-Z]*     any name beginning with a capital letter.
[1-9]       any non-zero number
????        any four-character-long name
????*      any at least four-character-long name
???*[0-9.x] any at least four-character-long name ending with a numeral, a period or an 'x'
[!T]*       any name not beginning with a capital T.

Remark: Wildcards usually don't match the '/' characters referring to subdirectories, neither the initial period of some special file names.

How to use them?

> rm *x* : removes all files matching the expression (e.g. 'xerxes', 'maxi', 'fox')
> cp x[aeiou] My_Directory : copies files 'xa', 'xe',..., 'xu' to 'My_Directory'.
> ls ???? : lists all files whose name contains exactly four characters.
> ls apple/???* : lists all files in the directory 'apple' whose name contains at least three characters.
> mv /[0-9]* . : moves all files in the root directory whose name starts with a number into the actual directory.
> ls ~/*[0-9]* : lists all files in the home directory whose name contains a number.
> ls ../[A-Z][A-Z][A-Z]: lists all files in the parent directory whose name contains exactly three capital letters.
> ls ./* : same as simply ls. Why?

!! Never do anything like 'rm a_*' ! (where '_' stands for a space...)

How does this works? Actually an expression containing a wildcard describes a set of characters. For instance: x[aeiou] = {xa, xe, xi, xo, xu}. A file name matches a wildcard expression iff it is element of the set.

When the shell encounters an expression with a wildcard in the command line, it automatically replaces it with the list of the matching file names (in the current directory), except if there is no matching file name.

The command rm (or ls) can receive multiple arguments and would remove / list all of them. Therefore in the case of a command line like rm a* (or ls a*) what happens, is the following. First, the pre-processor of the shell replaces the expression with whatever files names matching it, so you get something like rm apple ananas amerika a1445. Then this command is executed, deleting (listing) all of them.

The command mv and cp have two possibilities. If they receive exactly two arguments which are file names, then they move or copy the first one into the second one. But if they receive an (unbounded) list of file names followed by the name of a directory, then they move / copy all files in the list into the directory, retaining their names. This is how e.g. cp x[aeiou] My_Directory works. But if the last name is not a directory, this does not work. Why? Therefore it is not possible e.g. to rewrite the first character of all file names from 'a' to 'A' by giving the command mv a* A* .

Similarly, in order to concatenate all your files with a name beginning with 'a', you can simply use:

cat a*

This preprocessing by the shell takes place before running the actual command (commands, if more than one in a command line), independently of the command used. Therefore the shell does this for all commands, even if the command does not expect any file names as arguments.

For example:

> echo What does the * symbol stands for ? : what would this command line do?
> echo *x* : lists all file names matching the expression (e.g. 'xerxes', 'maxi', 'fox')
> expr 3 * 4 : this returns you an error message. Why?

3. Metacharacters, escapes, quotes

Let's try to calculate how much is (3 + 4) * 7? We would like to have something like

> expr ( 3 + 4 ) * 7

But this won't work. How can we overcome this problem?

The problem is that some characters have special meanings, those characters are called metacharacters. We have seen so far the special meanings of the characters '*', '?', '[' and ']'. The space has also a special meaning, it is the delimiter between two arguments of a command, therefore a file name containing a space will also cause problems under all versions of Unix. How to escape from the special meanings of the metacharacters?

The way is to introduce further metacharacters which will neutralize the effect of the metacharacters. There are two types of neutralizing characters. The escape character \ (backslash) neutralizes the effect of the following character. Whereas the two types of quotes ('...' and "...") neutralize all the metacharacters within them.

What is the difference between the following two commands?

echo What does the * symbol mean?
echo What does the \* symbol mean\?

Now these should work:

> expr 3 \* 7
21
> expr 3 '*' 7
21
> expr "3 * 7"
3 * 7

The problem with the latter is that the command has only one argument, because you also escape the two spaces which should differentiate between the three arguments (3, * and 7). This does not work either:

> expr ( 3 + 4 ) * 7

becauses the parantheses are also metacharacters. Further metacharacters are: ( , ) , & , < , > , | , etc. The solution therefore is something like

> expr $ 3 + 4 $ "*" 7

Remark: the difference between the apostrophe ('hard quotation') and the double quotation mark ("soft quotation") will become clear when we will deal with variables. In fact if you put a variable between "-s, the variable will be replaced with its actual value, unlike in the case of hard quotations. Another advantage of having two types of quotation marks is that you can quote even the quotation marks:

> echo "the ' symbol is an apostrophe"
the ' symbol is an apostrophe
> echo 'The word "apple" is English'
The word "apple" is English

Try also these:

echo "Here you have 'single quotation marks' between double ones"
echo 'And here you have "double quotation marks" between single ones'
echo There is plenty of room between these words
echo There is plenty of room " " between these words

An other way to calculate something is putting the expression between '$[' and ']', like:

echo $[3+5] $[11/5]

Question: the backslash character (\) is a metacharacter itself. How to neutralize it? By typing it twice.

Note: The parameters in Unix are seperated by a space. So space is also a metacharacter, as you could see it in the above examples ("echo there are plenty room..."), and you may want to neutralize it, too.

You could use most metacharacters in file names. For example spaces (as in Windows >95), but you should always think of neutralizing them. Therefore it is better to avoid them. For instance, rm file name would remove two files, one named "file" and one named "name", and not the file whose name is "file name", if you happen to have all of them. To remove this third file, use rm 'file name' or rm file\ name, by escaping the space.

4. Text editors (vi, pico, emacs, xemacs), e-mailing (pine, xemacs)

Intro to vi

Since we will need to create and to edit longer files than the ones we have created with 'cat', we need to learn how to use some text editors.

vi has been long considered to be the standard editor for Unix. Hopefully you will never need to use it, and you will always have an alternative editor available. (But you never know...). So le'ts have 10 minutes of vi, just in order to be able to appreciate any other text editors.

Start vi by typing: vi <filename>. Then don't panic!

There are tree modes in vi:
- command mode: typing a character means a command, and does not appear on the screen.
- input mode: now you can type in whatever text you wish into your text, to the place where the cursor is.
- status-line mode (last-line mode): issuing long commands that will appear on the bottom line of your screen.

Changing between these modes:
- when you enter vi, you are in the command mode;
- pressing the 'Esc' button ('escape') brings you back always to command mode;
- from command mode 'a' or 'i' brings you to input mode;
- from command mode ':' brings you to the last-line mode.

In the input mode you can just type in your text, but you will sometimes be surprised that you are not able to delete it. Then go back to the command mode (by using Esc) and bring the cursor onto the character you wish to delete. The just press 'x'. Then 'a' or 'i' will bring you back to input mode.

In fact 'x' deletes the given character by putting it into a buffer ("cut"). 'y' will put the given character into the buffer without deleting it ("copy"), and 'p' will paste it to the actual position of the cursor. If you wish to put more than one character into the buffer (e.g. copying or moving an entire word), then just type the number of characters before 'x' or 'y'. For instance '5x' will delete five characters, and put them into the buffer. Finally, 'dd' will delete you a line, and put it into the buffer. If you get lost, just don't panic...

By pressing ':', you get to the last-line mode. Pressing 'vi' will start a new file ("new document"), 'vi <file_name>' opens the mentioned file. Typing 'w' will save (write) our file, while typing 'w <file_name>' saves as (under) the given name.

Leaving this sadistic editor is possible by typing 'q' (quit), 'q!' (quit, even if not saved) or by 'wq' (save and quit).

The remaining beauty of vi (like different variations of the mentioned commands, further commands, searching, using the 36 buffers, etc.) are left for those of you who have some masochistic inclinations...

Intro to pico

Therefore let's rather try out another editor. This is 'pico', the text editor of the emailing program called 'pine'. Just run 'pico <file_name>' (or simply 'pico' if you wish to start a new file, and to give a name to it only at the end), and enjoy! (Compared to vi...)

(By the way: both in the case of vi and of pico, the file name given when launching the program should not necessarily exist before hand. If it already exists then you can continue editing it, while if it doesn't, then the file will be created as an empty document.)

In pico you always have the list of commands in the last lines of your screen. ^ stands for the CTRL-button. Thus you can ask for help with ^G (CTRL + g), cut a line with ^K (paste it at the same place or to another place with ^U), etc. You can insert the content of another file to the actual position of the cursor with ^R, search for a string of characters (e.g. a word) in your longer file with ^W, etc. With ^Y and ^V you can jump one page backwards or forwards. ^T will even check your English spelling... (I wouldn't trust it too much,...) When you are done, ^O saves your file, and ^X exits. If exiting without having saved it, you are asked if you want to save your file. Finally, ^C is the universal escape-combination.

Is it too much information? Don't worry, the only thing to remember is allways to check the last lines of the screen.

Intro to pine

Once you are familiarized with pico, just type pine, and you have entered an emailing program. The logic is exactly the same, just always check the menu: on the front page you have your main menu (typing 'M' will bring back to this), otherwise you have the menu on the bottom of the screen. (You usually don't need to type the CTRL button, in this case you don't have the ^ symbol in your menu.)

From the main menu just type L to get into your folder list. Then it is up to you to create new folders (with 'A') or to delete them. Then you have your emails within your folders, that you can delete (D), undelete (U), save to another folder (S) or export to a file (E), forward them (F), reply to them (R), etc. Note the option O that will list you another dozen of commands.

The only way to really learn it is just by playing with it!

Emacs and XEmacs

Emacs is a text editor with more options than pico, and easier to use than vi. Although one has to get used to it...

But there is a tutorial (CTRL-H T), and by going through it (with a lot of exercises) you will get the practice...

XEmacs is a more modern version of Emacs.

You can use them also for writing and reading emails.

Run XEmacs by typing xemacs. Then read the manual for more information. Don't forget to ask me for a synopsis of the commands.

5. Coding different alphabets.

"In UNIX everything is text": what does this mean?

Originally computers were built and used by English speaking people. English is a special language, because it (practically) doesn't use diacritical marks. What about other languages, that have all kind of strange characters (e.g. á, é , í, ó, ú, ö, ü,...)? What about languages using not latin alphabets (Arabic, Greek, Hebrew, languages using Cyrillic alphabet, Asian languages,...)?

You have to differentiate between the following "levels":

Keyboard layout: Which key is understood as what character? Different languages have different standards: e.g. English: QWERTY..., German: QWERTZ..., French: AZERTY, etc. Pressing a key sends a signal to the computer (to the central unit of the machine, and not to the computer used as a terminal!), that is understood as a number according to the layout set up.
Code page: a number (either a signal received from the keyboard, i.e. the standard input, or a byte appearing in a file) is associated with a character (e.g. 32 is a space, 48 is the character '0', 65 is the character 'A' in the ASCII standard). This associating process needs a coding system. On modern computers (from the late 80's) there are several coding systems, different standards for associating numbers with characters. These standards are called code pages.
Font: the exact graphical image of different characters (e.g. Tiems, Times New Roman, Arial, Courier, Helvetica,...). Originally fonts were defined using a matrix of points (either points of the screen or points of a plotter / printer). (E.g. Commodore 64 used a matrix of 8 × 8 points.) Nowdays people use "vector-graphic" fonts, that allow enlarging without loosing the quality of the text. "Bold", "italic", "underlined", etc. multiplies the possibility of fonts.

Standard code pages are:

ASCII: American Standard Code for Information Interchange, using 7 bits. (The 8th bit used to be a parity or security bit, to check whether something went wrong.)

ANSI-standards: Ameriacan National Standards Institute

                                 e.g. ANSI-1252: Latin1 for Windows, ANSI-1250: Central- and Eastern-European characters
ISO-8859 series: International Standards Organization
                                 e.g. 8859-1: Latin-1 (Western European languages),
                                         8859-2: Central- and Eastern-European languages,
                                         8859-4: Baltic Languages,
                                         8859-5: Cyrillic languages
                                         8859-6: Arabic
                                         8859-7: Greek
                                         8859-8: Hebrew
                                         8859-11: Thai (planned?)
                                         8859-15: Latin-1, including the Euro-symbol
Unicode (ISO-10646): it uses two bytes for one character (possible by now, due to the increased memory of computers, and increased storring capacity), therefore there are not 128 (c.f. ASCII) or 256 (c.f. ANSI, ISO), but 65,536 possibilities. This makes possible to use more than one code page in the same time. In the past it was very difficult to edit a document containing more languages with different alphabets.

Here is a link to Egyptian hierogriphs, using Unicode, as well as about coding different "exotic" languages in different ways.

Furthermore, it is useful to differentiate between to types of characters:

Printable characters: letters, digits, punctuation marks, graphical symbols, white space, tab, etc.
Non-printable characters: they are sort of 'commands' depending on the system, like ring the bell, delete the next / previous character, end-of-line (new-line, move-home), end-of-file (in some systems), change font, etc. In the ASCII standard these characters have codes between 0 to 31.

Tip: try out 'man ascii', and check its "see also chapter", too!

B�r� Tam�s:

e-mail

English web site

Magyar honlap

Last modified: Thu Jul 3 11:39:17 METDST 2003