[ previous ] [ Contents ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 6 ] [ 7 ] [ 8 ] [ 9 ] [ 10 ] [ 11 ] [ 12 ] [ 13 ] [ 14 ] [ 15 ] [ 16 ] [ 17 ] [ 18 ] [ A ] [ B ] [ C ] [ D ] [ next ]


Debian Tutorial (Obsolete Documentation)
Chapter 11 - Text tools


head, tail, grep, wc, tr, sed, perl and so on


11.1 Regular expressions

A regular expression is a description of a set of characters. This description can be used to search through a file by looking for text that matches the regular expression. Regular expressions are analagous to shell wildcards (see Filename expansion ("Wildcards"), Section 6.6), but they are both more complicated and more powerful.

A regular expression is made up of text and metacharacters. A metacharacter is just a character with a special meaning. Metacharacters include: . * [] - \ ^ $.

If a regular expression contains only text (no metacharacters), then it matches that text. For example, the regular expression 'my regular expression' matches the text 'my regular expression', and nothing else. Regular expressions are usually case-sensitive.

You can use the egrep command to display all lines in a file which contain a regular expression. Its syntax is:

egrep 'regexp' filename1 ... [16]

For example, to find all lines in the GPL which contain the word GNU, you type:

     egrep 'GNU' /usr/doc/copyright/GPL

egrep will print the lines to standard output.

If you want all lines which contain freedom, followed by some indeterminate text, followed by GNU, you can do:

     egrep 'freedom.*GNU' /usr/doc/copyright/GPL

The . means "any character"; the * means "zero or more of the preceding thing," in this case "zero or more of any character." So .* matches pretty much any text at all. egrep only matches on a line-by-line basis, so freedom and GNU have to be on the same line.

Here's a summary of regular expression metacharacters:

.

Matches any single character except newline.

*

Matches zero or more occurences of the preceding thing. So the expression a* matches 0 or more lowercase a, and .* matches zero or more characters.

[characters]

The brackets must contain one or more characters; the whole bracketed expression matches exactly one character out of the set. So [abc] matches one a, one b, or one c; it does not match 0 characters, and it does not match a character other than these three.

^

Anchors your search at the beginning of the line. The expression ^The matches The only at the beginning of a line; there can't be spaces or other text before The. If you want to allow spaces, you can permit 0 or more space characters like this: ^ *The.

$

Anchors at the end of the line. end$ requires the text end to be at the end of the line, with no intervening spaces or text.

[^characters]

^ reverses the sense of a bracketed character list. So [^abc] matches any single character, except a, b, or c.

[character-character]

You can include ranges in a bracketed character list. To match any lowercase letter, use [a-z]. You can have more than one range; so to match the first three or last three letters of the alphabet, try [a-cx-z]. To get any letter, any case, try [a-zA-Z]. You can mix ranges with single characters and with the ^ metacharacter; for example, [^a-zBZ] means "anything except a lowercase letter, capital B, or capital Z."

()

You can use parentheses to group parts of the regular expression, just as you do in a mathematical expression

|

| means "or" --- you can use it to provide a series of alternative expressions. Usually you want to put the alternatives in parentheses, like this: c(ad|ab|at) matches cad or cab or cat. Without the parentheses, it would match cad or ab or at instead

\

Escapes any special characters; if you want to find a literal *, you type \*. The slash means to ignore *'s usual special meaning.

Here are some more examples, to help you get a feel for things:

c.pe

matches cope, cape, caper

c\.pe

matches c.pe, c.per

sto*p

matches stp, stop, stoop

car.*n

matches carton, cartoon, carmen

xyz.*

matches xyz and anything after it; some tools, like egrep, only match until the end of the line.

^The

matches The at the beginning of a line

atime$

matches atime at the end of a line

^Only$

matches a line which consists solely of the word Only --- no spaces, no other characters, nothing. Only Only is allowed

b[aou]rn

matches barn, born, burn

Ver[D-F]

matches VerD, VerE, VerF

Ver[^0-9]

matches Ver followed by any non-digit

the[ir][re]

matches their, therr, there, theie

[A-Za-z][A-Za-z]*

matches any word which consists of only letters, and at least one letter. Will not match numbers or spaces


[ previous ] [ Contents ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 6 ] [ 7 ] [ 8 ] [ 9 ] [ 10 ] [ 11 ] [ 12 ] [ 13 ] [ 14 ] [ 15 ] [ 16 ] [ 17 ] [ 18 ] [ A ] [ B ] [ C ] [ D ] [ next ]


Debian Tutorial (Obsolete Documentation)

29 Dezember 2009

Havoc Pennington hp@debian.org