We mentioned in the beginning that Perl is well suited to text manipulation. A particularly useful feature in this regard is a powerful matching operator, that extends naturally to a search-and-replace operator.
The match operator is =~
which allows us to match a string
variable against a pattern, usually delimited by /
. For
instance, to print all lines from a file that contain the string
CMI, we would write:
while ($line = <INFILE>) { if ($line =~ /CMI/){ print $line; } }
More generally, the pattern could be a regular expression. The syntax
for regular expressions is similar to that in text editors like
vi or emacs or in the grep command of Unix. In a
regular expression, a character stands for itself. A sequence of
characters in square brackets stands for a choice of characters.
Thus, the pattern /[Cc][Mm][Ii]/
would match any combination of
lower and upper case letters that make up the string cmi--for
instance, CMI, CmI, .... The character . is
special and matches any character, so, for instance, the pattern
/[Cc].i/
would match any three letter pattern beginning with
C or c and ending with i. We can specify a
case-insensitive search by appending the modifier i at the end
of the pattern.
if ($line =~ /CMI/i){ # Same as ($line =~ /[Cc][Mm][Ii]/)
Perl provides some special abbreviations for commonly used choices of
alternatives. The expression \w
(for word) represents
any of the characters
_
,a,...,z,A,...,Z,0,...,9.
The expression \d
(for digit) represents
0,...,9, while \s
represents a whitespace
character (space, tab or newline).
Repetition is described using * (zero or more repetitions),
+ (one or more repetitions) and ? (zero or one
repetitions). For instance the expression \d+
matches a
nonempty sequence of digits, while \s*a\s*
matches a single
a along with all its surrounding white space, if any. More
controlled repetition is given by the syntax {m,n}
, which
specifies between m and n repetitions. Thus
\d{6,8}
matches a sequence of 6 to 8 digits.
A close relative of the match operator is the search and replace
operator, which is given by =~ s/pattern/replacement/
. For
instance, we can replace each tab (\t
) in $line
by a
single space by writing
$line =~ s/\t/ /;
More precisely, this replaces the first tab in $line
by
a space. To replace all tabs we have to add the modifier
g at the end, as follows.
$line =~ s/\t/ /g;
Often, we need to reuse the portion that was matched in the search pattern in the replacement string. Suppose that we have a file with lines of the form
phone-number name
which we wish to read and print out in the form
name phone-number
If we match each line against the pattern /\d+\s*\w.*/
, then
the portion \d+
would match the phone number, the portion
\s*
would match all spaces between the phone number and the
first part of the name (which could have many parts) and the portion
\w.*
would match the rest of the line, containing all parts of
the name. We are interested in reproducing the phone number and the
name, corresponding to the first and third groups of the pattern, in
the output. To do this, we group the portions that we want to
``capture'' within parentheses and then use \1
, \2
,
...to recover each of the captured portions. In particular, if
$line
contains a line of the form phone-number name, to modify
it to the new form name phone-number we could write
$line =~ s/(\d+)\s*(\w.*)/\2 \1/; # \1 is what \d+ matches, # \2 is what \w.* matches
One thing to remember is that if we assigned a value to $line
using the <>
operator, then it would initially have a trailing
newline character. In the search and replace that we wrote above,
this newline character would get included in the pattern \2
, so
the output would have a new line between the name and the phone
number. The function chomp $line
removes the trailing newline
from $line
, if it exists, and should always be used to strip
off unwanted newlines when reading data from a file.