4.4 Alternation

4.4  Alternation

 We can specify alternations in a subpattern using |. So, when we write

 

/a|b/

 

we are looking for either an a or a b. We can have more than one character in alternations. Therefore, if we write

 

/ab|ac|ad/

 

we are looking for either ab or ac or ad. We could have written this pattern more compactly as

 

/a(b|c|d)/

 

as well. Suppose now we want to modify our citation program a little more. Let us assume that the specification of year in the citation index can be either two digits or four digits. In addition, we assume that if it is four digits long, we must have either 19 or 20 as the first two optional digits. We can now rewrite the last program as given below.

Program 4.12

#!/usr/bin/perl

while (<>){
    if (/\\cite{([A-Z][a-zA-Z]*(19|20)?\d{2},)*[A-Z][a-zA-Z]*(19|20)?\d{2}}/){
        print $_;
    }
}

The program given above repeats a big subpattern twice. This is not a very good idea because it duplicates effort in writing the sub-patterns. We can easily make mistakes in typing it. In addition, the pattern has become very large. A more compact regular expression that does more or less the same is given below.

 Program 4.13

#!/usr/bin/perl

while (<>){
    if (/\\cite{([A-Z][a-zA-Z]*(19)?\d{2},?)+}/){
        print $_;
    }
}

But, this is not perfect either because it accepts citations such as \cite{Badler96,Kalita99,} where there is an extraneous , at the end.

A third version of the program is given below.

 Program 4.14

#!/usr/bin/perl
#file findcite351.pl
$pattern = "[A-Z][a-zA-Z]*(19|20)?[0-9]{2}";
#$pattern = "[A-Z][a-zA-Z]*(19|20)?\\d{2}";

while (<>){
    if (/\\cite{($pattern,)*$pattern}/){
        print $_;
    }
}

In this version of the program, we have defined a scalar variable $pattern that stores part of the regular expression for which we are looking. It stores the part of the regular expression that contains the specification of an author’s name and the prefix of the year (either 19 or 20, this prefix being optional), and the two-digit year. Inside the while loop where we perform pattern matching in the conditional of the if, we use the variable $pattern a couple of times. The part of the regular expression, given as

 

($pattern,)*

 

matches zero or more occurrences of one reference string. Each reference string is separated from the next using a comma. Next, the pattern consists of $pattern} matching the last reference string followed by } and no comma. Note that there are two versions of the assignment to $pattern variable in
the program, one of which is commented. Both work. The commented version shows that when a
\ is used inside a pattern variable, it must be escaped. when a \ is used inside a pattern variable, it must be escaped. Therefore, we have \\d{2} instead of \d{2}.