4.11 Finding Citations: Another try
We now write a program that makes the discussions regarding finding citations in a TeX/LaTeX file more complete. Unlike files that are created for word processors such as Microsoft Word or Adobe FrameMaker, the files created for processing by the word processor called TeX and a more user-friendly version of it called LaTeX, are text files. Therefore, it is easy for a user to manipulate such files on his or her own. In this program, we will focus on the initial processing of citations only.
We assume that the format of the bibliographic database is similar to what we have discussed in a previous section. To simplify our program, let us assume that we have only books in the bibliographic text database.
The bibliographic database may have a large number of entries. Many of these entries may not be referenced in the text of our article or paper. When we refer to a book or a paper in an article, we do so by using the \cite command. In the simplest use of \cite, we specify the index of only one book or paper. For example, to cite Allen’s book, we write in the text of the article, \cite{Allen95} since Allen95 is the citation index of the book. There can be one or more
spaces surrounding the left and the right braces.
There are some complications that can arise when we specify citations in an article. Some of the complications are given below.
• We make an assumption that each reference index has an uppercase letter, followed by zero or more lower/uppercase letters followed by either two or four digits. If there are two digits, the year of publication is assumed to be the twentieth century. If there are four digits, the first two digits must be 19 or 20. According to this self-imposed syntax, Allen95, Allen1995 or Allen2005 are all acceptable. We check for this syntax in our program. This is not a syntax imposed by TeX or LaTeX. It is a syntax we have decided to impose on ourselves to enforce a uniformity over the creation of index terms.
• It is possible that we can have more than one use of \cite in one line of text.
• In one single use of \cite, we may have more than one citation referenced, for example, we can have a line in the paper which looks like the following:
Natural language parsing techniques are discussed in \cite{Allen95,Fong92}.
In such cases, the entries need to be separated by a comma. In addition, there cannot be any space between the entries although spaces are allowed surrounding the braces.
• When we specify a citation, it is possible to have an optional field included inside square brackets as in \cite[p. 200]{Kalita91}. In such a case, there can be blank spaces between \cite and the optional entry. Also, there can be spaces between the right square bracket and the left curly brace. There can be spaces inside the optional entry although no space is allowed inside the actual citation index.
• It is quite possible that the use of \cite is followed by a punctuation mark such as a period, a comma, a question mark, etc.
• There may be a single lowercase letter after the year as in \cite{Kalita91a} if there are several citations attributed to the same author in the same year.
TeX/LaTeX processes the citations, removes duplicates, sorts the citations in some user-specified order and then prints them out at the end of the article following some user-specified format. Depending on the publication or the publisher, there are many standard formats for writing bibliographies and references.
In the program given below, we want to perform some of the tasks that TeX or LaTeX performs on its own.
Program 4.46
#!/usr/bin/perl
use strict;
my (@refs, $refString, @tempRefs, $tempRefString, $tempRef);
while (<>){
my @words = split (/\s+/, $_);
my @citations = grep (/\\cite *{ *[^}]+ *}/, @words);
map {if (/[,.?!]$/) {chop}} @citations;
foreach $refString (@citations){
($tempRefString) = $refString =~ /\\cite *{ *([^}]+) *}/;
if ($tempRefString =~ /,/){
@tempRefs = split /,/, $tempRefString;
push @refs, @tempRefs;
}
else{
push @refs, $tempRefString;
}
}
}
@refs = grep /^[A-Z][a-zA-Z]*(19|20)?\d{2}[a-z]?$/, @refs;
@refs = removeDuplicates (@refs);
@refs = sort (@refs);
print (join ("\n", @refs), "\n\n");
sub removeDuplicates{
my @list = @_;
my (%tempList, $element);
foreach $element (@list){
$tempList {$element}++;
}
@list = ();
foreach $element (keys %tempList){
push @list, $element;
}
return @list;
}
The program is actually quite simple. We repeatedly use split and grep. split takes a pattern and a string as its two arguments and returns an array or list where the original string has been broken up into several parts using the pattern as the separator. grep takes a pattern and a list as its two arguments. It returns a list containing those elements of the list that satisfy the pattern.
If we call the file that contains the program findTeXcites.pl, a call to the program will be something like
%findTeXcites.pl paper.tex
where paper.tex is the TeX-formatted file containing the paper.
The program reads the lines of the file one by one and processes it. Below we see some lines from the file.
\cite{Winograd72}, Herskovits \cite{Herskovits86}, and many others
deficiencies and drawbacks of CDGs \cite{Wilks75,Levin87,Palmer85}. Unlike
unification-based grammar formalisms \cite[p. 185]{Carpenter92},
The first line of the program inside the while loop takes a line of text from the file and breaks in apart using one or more space characters as the separator. The separated substrings are returned in the list or array @words. So, after the first line is processed, the array @words contains the following elements.
\cite{Winograd72}, Herskovits \cite{Herskovits86}, and many others
There are six substrings and two of the substrings contain the comma inside them. Next, the program looks at this list of strings and picks out or greps those elements that contain the pattern
/\\cite *{ *[^}]+ *}/
Here, the / is used as the pattern delimiter. In this pattern, we are looking for the string \cite first. The first backslash needs to be escaped. After \cite, there can be zero or more blank spaces followed by the left brace {. After the opening brace, there can be zero or more blank spaces. After the blank spaces, if any, we are looking for one or more characters that are not the closing brace }. Before the closing brace, we can have one or more blank spaces also.
Once this grep command works on @words, the resulting @citations list contains all the citations in the line. In this specific case, the @citations list contains the following strings.
\cite{Winograd72}, \cite{Herskovits86},
For the second and third lines given above, at this point the @citations list will contain the following substrings at this point in time.
\cite{Wilks75,Levin87,Palmer85}.
\cite[p. 185]{Carpenter92},
So, for the second and third lines, @words or @citations contains only one element in each case.
Next, the program has a foreach loop that goes over every element of the current value of @citations. For the first line, this list contains two elements and for the other two lines, it contains one element each. Inside the loop, the program has the following line.
($tempRefString) = $refString =~ /\\cite *{ *([^}]+) *}/;
The program is looking for the pattern
/\\cite *{ *([^}]+) *}/;
The pattern matching operation is being done on the the variable $refString that contains the element of the array @citations that is being looked at. If the pattern match succeeds, the program remembers the citation index or indices inside the braces. It removes spaces from the front and the end of the index or indices. So, for the first citation of the first line, it picks out and remembers Winograd72 for the second citation of the first line it remembers Herskovits86. For the second line, there is only one citation and the remembered string is the following.
Wilks75,Levin87,Palmer85
For the third line, the remembered string is
Carpenter92
If the remembered string has several indices as in the case of the second line, the program breaks up the string using the comma as the separator. The program stores each individual citation index from such a separated list in the list @refs. If the remembered substring has no comma inside it, it has only one index and this index is also pushed into the @refs array.
So, once the foreach loop has been executed for all lines of the paper, we have all the citation indices stored in the @refs array. Next, the program checks to make sure that each index satisfies the syntax that we have required it to have. Once again, note that this syntax is self-imposed and is not required by TeX/LaTeX. The line that does this parsing uses the grep command.
@refs = grep /^[A-Z][a-z]*(19|20)?\d{2}[a-z]?$/, @refs;
It keeps those substring entries from @refs that contain the pattern specified above. This pattern requires that the index starts with an uppercase letter and is then followed by zero or more lower case letters, followed by two digits. Finally, there is optionally a single lower-case letter. Then the index must end. We use the anchors ^ and $ to make sure that the index contains nothing else in it.
After picking out the citation indices that conform to our syntax, we remove duplicates and sort them and then print them out. In a real program, we should at this point consult the bibliographic database file and then print details corresponding to each entry that has been cited in the paper.
To remove duplicates, we use the subroutine removeDuplicates. This subroutine looks at every element of the list, makes the element a key of an associative array and stores a count for the element in the associative array. Once all elements of the list are processed, the associative array contains every citation index as a key and the number of times the citation index is used in the paper as the value. The program simply goes through every key of this associative array and puts it in a list. The subroutine then returns the list. Since an associative array can return the elements in any order, the list that is returned needs to be sorted
in the main program.
Output of the program is a set of indices that are sorted.
Beckwith1991 Borigault1992 Chen1994 Church1988 CoreLex98 Deerwester1990 Harman1995 Jesse1997 Kalita1986 Kalita1989 Laham1997 Lancaster1969 Lehnert1986 Levi1978 Mahesh1995 Mahesh1997 Paice1993 Rus1997 Salton1990 Salton1994 Swanson1989 Voutilainen1993
This program can be expanded a little to obtain the details from the bibliographic text file for the cited references. In an earlier section, we saw how we can obtain such details from the bibliographic file. However, that program needs to be modified so that bibliographic details are printed only for the cited entries and not for all records in the bibliographic text database.