4.14 Exercises
1. (Easy: Documentation Reading)
Read the Perl documentation on regular expressions and pattern matching by running the following.
%perldoc perlre
2. (Easy: String Processing, Tabs)
Write a program tabbify that replaces strings of blank spaces by the minimum number of tabs and blank spaces to achieve the same spacing. Assume a tab stop is n blank characters at most. The program takes n as a command-line argument.
3. (Easy to Medium: String Processing, File Reading)
Assume that you have a file where each row contains a person’s name followed by two tabs followed by his or her job title.
Kalita Associate Professor
Sebesta Associate Professor
Badal Professor
Augusteijn Professor
Nystuen Senior Instructor
Write a program that reads this file in one read operation into a multi-line string. Next, the program takes this string that has tabs and newlines inside it and converts the tabs into the requisite number of spaces. The purpose of the tabs is to align the characters that immediately follow the tabs. So, in what we have above, the S from Senior Instructor, the A from Associate Professor, and the P from Professor must be aligned to be in the same column.
Note that the maximum number of spaces a tab expands to is 8. So, \t\t can expand to a maximum of 16 spaces. Depending on the number of characters before the tab, the number of spaces may be fewer. Assume that the name before the tabs in each line are not more than 15 characters long.
4. (Medium: Text Processing)
Write a program that reads a file and breaks long lines into smaller lines. It takes a command-line argument that is an integer n and checks to make sure that it is an integer. It breaks lines after the last non-blank character that occurs before the nth column of input.
Extend the program to do the same for all text files in a directory.
5. (Medium to Hard: Text Processing, Program Parsing)
Write a program to check a Perl program for unbalanced parentheses, braces and brackets. Also, look to see if single and double quotes are properly ended. Note that a quote cannot cross line boundary.
6. (Medium: Text Processing)
Write a program to remove trailing blanks and tabs from each line of input. Also, delete more than one consecutive blank line.
7. (Medium: E-mail Addresses, Data Mining)
Write a program that mines e-mail addresses from all files in a directory. For example, if you store e-mail messages you receive and send in a certain directory, the files in this directory are likely to contain many e-mail addresses. Print all collected e-mail addresses to a file. Do not print an e-mail address twice. Sort the e-mail addresses such that
• they are sorted by the mail server name, i.e., the part of the e-mail address after @. If the portion after @ is not fully specified, fill it up correctly so that it is a complete e-mail address. For this purpose, provide a default complete ending. For example, an e-mail address with an incomplete server name may be kalita@pikespeak, or even kalita. If the default mail server is pikespeak.uccs.edu,
the e-mail address can be completed as
kalita@pikespeak.uccs.edu. so that it is a fully qualified Internet address.
• For each e-mail server name, sort the e-mail addresses alphabetically.
8. (Medium: File Reading, Histograms, Statistics, Web Graphics)
Write a program to print a histogram of the number of words in the text files in a directory. First, draw the histogram horizontally using a character such as x. There is a row for each file. Repeat the character as many times as necessary. Scale the number of words so that the histogram is visible on the screen. Next, modify the program so that the results can be displayed on a Web page. In other words, create HTML and draw nice lines for the histogram. You should do better than a character-based histogram.
9. (Medium: Web Log Analysis, HTML Tables)
All accesses to a Web or HTTP server are stored in the file /var/httpd/logs/access_log on a Red Hat Linx 7.1 system. Find out where this file is on your system. The file may be large. Please look at a line of this file given below. It is a single line although it has been broken it up into two lines here.
aazoli.cstp.umkc.edu - - [27/Apr/2001:08:03:59 -0600]
``GET /~kalita/index.html HTTP/1.0'' 200 595
It has the machine that has accessed a page, the date and time of access, and information about the HTTP protocol command that it received. It also stores the name of the file that is accessed.
Process this file to find which machines have accessed a certain user’s files how many times: today, this month, and this year. Specify the name of the user as a command-line argument. Now, create an HTML file that will contain a presentation of this information in tabular form. See how this page looks like in a Web browser.
10. (Medium: Bibtex, Creating HTML, Research)
There are two related text processing packages called TeX and LaTeX that many in academia frequently use. These are markup languages that were invented by Donald Knuth in the late 70’s and early 80’s. Create a large bibliographic file for this word processor using the Bibtex format. One such example file is at http://www.cs.uccs.edu/\~kalita/work/bibtex.html.
A bibliographic file has a large number of entries which follow a given format. Entries look like the following.
@InProceedings{Kalita91,
author = ``Jugal K. Kalita and Norman I. Badler'',
title = ``Interpreting prepositions
physically'',
booktitle = ``Proceedings of the Annual Conference of the American
Association for Artificial Intelligence ``,
year = ``July, 1991'',
address = ``Anaheim, California'',
pages = ``105-110''
}
@Book{edward-gait-1906,
author = ``Gait~b1863, Edward Albert'',
title = ``A History of Assam'',
publisher = ``Thacker, Spink \& Co.'',
year = ``1906'',
editor = ``'',
volume = ``'',
series = ``'',
address = ``Calcutta'',
edition = ``'',
month = ``'',
note = ``PATG''
}
The first entry tells us that it is an entry from a conference proceedings. It has an internal index called Kalita91. Then, it gives us some details about the entry.
Please note that there are several kinds of records with different fields. In addition, some fields may not be specified.
Please look at the file provided. Find documentation on Bibtex format on the Web. let the user supply one or more keywords and search the “ database”. Get the entries that mention the keyword(s). Process the selected entries to produce a presentable HTML document that contains the bibliographic entries. Find out from a book of styles (e.g., The Chicago Manual of Style) how they can be presented. Once again, load this HTML file into a browser to display it.
11. (Easy: Text Processing)
Write a program that reads a file a paragraph at a time. It reports the lengths of the longest and the smallest paragraphs in the document, in terms of numbers of words. It also reports the lengths of the shortest and the longest sentences, in numbers of words. Make any simple assumptions you need regarding when a sentence starts.
12. (Easy: Text Processing)
In Perl you can include one file in another by writing either require file or use file. Write a program that takes a set of file names as command line arguments and prints which file includes which ones. The output should look something like the following:
Name Includes
----- --------
file1 file5, file6, file20
file2 file3, file4
13. You have a text file customers with a large number of records. A record looks like the following
#
name = ``Kalita, Jugal'';
e-mail = ``kalita@pikespeak.uccs.edu'';
phone = ``719-574-3656''
# next record begins
Note that the fields may be jumbled up. Also, there may be any number of spaces or newlines scattered in the file.
Treat this file as a mailing list. Generate a mail message with a body that looks like the following to every person listed in the “database.”
Dear Jugal,
How are you doing? Is 719-574-3656 still your phone number?
Sincerely,
Customer Service Representative
This mail message is sent to kalita@pikespeak.uccs.edu.
14. (Medium: Text Processing, “Database” Processing)
Let us make an initial effort at writing a grocery store program.
We make up a syntax for entering our grocery store items. This may not be the best syntax, but it gives us the flexibility that we need for our purposes.
Let us assume that we have grocery items belonging to the following categories Produce, Deli, Dairy, Meat, Beverages, Snack Foods, Paper Products, Books and Magazines, Spices and Condiment, and Frozen Foods.
Write up a file for the grocery items that we have in the store. Let this file have 15 items belonging to any five categories given above. Have 3 items per category.
An entry in this file looks like the following.
@Item{
ID = ``20-130-5000'',
name = ``Philadelphia Cream Cheese'',
category = ``Dairy'',
manufacturer = ``General Foods, Inc.'',
price = ``0.99'',
unit= ``ounce'',
unitsPerItem = ``16'',
packaging = ``plastic and paper'',
inventory = ``1000'',
description = ``Low calorie; No fat''
}
It tells us that it is a grocery item. It has an internal ID. Then, it gives us details about the entry. Note that some fields may not be specified. Also, a field may be specified over several lines.
Our goal now is to make up a Web page from the item file. By processing this file, we want to make an HTML page where the item categories are shown as clickable links. Show the items in a sorted manner. When a user clicks on a link for a certain category, show all the products in that category with detailed information that you deem important. You should sort the items.
15. (Easy: Text Processing)
Suppose we have XML elements that look like the following.
<Photographs>
<Photograph>
<URL>http://www.cs.uccs.edu/~kalita/kalita1.jpg</URL>
<Caption>Jugal Kalita on a summer afternoon</Caption>
<Width>150</Width>
<Height>150</Height>
</Photograph>
<Photograph>
<URL>http://www.cs.uccs.edu/~kalita/jk1.jpg</URL>
<Caption>Jugal Kalita on a boat on Lake Umtru</Caption>
<Width>200</Width>
<Height>200</Height>
</Photograph>
</Photographs>
Parse this XML fragment to obtain the Photograph elements. For each Photograph element, obtain the sub-elements. Create a string for each Photograph element where the sub-elements are separated by the separator ::. Write the strings to a file, one string per line.
16. (Easy to Hard: Text Processing)
Write a program that compares two files like the diff program in Unix. Implement as many options as you can.
17. (Easy: Text Processing, Loops)
Write definitions for the following group of functions. In each case, the function takes two arguments. The first argument must evaluate to a list and the second is a regular expression. What each function returns is given below:
(a) forall lst pattern: This function returns 1 if the pattern is satisfied for every item in the list lst and 0 otherwise. This function should always return 1 if the first argument is the empty list.
(b) forsome lst pattern: This function returns 1 if the pattern is satisfied for some item in the list lst and 0 otherwise. This function should always return 0 if its first argument is the empty list.
(c) formost lst pattern: This function returns 1 if the pattern is satisfied for most of the items in the list lst and 0 otherwise. Let most mean half or more of the items in the list. This function should always return 1 if the first argument is the empty list.
18. (Medium to Hard: Calculus, Pattern Matching, Symbolic Differentiation)
This problem asks you to complete a symbolic differentiation function. The goal of the function is to take an arithmetic expression containing variables and to produce another representing its first derivative with respect to the variable X. Thus given the expressions:
((x ^ 2) + (3 * x)),
we want to produce the new expression
((2 * x) + 3)}.
We assume the expressions are fully parenthesized for simplicity.
The standard rules for differentiation are given below. Here, x stands for the variable x , c for any constant symbol, u and v for any arbitrary arithmetic expressions and du and dv for their derivatives. Note that the rules are recursive. The fourth rule, for example, states that the derivative of the sum of two expressions is the sum of their derivatives.
(a) d/dx[c] = 0 , where c is a numeric constant.
(b) d/dx[x] = 1
(c) d/dx[v] = 0 , where v is a variable other than x .
(d) d/dx[u + v] = d/dx[u] + d/dx[v]
(e) d/dx[u - v] = d/dx[u] - d/dx[v]
(f) d/dx[-v] = - d/dx[v]
(g) d/dx[u * v] = u * d/dx[v] + v * d/dx[u]
(h) d/dx[u / v] = (v * d/dx[u] - u * d/dx[v]) / v
Try to simplify the results of differentiation. Take the following approach. Use simplifying patterns such as the following.
|
pattern |
x+0 |
0+x |
x*0 |
0*x |
x*1 |
1*x |
x/1 x |
0/x |
|
|||||||||
|
|
result |
x |
x |
0 |
0 |
x |
x |
x |
0 |
|||||||||
In the returned expression, look for any of the above pattern and additional patterns that you write, and if you find any replace it by the corresponding result. And, keep on doing this till no more simplifications can be done. Note that x need not be a symbol; it can be any complicated expression.