9.6 Exercises

9.6  Exercises

1.  (Medium to Hard: E-Commerce)

Write a program that compares the prices of books from five different Web bookstores. First, present an HTML page where a visitor can search for books by title words, author name or ISBN number. Your program should search the five stores, collate the results, and present them in an easily readable format. What are some of the problems you face?  How do you solve them?  How can you make your program more efficient, more user-friendly and more robust?  Discuss ideas regarding these issues.

2.  (Hard: E-Commerce, Can be a long-term project)

Write a program that compares prices of electronic goods such as cameras from three to five different e-commerce sites.

3.  (Hard: E-Commerce, Long-term project)

Write a program that compares the prices of airlines tickets. This program is going to be harder to write than the previous programs. This is mainly because an itinerary may consist of several legs.

4.  (Hard: E-Commerce, Long-term project)

Write a program that compares the prices of automobiles from a few dealers. This is going to be difficult in general because of the plethora of options that manufacturers usually provide for cars.

5.  (Hard: Meta-Search Engine, Long-term project)

This problem instructs you to write a meta-search engine. You have a Web page where a user can search for keywords, just like a commercial search engine such as Google or Lycos. Your CGI program connects to three or four commercial search engines, performs searches on their sites, collates the results, and prints the results to the browser. At a minimum, remove duplicates, and present the information consistently. Different search engines usually return the search results using different formats. What are some of the problems you face?

6.  (Easy: Fetching Web Page, Research)

Write a program that fetches a Web page, given its URL, only if it has been modified from the last time it was fetched. Run the program from time to time.

7.  (Easy to Medium: Monitoring Web Site)

Write a program that detects changes in certain Web page of interest to you. It monitors the page every few hours. If it detects changes, it alerts you with a mail message. Such a program can be of value to the programs you are asked to write in the previous problems. This is because you may spend a lot of time writing your code to perform searches on Web sites to make your programs work. However, the format in which Web sites return search results can change frequently. Such changes can reduce the usefulness of your program at any time. Therefore, a program that monitors to see if the format in which results are entered has changed at a certain Web site can alert you so that you can make the appropriate changes in your own program.

8.  (Medium: Web-site Mirroring)

Write a program that mirrors a Web site locally on your machine. This program copies every directory recursively to your machine. However, it does not copy everything. It runs from time to time, and copies only those files and directories that have changed. It copies new files and directories as well. What are some problems in writing such a program so that it is efficient?  Implement some of your ideas.

9.  (Medium: HTML Forms, Research)

There is a Perl module called HTML::Forms that can parse out a Web page and capture the forms in it. Install this module if you do not have it already. Use this module to capture a form, fill it, and submit it. Rewrite the programs discussed in the text of this Chapter and the problems here, so that they use HTML::Forms.

10.  (Medium to Hard: Authentication, Cryptography, Research)

Many Web sites require a client to authenticate with a name and a password. Perl provides modules for doing such authentication. Research into these modules. Write a program that fetches a Web page that requires authentication.

11.  (Easy: Secure Web Sites, Research)

There are Web sites which are secure. Their URLs start with https:// instead of http://. These sites use encryption to ensure that the data transferred back and forth between the client and the server is encoded so that interceptors cannot read them. Research how a Perl program can obtain a Web page from such a secure server. Write a program that obtains such a page.

12.  (Hard: Movie Database, Can be a long-term project)

Write a program that searches for a movie’s review from a site such as www.imdb.com (The International Movie Database). Write code for an HTML form that allows one to search for a movie with a keyword. The CGI program associated with this form actually performs searches at the IMDB or a similar site. It then obtains the names of the movies that the site returns. Your program further obtains reviews of these movies from the site, if necessary by performing traversal of additional hyperlinks. The CGI program you write presents the name of each movie followed by its review.

13.  (Medium to Hard: Text Processing, Vector Computation, Can be a long-term research project)

Write a program that is given a URL as command-line argument. It fetches the URL. Assume it is a text file. Remove all the HTML tags from the page. Remove commonly used words such as it, is, some, etc. Count the frequencies of occurrence of the words in the file.

Now, you are given the URL of another Web page. Repeat the above steps for the second file as well. Make a single sorted list of the words in the two files. Assign a numeric ID to each word. Form a frequency vector, one per URL. You will create two frequency vectors. The vector for a file should contain the frequencies of the words in the file, in order of the word number. Words that do not occur in a file have the frequncy of zero for that file. Find a way to normalize this vector so that the values do not grow unbounded, i.e., become large. In addition, it is easier to do computation with normalized numbers.

Given these two vectors, we can perform a computation called the cosine computation to find how similar the two files are. The cosine computation treats the finds the cosine of the angle between the two vectors. Although the dimension of the two vectors is large, we can still find the angle between the two vectors by computing its cosine. Find out the cosine of the angle between the two vectors for the two URLs. It will tell us how similar are the two Web pages that we started with. A small value of the cosine means that the angle between the two vectors is small.

Now, we will extend the program. We are still given the first URL to start with. It is the base page with which we will compare all the other pages. The program is also given another URL. It traverses the Web site represented by the second URL recursively. It finds the similarity of each page with the base page. It prints the similarity numbers in a table. It also sorts the pages by the similarity values in ascending order.

A program such as this can be used to traverse the Web and automatically classify pages according to whether they are similar to a base page or not. The base page’s vector does not have to correspond to a real page, but could be the representation of a class of pages. For example, a vector can represent the characteristic word frequencies for a class such as news. A program like the one you have written can then crawl the Web and automatically find Web sites that are news related.