9.3 Using Cookies
Cookies are mechanism by which a Web server can store information on a Web client. This information can be later retrieved by the Web server. Cookies are used frequently by many Web servers to keep track of who comes and visits the sites. The first time someone comes to a site, the server can send one or more cookies to the client. The client stores the cookies in a local file, if the client is enabled to do so. Many Web servers do not allow a client to visit the site unless cookies are enabled. Some sites allow only partial access if the client is not enabled to accept cookies.
The following program is somewhat complex. It is a small part of a much bigger program, written by the author and others, that fetches news items from various on-line sources, puts them together, removes duplicates, classifies them automatically, and then presents the news items to every individual that visits the Web site as a personalized newspaper. The details are fairly involved using sophisticated techniques based on artificial intelligence based learning.
Figure 9.20: The Top Page at www.apnewstracker.com
The program presented here logs onto the site apnewstracker.com using a login name and a password. The top page of the site at a certain instant of time is shown in Figure 9.20. If the login is unsuccessful, it attempts to login several times, waiting a specified number of seconds after each attempt. The number of times is obtained from a separate file that we call a configuration file. The site sends back cookies to the browser and the browser needs to save them locally to be able to proceed further with retrieving the news items posted on the site. If the login is successful, the site automatically returns a page containing an index of news items with a headline for each item and a link to a URL containing the text of the item. The index page for a particular instant of time is shown in Figure 9.21.
Figure 9.21: The News Index Page at www.apnewstracker.com
The program parses the index page sent to it, and culls the URLs of the individual news item. There are several parts to the index page and the program picks out the news items in the first part only. The first part is indicated by the presence of certain specific syntactical items in the HTML of the index Web page. Next, the program fetches each one of the news items. A partial Web page for a specific news item is shown in Figure 9.22.
Figure 9.22: A News Item at www.apnewstracker.com
To fetch each news item URL, it has to present cookies to the Web server along with the request to fetch the page. In the more sophisticated program not discussed here, the program parses each news item’s text, saves it in some form in a database, classifies it into pre-determined categories, and performs much additional computation to be useful. We do not present these other processing steps in this simplified version of the program since these are irrelevant to our current discussion.
The program’s text is given below, followed by more explanation.
Program 9.12
#!/usr/bin/perl
#file apget1.pl
use strict;
#use LWP::Debug qw (+);
use HTTP::Response;
use LWP::UserAgent;
use HTTP::Cookies;
use HTTP::Request::Common;
use Time::Local;
use AP_config;
###Main Program ##################################
##########Global declarations. These are declared in a configuration file####
my ($cookie_jar, $ua, $loginResponse, $newsItemIndex);
my (%newsIndexHash);
my ($itemURL, $itemText);
#Create cookie_jar to be stored in file $AP_config::cookieFile
$cookie_jar = HTTP::Cookies -> new (file => $AP_config::cookieFile,
ignore_discard => 1);
#define the user agent
$ua = LWP::UserAgent -> new ();
$ua -> agent ($AP_config::userAgentDescription);
$ua -> timeout ($AP_config::userAgentTimeout); #default is 180 seconds
#Do for ever. Login+fetch news items every
#$$AP_config::waitTimeBetweenFetches seconds
while (1){
my $noLoginTriesThisTime = $AP_config::noLoginTries;
LOGIN:{
print "\nTrying to login to $AP_config::loginURL...\n\n";
#log into the AP site and Fetch the news item index
$loginResponse = &login($ua, $cookie_jar);
print "After login\n";
if (!$loginResponse){
print "Couldn't login\n";
$noLoginTriesThisTime--;
#Reduce the number of times login will be tried sequentially
if ($noLoginTriesThisTime){
print "SLEEPING $AP_config::waitTimeBetweenFailedLogins seconds...\n";
sleep ($AP_config::waitTimeBetweenFailedLogins);
goto LOGIN; #Try to login again
}else{
print "Tried logging in $AP_config::noLoginTries, didn't succeed, exiting...\n";
exit 1;
}
}
} #LOGIN block ends
print "Just logged onto the http://www.apnewstracker.com site...\n";
for (my $i = 1; $i <= $AP_config::repeatFetchTimesBetweenLogins; $i++){
#logged in at this point
$newsItemIndex = &fetchIndexPage ($ua, $cookie_jar);
#Parse the news index page that comes back; if there is an HTTP error,
#$newsItemIndex is the number 0
%newsIndexHash = ();
if (!$newsItemIndex){
print "Couldn't fetch news items index page...\n";
print "SLEEPING $AP_config::waitTimeBetweenFailedLogins";
print "seconds before retrying login...\n";
sleep ($AP_config::waitTimeBetweenFailedLogins);
goto LOGIN; #Try to login again
}
&parseIndexPage ($newsItemIndex); #This puts values in %newsIndexHash
#Next, need to fetch the URLs specified in the index page
#But, need to fetch only those URLs that are not already there in cache
foreach $itemURL (reverse (sort (keys %newsIndexHash))){
$itemText = &fetchNewsItemURL ($ua, $cookie_jar, $itemURL);
if (!$itemText){
print "Could not fetch url: $itemURL\n";
}
} #foreach ends
#sleep for a while
print "\nSLEEPING FOR $AP_config::waitTimeBetweenFetches SECONDS...\n\n";
sleep ($AP_config::waitTimeBetweenFetches);
} #for my $i ends, fetched news indices
#$AP_config::repeatFetchTimesBetweenLogin times
} #while (1) ends
#########SUBROUTINES####################################
################sub login################################
#login to the AP newstracker page. This page is at
#$AP_config::baseURL = http://www.newstracker.com.
#The form to fill in to login is at $AP_config::loginURL
sub login{
my ($ua, $cookie_jar) = @_;
my ($request, $response);
my ($indexContents);
#add cookie information to the request
#the request logs one the apnewstracker.com site
$request = POST $AP_config::loginURL,
[username => $AP_config::loginName, password => $AP_config::loginPassword];
print $request->as_string . "\n";
$cookie_jar -> add_cookie_header ($request);
#Get the response to the request and extract cookies from what comes back
$response = $ua -> request ($request);
$cookie_jar -> extract_cookies ($response);
#print HTTP error message if the user couldn't login or there was an HTTP error
if ($response -> is_success){
return $response;
} else {
print "Could not login to the http://www.apnewstracker.com site\n";
print $response -> error_as_HTML;
return 0; #If it couldn't login or there was an HTTP error, it returns the number 0
}
} #sub login ends
################sub fetchIndexPage################################
#Fetch the index page of news items
sub fetchIndexPage{
my ($ua, $cookie_jar) = @_;
#print "Inside fetchIndexPage subroutine\n";
my ($request, $response);
my ($indexContents);
#Now, get the actual page where the news index is.
#This page http://www.apnewstracker.com/is3/runprofile.hts is updated
#every minute or two.
$request = GET $AP_config::newsIndexURL;
$cookie_jar -> add_cookie_header ($request);
$response = $ua -> request ($request);
$cookie_jar -> extract_cookies ($response);
#Parse the news index page if it comes back. If there
#is HTTP error, print an error page
if ($response -> is_success){
print "Obtained $AP_config::newsIndexURL\n";
$indexContents = $response -> content;
return $indexContents;
} else {
print $response -> error_as_HTML;
return 0; #return 0 if there is a HTTP error
}
} #sub fetchIndexPage ends
################sub parseIndexPage####################################
#A subroutine that parses the AP News index page:
# http://www.apnewstracker.com/is3/runprofile.hts
#The page is passed to the subroutine as a string
sub parseIndexPage{
my ($fileContents) = @_;
my (@tables, @items);
my ($item, $month, $day, $year, $url, $title, $hour, $minute, $headline);
my ($relevantPart) =
($fileContents =~ m#DataStream A(.+?)technology<#s);
@items = ($relevantPart =~ m#(The program uses several modules we have seen earlier:
HTTP::Response, LWP::UserAgent, HTTP::Cookies and HTTP::Request::Common. These are pre-defined well-regarded modules that can be downloaded from the Internet. We also have a module called AP_config that declares and gives values to certain variables used in the program. The module is shown below. The login name and the password have been replaced by a sequence of Xs.
Program 9.13
package AP_config;
#Define all site-specific global variables
use strict;
use vars qw($baseURL $loginURL $newsIndexURL);
use vars qw($loginName $loginPassword); #To login to apnewstracker.com
use vars qw($waitTimeBetweenFetches $waitTimeBetweenFailedLogins);
use vars qw($repeatFetchTimesBetweenLogins $noLoginTries);
use vars qw($userAgent $userAgentTimeout);
$baseURL = "http://www.apnewstracker.com";
$loginURL = "$baseURL" . "/is3/wsh-kickoff.hts";
$newsIndexURL = "$baseURL" . "/is3/runprofile.hts";
$loginName = "XXXXXXX";
$loginPassword = "XXXXXX";
#Global variables for database access
$waitTimeBetweenFetches = 180;
$waitTimeBetweenFailedLogins = 180;
$noLoginTries = 10;
$repeatFetchTimesBetweenLogins = 20;
$userAgent = "Mozilla/4.7 (Compatible; MSIE 5.0; Windows2000)";
$userAgentTimeout =400;
1;
One of the first things that the main program does is to specify a file to store cookies sent by a Web server.
#Create cookie_jar to be stored in file $AP_config::cookieFile
$cookie_jar = HTTP::Cookies -> new (file => $AP_config::cookieFile,
ignore_discard => 1);
This is done by creating a new HTTP::Cookies object. The call to the new object constructor method takes one or more parameters, given in the form of a hash. The file argument gives the name of the file where cookies are stored when they are sent to the browser by the Web server. The second argument is boolean. It is not necessary that we use it like we do here. This parameter instructs that the program save even cookies that are requested to be destroyed by the server. Initially, the cookie file is empty.
Next, the program creates a user agent $ua, gives a description to the agent using the agent method and specifies a time after which the user agent should give up by using the timeout method. Once the setup is over, the program goes into a while loop where it attempts to log in the the www.apnewstracker.com site. The logging is done using the login subroutine that takes the user agent and the cookie file, called $cookie_jar, as arguments. If the login is not successful, the program sleeps
for a pre-determined amount of time and retries. It attempts to login only a pre-specified number of times, say 10. If even after all attempts, the program does not succeed, it dies. This can happen because of reasons such as that the server is down or is overloaded, or the server is unreachable.
Once the program has logged in to the site, it fetches the index page a specified number of times after which it logs in again. This makes the program robust because it was found that the site logged the program out after a certain amount of time. The way the program is written now, it can go for many days fetching news items without any problems. The fetching of the news index page is done using the following statement.
#logged in at this point
$newsItemIndex = &fetchIndexPage ($ua, $cookie_jar);
If the news item index page cannot be fetched, it generally means that the program has been logged out. Therefore, in such a case, the program tries to log in again. The cookie file, which is a component of the cookie jar being passed as argument to fetchIndexPage may look like the following after the program has logged in.
#LWP-Cookies-1.0
Set-Cookie3: entitlements="dsa{?tag"; path="/";
domain="www.apnewstracker.com"; path_spec;
discard; ; version=0
Set-Cookie3: numhits=12; path="/"; domain="www.apnewstracker.com";
path_spec;
discard; ; version=0
Set-Cookie3: sessionid=470092; path="/"; domain="www.apnewstracker.com"; path_spec;
discard; ; version=0
Set-Cookie3: sortval=1; path="/"; domain="www.apnewstracker.com";
path_spec;
discard; ; version=0
Set-Cookie3: userid=6001210; path="/"; domain="www.apnewstracker.com";
path_spec;
discard; ; version=0
Set-Cookie3: usertype=A; path="/"; domain="www.apnewstracker.com";
path_spec;
discard; ; version=0
Each individual line has been broken into two for the purpose of this example since the lines are too long to be printed.
We do not discuss the syntax of the cookies here. In the cookie file displayed above, each row stands for a cookie.
After fetching an index page, the program parses the index page using the following call.
&parseIndexPage ($newsItemIndex); #This puts values in %newsIndexHash
Parsing places the news items in the hash %newsIndexHash. Finally, the program goes through each item in this hash and fetches the text of each news item by following the URL for the news item.
foreach $itemURL (reverse (sort (keys %newsIndexHash))){
$itemText = &fetchNewsItemURL ($ua, $cookie_jar, $itemURL);
if (!$itemText){
print "Could not fetch url: $itemURL\n";
}
} #foreach ends
In the call to fetchNewsItemURL that fetches individual URLs corresponding to individual news items, the cookies in the stored cookie file are sent to the Web server. This sending of cookies needs to be done every time when we attempt to retrieve pages from the Web server. This is important. If the appropriate cookies are not sent with the request, the server rejects the request. In this particular case, since we obtain the cookies to be sent from the cookie file that got the cookies from the previous page fetched, everything should be in proper order.
The program has several subroutines: login, fetchIndexPage, parseIndexPage and
fetchNewsItemURL. We briefly look at each one next.
The subroutine login takes a user agent and a cookie jar as the arguments. It makes a POST request using the following statement.
$request = POST $AP_config::loginURL,
[username => $AP_config::loginName, password => $AP_config::loginPassword];
The request is printed on the screen as a string for verification purposes. The HTTP::Request object needs to have appropriate cookie headers set. This is done using the following statement.
$cookie_jar -> add_cookie_header ($request);
The contents of the cookies as well as all HTTP interactions taking place between the server and the client can be seen if we use the LWP::Debug module and import the + symbol. In the current program, the line that imports this module has been commented out. The line looks like the following and occurs on the top of the program.
#use LWP::Debug qw (+);
If we want to see the HTTP interactions, this line needs to be uncommented. Seeing what is going back and forth between the Web server and the Web client is instructive as well as useful for debugging.
The HTTP::Request object $request must have a valid URL attribute before the
add_cookie_headers method can be called. A response is obtained whether the request is successful or not. The cookies are extracted from the response and the cookie file, the so-called cookie jar, is updated using the following statement.
$cookie_jar -> extract_cookies ($response);
The subroutine then looks to see if the status code associated with the response indicates success. If so, it returns the response to the calling program. Otherwise, it prints an error message and returns 0 to the calling program. Of course, the extracting of cookies can be done after checking the status code.
The fetchIndexPage subroutine also takes the user agent and the cookie jar as arguments. It obtains the index page for news items using the GET function call of the HTTP::Request::Common module. Cookie headers are added to the request. From the response, the cookies are extracted and the cookie file is updated if any of the cookie value is new. If successful, fetchIndexPage returns the content of the page fetched to the calling program.
The parseIndexPage subroutine parses the contents of the index page fetched earlier. We examine the HTML of the index page carefully and determine where the individual news items are. Thus, the HTML-level parsing that takes place in this subroutine is very much dependent on the syntax of the HTML page under consideration. Web sites are known to change the format of their Web pages frequently, and hence, the parsing performed here will have to be changed if the structure of the Web page is found to change. The components obtained for a news item are: $year, $month, $day, $hour, $minute, $file and $headline. Once the items of news have been obtained, they are joined to form a single string called $itemData. This string is stored in the global variable %newsIndexHash with the URL of the news item as its key.
The fetchNewsItemURL subroutine is simple. It gets the user agent, the cookie file, and the URL to fetch as argument. It creates a GET request, adds the cookie headers to it, and fetches the file.
As indicated earlier, the program described above is the shell of a much more complex program written by the author and others. The program shows that cookies play an important role in Web programming. A lot of Web servers use such cookies, which are small pieces of data, to keep track of information about the clients fetching pages from them. Some cookies are short-lived whereas others are valid for longer periods, such as days or months. Some cookies are to be discarded by the client whereas others are to be saved for a certain duration of time. In the cookie file shown earlier, all cookies are to be discarded after the current session is over.
During a single session, as a Web client goes from page to page within a certain Web site, the server checks the names and contents of the cookies for various informational bits about the client. Without the appropriate cookie names and the appropriate values, the Web server may refuse to send the Web page requested to the client. A cookie may indicate if the user is logged in during this session, when the user logged in the last time, the type of machine and browser being used, etc. Therefore, when we program a Web client to interact with such a Web server, cookie headers need to be appropriately filled in for each Web page request. The cookies are saved in a cookie file and usually this file is updated after each interaction with the Web server. The program discussed above uses cookies fairly extensively and illustrates how they can be manipulated. More detailed discussion on cookies can found on a Web site such as www.netscape.com and www.cookiecentral.com.