


(1)  To become familiar with this module, first run the script:

             retrieve_with_VSM.pl

     This script carries out basic VSM-based model construction and
     retrieval.

     When you run the above script, you will be asking the module to
     retrieve the Java files in the 'corpus' directory that match the
     un-commented query shown at the top of the script.  Try running the
     script with one of other queries by uncommenting the appropriate query
     line.



(2)  Next run the script:

             retrieve_with_LSA.pl

     For basic LSA-based model construction and retrieval.  As with the
     previous script, you will be asking the module to retrieve the Java
     files in the 'corpus' directory that match the un-commented query
     shown at the top of the script.  Try running the script with one of
     other queries by uncommenting the appropriate query line.



(3)  The first script mentioned above deposits the calculated VSM model 
     in disk-based database files.  To retrieve from the disk-based
     VSM model created by the 'retrieve_with_VSM.pl' script, run the script

             retrieve_with_disk_based_VSM.pl

     Try running this script with different queries.  You can uncomment one
     of the other queries that are currently commented out or you can 
     create fresh queries of your own.



(4)  The disk-based information that is saved away by retrieve_with_VSM.pl
     can also be used for a faster calculation of the LSA model and 
     retrieval from the model thus created.  For this you'd need to call:

             retrieve_with_disk_based_LSA.pl

     Try running this script with different queries.  You can uncomment one
     of the other queries that are currently commented out or you can 
     create fresh queries of your own.



(5)  For your first experiments with measuring the accuracy of retrieval  
     performance, execute the script

             calculate_precision_and_recall_for_VSM.pl
   
     This script first tries to estimate the relevancies of the corpus
     files to each of the queries in the file 'test_queries.txt'.  The
     module calculates the two measures Precision@rank and Recall@rank.
     The area under the Precision vs. Recall curve for each query is the
     accuracy of retrieval for that query.  Averaging of this result over
     all the queries yields the more global metric MAP (Mean Average
     Precision).
     
     As mentioned elsewhere in the module documentation, estimating
     relevancies in the manner carried out by the module is not safe.  
     Relevancies are supposed to be supplied by humans.  All that a computer 
     can do to estimate relevancies is to count the number of query words in a
     document.  But, measuring relevancies in this manner creates a circular
     dependency between the retrieval algorithm and the estimated
     relevancy.


(6)  Do the same as in the previous step, but this time for LSA, by
     executing the script

             calculate_precision_and_recall_for_LSA.pl


(7)  As mentioned above in the note for step (5), measuring retrieval
     accuracy requires human-supplied relevancy judgments.  Assuming 
     that such judgments are made available to the module through the
     file named through the constructor parameter 'relevancy_file',
     you can run the script

          calculate_precision_and_recall_from_file_based_relevancies_for_VSM.pl

     for the case of VSM.  This script will print out the average
     precisions for the different test queries and calculate the MAP metric
     of retrieval accuracy.


(8)  To do the same as above but for the case of LSA, run the script


          calculate_precision_and_recall_from_file_based_relevancies_for_LSA.pl


(9)  To carry out significance testing for comparing two different retrieval
     algorithms (VSM or LSA with different values for some of the
     constructor parameters), run the script

          significance_testing.pl  randomization
    
     or 


          significance_testing.pl  t-test

     As currently set up, the case-study incorporated in the script
     significance_testing.pl is for two different versions of the LSA
     algorithm with two different thresholds for the important parameter
     lsa_svd_threshold.  Note that the command-line argument determines
     which type of significance test will be carried out, the one based on
     randomization or the one based on student-t.
