Docstoc

PageRank Calculation using Spars

Document Sample
PageRank Calculation using Spars Powered By Docstoc
					PageRank Calculation using Sparse Matrix in Clustered
              Computer Environment



                Kwan Dong Kim, Chang Min Kim




                                                    0
                           Table of Contents

1. Introduction ------------------------------------------------------------- 1
2. PageRank ---------------------------------------------------------------- 1
       A. Definition ------------------------------------------------------------ 1
       B. PageRank Algorithm ----------------------------------------------- 2
3. Sparse Matrix Representation ------------------------------------ 4
       A. Definition ------------------------------------------------------------ 4

       B. Compression Algorithm -------------------------------------------- 5

4. Hardware Environment ---------------------------------------------- 6
       A. V2 Cluster------------------------------------------------------------ 6

       B. Parallel Java -------------------------------------------------------- 7

       C. Parallel Java in V2 Cluster ----------------------------------------- 7

5. PageRank Iteration in Cluster ------------------------------------- 9
6. Experiments and Results ------------------------------------------ 11
       A. Methodology -------------------------------------------------------- 11

       B. Results and Conclusion -------------------------------------------- 12

7. Acknowledgement ---------------------------------------------------- 12
8. Reference ---------------------------------------------------------------- 12
9. Appendix ----------------------------------------------------------------- 13
       A. Establishing SSL Connection and Running MPI program ------- 13

       B. Shell scripts -------------------------------------------------------- 15

       C. Java Codes --------------------------------------------------------- 20




                                                                                      1
1. Introduction
     The purpose of the Web Laboratory is to provide data and computing tools for research
     about the Web and the information on the Web. “It is funded in part by National Science
     Foundation grants CNS-0403340, DUE-0127308, SES-0537606, and IIS 0634677. The
     Web Lab is an NSF Next Generation Cyberinfrastructure project”[1]. To achieve this goal,
     three teams will work for Web Lab project: Index, User Interface and PageRank teams.
     The index team provides the full text indexing for other teams using Linux cluster server,
     the user interface team takes charge of the user interface and pre-processing of data for
     the PageRank. And the PageRank team works for compressed sparse matrix and
     parallel programming to calculate large scale PageRank. This document is the project
     report for the PageRank team therefore it will explain the several methods to calculate
     large scale PageRank. And then it will show the results of the experiment which tests the
     performance and the scalability of this algorithm.


2. PageRank
    A document on the web can be measured typically by two metrics, “relevance” and
    “importance”. Relevance is based on comparison between the document terms and
    query terms, while importance is based on the estimation of popularity of the documents.
    Although two metrics are used as one combined metric in practice by search engines,
    importance by popularity is considered to be the key metric to measure the rank of a
    document on the web. Getting relevance metric requires all the documents in a set to be
    compared with the query terms but it is neither practically possible nor desirable for the
    web due to the cost and the characteristics of the web such as growing number of
    documents and quality of the documents on the web. It is formally rather suitable for
    controlled collection of documents. Among several popularity measure of a page on the
    web, “PageRank” provides most suitable and efficient semantics and algorithm.

    A. Definition
    Brin and Page (1998) suggested “PageRank” as a measure of estimating the popularity
    of a web page[2]. PageRank calculates stochastic probability to reach a certain web page
    based on the number of in-links to a page. “PageRank is basically modified version of
    Pinski and Narin’s influence weights applied to the web graph” (Arms, 2006). PageRank
    calculation contains an iteration of matrix multiplications and additions but it is proven to
    converge in reasonable amount of time and the time does not depend on the size of input
    data, i.e. the number of pages that are calculated. This is major reason why PageRank
    can be applied to a large set of documents such as those on the web unlike other
    algorithms such as “Hubs and Authorities” (Kleinberg, 1997)[3].




                                                                                               1
B. PageRank Algorithm
PageRank essentially calculates the stochastic probability to reach a page based on
“Random Surfer” model. In this model, a user surfs around the web without any restriction
except an assumption. And the possibility to reach a page from another page is purely
decided by the number of links in the page the surfer is in. In the view point of the page
being reached by user, if the page has many in-links, the page gets to have higher
possibility to be reached and thus gets higher PageRank. What random surfer model
assumes is that a user may follow a link on the page he is in or he may jump to another
page randomly. So, he begins surfing with the same possibility of reaching any page in
the set of documents, virtually whole web pages.
Suppose the number of all pages in the set is n.
                                          1
Then W 0 is a vector with every element     , which is the probability to reach each page in
                                          n
the vector.
Then set up a square matrix B that contains all the link information of the pages in the set.
The column index represents page a link is from pages and the row index represents
                           the
the page a link is to pages. Each cell in the matrix contains 1 if there is a link between
those two (from and to pages). Then normalize it by dividing each column by the total
number of links in the column so that each cell can contain the possibility to reach a page
in the row from a page in column. A page which does not have any link to other pages is
                                                         1
called “a dangling node” and the value for the cell is     , because if a surfer reaches the
                                                         n
page he will jump to another page randomly because there is no out link in that page.


W0 = [¼, ¼, ¼, ¼]                           




                  1   2   3   4

B=            1   0   ½   ¼   1/3
              2   ½   0   ¼   1/3
              3   0   0   ¼   1/3
              4   ½   ½   ¼   0

        * page 3 is a dangling page


The probabilities to reach each page after one step from the beginning page without a
random jump to another page can be calculated by W 1 = B * W0’. If we consider a random




                                                                                           2
jump from starting page with probability 1-d (d is damping factor), then the equation for
W0 will be


W1 = d * (B * W 0’) + (1-d) * W0’, and
W2 = d * (B * W 0’) + (1-d) * W1’
W3 = d * (B * W 0’) + (1-d) * W2’
.
.
.
Wk = d * (B * W 0’) + (1-d) * Wk-1’


The sum of every element in each W is 1 because the sum is the probability to reach any
page in the set. Every element in W will eventually converge and W will be the PageRank
of the pages in the set.
The convergence to a unique vector for any given staring vector W 0 is the feature of
Markov Property because matrix B is stochastic, irreducible, and aperiodic (Arms, 2006).


B is a dense matrix but in the real world, a typical web page will not have more than
several hundred links at most and that is unusual. In our experiment from Amazon
dataset the average out-link in a page was , and the page with most link was . Under this
condition, the iterating matrix multiplication on dense matrix will reduce the performance
of PageRank calculation significantly. We can construct sparse matrix by re-writing


C = S + (1/n) * e * a’                                              --- 1


Then, C considers random jump from dangling pages.
S is the initial B matrix before adding 1/n for damping pages, e is a vector with every
element 1, and “a” is a vector with elements 1 if the pages for the entries in the vector are
dangling otherwise 0. If we define a Matrix L


L = d * C + (1/n)(1/n) * E                                  --- 2


Then L considers random jump from any page by user choice.
E is a square Matrix with every element 1. Then


Wk = L * Wk-1’,                                                     --- 3


If we substitute L with equation 1 and 2




                                                                                           3
    Wk = (d * C + (1-d)(1/n) * E) * W k-1’
      = dSWk-1’ + d(1/n)ea’ * W k-1’ + (1-d)(1/n)EW k-1’
      = dSWk-1’ + d(1/n)ea’ * W k-1’ + (1-d)(1/n)e            --- 4
    (note EW k-1 is e)


    There is no dense matrix in this equation, so iteration with this equation is significantly
    more efficient when the size of the set gets larger. We used equation 4 for our
    experiments.


    However, as PageRank team is working on, if we want to calculate PageRank on a very
    large set of pages, there would be a memory space problem. Constructing a huge sparse
    matrix is not desirable in terms of either memory utilization or computational efficiency.
    We can transform this sparse matrix to a compressed sparse matrix using the sparse
    matrix representation technique, which will be discussed in chapter 3.


    But even with this representation, in-memory computation is impractical when the data
    set is too large. So we need to move on to clustered computing environment (Considering
    the large number of web pages, this should be done rather than need to be done). By
    using “Message Passing Interface (MPI)”, multiple nodes can exchange relevant portion
    of data and can compute on them concurrently. MPI is very expensive in terms of speed,
    but considering the size of real data, this is inevitable. Hardware environment and setup
    for clustered computing will be discussed in chapter 4.


    In this clustered computing environment using MPI, we need to slightly modify the
    algorithm so that necessary information for each node can be effectively and lossless
    distributed and gathered among the nodes. The actual process of the algorithm and
    problems in clustered machines will be discussed in chapter 5.


3. Sparse Matrix Representation
    A. Definition
    A sparse matrix is a matrix that consists of primarily zeros. The sparse matrix calculation
    is important to compute the PageRank because the PageRank calculation is
    fundamentally the combination of several sparse matrix calculations. Therefore if we can
    deal with the matrix more effectively, we can process bigger PageRank calculation. Fig-1
    is example of sparse matrix. It contain a lot of “0”s and few numbers which are greater
    than “0”.




                                                                                             4
               index         1      2       3      4       5      6       7      8      9
               1             0      0       1      0       0      0       0      0      0
               2             0      0       0      0       0      0       0      0      0
               3             1      0       0      1       0      0       0      2      0
               4             0      0       0      0       0      0       0      0      0
               5             0      0       0      0       0      0       0      1      0
               6             .      .       .      .       .      .       .      .      .
                          (Fig - 1) Example of sparse matrix


As we mentioned, the PageRank calculation is the combination of several sparse matrix
calculations. Therefore if there are 100 pages, we have to calculate 100 x 100 matrix to
calculate PageRank without compression. For the Web Lab project, there are
approximately 13 million pages therefore at least a 13 million x 13 million matrices must
be processed. If we use a regular array, vector or linked list, the enormous memory space
is required to calculate the PageRank. For example, if there are one million pages and
average out link for each page is approximately 10 links. If we use regular array structure
and each cell in the array needs 4 byte, 4 byte x one million x one million is 4 x 10 ^ 12
byte therefore 4 x 10 ^ 12 byte is required to store that matrix. However when we ignore
all zero values and only store the values greater than zero, we only need 4x 10 million (4
x one million x 10). As a result, the compressed spare matrix algorithm saves enormous
memory space when we calculate the PageRank.


B. Compression Algorithm
Three linked lists are used to implement the sparse matrix algorithm. First linked list is the
row linked-list and it represents the row index. Second linked list is the column-linked list
and it represents the column index. Third linked list is the value linked-list and it
represents the value of each cell [4].Fig-2 shows the structure of the compressed sparse
matrix algorithm and the array representation is a equivalent structure in Fig-3 with the
compressed sparse matrix representation in Fig -2.




                                                                                            5
            Number of column at
               the first row

                        1 2 3 4 5 6 7

        Row List
                        5 0 4 2 1 0 . . . . . . .




     Column List       2 3 7 8 9 1 3 4 9 2 4 8 . .


      Value List
                       1 2 1 1 2 1 3 1 2 3 4 5 . .
    (Fig - 2) Compressed sparse matrix representation


                   index        1       2      3      4       5      6      7       8      9
                   1            0       1      2      0       0      0      1       1      2
                   2            0       0      0      0       0      0      0       0      0
                   3            1       0      3      1       0      0      0       0      2
                   4            0       3      0      4       0      0      0       0      0
                   5            0       0      0      0       0      0      0       5      0
                   6            .       .      .      .       .      .      .       .      .
                              (Fig - 3) Regular Matrix Representation



4. Hardware Environment
    A. V2 Cluster
    Even if we use the compressed sparse matrix representation, 4 x 10 ^ 12 is still huge to
    calculate on a single machine therefore we decide to process them in the Multiple node
    Linux cluster servers. The Cornell Theory Center has several hundred nodes of Linux
    cluster servers therefore we can use this system. To calculate PageRank using cluster
    servers, we develop the program that runs in this multiple Linux cluster servers. It divides
    all of jobs and then processes each job in the separated nodes simultaneously. And then
    each node creates the result when the job is finished. All of these results are returned to




                                                                                               6
Linux login machine when all of processes are finished. As a result, These results are
merged together in the Linux login machine.


B. Parallel Java
As we mentioned, Linux cluster sever in the Cornell theory center will be used to process
our PageRank program. When we look at the manual in the Cornell theory center for
Parallel programming, only C and Fortran language are supported in this system.
However our goal is to develop the Java PageRank program that runs on Linux cluster
servers in the Cornell Theory center therefore we were looking for several Java packages
which support parallel programming.


The parallel programming java packages are listed below:
1. mpiJava(http://aspen.ucs.indiana.edu/pss/HPJava/mpiJava.html)
2. Open MPI(http://www.open-mpi.org/)
3. OpenMP API(http://docs.sun.com/app/docs/doc/819-3694)
4. MPICH2(http://www-unix.mcs.anl.gov/mpi/mpich/)
5. Parallel Java(http://www.cs.rit.edu/~ark/pj.shtml)


Although there are many Java packages supporting parallel programming, we select
Parallel Java (PJ) implemented by professor Alan Kaminsky at Rochester Institute of
Technology. Because this package is well documented and there are several examples
to follow easily.


There are several important methods and programs in the Parallel Java package:
“Comm” and “Buffers” methods, “JobScheduler”, “Backend” and “Frontend” programs.
“Comm” and “Buffers” methods are used to communicate between the backend and the
frontend. And “JobScheduler” assigns all job to each node(frontend or backend). The
frontend node controls all of backend nodes and gathers all results from the backend
nodes and merges them together. The backend processes the actual data and send
them back to the frontend node. These “JobScheduler”, “Backend” and “Frontend”
programs are required when we run our parallel program. (Refer to Parallel Java docs)
[5].


C. Parallel Java in V2 Cluster
To set up the appropriate environment for the parallel program, it is essential to
understand the structure of the Linux cluster servers in Cornell Theory Center. There are
a lot of systems in the Cornell Theory Center, however we only use three different




                                                                                       7
systems; Linux Login Machine, File Server and V2 Linux Cluster Servers. The overall
structure of system is shown in Fig-4. This system only allow the client to connect to Linux
Login machine at first. And then the user can connect to the V2 Linux Cluster Servers by
using SSH command when they log in Linux Login machine. The user directory and files
are stored and managed in the File Server. When the client connects to the Linux Login
machine, the user directory from the File Server is mounted automatically. Also when the
client connects to V2 Linux Cluster Servers, the user directory from the File Server is
mounted automatically too. Therefore user can use same file system all around the
system include the Linux login and V2 Linux Cluster servers[6].




                                                    File Server
    Linux Login Machine




                               Vii0001    Vii0002    ………            Vii000k


                                         V2 Linux Cluster Servers




                Client




(Fig-4) The structure of Linux Cluster Servers


As we mentioned, Linux cluster servers in Cornell Theory center only support C and
Fortran language. Therefore we need a special set up for Parallel Java(PJ) package.
There are five essential conditions to run the Parallel Java package in the Linux cluster
servers.


These conditions are listed below:
    1.     Java package must be installed in each node(V2 Linux Cluster Servers)
    2.     Each node can communicate by using SSH without any authentication
    3.     Packages must be deployed in each node




                                                                                          8
        4.   Job Scheduler should be running in each node
        5.   The program should be developed as it explained in the PJ document


    To accomplish the first condition, we request that the administrator working in Cornell
    Theory Center install Java 1.5 in Linux cluster servers. And we set up a public key for our
    account to allow SSH communication without any authentication. And then we make
    script program to copy all required programs to each node and generate the configuration
    file for the job scheduler. Finally these script programs run the Job scheduler and our
    page. The detail shell code for set up and the way to set up the public key are supplied in
    the appendix.


5. PageRank Iteration in Cluster
    Other than hardware setup for clustered machines, we need to modify the PageRank
    process slightly. It is important to understand the relationship between matrix and
    compressed matrix representation, PageRank iteration and compressed matrix
    representation, and the characteristics of input data as well as hardware environment and
    some facts about “Parallel Java” package.
    First, when we compress sparse matrix to compressed format, we intentionally eliminate
    the cell with value 0 to get rid of unnecessary information at the cost of losing some
    information. In our implementation, we kept the row information even with no entry in the
    data structure as long as there was a trial to input any data to the compressed format,
    even if it was a 0. This is good enough for PageRank calculation because link matrix in
    PageRank calculation is always square, so even if we don’t have full column information,
    we still can get column dimension from row dimension for PageRank calculation.
    However if the compressed representation is used in other calculations, it should be
    modified in order to keep column dimension information.
    Second, in clustered PageRank calculation, each node needs to maintain just the portion
    it should calculate from input file. However each node has to have full W vector
    information to calculate iteration and also e, a’ and ea’ should be sliced so that their
    appropriate portion is joined in the calculation. Also the resulting W vector in each node is
    a part of the full W vector, so they should be gathered properly by the master node and
    re-distributed to each node again.
    Third, there are several issues we don’t get correctly in the “Parallel Java” package. For
    example, when we set up buffers for sending Boolean variable and receiving Boolean
    array, we encountered a null pointer exception. The code was




                                                                                               9
    BooleanItemBuf bif = new BooleanItemBuf();
    BooleanItemBuf [] bifa = new BooleanItembuf[size];
    bif.set(true);
    world.gather(0, bif, bifa);


    This problem was resolved after we re-ordered the lines as following,


    BooleanItemBuf bif = new BooleanItemBuf();
    bif.set(true);
    BooleanItemBuf [] bifa = new BooleanItembuf[size];
    world.gather(0, bif, bifa);


We still can’t find the justification for this, but in practice, the order of statement was
important.


We also suffered from jobScheduler not running correctly for a long time. The reason was
that the jobScheduler takes some time to fully start up to correctly work. So we modified
the shell script to wait some time after it issues java command for jobScheduler.


Moreover, there is a quite serious environmental problem with clustered machines in the
Cornell Theory Center. First, it is quite hard to get a quota for a clustered machine,
especially a development machine. We are supposed to use v2 linuxdev cluster for
development purpose, which has only 4 nodes, and it is hardly idle. Second, the worst
problem is that this cluster is quite unstable and the server goes down quite often. For
example, exactly same set of codes work for one account but not for another account in
the same cluster, and there were many times that clearly working codes did not work until
Theory Center agents rebooted the system. We spoke with agents when we believed we
were right and rebooting the system usually resolved the problem. The cluster uses
“mpiexec” to distribute necessary files from login machine to the nodes and it failed often
with no evident reasons. Executing the same program does not guarantee the same
process under clustered condition in v2 linux and v2 linuxdev clusters. The same code
work behaves different depending on the server. All of these were the major obstacles
that retarded development stage.


Because we spent most of our time on verifying PageRank iteration equation and
hardware environment setting, we had little chance to work on the PageRank calculation
program for clustered machines. Now we have got to the point where message passing




                                                                                        10
for data types except object data type is correctly working and most parts of the
PageRank calculation program are working correctly.
One of the remaining parts that should be done with PageRank in a clustered
environment is, first, to figure out how to exchange object data type. After partial wCurrent,
which is the result matrix of each iteration from a node, is calculated, every partial
matrices should be gathered and broadcasted to each node again for next iteration.
wCurrent is a compressed matrix and this data should be exchanged using Object (Array,
Item) Buffer. But we did not have chance to work on this due to the time restriction.
However we decomposed the compressed matrix into three component arrays and
exchanged them instead of the real matrix. Although this is not the way we ultimately
want and this method breaks encapsulation of compressed matrix, now we can calculate
PageRank iteration by exchanging those arrays, but still we need to verify the program
with more data. Because we could not get enough of data in time, we only tested with
very small sized data sample, some are real and some are randomly generated, ranging
from 5 by 5 to 1000 by 1000. Although we could get quite reasonable results from current
code work, it might not work for larger set of data or under specific conditions. Also as it is
always so, there are several points that should be optimized. One good example is
transpose()    in   CCompressdMatrix       for   row-based     and    column-based      matrix
representation transformation.
Another issue is to figure out the problem that was given in the box above. Probably and
more than naturally, it is hard to expect that to happen based on the characteristics of
object oriented language. But it was quite true that at least under Cornell Theory Center
(CTC) environment according to our experiments. There is possibility that Parallel Java
(PJ) package is not well tuned for CTC environment because it is open to public for
academic purpose and not designed specifically for CTC environment. We may try
another package for the same code works.
Another issue is that we could not implement the function to write the result in our server.
To implement this function, we can write the result file in each cluster node and collects all
of the results in the login machine. And then we can merge these results however this
function is not developed right now because we spent time to test and do the experiment
our algorithms.
Lastly, the most important and probably the first thing that should be done is to establish
stable clustered computing environment somehow. Due to the reasons I described above,
it is really hard to work on program writing. Besides the pure programming difficulties in
clustered environment, when there was an error or bug, it was hard to tell if the cause was
bad implementation or instable cluster.




                                                                                            11
6. Experiments and Results
    A. Experiment method
    Our compressed matrix structure and parallel program were tested to measure the
    performance. And the performance will be measured by comparing the time to finish each
    process when we increase the data. To measure the performance of our parallel program,
    we will use 4 node and 50, 100, 150 and 200 data will be used to test. (50 data contains a
    sparse matrix structure of 50 web links)



    B. Result and conclusion
    According to our experiment result, the performance of our parallel program is decreased
    rapidly when the data is increased consistently. According as the data is increased, the
    data communication between the backend and the frontend will be increased rapidly.
    Therefore when we increase the data, the overall performance is hurt a lot because of
    excess of the communication. As a result, the performance graph is not a linear function
    but a exponential function approximately. The fig -5 show the result of performance
    measurement.

                                        Time(Sec)


        6000

        5000

        4000

        3000
                                                                        Time(Sec)

        2000

        1000

            0
                  50          100           150        200
                            The number of Web Link


            (Fig -5) the result of performance measurement



7. Acknowledgement
    We would like to thank the following people list below :
            William Y. Arms as a our Project Advisor
            Daniel Ira Sverdlik as a Consultant in the Cornell Theory Center
            Alan Kaminsky as a Parallel Java Package writer




                                                                                           12
          The Web Lab team wishes to thank the Internet Archive for their assistance and support.
This work is funded in part by National Science Foundation grants CNS-0403340, SES-0537606,
and IIS-0634677.


8. Reference
  [1] W. Arms, The Web Laboratory: A Joint Project of Cornell University and the Internet Archive ,
  http://www.infosci.cornell.edu/SIN/WebLab/about.html
  [2] S. Brin, & L. Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, Stanford ,USA,
  1998
  [3] J. Kleinberg, Authoritative sources in a hyperlinked environment, Journal of the ACM 46, 1999, IBM
  Research Report, 1997
  [4] N.Goharian, T.El-Ghazawi & D.Grossman, Enterprise Text Processing: A Sparse Matrix Approach,
  IEEE, 2001
  [5] A. Kaminsky, Parallel Java Library, http://www.cs.rit.edu/~ark/pj.shtml, 2007
  [6] Cornell Theory Center,Computing Resources for CTC Users,
  http://www.tc.cornell.edu/Services/CTC+Resources.htm, 2007
  [7] J.Willcock & A. Lumsdaine, Accelerating Sparse Matrix Computations via Data Compression, ACM
  Press, 2006




9. Appendix
        A. Establishing SSL Connection and Running MPI program
        A.1 Creating ssh keys and a required MPI file

        Run the following script before submitting a batch job. From a linux login node, type:

        /ctc/tools/setup_ssh_mpd_linux.sh

        This script creates ssh keys and a required MPI file. MPICH2 is the supported MPI
        implementation on the Linux clusters.

         The script /ctc/tools/setup_ssh_mpd_linux.sh performs the steps detailed below.
         Note: you do not need to issue any of the commands illustrated here, they are done for
         you automatically when you run the script.

                  On linuxlogin1.tc.cornell.edu or linuxlogin2.tc.cornell.edu, creates the directory
                  ~<your_userid>/.ssh

                           cd ~<your_userid>

                           mkdir .ssh

                           cd .ssh

                  Creates an SSH keypair to automate logons .




                                                                                                     13
                 ssh-keygen -b 1024 -t dsa -C <your_userid>

         Adds SSH public key to authorized_keys file (file is visible from all machines.)

                 cat id_dsa.pub >> authorized_keys

                 Note: Use append (>>) when adding keys to the authorized_keys file so
                 any existing keys are not overwritten.


         Adds a required public key to your authorized_keys file. This is required to allow
         the scheduler to launch jobs with your userid. In addition to adding the key, it is
         also necessary to set the proper permissions on both the ssh folder and
         authorized_keys file for ssh to function.

                 cat /ctc/tools/velocity.pub >> ~<your_userid>/.ssh/authorized_keys


         Changes permissions on authorized_keys to 600 and on the .ssh directory to
         700. Returns to your home directory:

                 chmod 600 authorized_keys

                 cd ..

                 chmod 700 .ssh


         Creates the file .mpd.conf in your home folder. It will contain the parameter
         MPD_SECRETWORD.


         Sets permissions so only you can read it.

                 chmod 600 .mpd.conf

Running the script creates the following files:


$HOME\.ssh\authorized_keys
$HOME\.ssh\id_dsa
$HOME\.ssh\id_dsa.pub
$HOME\.mpd.conf



A.2 Testing MPI Interactively

Create a hosts file, mpd.hosts.

        On compute nodes that have been assigned by vsched, this is very easy to do:

                vsched -m




                                                                                         14
                mv machines mpd.hosts

        Alternatively, use a text editor like nano or vi or emacs to add any machine
        names you want to mpd.hosts, one name per line, and save it.

Start the mpd daemons.

        Preferred method using mpdboot:

                mpdboot -n <numberofhosts>

        Alternate method if mpdboot doesn't work:

                At the command prompt, enter: mpd &

                To find the port, run
                mpdtrace -l
                It will return with the port number it's running on

                To start mpd's on the other machines, run
                ssh <nextmachinename> mpd -h <firstmachinename> -p <port> -d

        Verify all the mpd daemons are running correctly.

        Run mpdtrace to get a quick trace and see all the machines

        Run mpdringtest 3000 to run a ring around the mpd daemons

        Verify you've got the right hosts with mpiexec <numberofhosts> hostname



A.3 Running an MPI Program Interactively

Make sure mpd.hosts is in the same directory as your executable. Then in that directory,
issue:

        mpiexec <numberofhosts> <mycodename>

When you are done, close all the daemons by running:

        mpdallexit



B. Shell scripts
        B.1 PageRank.xml

                <?xml version="1.0" ?>

                <!-- Sample XML Job File -->

                <job>

                <nodes>4</nodes>

                <minutes>20</minutes>

                <type>interactive</type>



                                                                                       15
       <affiliation>v2linuxdev</affiliation>

       <run>/bin/sh $HOME/Lab/parallel.sh</run>

       </job>

B.2 parallel.sh

      #!/bin/sh

      # parallel.sh

      # @author Kwan Dong Kim



      #Description of the script

      #This script set up environment for Parallel Java in the V2 Linux Cluster



      # Set up number of machine

      NMACHINES=4

      # Set up the number of processes

      NPROCS=4

      # Set ROOTDIR

      ROOTDIR=$HOME/Lab

      export ROOTDIR

      # Set and create an output directory

      tmphost=`hostname | cut -f1 -d"."`

      OUTDIR=$ROOTDIR/output

      mkdir -v $OUTDIR

      export OUTDIR

      #Change directory from "linuxlogin" to "vii000XX" node

      cd /tmp

      # Set up the SSH public key authentication

      vsched -m

      mpdboot -n $NMACHINES -f /tmp/machines

      # Create a local directory on /tmp

      # Copy files to local disk(vii000XX)

      mpiexec -n $NPROCS $ROOTDIR/setup.sh



                                                                             16
        TMPDIR=/tmp/$USER

        export TMPDIR

        mv /tmp/machines $TMPDIR/machines

        # Run the executable from local disk(vii000XX)

        cd $TMPDIR

        # Copy "config generator.sh" file

      echo "copying $ROOTDIR/config_generator.sh
$TMPDIR/config_generator.sh "

        cp $ROOTDIR/config_generator.sh $TMPDIR/config_generator.sh

        cp $ROOTDIR/colex $TMPDIR/colex

        cp $ROOTDIR/linex $TMPDIR/linex

        #Change directory from "linuxlogin" to "vii000XX" node

        cd /tmp/$USER

        echo "Host name"

        echo `hostname`

        # Generate config file

        ./config_generator.sh $NMACHINES

        # EXport PJ package class path

       export CLASSPATH=.:/tmp/kk386/pj.jar
#:/home/nfs/ctcfsrv11/m/$USER/Lab/pj.jar

        #Run the JobScheduler

       echo "/usr/java/jre1.5.0_11/bin/java edu/rit/pj/cluster/JobScheduler
scheduler.conf"

       /usr/java/jre1.5.0_11/bin/java edu/rit/pj/cluster/JobScheduler
scheduler.conf &

        #Wait 10 sec to run the edu/rit/pj/cluster/JobScheduler correctly

        echo "Start 10-second sleep"

        sleep 10

        # Export PJ package class path

        export CLASSPATH=.:/tmp/kk386/pj.jar

        #Run the CMainClusterProcess

         echo "/usr/java/jre1.5.0_11/bin/java -Dpj.np=3 CMainClusterProcess
pidList >& result.$tmphost.out"




                                                                              17
         /usr/java/jre1.5.0_11/bin/java -Dpj.np=3 CMainClusterProcess pidList >&
result.$tmphost.out

        # Copy output files to your output directory on the fileserver

        # Delete all remaining files on /tmp/$USER

        mpiexec -n $NPROCS $ROOTDIR/cleanup.sh

        # Cancel the all process and node when all of process is finished.

        vsched –c

 B.3 setup.sh

        #!/bin/sh

        # setup.sh

        # @author Kwan Dong Kim



        #Description of the script

       #This script copy the required files to each nodes(both frontend and
backend)



        #Remove all data which is created previously

        rm -f -r /tmp/$USER

        #Set up TMP directory

        TMPDIR=/tmp/$USER

        #Make TMP directory

        mkdir $TMPDIR

        echo $TMPDIR

        export TMPDIR

        #Make Root Directory

        mkdir $ROOTDIR $TMPDIR

        echo "cp -r $ROOTDIR/ $TMPDIR/"

        #Copy all needed data and programs to each cluster nodes(vii000XX)

        cp $ROOTDIR/* $TMPDIR/

        cp -r $ROOTDIR/edu $TMPDIR/

        cp -r $ROOTDIR/pr $TMPDIR/




                                                                              18
B.4 config_ generator.sh
    #!/bin/sh

      # config_ generator.sh

      # @author Kwan Dong Kim



      #Description of the script

      #This script generate the configuration file for job scheduler


    # save variable java_path
    JAVA_PATH="/usr/java/jre1.5.0_11/bin/java"
    # save variable PJ_package_path
    PJ_PATH="/tmp/$USER/pj.jar"
    # save variable log file
    LOG_PATH="/tmp/$USER/scheduler.log"
    # save variable web host
    WEB_HOST_PATH=".tc.cornell.edu"
    # save number of cluster to use
    Num_Cluster=$1
    #Set up the variable index i
    i=1
    #Get the Frontend node by using vsched, colex and linex command
    Frontend=`vsched -u v2Linuxdev | grep $USER | ./colex 1 | ./linex 1`
    #Remove empty space in the variable Frontend
    parsedFrontend=`echo "$Frontend" | tr -c '\012[a-zA-Z][0-9].\-_' '\n' | uniq`
    echo -e "$parsedFrontend"
    #Increase index i
    i=`expr $i + 1`
    # Save variable web host
    WEB_HOST_PATH="$parsedFrontend.tc.cornell.edu"
    #start write the config file
    echo "#Parallel Java Job Scheduler Configuration file" > scheduler.conf
    echo "#Frontend processor : $Frontend" >>scheduler.conf
    echo "cluster v2linuxdev" >>scheduler.conf
    echo "logfile $LOG_PATH" >>scheduler.conf
    echo "webhost $WEB_HOST_PATH" >>scheduler.conf
    echo "webport 8080" >>scheduler.conf
    echo "schedulerhost localhost" >>scheduler.conf




                                                                               19
          echo "schedulerport 20617" >>scheduler.conf
          echo "frontendhost $WEB_HOST_PATH">>scheduler.conf
          #Start process Backend node
          while [ $i -le $Num_Cluster ]
          do
              #Get the Balcked node by using vsched, colex and linex command
              Backend=`vsched -u v2Linuxdev | grep $USER | ./colex 1 | ./linex $i`
               #Remove empty space in the variable Frontend
              parsedBackend=`echo "$Backend" | tr -c '\012[a-zA-Z][0-9].\-_' '\n' | uniq`
              #Write the Balcked node in the config file
              echo    "backend    $parsedBackend       $parsedBackend      $JAVA_PATH
   $PJ_PATH">>scheduler.conf


          #Increase index i
              i=`expr $i + 1`
          done
          # Copy scheduler.conf file to Linux login machine
          cp $TMPDIR/scheduler.conf
          $ROOTDIR/output/scheduler.conf

     B.5 cleanup.sh
          #!/bin/sh

              # cleanup.sh

              # @author Kwan Dong Kim



              #Description of the script

              #This script copy the final result and clean up the nodes and finish the all
    process


          # Copy all log and result file towad linux login machine
          cd $TMPDIR/
          cp $TMPDIR/result.* $OUTDIR
          cp $TMPDIR/*.log $OUTDIR


          #Remove all data and program used in each node(Vii000XX)
          rm -f -r /tmp/$USER

C. Java Codes



                                                                                       20
A. CCompressedMatrix.java
/**
* CCompressedMatrix.java
* Created in 2007. 02. 04
* @author Chang Min Kim
* netID : ck273
*/


/**
* Description of the Class
*         This class is representation of compressed sparse matrix
* CSM(compressed sparse matrix) consists of three linked lists.
* One is for row representation
* The sencond is for column representation
* The last is for value for the element of a matrix
* For more information, refer to the documentation.
*/


package pr;


import java.util.*;
import Jama.Matrix;


public class CCompressedMatrix {
          // LinkedList containing beginning cell number of colList for the row
          private LinkedList<Integer> rowList;
          // LinkedList containing column number of the corresponding row
          private LinkedList<Integer> colList;
          // LinkedList containing value of corresponding cell
          private LinkedList<Double> valueList;
          // These four variables are only to keep the information of the original matrix even if there is
none
          // last number of column that actually contains value other than 0
          private int lastCol;
          // last number of row that actually contains value other than 0
          private int lastRow;
          // actual row dimension of this matrix including the rows with only 0's
          private int rowDim;
          // actual column dimension of this matrix including the columns with only 0's
          private int colDim;


          public CCompressedMatrix() {
                      // initialize object
                      rowList = new LinkedList<Integer>();
                      colList = new LinkedList<Integer>();
                      valueList = new LinkedList<Double>();
                      lastCol = -1;




                                                                                                      21
                   lastRow = -1;
                   rowDim = 0;
                   colDim = 0;
        }


        public boolean compareWithMatrix(Matrix mt) {
                   // compare this object with matrix object
                   // If dimensions don't match, they are different
                   if   (    this.rowDim        !=    mt.getRowDimension()          ||   this.colDim   !=
mt.getColumnDimension() ) {
                              return false;
                   }
                   // cell by cell comparison
                   for ( int i=0; i< mt.getRowDimension() ; i++ ) {
                              for ( int j=0; j< mt.getColumnDimension() ; j++ ) {
                                          if ( mt.get(i, j)!=getValueAt(i,j) ) {
                                                      System.out.println(i + " " + j);
                                                      return false;
                                          }
                              }
                   }
                   return true;
        }


        public int getRowDimension() {
                   return rowDim;
        }


        public void setRowDimension(int i) {
                   rowDim = i;
        }


        public int getColDimension() {
                   return colDim;
        }


        public void setColDimension(int i) {
                   colDim = i;
        }


        public int getLastCol() {
                   return lastCol;
        }


        public void setLastCol(int i) {
                   lastCol = i;
        }




                                                                                                       22
        public int getLastRow() {
                   return lastRow;
        }


        public void setLastRow(int i) {
                   lastRow = i;
        }


        public void setDims(int i, int j) {
                   this.rowDim = i;
                   this.colDim = j;
        }


        public void clear() {
                   // initialize this object
                   rowDim = 0;
                   colDim = 0;
                   lastRow = -1;
                   lastCol = -1;
                   rowList.clear();
                   colList.clear();
                   valueList.clear();
        }


        public boolean empty() {
                   // tells if compressedMatrix object has any element
                   if ( rowDim == 0 || lastRow == -1 || colDim == 0 || lastCol == -1 )
                                return true;
                   else
                                return false;
        }


        public boolean isEqualTo(CCompressedMatrix cm) {
                   // compares two compressedMatrix objects
                   // If dimensions don't mathc, they are different
                   if     (     this.rowDim        !=    cm.getRowDimension()           ||      this.colDim   !=
cm.getColDimension() ) {
                                return false;
                   }
                   // cell by cell comparison
                   for ( int i=0 ; i<rowDim ; i++ ) {
                                for ( int j=0; j<colDim; j++ ) {
                                              if ( getValueAt(i,j) != cm.getValueAt(i, j) ) {
                                                         return false;
                                              }
                                }




                                                                                                              23
                     }
                     return true;
          }


          public LinkedList<Integer> getRowList() {
                     // returns rowList
                     return rowList;
          }


          public LinkedList<Integer> getColList() {
                     // returns colList
                     return colList;
          }


          public LinkedList<Double> getValueList() {
                     // returns valueList
                     return valueList;
          }


          // add cell elements to compressed matrix
          // Becasue CCompressedMatrix is row based compression,
          // every column for the row should be processed before row increases
          // For the data ordered by column first, simply add elements first and transpose this.
          // For more information about representation, refer to the documentation.
          public void addElement(int i, int j, double k) {
                     // checks if there is already existing value in that cell
                     if ( getValueAt(i,j) == 0 ) {
                                // increase dimension
                                // This is necessary to keep the information of original matrix or
number of pages as much as possible
                                // in case last rows don't have any link to it.
                                if ( rowDim <= i ) {
                                            rowDim = i+1;
                                }
                                if ( colDim <= j ) {
                                            colDim = j+1;
                                }
                                // if the value is not 0, compressed matrix will contain the value
information for the cell
                                if ( k != 0.0 ) {
                                            // beginning of new row
                                            if ( lastRow < i ) {
                                                        // If there were rows with no values before this row
i,
                                                        // the rows from lastRow+1 to i-1 should be filled
with -1
                                                        // to indicate there were empty rows




                                                                                                        24
                                                        for ( int      rowLoc =     lastRow+1;      rowLoc<i;
rowLoc++ ) {
                                                                    rowList.add(new Integer(-1));
                                                        }
                                                        // now lastRow with any values is i
                                                        lastRow = i;
                                                        // column list size is the index of beginning index
number for this row with non zero value
                                                        rowList.add(new Integer(colList.size()));
                                                        // column list contains the column index of input
cell
                                                        colList.add(new Integer(j));
                                                        // value list contains the value for given cell
                                                        valueList.add(new Double(k));
                                                        // lastCol indicates the last column index whose
corresponding cell has non zero value
                                                        if ( lastCol < j ) {
                                                                    lastCol = j;
                                                        }
                                             }
                                             // in case a cell with the same row number as given cell was
already inserted to the matrix already
                                             else if ( lastRow == i ) {
                                                        // We don't need to modify rowList, but simply add
column and value of the cell to the lists.
                                                        colList.add(new Integer(j));
                                                        valueList.add(new Double(k));
                                                        if ( lastCol < j ) {
                                                                    lastCol = j;
                                                        }
                                             } else {
                                                        System.out.println("You are trying to add an
element which should be inserted earlier");
                                                        System.out.println("Check if you are trying to
insert elements in sorted order as specified");
                                             }
                                 }
                      } else {
                                 System.out.println("There is already existing value in that cell, try
'setValuAt(i,j)'");
                      }
          }


          public CCompressedMatrix multiply(CCompressedMatrix cm) {
                      // inner dimension should match to perform matrix multiplication
                      if ( this.colDim != cm.getRowDimension() ) {
                                 System.out.println("Dimension mis-match for multiplication");




                                                                                                          25
                                  return null;
                      } else {
                                  CCompressedMatrix resCM = new CCompressedMatrix();
       for ( int i =0; i <= lastRow; i++ ) {
                                              // proceed only to the last index of column of this matrix and
last row index of cm for efficiency
                                              for ( int j = 0; j <= cm.getLastCol(); j++ ) {
                                                             double accumulator = 0.0;
                                                             for ( int k = 0; k <= lastCol; k++ ) {
                                                                        accumulator        +=     getValueAt(i,k)*
cm.getValueAt(k,j);
                                                             }
                                                             resCM.addElement(i, j, accumulator);
                                              }
                                  }
                                  resCM.setColDimension(cm.getColDimension());
                                  resCM.setRowDimension(this.rowDim);
                                  return resCM;
                      }
         }


         // tranpose this matrix
         // after(i,j) = before(j,i)
         public CCompressedMatrix transpose() {
                      CCompressedMatrix resCM = new CCompressedMatrix();
                      for ( int i=0; i <= lastCol; i++ ) {
                                  for ( int j=0; j<= lastRow; j++ ) {
                                              resCM.addElement(i, j, getValueAt(j,i));
                                  }
                      }
                      resCM.setLastCol(this.lastRow);
                      resCM.setLastRow(this.lastCol);
                      resCM.setColDimension(this.rowDim);
                      resCM.setRowDimension(this.colDim);
                      return resCM;
         }


         // multiplies the values by scalar
         public CCompressedMatrix ScalarMultiply(double x) {
                      CCompressedMatrix resCM = new CCompressedMatrix();
                      // multiply values by the scalar
                      for ( int i = 0 ; i < rowList.size(); i++ ) {
                                  resCM.getRowList().add(new Integer(rowList.get(i).intValue()));
                      }


                      for ( int i = 0 ; i < valueList.size(); i++ ) {
                                  resCM.getColList().add(new Integer(colList.get(i).intValue()));




                                                                                                              26
                                resCM.getValueList().add(new Double(valueList.get(i).doubleValue()
* x));
                    }
                    resCM.setLastCol(this.lastCol);
                    resCM.setLastRow(this.lastRow);
                    resCM.setColDimension(this.colDim);
                    resCM.setRowDimension(this.rowDim);
                    return resCM;
         }


         public CCompressedMatrix plus(CCompressedMatrix cm) {
                    // dimension check
                    if   (     this.colDim        ==    cm.getColDimension()    &&    this.rowDim      ==
cm.getRowDimension() ) {
                                // proceed to the index of the matrix which has larger last index for
efficiency
                                int lc, lr;
                                if ( lastCol < cm.getLastCol() ) {
                                              lc = cm.getLastCol();
                                } else {
                                              lc = lastCol;
                                }
                                if ( lastRow < cm.getLastRow() ) {
                                              lr = cm.getLastRow();
                                } else {
                                              lr = lastRow;
                                }
                                CCompressedMatrix resCM = new CCompressedMatrix();
                                for ( int i =0; i <= lr; i++ ) {
                                              for ( int j=0; j<= lc; j++ ) {
                                                          resCM.addElement(i,   j,   getValueAt(i,j)    +
cm.getValueAt(i, j));
                                              }
                                }
                                resCM.setLastCol(lc);
                                resCM.setLastRow(lr);
                                resCM.setColDimension(this.colDim);
                                resCM.setRowDimension(this.rowDim);
                                return resCM;
                    } else {
                                System.out.println("Dimension mis-match for addition");
                                return null;
                    }
         }


         public CCompressedMatrix minus(CCompressedMatrix cm) {
                    // minus is the same as plus with the matrix multiplied by -1




                                                                                                       27
                     return plus(cm.ScalarMultiply(-1));
          }


          // converts matrix to compressed matrix data structure
          // This is not necessary for this project.
          // Written only for convenience for experiments
          public void compressFrom(Matrix mt) {
                     rowList.clear();
                     colList.clear();
                     valueList.clear();
                     for ( int i=0; i<mt.getRowDimension(); i++ ) {
                                  for ( int j=0; j<mt.getColumnDimension(); j++ ) {
                                             addElement(i,j,mt.get(i,j));
                                  }
                     }
                     this.rowDim = mt.getRowDimension();
                     this.colDim = mt.getColumnDimension();
          }


          // Some vectors such as page vector will be represented as compressed matrix for
convenient calculation
          // Sometimes it is necessary to set the row dimension of the matrix not to lose dimension
information
          // in case last rows have values of 0.
          // This is due to the characteristics of compressed representation of sparse matrix
          // This is not necessary for our given data set.
          public void toVerticalVectorInCompressedMatrixFormat(int i) {
                     rowDim = i;
                     colDim = 1;
                     lastCol = 0;
          }


          // sets value of the cell indexed by (i,j) in regular matrix with value k
          public void setValueAt(int i, int j, double k) {
                     int begin;
                     int end;
                     boolean valFound = false;
                     // first, check if cell (i,j) has value which is not 0, if so modify it, otherwise do
nothing
                     // This can be done more siply as following
                     // if (getValueAt(i,j) != 0)
                     //                      modify corresponding valueList
                     // This was written for more clarity
                     if ( i < rowList.size() - 1 ) {
                                  begin = rowList.get(i).intValue();
                                  end = rowList.get(i+1).intValue();
                                  if ( begin != -1 && end != -1 ) {




                                                                                                      28
                                            for ( int l = begin; l < end && !valFound; l++ ) {
                                                        if ( colList.get(l).intValue() == j ) {
                                                                    valueList.set(l, new Double(k));
                                                                    valFound = true;
                                                        }
                                            }
                                            if ( !valFound ) {
                                                        System.out.println("There is no value for the cell,
try addElementAt");
                                            }
                                } else if ( begin == -1 ) {
                                            System.out.println("There is no value for the cell, try
addElementAt");
                                } else if ( end == -1 ) {
                                            int idx = i+2;
                                            while ( end == -1 && idx < rowList.size() ) {
                                                        end = rowList.get(idx).intValue();
                                                        idx++;
                                            }
                                            if ( end == -1 ) {
                                                        for ( int l = begin; l < colList.size() && !valFound;
l++ ) {
                                                                    if ( colList.get(l).intValue() == j ) {
                                                                                valueList.set(l,              new
Double(k));
                                                                                valFound = true;
                                                                    }
                                                        }
                                                        if ( !valFound ) {
                                                                    System.out.println("There is no value
for the cell, try addElementAt");
                                                        }
                                            } else {
                                                        for ( int l = begin; l < end && !valFound; l++ ) {
                                                                    if ( colList.get(l).intValue() == j ) {
                                                                                valueList.set(l,              new
Double(k));
                                                                                valFound = true;
                                                                    }
                                                        }
                                                        if ( !valFound ) {
                                                                    System.out.println("There is no value
for the cell, try addElementAt");
                                                        }
                                            }
                                }
                    } else if ( i == rowList.size() - 1 ) {




                                                                                                              29
                                begin = rowList.get(i).intValue();
                                end = colList.size();
                                if ( begin == -1 ) {
                                           System.out.println("There is no value for the cell, try
addElementAt");
                                } else {
                                           for ( int l = begin; l < end && !valFound; l++ ) {
                                                        if ( colList.get(l).intValue() == j ) {
                                                                    valueList.set(l, new Double(k));
                                                                    valFound = true;
                                                        }
                                           }
                                           if ( !valFound ) {
                                                        System.out.println("There is no value for the cell,
try addElementAt");
                                           }
                                }
                   } else {
                                System.out.println("There is no value for the cell, try addElementAt");
                   }
        }


        // get value of the cell indexed by (i,j) in regular matrix
        public double getValueAt(int i, int j) {
                   int begin;
                   int end;
                   // find the colList range for the given row i first, and find j in colList within the
range
                   // if i and j can be located in data structure, there is non 0 value for the cell (i,j)
                   // otherwise return 0
                   if ( i < rowList.size() - 1 ) {
                                begin = rowList.get(i).intValue();
                                end = rowList.get(i+1).intValue();
                                if ( begin != -1 && end != -1 ) {
                                           for ( int k = begin; k < end; k++ ) {
                                                        if ( colList.get(k).intValue() == j ) {
                                                                    return valueList.get(k).doubleValue();
                                                        }
                                           }
                                           return 0;
                                } else if ( begin == -1 ) {
                                           return 0;
                                } else if ( end == -1 ) {
                                           int idx = i+2;
                                           while ( end == -1 && idx < rowList.size() ) {
                                                        end = rowList.get(idx).intValue();
                                                        idx++;




                                                                                                        30
                                            }
                                            if ( end == -1 ) {
                                                        for ( int k = begin; k < colList.size(); k++ ) {
                                                                    if ( colList.get(k).intValue() == j ) {
                                                                               return
valueList.get(k).doubleValue();
                                                                    }
                                                        }
                                                        return 0;
                                            } else {
                                                        for ( int k = begin; k < end; k++ ) {
                                                                    if ( colList.get(k).intValue() == j ) {
                                                                               return
valueList.get(k).doubleValue();
                                                                    }
                                                        }
                                                        return 0;
                                            }
                                }
                   } else if ( i == rowList.size() - 1 ) {
                                begin = rowList.get(i).intValue();
                                end = colList.size();
                                if ( begin == -1 ) {
                                            return 0;
                                } else {
                                            for ( int k = begin; k < end; k++ ) {
                                                        if ( colList.get(k).intValue() == j ) {
                                                                    return valueList.get(k).doubleValue();
                                                        }
                                            }
                                            return 0;
                                }
                   } else {
                                return 0;
                   }
                   return 0;
         }


         // for convenience, get value from given compressed matrix other than 'this' matrix
         public double getValueAt(int i, int j, CCompressedMatrix cm) {
                   int begin;
                   int end;
                   if ( i < cm.getRowList().size() - 1 ) {
                                begin = cm.getRowList().get(i).intValue();
                                end = cm.getRowList().get(i+1).intValue();
                                if ( begin != -1 && end != -1 ) {
                                            for ( int k = begin; k < end; k++ ) {




                                                                                                              31
                                                      if ( cm.getColList().get(k).intValue() == j ) {
                                                                  return
cm.getValueList().get(k).doubleValue();
                                                      }
                                          }
                                          return 0;
                              } else if ( begin == -1 ) {
                                          return 0;
                              } else if ( end == -1 ) {
                                          int idx = i+2;
                                          while ( end == -1 && idx < cm.getRowList().size() ) {
                                                      end = cm.getRowList().get(idx).intValue();
                                                      idx++;
                                          }
                                          if ( end == -1 ) {
                                                      for ( int k = begin; k < cm.getColList().size(); k++ )
{
                                                                  if ( cm.getColList().get(k).intValue() ==
j){
                                                                            return
cm.getValueList().get(k).doubleValue();
                                                                  }
                                                      }
                                                      return 0;
                                          } else {
                                                      for ( int k = begin; k < end; k++ ) {
                                                                  if ( cm.getColList().get(k).intValue() ==
j){
                                                                            return
cm.getValueList().get(k).doubleValue();
                                                                  }
                                                      }
                                                      return 0;
                                          }
                              }
                   } else if ( i == cm.getRowList().size() - 1 ) {
                              begin = cm.getRowList().get(i).intValue();
                              end = cm.getColList().size();
                              if ( begin == -1 ) {
                                          return 0;
                              } else {
                                          for ( int k = begin; k < end; k++ ) {
                                                      if ( cm.getColList().get(k).intValue() == j ) {
                                                                  return
cm.getValueList().get(k).doubleValue();
                                                      }
                                          }




                                                                                                        32
                                            return 0;
                               }
                   } else {
                               return 0;
                   }
                   return 0;
         }


         public CCompressedMatrix cutRow(int i, int j){
                   CCompressedMatrix resCM = new CCompressedMatrix();
                   for ( int k = i ; k < j ; k++ ){
                               for ( int l = 0 ; l < colDim; l++){
                                            resCM.addElement(k-i, l, getValueAt(k, l));
                               }
                   }
                   resCM.setRowDimension(j-i);
                   resCM.setColDimension(colDim);
                   return resCM;
         }
}

B. CMainProcess.java


/**
* CMainProcess.java
* Created in 2007. 01. 30
* @author Chang Min Kim
* netID : ck273
* email : ck273@cs.cornell.edu
* Master of Engineering
* Computer Science at Cornell University
*/


/**
* Description of the Class
*
*/


/**
* Description of the Variables
*
*/


package pr;


import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;




                                                                                          33
import java.io.InputStreamReader;
import java.io.StringReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.LinkedList;
import java.util.StringTokenizer;
import java.io.BufferedWriter;
import java.io.FileWriter;
import Jama.Matrix;


public class CMainProcess implements HTMLHandler {
          final double DAMP = 0.80;
          LinkedList<String> urlList;
          Matrix linkMatrix;
          Matrix rank;
          String perDoc;
          boolean title = false;
          boolean link = false;
          boolean mail = false;
          boolean img = false;
          String linkRec = "";
          String fileName;
          CCompressedMatrix compMatrix;
          int siteSize;


          public CMainProcess(String fn) {
                     siteSize = 130;
                     compMatrix = new CCompressedMatrix();
                     //fileName = fn;
                     fileName = "test4.txt";
                     urlList = new LinkedList<String>();
                     perDoc = "";
                     linkRec = "";
                     importURL();
                     System.out.println("Canonicalizing Sites");
                     System.out.println("This may take a while depending on network connection\n");
                     canonicalize();


                     //load test data
                     //importTest();


                     CCompressedMatrix cmTest = new CCompressedMatrix();
                     cmTest.compressFrom(linkMatrix);


                     if ( compMatrix.isEqualTo(cmTest)){
                                   System.out.println("test   matrix   and   compressed   matrix   are
identical\n");




                                                                                                   34
                    }
                    else{
                               System.out.println("test matrix and compressed matrix are *** not ***
identical\n");
                    }


                    if ( compMatrix.compareWithMatrix(linkMatrix)){
                               System.out.println("Link       matrix   and    compressed   matrix   are
identical\n");
                    }
                    else{
                               System.out.println("Link matrix and compressed Matrix are *** not ***
identical\n");
                    }


                    //outputLinkMatrix();
                    //outputPIDList();
                    calc_pageRank();
                    calc_pageRank_compMatrix();
          }


          public void importTest() {
                    linkMatrix = new Matrix(5,5);
                    rank = new Matrix(5,1);
                    try {
                               BufferedReader in = new BufferedReader(new FileReader("list.txt"));
                               String inStr;
                               while( (inStr = in.readLine()) != null) {
                                            StringTokenizer st = new StringTokenizer(inStr);
                                            int from = Integer.parseInt(st.nextToken());
                                            int to = Integer.parseInt(st.nextToken());
                                            linkMatrix.set(from, to, 1.0d);
                                            compMatrix.addElement(from, to, 1.0d);
                               }
                               compMatrix.setDims(5, 5);
                    }
                    catch(IOException e){


                    }
          }




          public void importURL() {
                    // import test4.txt
                    try {
                               BufferedReader            in      =         new      BufferedReader(new
FileReader(fileName));




                                                                                                    35
                      String inStr;
                      while( (inStr = in.readLine()) != null) {
                                  if(inStr.endsWith("/")) {
                                              inStr += "index.html";
                                  }
                                  if(!urlList.contains(inStr)) {
                                              urlList.add(new String(inStr.trim()));
                                  }
                      }
          }
          catch(IOException e){


          }
          linkMatrix = new Matrix(urlList.size(),urlList.size());
          rank = new Matrix(urlList.size(),1);
}




public void canonicalize() {
          for(int i =0; i<urlList.size(); i++) {
                      perDoc = "";
                      title = false;
                      link = false;
                      img = false;
                      String content = "";
                      URL url = null;
                      String tmp = urlList.get(i);
                      tmp = tmp.trim();
                      if (tmp.endsWith("/")) {
                                  System.out.println("bad form ou url");
                      }
                      try {
                                  url = new URL(tmp);
                      }
                      catch ( MalformedURLException m) {
                                  System.out.println("Illegal Format of URL");
                      }


                      perDoc += url.toString() + "\n";


                      try {
                      // Read all the text returned by the server
                      InputStreamReader sr = new InputStreamReader(url.openStream());
                      BufferedReader in = new BufferedReader(sr);


                      String str;
                      String contentToBeRevised = "";




                                                                                       36
                            while ((str = in.readLine()) != null) {
                                          contentToBeRevised += str.trim() + " ";
                            }


                            int count = 0;
                            while( count < contentToBeRevised.length()-5) {
                                          if ( contentToBeRevised.charAt(count) == '=' ) {
                                                     if(contentToBeRevised.charAt(count-1) == ' ') {
                                                               contentToBeRevised                      =
contentToBeRevised.substring(0, count-1) +


contentToBeRevised.substring(count);
                                                               count--;
                                                     }
                                                     if(contentToBeRevised.charAt(count+1) == ' ') {
                                                               contentToBeRevised                      =
contentToBeRevised.substring(0, count+1) +


contentToBeRevised.substring(count+2);
                                                     }
                                          }
                                          count++;
                            }
                            content = contentToBeRevised;


                            sr.close();
                            in.close();
                            } catch (MalformedURLException e) {
                            } catch (IOException e) {
                            }




//                         String cacheSite = "cache_" + i + ".txt";
//                         try {
//                                     BufferedWriter      caWriter    =    new     BufferedWriter(new
FileWriter(cacheSite));
//                                     caWriter.write(content);
//                                     caWriter.close();
//                         } catch (IOException e1) {
//                                     e1.printStackTrace();
//                         }




                           try {
                                       //System.out.println(i+"th site");
                                       if ( (i % 25 == 0) && ( i !=0) ) {




                                                                                                   37
         System.out.println(( (int)(((float)i/urlList.size())*100)) + "% done");
                                           }


                          HTMLParserFactory parserFactory = HTMLParserFactory.getInstance();
                          HTMLParser saxParser = parserFactory.getNewSAXHtmlParser();
                          saxParser.parse(content, this);
                                }


                       catch (Exception e) {
                                           e.printStackTrace();
                                }


                                // analyze perDoc string to get info
                                BufferedReader              br    =         new       BufferedReader(new
StringReader(perDoc));
                                String lineString = "";


                                try {
                                           while((lineString = br.readLine())!= null) {
                                                        if( lineString.trim().equalsIgnoreCase("<title>")) {
                                                                   title = true;
                                                                   link = false;
                                                                   img = false;
                                                        }
                                                        else
if( lineString.trim().equalsIgnoreCase("</title>")) {
                                                                   title = false;
                                                                   link = false;
                                                                   img = false;
                                                        }
                                                        else
if( lineString.trim().equalsIgnoreCase("<a>")) {
                                                                   link = true;
                                                                   img = false;
                                                                   title = false;
                                                        }
                                                        else
if( lineString.trim().equalsIgnoreCase("</a>")) {
                                                                   link = false;
                                                                   title = false;
                                                                   img = false;
                                                        }
                                                        else
if( lineString.trim().equalsIgnoreCase("<img>")) {
                                                                   img = true;
                                                                   title = false;




                                                                                                         38
                                                          }
                                                          else   if(       (title==true)    &&   (link==false)      &&
(img==false)) {
                                                                       title = false;
                                                          }
                                                          else   if(       (title==false)   &&       (link==true)   &&
(img==false)) {
                                                                       StringTokenizer           toc        =       new
StringTokenizer(lineString);
                                                                       String t = toc.nextToken();
                                                                       if ( t.trim().equalsIgnoreCase("href")) {
                                                                                     t = toc.nextToken();
                                                                                     linkRec += t.trim()+ "\t";
                                                                       }
                                                                       else {
                                                                                     linkRec += lineString.trim() +
"\n";
                                                                                     link = false;
                                                                                     img = false;
                                                                       }
                                                          }
                                                          else   if    ((title==false)      &&       (link==true)   &&
(img==true)) {
                                                                       StringTokenizer           toc        =       new
StringTokenizer(lineString);
                                                                       String t = toc.nextToken();
                                                                       if ( t.trim().equalsIgnoreCase("alt")) {
                                                                                     while(toc.hasMoreTokens())
{
                                                                                                 t                    =
toc.nextToken();
                                                                                                 linkRec += t.trim()
+ " ";
                                                                                     }
                                                                                     linkRec += "\n";
                                                                                     img = false;
                                                                                     link = false;
                                                                       }
                                                          }
                                           }
                               } catch (IOException e) {
                                           e.printStackTrace();
                               }




                               //filling up link matrix




                                                                                                                    39
                              BufferedReader            links       =     new         BufferedReader(new
StringReader(linkRec));
                              String lnk = "";
                              try {
                                         while( (lnk = links.readLine()) != null) {
                                                    StringTokenizer st = new StringTokenizer(lnk);
                                                    String add = st.nextToken();
                                                    if(add.endsWith("/")) {
                                                                add += "index.html";
                                                    }


                                                    if(add.startsWith("mailto")) {


                                                    }
                                                    else if(add.startsWith("file")) {


                                                    }
                                                    else if(add.startsWith("http")) {
                                                                if ( add.contains("#")) {
                                                                            int ix = add.indexOf("#");
                                                                            add = add.substring(0, ix);
                                                                }
                                                                int idx = urlList.indexOf(add.trim());
                                                                if(idx != -1) {
                                                                            linkMatrix.set(i, idx, 1.0);
                                                                            if (compMatrix.getValueAt(i,
idx) == 0.0 ){


          compMatrix.addElement(i, idx, 1.0);
                                                                            }




                                                                            // the rest is value of the
hyper link, put them appropriately
                                                                            String at = "";
                                                                            while(st.hasMoreTokens()) {
                                                                                        String    part      =
st.nextToken();
                                                                                        if
( part.startsWith("(")) {
                                                                                                   part     =
part.substring(1);
                                                                                        }
                                                                                        if
( part.startsWith("\"")) {
                                                                                                   part     =
part.substring(1);




                                                                                                           40
                                      }
                                      if
( part.startsWith("\'")) {
                                                 part      =
part.substring(1);
                                      }
                                      if
( part.endsWith(")")) {
                                                 part      =
part.substring(0, part.length()-1);
                                      }
                                      if
( part.endsWith("\"")) {
                                                 part      =
part.substring(0, part.length()-1);
                                      }
                                      if
( part.endsWith("\'")) {
                                                 part      =
part.substring(0, part.length()-1);
                                      }
                                      if
( part.endsWith(",")) {
                                                 part      =
part.substring(0, part.length()-1);
                                      }
                                      if
( part.endsWith(".")) {
                                                 part      =
part.substring(0, part.length()-1);
                                      }
                                      if
( part.endsWith("?")) {
                                                 part      =
part.substring(0, part.length()-1);
                                      }
                                      if
( part.endsWith(":")) {
                                                 part      =
part.substring(0, part.length()-1);
                                      }
                                      if
( part.endsWith(";")) {
                                                 part      =
part.substring(0, part.length()-1);
                                      }
                                      at += part + " ";




                                                          41
                                                                                }
                                                                    }
                                                         }
                                                         else if(add.startsWith("../")) {
                                                                    if ( add.contains("#")) {
                                                                                int ix = add.indexOf("#");
                                                                                add = add.substring(0, ix);
                                                                    }
                                                                    int ccc = 0;


          while( add.substring(3).startsWith("../")) {
                                                                                add = add.substring(3);
                                                                                ccc++;
                                                                    }
                                                                    String [] el = url.toString().split("/");
                                                                    String site = "";
                                                                    for(int x=0; x<el.length-1-ccc; x++) {
                                                                                site += el[x].trim() + "/";
                                                                    }
                                                                    add = site + add;
                                                                    int idx = urlList.indexOf(add.trim());
                                                                    if (idx != -1) {
                                                                                linkMatrix.set(i, idx, 1.0);
                                                                                if (compMatrix.getValueAt(i,
idx) == 0.0 ){


          compMatrix.addElement(i, idx, 1.0);
                                                                                }
                                                                                // the rest is value of the
hyper link, put them appropriately
                                                                                String at = "";
                                                                                while(st.hasMoreTokens()) {
                                                                                            String     part      =
st.nextToken();
                                                                                            if
( part.startsWith("(")) {
                                                                                                        part     =
part.substring(1);
                                                                                            }
                                                                                            if
( part.startsWith("\"")) {
                                                                                                        part     =
part.substring(1);
                                                                                            }
                                                                                            if
( part.startsWith("\'")) {




                                                                                                                42
                                                             part      =
part.substring(1);
                                                  }
                                                  if
( part.endsWith(")")) {
                                                             part      =
part.substring(0, part.length()-1);
                                                  }
                                                  if
( part.endsWith("\"")) {
                                                             part      =
part.substring(0, part.length()-1);
                                                  }
                                                  if
( part.endsWith("\'")) {
                                                             part      =
part.substring(0, part.length()-1);
                                                  }
                                                  if
( part.endsWith(",")) {
                                                             part      =
part.substring(0, part.length()-1);
                                                  }
                                                  if
( part.endsWith(".")) {
                                                             part      =
part.substring(0, part.length()-1);
                                                  }
                                                  if
( part.endsWith("?")) {
                                                             part      =
part.substring(0, part.length()-1);
                                                  }
                                                  if
( part.endsWith(":")) {
                                                             part      =
part.substring(0, part.length()-1);
                                                  }
                                                  if
( part.endsWith(";")) {
                                                             part      =
part.substring(0, part.length()-1);
                                                  }
                                                  at += part + " ";
                                              }
                                          }
                                      }




                                                                      43
                                                else {
                                                         if ( add.contains("#")) {
                                                                     int ix = add.indexOf("#");
                                                                     add = add.substring(0, ix);
                                                         }
                                                         String [] el = url.toString().split("/");
                                                         String site = "";
                                                         for(int x=0; x<el.length-1; x++) {
                                                                     site += el[x].trim() + "/";
                                                         }
                                                         add = site + add;
                                                         int idx = urlList.indexOf(add.trim());
                                                         if (idx != -1) {
                                                                     linkMatrix.set(i, idx, 1.0);
                                                                     if (compMatrix.getValueAt(i,
idx) == 0.0 ){


          compMatrix.addElement(i, idx, 1.0);
                                                                     }
                                                                     // the rest is value of the
hyper link, put them appropriately
                                                                     String at = "";
                                                                     while(st.hasMoreTokens()) {
                                                                                 String     part      =
st.nextToken();
                                                                                 if
( part.startsWith("(")) {
                                                                                             part     =
part.substring(1);
                                                                                 }
                                                                                 if
( part.startsWith("\"")) {
                                                                                             part     =
part.substring(1);
                                                                                 }
                                                                                 if
( part.startsWith("\'")) {
                                                                                             part     =
part.substring(1);
                                                                                 }
                                                                                 if
( part.endsWith(")")) {
                                                                                             part     =
part.substring(0, part.length()-1);
                                                                                 }
                                                                                 if
( part.endsWith("\"")) {




                                                                                                     44
                                                                                     part      =
part.substring(0, part.length()-1);
                                                                          }
                                                                          if
( part.endsWith("\'")) {
                                                                                     part      =
part.substring(0, part.length()-1);
                                                                          }
                                                                          if
( part.endsWith(",")) {
                                                                                     part      =
part.substring(0, part.length()-1);
                                                                          }
                                                                          if
( part.endsWith(".")) {
                                                                                     part      =
part.substring(0, part.length()-1);
                                                                          }
                                                                          if
( part.endsWith("?")) {
                                                                                     part      =
part.substring(0, part.length()-1);
                                                                          }
                                                                          if
( part.endsWith(":")) {
                                                                                     part      =
part.substring(0, part.length()-1);
                                                                          }
                                                                          if
( part.endsWith(";")) {
                                                                                     part      =
part.substring(0, part.length()-1);
                                                                          }
                                                                          at += part + " ";
                                                                      }
                                                                  }
                                                    }
                                         }
                               } catch (IOException e) {
                                         e.printStackTrace();
                               }


                               // initialize perDoc and content
                               content = "";
                               perDoc = "";
                               linkRec = "";
                     }




                                                                                              45
                  System.out.println();
                  compMatrix.setDims(siteSize, siteSize);
             }


  public void startElement(String pElementName, HTMLAttributeList pAttrList)
  {
             perDoc += "<" + pElementName + ">" + "\n";
      if(pAttrList != null)
         for(int i=0 ; i< pAttrList.size(); i++)
         {
                 HTMLAttribute attribute = (HTMLAttribute) pAttrList.get(i);
                 perDoc += "\t" + attribute.getAttributeName() + "\t" + attribute.getAttributeValue() + "\n";
         }
  }


  public void endElement(String pElementName)
  {
             perDoc += "</"+ pElementName + ">" + "\n";
  }


  public void elementValue(String pElementValue)
  {
             perDoc += "\t" + pElementValue + "\n";
  }


  public void startDocument()
  {


  }


  public void endDocument()
  {


  }


  public void outputLinkMatrix() {
                          String fName = "lMatrix";
                          try {
                                     BufferedWriter       caWriter        =      new        BufferedWriter(new
FileWriter(fName));
                                     for(int i=0; i<linkMatrix.getRowDimension(); i++) {
                                                for(int j=0; j<linkMatrix.getColumnDimension(); j++) {
                                                           caWriter.write(linkMatrix.get(i, j) + "\t");
                                                }
                                                caWriter.write("\n");
                                     }
                                     caWriter.close();




                                                                                                           46
                         } catch (IOException e1) {
                                    e1.printStackTrace();
                         }
     }


     public void outputPIDList() {
           try {
                                    BufferedWriter           caWriter         =        new      BufferedWriter(new
FileWriter("pidList"));
                         for(int i=0 ; i<linkMatrix.getRowDimension(); i++) {
                                    for(int j=0 ; j<linkMatrix.getColumnDimension(); j++) {
                                                  if(linkMatrix.get(i, j) != 0.0 ) {
                                                              caWriter.write(i + "\t" + j + "\n");
                                                  }
                                    }
                         }
                         caWriter.close();
                         } catch (IOException e) {
                                    e.printStackTrace();
                         }
     }


     void calc_pageRank() {
//          linkMatrix
           double [] dangle = new double[linkMatrix.getRowDimension()];
           for(int i=0; i<linkMatrix.getRowDimension();i++) {
                         dangle[i] = 0.0d;
           }


           for(int i =0; i<linkMatrix.getRowDimension(); i++) {
                         for(int j=0 ; j<linkMatrix.getRowDimension(); j++) {
                                    if(linkMatrix.get(j, i) == 1.0) {
                                                  dangle[i] += 1.0d;
                                    }
                         }
           }


           //normalize
           for(int i =0; i<linkMatrix.getRowDimension(); i++) {
                         if(dangle[i] != 0.0) {
                                    for(int j=0 ; j<linkMatrix.getRowDimension(); j++) {
                                                  linkMatrix.set(j,     i,    (double)linkMatrix.get(j,i)/dangle[i]);


                         }
                         }
           }




                                                                                                                 47
        // construct a, e matrix
        Matrix e = new Matrix(linkMatrix.getRowDimension(),1);
        Matrix a = new Matrix(linkMatrix.getRowDimension(),1);
        for(int i=0; i<linkMatrix.getRowDimension(); i++) {
                   e.set(i,0, 1.0d);
                   if(dangle[i] == 0.0) {
                              a.set(i, 0, 1.0d);
                   }
                   else {
                              a.set(i, 0, 0.0d);
                   }
        }


        // calc rank
        // pk = dHpk-1 + e(da'pk-1+1-d)(1/n)
        // pk = dHpk-1+(d/n)(ea')pk-1+(1-d)(1/n)e
        // pk = d(H + (1/n)ea')pk-1 + ((1-d)/n)e


                   Matrix wBefore = new Matrix(linkMatrix.getRowDimension(),1);
                   for(int g=0; g<linkMatrix.getRowDimension(); g++) {
                              wBefore.set(g,0,((double)1.0/linkMatrix.getRowDimension()));
                   }


                   Matrix wCurrent = new Matrix(linkMatrix.getRowDimension(),1);
        Matrix tmp1;
        Matrix tmp2 = new Matrix(linkMatrix.getRowDimension(),1);
        boolean conv = false;
        while(!conv) {
                   conv = true;
                   tmp1                                                                        =
linkMatrix.plus(e.times(a.transpose()).times(1.0d/linkMatrix.getRowDimension())).times(DAMP).tim
es(wBefore);
                   tmp2 = e.times((1.0d-DAMP)/linkMatrix.getRowDimension());
                   wCurrent = tmp1.plus(tmp2);
                   for(int z=0; z<wBefore.getRowDimension(); z++) {
                              if ( Math.abs(wBefore.get(z,0)- wCurrent.get(z,0)) != 0.0 ) {
                                            conv = false;
                              }
                   }
                   wBefore = wCurrent;
        }


        rank = wCurrent;


        double sum = 0.0;
        for(int i=0; i<rank.getRowDimension(); i++) {
                   sum += rank.get(i,0);




                                                                                              48
         }
         System.out.println("Sum of pageRank from matrix is " + sum + "\n");
  }


  void calc_pageRank_compMatrix(){
         // record dangling pages by counting outlinks
         double [] dangle = new double[siteSize];
         for(int i=0; i<siteSize;i++) {
                    dangle[i] = 0.0d;
         }


         for(int i =0; i<siteSize; i++) {
                    for(int j=0 ; j<siteSize; j++) {
                                if(compMatrix.getValueAt(j, i) == 1.0 ) {
                                             dangle[i] += 1.0d;
                                }
                    }
         }


         //normalize by dividing each colum by outlink count
         for(int i =0; i<siteSize; i++) {
                    if(dangle[i] != 0.0) {
                                for(int j=0 ; j<siteSize; j++) {
                                             if(compMatrix.getValueAt(j,i) !=0) {
                                                       compMatrix.setValueAt(j,      i,
(double)compMatrix.getValueAt(j,i)/dangle[i]);
                                             }
                    }
                    }
         }


         // construct a, e matrix
         CCompressedMatrix e = new CCompressedMatrix();
         CCompressedMatrix a = new CCompressedMatrix();
         for(int i=0; i<siteSize; i++) {
                    e.addElement(i, 0, 1.0d);
                    if(dangle[i] == 0.0) {
                                a.addElement(i, 0, 1.0d);
                    }
                    else {
                                a.addElement(i, 0, 0.0d);
                    }
         }
         e.toVerticalVectorInCompressedMatrixFormat(siteSize);
         a.toVerticalVectorInCompressedMatrixFormat(siteSize);


         // calculate PageRank




                                                                                    49
          // pk = d(H + (1/n)ea')pk-1 + ((1-d)/n)e


                    // initial column compressed matrix each of which element is 1/(number of pages)
          CCompressedMatrix wBefore = new CCompressedMatrix();
                    for(int g=0; g<siteSize; g++) {
                               wBefore.addElement(g,0,(double)1.0/siteSize);
                    }


                    CCompressedMatrix wCurrent = new CCompressedMatrix();
                    wCurrent.setDims(siteSize, siteSize);
          CCompressedMatrix tmp1;
          CCompressedMatrix tmp2 = new CCompressedMatrix();
          boolean conv = false;
          while(!conv) {
                    conv = true;
                    tmp1                                                                          =
compMatrix.plus(e.multiply(a.transpose()).ScalarMultiply(1.0d/siteSize)).ScalarMultiply(DAMP).mul
tiply(wBefore);
                    tmp2 = e.ScalarMultiply((1.0d-DAMP)/siteSize);
                    wCurrent = tmp1.plus(tmp2);
                    for(int z=0; z<siteSize; z++) {
                               if ( Math.abs(wBefore.getValueAt(z,0)- wCurrent.getValueAt(z,0)) !=
0.0 ) {
                                          conv = false;
                               }
                    }
                    wBefore = wCurrent;
          }


          double sum = 0;
          for(int i=0; i<wCurrent.getRowDimension(); i++) {
                               sum += wCurrent.getValueAt(i, 0);
          }
          System.out.println("sum of pageRank from compressedMatrix is : " + sum);
      }
}

C. CMainClusterProcess.java
/**
* CMainClusterProcess.java
* Created in 2007. 03. 25
* @author Chang Min Kim
* netID : ck273
*/


/**
* Description of the Class
*         This class calculates PageRank of given pages by file "pidList"




                                                                                                50
* using compressed sparse matrix representaion and Google PageRank algorithm
* in clustered environment using Parallel Java Middleware system.
*/


package pr;


import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.util.StringTokenizer;


import edu.rit.mp.buf.*;
import edu.rit.pj.Comm;


public class CMainClusterProcess {


// Prevent construction. Constructor is not necessary
         private CMainClusterProcess() {


         }


// Shared variables for clustered processes.


         // World communicator.
         static Comm world;      // world of clustered computers
         static int size;        // number of clustered computers
         static int rank;        // rank of individual computer in the world


         // total count of links in each column
         static double[] total_count;


         // compressed link matrix
         static CCompressedMatrix comp;


         // PageRank linkedlist
         // the size of page should be siteSize
         static CCompressedMatrix page;


         static int siteSize; // number of total site count
         static int siteSize_per_node ; // number of sites each node should process except master
         static int siteSize_master; // number of sites the master should process


         static final double DAMP = 0.65; // damping factor as it is defined by pagerank algorithm




                                                                                                51
            static final double TOLERANCE = 0.0000000001; // tolerance of convergence for each
iteration


            /**
            * Main program of PageRank calculation.
            */
            public static void main(String[] args) throws Throwable{
                      //Initialize world communicator.
                      Comm.init (args);
                      world = Comm.world();
                      size = world.size();
                      siteSize = -1;
                      rank = world.rank();


                      int slave_size = size - 1;


                      if ( rank == 0 ) {


            System.out.println("#######################################################
#######");
                      System.out.println("              PageRank   Calculation    using    Sparse   Matrix
Compression");
                      System.out.println("          in clustered computer environment with Message
Passing");


System.out.println("##############################################################"
);
                                  // time checker
                                  long t0 = System.currentTimeMillis();
                                  System.out.println(rank + " : Starting time at master node is " + t0);
                      }


                      String infile = null, outfile = null;


                      // Parse command line arguments for main process.
                      if ( args.length != 1 ) usage1();
                      infile = args[0];
                      outfile = "result";


                      // First, all nodes read pid:pid input file and construct compressed sparse matrix
                      // This is better for efficiency because it reduces the number of message passing


                      // Master process will take care of remainder part of site list
                      // For example, if there are 201 pages and 3 nodes in the cluster,
                      // the master will take care of first page and the other two nodes
                      // will take care of 100 pages in order from second page




                                                                                                       52
                  // The portion the master will take care is much less than the portion of slaves
typically
                  // This is to allow the master more time to work on the rolls during message
passing
                  BufferedReader br = new BufferedReader(new FileReader(infile));
                  String lastLine = null;
                  String line = null;
                  while ( (line=br.readLine())!=null ) {
                              lastLine = line;
                              StringTokenizer st = new StringTokenizer(lastLine);
                              int index = Integer.parseInt(st.nextToken());
                              // Given structure of pidList, sorted by (from, to) link pair,
                              // first token of last line in the file is the biggest numbered page in the
closure.
                              // Index begins with 0
                              siteSize = index + 1;
                  }


                  // Because we know how many pages we should process (= siteCount)
                  // So assign appropriate number of sites for each process
                  siteSize_master = siteSize % slave_size; // the number of sites the master
process should load up
                  siteSize_per_node = siteSize / slave_size; // the number of sites the slave
processes should load up


                  // array dangle contains number of links in compressed sparse matrix column by
column
                  // first initialize this so that it can be passed by MP protocol
                  // If this is not initialized here MP will not work as specified by the package
                  double [] dangle = new double[siteSize];
                  for ( int i=0; i<siteSize;i++ ) {
                              dangle[i] = 0.0d;
                  }


                  // now nodes set up page vector
                  page = new CCompressedMatrix();
                  for ( int i=0; i< siteSize; i++ ) {
                              page.addElement(i, 0, 1.0d/(double)siteSize);
                  }


                  // now all nodes read input file and records links in comp
                  readInputFile(infile);


                  // all nodes calculate total number of links column by column


                  for ( int i =0; i<siteSize; i++ ) {
                              for ( int j=0 ; j<siteSize; j++ ) {




                                                                                                     53
                        if ( comp.getValueAt(j, i) == 1.0 ) {
                                     dangle[i] += 1.0d;
                        }
            }
}


// master collects all the dangle array
DoubleArrayBuf [] gatheredSize = new DoubleArrayBuf[size];
for ( int i =0 ; i<size; i++) {
            gatheredSize[i] = DoubleArrayBuf.buffer(dangle);
}
DoubleArrayBuf sizeArray = DoubleArrayBuf.buffer(dangle);
world.gather(0, sizeArray, gatheredSize);


// total count will hold the total number of links for each page
total_count = new double[siteSize];


// master aggregating counts
if ( rank==0 ) {
            for ( int i = 0; i < siteSize; i++ ) {
                        for ( int j = 0; j < gatheredSize.length; j++ ) {
                                     total_count[i] = gatheredSize[j].get(i);
                        }
            }
}


// now master broadcasts aggregated size to each process
DoubleArrayBuf bf = DoubleArrayBuf.buffer(total_count);
world.broadcast(0, bf);
for ( int i=0 ; i<bf.length(); i++ ) {
            dangle[i] = bf.get(i);
}


// These should be declared inside of pagerank calculation
// but due to the restriction of parallel java that every process should call
// the "gather" and "broadcast" to make them work correctly,
// these were defined here intentionally


// temporary matrices for calculation
CCompressedMatrix tmp1;
CCompressedMatrix tmp2 = new CCompressedMatrix();


// e is a row vector with every element 0
CCompressedMatrix e = new CCompressedMatrix();
// a is a row vector. An element has 1 if that page is dangling page, otherwise 0
CCompressedMatrix a = new CCompressedMatrix();




                                                                                54
                    // wBefore is pagerank for each page before each iteration
                    CCompressedMatrix wBefore = new CCompressedMatrix();
                    // Because each node will have its own pagerank elements corresponding the
pages it loaded after iteration,
                    // we save last result for the node and compare the iteration result with this one to
find if those are converged.
                    CCompressedMatrix wBefore_node = new CCompressedMatrix();
                    // wCurrent is pagerank for each page after each iteration
                    CCompressedMatrix wCurrent = new CCompressedMatrix();
                    // These two matrices(vectors) should be compared after every iteration
                    // If the difference is smaller than the tolerance given,
                    // we consider the matrix is converged to some point


                    //normalize by dividing each colum by outlink count
                    for ( int i =0; i<siteSize; i++ ) {
                                if ( dangle[i] != 0.0 ) {
                                            for ( int j=0 ; j<siteSize; j++ ) {
                                                          if ( comp.getValueAt(j,i) !=0 ) {
                                                                     comp.setValueAt(j,                i,
(double)comp.getValueAt(j,i)/dangle[i]);
                                                          }
                                            }
                                }
                    }


                    // construct a, e matrix
                    for ( int i=0; i<siteSize; i++ ) {
                                e.addElement(i, 0, 1.0d);
                                if ( dangle[i] == 0.0 ) {
                                            a.addElement(i, 0, 1.0d);
                                } else {
                                            a.addElement(i, 0, 0.0d);
                                }
                    }


                    // These make sure that the dimension of compressed matrix is the same as
original matrix
                    // We don't actually have to do this due to closure property of the given data
                    e.toVerticalVectorInCompressedMatrixFormat(siteSize);
                    a.toVerticalVectorInCompressedMatrixFormat(siteSize);


                    // calculate PageRank
                    // pk = d(H + (1/n)ea')pk-1 + ((1-d)/n)e
                    // Refer to the documentation for details


                    // initial column matrix each of which element is 1/(number of pages)
                    wBefore = new CCompressedMatrix();




                                                                                                     55
                    for ( int g=0; g<siteSize; g++ ) {
                               wBefore.addElement(g,0,(double)1.0/siteSize);
                    }


                    if ( rank == 0 && siteSize_master == 0 ){


                    }
                    else if ( rank == 0 && siteSize_master !=0 ) {
                               for ( int g=0; g<siteSize_master; g++) {
                                             wBefore_node.addElement(g,0,(double)1.0/siteSize);
                               }
                    }
                    else if ( rank != 0 ){
                               for ( int g=0; g<siteSize_per_node; g++) {
                                             wBefore_node.addElement(g,0,(double)1.0/siteSize);
                               }
                    }




                    wCurrent = new CCompressedMatrix();


                    CCompressedMatrix ea = new CCompressedMatrix();
                    ea = e.multiply(a.transpose());


                    // We need to cut ea and a ccording to the rank so that only proper portion of ea
will remain for calculation
                    int begin = -1;
                    int end = -1;
                    if ( rank == 0 && siteSize_master !=0 ) {
                               begin = 0;
                               end = siteSize_master;
                    }
                    else if ( rank != 0 ){
                               begin = (rank - 1) * siteSize_per_node + siteSize_master;
                               end = rank * siteSize_per_node + siteSize_master;
                    }


                    ea = ea.cutRow(begin, end);
                    e = e.cutRow(begin, end);


                    // begins iteration
                    boolean conv = false;
                    int iterCount = 0;
                    while ( !conv ) {
                               iterCount++;
                               if(rank ==0){




                                                                                                  56
                                        System.out.println(rank + " : in the middle of " + iterCount + "
th iteration");
                            }
                            // pagerank calcuation in master node won't be done if master has no
page to process.
                            conv = true;


                            if ( rank == 0 && siteSize_master == 0 ){
                                        wCurrent = new CCompressedMatrix();
                            }
                            else {
                                        tmp1                                                          =
comp.plus(ea.ScalarMultiply(1.0d/siteSize)).ScalarMultiply(DAMP).multiply(wBefore);
                                        tmp2 = e.ScalarMultiply((1.0d-DAMP)/siteSize);
                   wCurrent = tmp1.plus(tmp2);
                            }


                            // check if each row of pagerank is converged
                            if ( rank == 0 && siteSize_master != 0){
                                        for ( int z=0; z<siteSize_master; z++ ) {
                                                     if   (      Math.abs(wBefore_node.getValueAt(z,0)-
wCurrent.getValueAt(z,0)) >= TOLERANCE ) {
                                                                   conv = false;
                                                     }
                                        }
                            }
                            else if ( rank != 0 ){
                                        for ( int z=0; z<siteSize_per_node; z++ ) {
                                                     if   (      Math.abs(wBefore_node.getValueAt(z,0)-
wCurrent.getValueAt(z,0)) >= TOLERANCE ) {
                                                                   conv = false;
                                                     }
                                        }
                            }


                            // decide whether each process has converged
                            BooleanItemBuf converged = new BooleanItemBuf();
                            if ( rank == 0 && siteSize_master == 0 ){
                                        converged.set(true);
                            }
                            else {
                                        converged.set(conv);
                            }
                            BooleanItemBuf [] allConverged = new BooleanItemBuf[size];
                            for ( int i = 0; i < size; i++ ) {
                                        allConverged[i] = new BooleanItemBuf();
                            }




                                                                                                    57
                           world.gather(0, converged, allConverged);
                           boolean allConv = true;
                           for ( int i=0; i<size; i++ ){
                                         if(!allConverged[i].get()){
                                                     allConv = false;
                                         }
                           }




                           // now broadcast if all process converged
                           BooleanItemBuf convergeDone = new BooleanItemBuf();
                           convergeDone.set(allConv);
                           world.broadcast(0, convergeDone);


                           // if all process are converged, stop iteration
                           conv = convergeDone.get();


                           wBefore_node = wCurrent;


                           // wBefore_node should be gathered and combined by master to be
wBefore




                               // This italic part is the point where we are stuck
                               // This part causes null point exception from object not being
                               // serialized properly
                               ObjectI tem Buf p art _ob = ne w Obje ctItem Buf ();
                           part _ob .set ( wBefo re_ nod e);
                           ObjectI tem Buf [] pa rt_ wBe for e = n e w O bjectIt em Buf[si ze ];
                           for(i nt i = 0; i<si ze ; i++){
                                         part _ wBefo re[i] = ne w Obj ectIt em Buf();
                           }
                           wo rld. gath er (0, par t_o b, p art _ wBefo re );


                           CCom press edM atrix toBeS ent = ne w CC om pres sed Mat rix( );


                           if ( rank == 0 ) {
                                         int count = 0;
                                         for ( int i=0 ; i<size; i++ ) {
                                                     CCompressedMatrix                   part            =
(CCompressedMatrix)part_wBefore[i].get();
                                                     for ( int j=0; j<part.getRowDimension(); j++ ) {
                                                                 toBeSent.addElement(count,             0,
part.getValueAt(j, 0));
                                                                 count++;
                                                     }
                                         }




                                                                                                        58
                               }


                               for(int i=0; i<size; i++){
                                            part_wBefore[i].reset();
                               }
                               // now broadcast combined pagerank matrix(vector) to nodes
                               ObjectItemBuf full_ob = new ObjectItemBuf();
                               full_ob.set(toBeSent);
                               world.broadcast(0, full_ob);
                               wBefore = (CCompressedMatrix)full_ob.get();
                   }


                   // now page has PageRank for every page
                   page = wBefore;


                   // write computation result to file
                   writeOutputFile(outfile);
         }


         // Hidden operations.
         // Loading site list to compMatrix
         // Depending on the rank, loading range will be different
         private static void readInputFile(String inputFile) {
                   // in case the data format is -> from_pid : to_pid , just tranpose comp matrix
                   // assuming data format is -> to_pid : from_pid
                   comp = new CCompressedMatrix();
                   int begin = -1;
                   int end = -1;
                   if ( rank == 0 ){
                               begin = 0;
                               end = siteSize_master;
                   }
                   else if( rank != 0 ){
                               begin = (rank - 1) * siteSize_per_node + siteSize_master;
                               end = rank * siteSize_per_node + siteSize_master;
                   }


                   BufferedReader br;
                   try {
                               br = new BufferedReader(new FileReader(inputFile));
                               String line;
                               int currentIndex = -1;
                               // pass the pairs which should not be loaded to this process and load
pairs which are in the range
                               while ( (line=br.readLine())!=null && currentIndex < end ) {
                                            StringTokenizer st = new StringTokenizer(line);
                                            int argb = Integer.parseInt(st.nextToken());




                                                                                                    59
                                           int arge = Integer.parseInt(st.nextToken());
                                                         if ( argb >= begin && argb < end ) {
                                                         comp.addElement(argb, arge, 1.0d);
                                           }
                                           currentIndex = argb;
                                }
                                if ( rank == 0 ){
                                           comp.setRowDimension(siteSize_master);
                                           comp.setColDimension(siteSize);
                                }
                                else if ( rank != 0 ){
                                           comp.setRowDimension(siteSize_per_node);
                                           comp.setColDimension(siteSize);
                                }


                      } catch ( NumberFormatException e ) {
                                e.printStackTrace();
                      } catch ( FileNotFoundException e ) {
                                e.printStackTrace();
                      } catch ( IOException e ) {
                                e.printStackTrace();
                      }
         }


         // resultant page file in master node will hold PageRank for every page, so save it to disk
         private static void writeOutputFile(String outFileName) {
                      try {
                                if ( rank == 0 ) {
                                           BufferedWriter         out     =       new      BufferedWriter(new
FileWriter(outFileName));
                                           double sum = 0;
                                           // the format of result file if -> pid<tab>pagerank
                                           for ( int i=0; i<page.getRowDimension(); i++ ) {
                                                         out.write( i + "\t" + page.getValueAt(i, 0) + "\n");
                                                         sum += page.getValueAt(i,0);
                                           }
                                           System.out.println("Sum of pagerank from each page is " +
sum);
                                           out.close();
                                           System.out.println("Results are written to a file '" +
outFileName + "'");
                                           System.out.println("All         jobs         done    at     "        +
System.currentTimeMillis());
                                }
                      } catch ( FileNotFoundException e1 ) {


                                e1.printStackTrace();




                                                                                                           60
                      } catch ( IOException e1 ) {


                                 e1.printStackTrace();
                      }
          }


          // when the argument given is not appropriate
          private static void usage1() {
                      System.err.println ("For Every Process");
                      System.err.println ("Usage: java CMainClusterProcess <infile>");
                      System.err.println ("<infile> = Input file with ' pid<tab>pid ' format");
                      System.exit (1);
          }
}

D. DataGenerator.java
/**
* DataGenerator.java
* @author Kwan Dong Kim
**/


/**
* Description of the Class
*         This class generate the pesudeo data for CMainClusterProcess
* args[0] ---------------- max number of input
*/




import java.util.*;
import java.io.*;


class DataGenerator
{
          public static void main(String[] args) throws Exception
          {


          String Column_Out;
          int tmp_Column_Out;
          String Output_String;
          String Section_Output_String="";


          //init the tree map structure
          TreeSet Second_Column = new TreeSet();


          //parse the argument[0] as a max number of input
          int max = Integer.parseInt(args[0]);
          String outfile = "input_"+max+".txt";




                                                                                                  61
        // init the file writer
        FileWriter fw = new FileWriter(outfile);
  BufferedWriter bw = new BufferedWriter(fw);


        // init the random generator
        Random generator = new Random();




                    int counter;
                    for (int i=1 ; i < max ; i++ )
                    {
                                  //generate number of "to-link" using random generate
                                  int number_of_second_column = generator.nextInt(100) +1;


                                            for (int j=1; j< number_of_second_column; j++ )
                                            {
                                                                //generate   "to-link"   number     using
random generate
                                                                int          second_column             =
generator.nextInt(max)+1 ;


                                                                Second_Column.add
(second_column) ;


                                            }


                    //            iterate each "from to link"
                    Iterator iterator = Second_Column.iterator ( ) ;


                     while ( iterator.hasNext ( ) )
                                            {
                                                       tmp_Column_Out = (Integer)iterator.next();
                                                       Column_Out = tmp_Column_Out+"";
                                                       Output_String = i+" "+Column_Out+"\n";
                                                       Section_Output_String =Section_Output_String
+Output_String;
                                            }


                    Second_Column.clear();
                    //write the result of each "from to link"
                    fw.write(Section_Output_String);
                    Section_Output_String="";
                    }
            // close the writer
                    bw.close();


        }




                                                                                                      62
}




    63

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:46
posted:4/28/2010
language:English
pages:65