PageRank Calculation using Spars
Document Sample


PageRank Calculation using Sparse Matrix in Clustered
Computer Environment
Kwan Dong Kim, Chang Min Kim
0
Table of Contents
1. Introduction ------------------------------------------------------------- 1
2. PageRank ---------------------------------------------------------------- 1
A. Definition ------------------------------------------------------------ 1
B. PageRank Algorithm ----------------------------------------------- 2
3. Sparse Matrix Representation ------------------------------------ 4
A. Definition ------------------------------------------------------------ 4
B. Compression Algorithm -------------------------------------------- 5
4. Hardware Environment ---------------------------------------------- 6
A. V2 Cluster------------------------------------------------------------ 6
B. Parallel Java -------------------------------------------------------- 7
C. Parallel Java in V2 Cluster ----------------------------------------- 7
5. PageRank Iteration in Cluster ------------------------------------- 9
6. Experiments and Results ------------------------------------------ 11
A. Methodology -------------------------------------------------------- 11
B. Results and Conclusion -------------------------------------------- 12
7. Acknowledgement ---------------------------------------------------- 12
8. Reference ---------------------------------------------------------------- 12
9. Appendix ----------------------------------------------------------------- 13
A. Establishing SSL Connection and Running MPI program ------- 13
B. Shell scripts -------------------------------------------------------- 15
C. Java Codes --------------------------------------------------------- 20
1
1. Introduction
The purpose of the Web Laboratory is to provide data and computing tools for research
about the Web and the information on the Web. “It is funded in part by National Science
Foundation grants CNS-0403340, DUE-0127308, SES-0537606, and IIS 0634677. The
Web Lab is an NSF Next Generation Cyberinfrastructure project”[1]. To achieve this goal,
three teams will work for Web Lab project: Index, User Interface and PageRank teams.
The index team provides the full text indexing for other teams using Linux cluster server,
the user interface team takes charge of the user interface and pre-processing of data for
the PageRank. And the PageRank team works for compressed sparse matrix and
parallel programming to calculate large scale PageRank. This document is the project
report for the PageRank team therefore it will explain the several methods to calculate
large scale PageRank. And then it will show the results of the experiment which tests the
performance and the scalability of this algorithm.
2. PageRank
A document on the web can be measured typically by two metrics, “relevance” and
“importance”. Relevance is based on comparison between the document terms and
query terms, while importance is based on the estimation of popularity of the documents.
Although two metrics are used as one combined metric in practice by search engines,
importance by popularity is considered to be the key metric to measure the rank of a
document on the web. Getting relevance metric requires all the documents in a set to be
compared with the query terms but it is neither practically possible nor desirable for the
web due to the cost and the characteristics of the web such as growing number of
documents and quality of the documents on the web. It is formally rather suitable for
controlled collection of documents. Among several popularity measure of a page on the
web, “PageRank” provides most suitable and efficient semantics and algorithm.
A. Definition
Brin and Page (1998) suggested “PageRank” as a measure of estimating the popularity
of a web page[2]. PageRank calculates stochastic probability to reach a certain web page
based on the number of in-links to a page. “PageRank is basically modified version of
Pinski and Narin’s influence weights applied to the web graph” (Arms, 2006). PageRank
calculation contains an iteration of matrix multiplications and additions but it is proven to
converge in reasonable amount of time and the time does not depend on the size of input
data, i.e. the number of pages that are calculated. This is major reason why PageRank
can be applied to a large set of documents such as those on the web unlike other
algorithms such as “Hubs and Authorities” (Kleinberg, 1997)[3].
1
B. PageRank Algorithm
PageRank essentially calculates the stochastic probability to reach a page based on
“Random Surfer” model. In this model, a user surfs around the web without any restriction
except an assumption. And the possibility to reach a page from another page is purely
decided by the number of links in the page the surfer is in. In the view point of the page
being reached by user, if the page has many in-links, the page gets to have higher
possibility to be reached and thus gets higher PageRank. What random surfer model
assumes is that a user may follow a link on the page he is in or he may jump to another
page randomly. So, he begins surfing with the same possibility of reaching any page in
the set of documents, virtually whole web pages.
Suppose the number of all pages in the set is n.
1
Then W 0 is a vector with every element , which is the probability to reach each page in
n
the vector.
Then set up a square matrix B that contains all the link information of the pages in the set.
The column index represents page a link is from pages and the row index represents
the
the page a link is to pages. Each cell in the matrix contains 1 if there is a link between
those two (from and to pages). Then normalize it by dividing each column by the total
number of links in the column so that each cell can contain the possibility to reach a page
in the row from a page in column. A page which does not have any link to other pages is
1
called “a dangling node” and the value for the cell is , because if a surfer reaches the
n
page he will jump to another page randomly because there is no out link in that page.
W0 = [¼, ¼, ¼, ¼]
1 2 3 4
B= 1 0 ½ ¼ 1/3
2 ½ 0 ¼ 1/3
3 0 0 ¼ 1/3
4 ½ ½ ¼ 0
* page 3 is a dangling page
The probabilities to reach each page after one step from the beginning page without a
random jump to another page can be calculated by W 1 = B * W0’. If we consider a random
2
jump from starting page with probability 1-d (d is damping factor), then the equation for
W0 will be
W1 = d * (B * W 0’) + (1-d) * W0’, and
W2 = d * (B * W 0’) + (1-d) * W1’
W3 = d * (B * W 0’) + (1-d) * W2’
.
.
.
Wk = d * (B * W 0’) + (1-d) * Wk-1’
The sum of every element in each W is 1 because the sum is the probability to reach any
page in the set. Every element in W will eventually converge and W will be the PageRank
of the pages in the set.
The convergence to a unique vector for any given staring vector W 0 is the feature of
Markov Property because matrix B is stochastic, irreducible, and aperiodic (Arms, 2006).
B is a dense matrix but in the real world, a typical web page will not have more than
several hundred links at most and that is unusual. In our experiment from Amazon
dataset the average out-link in a page was , and the page with most link was . Under this
condition, the iterating matrix multiplication on dense matrix will reduce the performance
of PageRank calculation significantly. We can construct sparse matrix by re-writing
C = S + (1/n) * e * a’ --- 1
Then, C considers random jump from dangling pages.
S is the initial B matrix before adding 1/n for damping pages, e is a vector with every
element 1, and “a” is a vector with elements 1 if the pages for the entries in the vector are
dangling otherwise 0. If we define a Matrix L
L = d * C + (1/n)(1/n) * E --- 2
Then L considers random jump from any page by user choice.
E is a square Matrix with every element 1. Then
Wk = L * Wk-1’, --- 3
If we substitute L with equation 1 and 2
3
Wk = (d * C + (1-d)(1/n) * E) * W k-1’
= dSWk-1’ + d(1/n)ea’ * W k-1’ + (1-d)(1/n)EW k-1’
= dSWk-1’ + d(1/n)ea’ * W k-1’ + (1-d)(1/n)e --- 4
(note EW k-1 is e)
There is no dense matrix in this equation, so iteration with this equation is significantly
more efficient when the size of the set gets larger. We used equation 4 for our
experiments.
However, as PageRank team is working on, if we want to calculate PageRank on a very
large set of pages, there would be a memory space problem. Constructing a huge sparse
matrix is not desirable in terms of either memory utilization or computational efficiency.
We can transform this sparse matrix to a compressed sparse matrix using the sparse
matrix representation technique, which will be discussed in chapter 3.
But even with this representation, in-memory computation is impractical when the data
set is too large. So we need to move on to clustered computing environment (Considering
the large number of web pages, this should be done rather than need to be done). By
using “Message Passing Interface (MPI)”, multiple nodes can exchange relevant portion
of data and can compute on them concurrently. MPI is very expensive in terms of speed,
but considering the size of real data, this is inevitable. Hardware environment and setup
for clustered computing will be discussed in chapter 4.
In this clustered computing environment using MPI, we need to slightly modify the
algorithm so that necessary information for each node can be effectively and lossless
distributed and gathered among the nodes. The actual process of the algorithm and
problems in clustered machines will be discussed in chapter 5.
3. Sparse Matrix Representation
A. Definition
A sparse matrix is a matrix that consists of primarily zeros. The sparse matrix calculation
is important to compute the PageRank because the PageRank calculation is
fundamentally the combination of several sparse matrix calculations. Therefore if we can
deal with the matrix more effectively, we can process bigger PageRank calculation. Fig-1
is example of sparse matrix. It contain a lot of “0”s and few numbers which are greater
than “0”.
4
index 1 2 3 4 5 6 7 8 9
1 0 0 1 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0
3 1 0 0 1 0 0 0 2 0
4 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 1 0
6 . . . . . . . . .
(Fig - 1) Example of sparse matrix
As we mentioned, the PageRank calculation is the combination of several sparse matrix
calculations. Therefore if there are 100 pages, we have to calculate 100 x 100 matrix to
calculate PageRank without compression. For the Web Lab project, there are
approximately 13 million pages therefore at least a 13 million x 13 million matrices must
be processed. If we use a regular array, vector or linked list, the enormous memory space
is required to calculate the PageRank. For example, if there are one million pages and
average out link for each page is approximately 10 links. If we use regular array structure
and each cell in the array needs 4 byte, 4 byte x one million x one million is 4 x 10 ^ 12
byte therefore 4 x 10 ^ 12 byte is required to store that matrix. However when we ignore
all zero values and only store the values greater than zero, we only need 4x 10 million (4
x one million x 10). As a result, the compressed spare matrix algorithm saves enormous
memory space when we calculate the PageRank.
B. Compression Algorithm
Three linked lists are used to implement the sparse matrix algorithm. First linked list is the
row linked-list and it represents the row index. Second linked list is the column-linked list
and it represents the column index. Third linked list is the value linked-list and it
represents the value of each cell [4].Fig-2 shows the structure of the compressed sparse
matrix algorithm and the array representation is a equivalent structure in Fig-3 with the
compressed sparse matrix representation in Fig -2.
5
Number of column at
the first row
1 2 3 4 5 6 7
Row List
5 0 4 2 1 0 . . . . . . .
Column List 2 3 7 8 9 1 3 4 9 2 4 8 . .
Value List
1 2 1 1 2 1 3 1 2 3 4 5 . .
(Fig - 2) Compressed sparse matrix representation
index 1 2 3 4 5 6 7 8 9
1 0 1 2 0 0 0 1 1 2
2 0 0 0 0 0 0 0 0 0
3 1 0 3 1 0 0 0 0 2
4 0 3 0 4 0 0 0 0 0
5 0 0 0 0 0 0 0 5 0
6 . . . . . . . . .
(Fig - 3) Regular Matrix Representation
4. Hardware Environment
A. V2 Cluster
Even if we use the compressed sparse matrix representation, 4 x 10 ^ 12 is still huge to
calculate on a single machine therefore we decide to process them in the Multiple node
Linux cluster servers. The Cornell Theory Center has several hundred nodes of Linux
cluster servers therefore we can use this system. To calculate PageRank using cluster
servers, we develop the program that runs in this multiple Linux cluster servers. It divides
all of jobs and then processes each job in the separated nodes simultaneously. And then
each node creates the result when the job is finished. All of these results are returned to
6
Linux login machine when all of processes are finished. As a result, These results are
merged together in the Linux login machine.
B. Parallel Java
As we mentioned, Linux cluster sever in the Cornell theory center will be used to process
our PageRank program. When we look at the manual in the Cornell theory center for
Parallel programming, only C and Fortran language are supported in this system.
However our goal is to develop the Java PageRank program that runs on Linux cluster
servers in the Cornell Theory center therefore we were looking for several Java packages
which support parallel programming.
The parallel programming java packages are listed below:
1. mpiJava(http://aspen.ucs.indiana.edu/pss/HPJava/mpiJava.html)
2. Open MPI(http://www.open-mpi.org/)
3. OpenMP API(http://docs.sun.com/app/docs/doc/819-3694)
4. MPICH2(http://www-unix.mcs.anl.gov/mpi/mpich/)
5. Parallel Java(http://www.cs.rit.edu/~ark/pj.shtml)
Although there are many Java packages supporting parallel programming, we select
Parallel Java (PJ) implemented by professor Alan Kaminsky at Rochester Institute of
Technology. Because this package is well documented and there are several examples
to follow easily.
There are several important methods and programs in the Parallel Java package:
“Comm” and “Buffers” methods, “JobScheduler”, “Backend” and “Frontend” programs.
“Comm” and “Buffers” methods are used to communicate between the backend and the
frontend. And “JobScheduler” assigns all job to each node(frontend or backend). The
frontend node controls all of backend nodes and gathers all results from the backend
nodes and merges them together. The backend processes the actual data and send
them back to the frontend node. These “JobScheduler”, “Backend” and “Frontend”
programs are required when we run our parallel program. (Refer to Parallel Java docs)
[5].
C. Parallel Java in V2 Cluster
To set up the appropriate environment for the parallel program, it is essential to
understand the structure of the Linux cluster servers in Cornell Theory Center. There are
a lot of systems in the Cornell Theory Center, however we only use three different
7
systems; Linux Login Machine, File Server and V2 Linux Cluster Servers. The overall
structure of system is shown in Fig-4. This system only allow the client to connect to Linux
Login machine at first. And then the user can connect to the V2 Linux Cluster Servers by
using SSH command when they log in Linux Login machine. The user directory and files
are stored and managed in the File Server. When the client connects to the Linux Login
machine, the user directory from the File Server is mounted automatically. Also when the
client connects to V2 Linux Cluster Servers, the user directory from the File Server is
mounted automatically too. Therefore user can use same file system all around the
system include the Linux login and V2 Linux Cluster servers[6].
File Server
Linux Login Machine
Vii0001 Vii0002 ……… Vii000k
V2 Linux Cluster Servers
Client
(Fig-4) The structure of Linux Cluster Servers
As we mentioned, Linux cluster servers in Cornell Theory center only support C and
Fortran language. Therefore we need a special set up for Parallel Java(PJ) package.
There are five essential conditions to run the Parallel Java package in the Linux cluster
servers.
These conditions are listed below:
1. Java package must be installed in each node(V2 Linux Cluster Servers)
2. Each node can communicate by using SSH without any authentication
3. Packages must be deployed in each node
8
4. Job Scheduler should be running in each node
5. The program should be developed as it explained in the PJ document
To accomplish the first condition, we request that the administrator working in Cornell
Theory Center install Java 1.5 in Linux cluster servers. And we set up a public key for our
account to allow SSH communication without any authentication. And then we make
script program to copy all required programs to each node and generate the configuration
file for the job scheduler. Finally these script programs run the Job scheduler and our
page. The detail shell code for set up and the way to set up the public key are supplied in
the appendix.
5. PageRank Iteration in Cluster
Other than hardware setup for clustered machines, we need to modify the PageRank
process slightly. It is important to understand the relationship between matrix and
compressed matrix representation, PageRank iteration and compressed matrix
representation, and the characteristics of input data as well as hardware environment and
some facts about “Parallel Java” package.
First, when we compress sparse matrix to compressed format, we intentionally eliminate
the cell with value 0 to get rid of unnecessary information at the cost of losing some
information. In our implementation, we kept the row information even with no entry in the
data structure as long as there was a trial to input any data to the compressed format,
even if it was a 0. This is good enough for PageRank calculation because link matrix in
PageRank calculation is always square, so even if we don’t have full column information,
we still can get column dimension from row dimension for PageRank calculation.
However if the compressed representation is used in other calculations, it should be
modified in order to keep column dimension information.
Second, in clustered PageRank calculation, each node needs to maintain just the portion
it should calculate from input file. However each node has to have full W vector
information to calculate iteration and also e, a’ and ea’ should be sliced so that their
appropriate portion is joined in the calculation. Also the resulting W vector in each node is
a part of the full W vector, so they should be gathered properly by the master node and
re-distributed to each node again.
Third, there are several issues we don’t get correctly in the “Parallel Java” package. For
example, when we set up buffers for sending Boolean variable and receiving Boolean
array, we encountered a null pointer exception. The code was
9
BooleanItemBuf bif = new BooleanItemBuf();
BooleanItemBuf [] bifa = new BooleanItembuf[size];
bif.set(true);
world.gather(0, bif, bifa);
This problem was resolved after we re-ordered the lines as following,
BooleanItemBuf bif = new BooleanItemBuf();
bif.set(true);
BooleanItemBuf [] bifa = new BooleanItembuf[size];
world.gather(0, bif, bifa);
We still can’t find the justification for this, but in practice, the order of statement was
important.
We also suffered from jobScheduler not running correctly for a long time. The reason was
that the jobScheduler takes some time to fully start up to correctly work. So we modified
the shell script to wait some time after it issues java command for jobScheduler.
Moreover, there is a quite serious environmental problem with clustered machines in the
Cornell Theory Center. First, it is quite hard to get a quota for a clustered machine,
especially a development machine. We are supposed to use v2 linuxdev cluster for
development purpose, which has only 4 nodes, and it is hardly idle. Second, the worst
problem is that this cluster is quite unstable and the server goes down quite often. For
example, exactly same set of codes work for one account but not for another account in
the same cluster, and there were many times that clearly working codes did not work until
Theory Center agents rebooted the system. We spoke with agents when we believed we
were right and rebooting the system usually resolved the problem. The cluster uses
“mpiexec” to distribute necessary files from login machine to the nodes and it failed often
with no evident reasons. Executing the same program does not guarantee the same
process under clustered condition in v2 linux and v2 linuxdev clusters. The same code
work behaves different depending on the server. All of these were the major obstacles
that retarded development stage.
Because we spent most of our time on verifying PageRank iteration equation and
hardware environment setting, we had little chance to work on the PageRank calculation
program for clustered machines. Now we have got to the point where message passing
10
for data types except object data type is correctly working and most parts of the
PageRank calculation program are working correctly.
One of the remaining parts that should be done with PageRank in a clustered
environment is, first, to figure out how to exchange object data type. After partial wCurrent,
which is the result matrix of each iteration from a node, is calculated, every partial
matrices should be gathered and broadcasted to each node again for next iteration.
wCurrent is a compressed matrix and this data should be exchanged using Object (Array,
Item) Buffer. But we did not have chance to work on this due to the time restriction.
However we decomposed the compressed matrix into three component arrays and
exchanged them instead of the real matrix. Although this is not the way we ultimately
want and this method breaks encapsulation of compressed matrix, now we can calculate
PageRank iteration by exchanging those arrays, but still we need to verify the program
with more data. Because we could not get enough of data in time, we only tested with
very small sized data sample, some are real and some are randomly generated, ranging
from 5 by 5 to 1000 by 1000. Although we could get quite reasonable results from current
code work, it might not work for larger set of data or under specific conditions. Also as it is
always so, there are several points that should be optimized. One good example is
transpose() in CCompressdMatrix for row-based and column-based matrix
representation transformation.
Another issue is to figure out the problem that was given in the box above. Probably and
more than naturally, it is hard to expect that to happen based on the characteristics of
object oriented language. But it was quite true that at least under Cornell Theory Center
(CTC) environment according to our experiments. There is possibility that Parallel Java
(PJ) package is not well tuned for CTC environment because it is open to public for
academic purpose and not designed specifically for CTC environment. We may try
another package for the same code works.
Another issue is that we could not implement the function to write the result in our server.
To implement this function, we can write the result file in each cluster node and collects all
of the results in the login machine. And then we can merge these results however this
function is not developed right now because we spent time to test and do the experiment
our algorithms.
Lastly, the most important and probably the first thing that should be done is to establish
stable clustered computing environment somehow. Due to the reasons I described above,
it is really hard to work on program writing. Besides the pure programming difficulties in
clustered environment, when there was an error or bug, it was hard to tell if the cause was
bad implementation or instable cluster.
11
6. Experiments and Results
A. Experiment method
Our compressed matrix structure and parallel program were tested to measure the
performance. And the performance will be measured by comparing the time to finish each
process when we increase the data. To measure the performance of our parallel program,
we will use 4 node and 50, 100, 150 and 200 data will be used to test. (50 data contains a
sparse matrix structure of 50 web links)
B. Result and conclusion
According to our experiment result, the performance of our parallel program is decreased
rapidly when the data is increased consistently. According as the data is increased, the
data communication between the backend and the frontend will be increased rapidly.
Therefore when we increase the data, the overall performance is hurt a lot because of
excess of the communication. As a result, the performance graph is not a linear function
but a exponential function approximately. The fig -5 show the result of performance
measurement.
Time(Sec)
6000
5000
4000
3000
Time(Sec)
2000
1000
0
50 100 150 200
The number of Web Link
(Fig -5) the result of performance measurement
7. Acknowledgement
We would like to thank the following people list below :
William Y. Arms as a our Project Advisor
Daniel Ira Sverdlik as a Consultant in the Cornell Theory Center
Alan Kaminsky as a Parallel Java Package writer
12
The Web Lab team wishes to thank the Internet Archive for their assistance and support.
This work is funded in part by National Science Foundation grants CNS-0403340, SES-0537606,
and IIS-0634677.
8. Reference
[1] W. Arms, The Web Laboratory: A Joint Project of Cornell University and the Internet Archive ,
http://www.infosci.cornell.edu/SIN/WebLab/about.html
[2] S. Brin, & L. Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, Stanford ,USA,
1998
[3] J. Kleinberg, Authoritative sources in a hyperlinked environment, Journal of the ACM 46, 1999, IBM
Research Report, 1997
[4] N.Goharian, T.El-Ghazawi & D.Grossman, Enterprise Text Processing: A Sparse Matrix Approach,
IEEE, 2001
[5] A. Kaminsky, Parallel Java Library, http://www.cs.rit.edu/~ark/pj.shtml, 2007
[6] Cornell Theory Center,Computing Resources for CTC Users,
http://www.tc.cornell.edu/Services/CTC+Resources.htm, 2007
[7] J.Willcock & A. Lumsdaine, Accelerating Sparse Matrix Computations via Data Compression, ACM
Press, 2006
9. Appendix
A. Establishing SSL Connection and Running MPI program
A.1 Creating ssh keys and a required MPI file
Run the following script before submitting a batch job. From a linux login node, type:
/ctc/tools/setup_ssh_mpd_linux.sh
This script creates ssh keys and a required MPI file. MPICH2 is the supported MPI
implementation on the Linux clusters.
The script /ctc/tools/setup_ssh_mpd_linux.sh performs the steps detailed below.
Note: you do not need to issue any of the commands illustrated here, they are done for
you automatically when you run the script.
On linuxlogin1.tc.cornell.edu or linuxlogin2.tc.cornell.edu, creates the directory
~<your_userid>/.ssh
cd ~<your_userid>
mkdir .ssh
cd .ssh
Creates an SSH keypair to automate logons .
13
ssh-keygen -b 1024 -t dsa -C <your_userid>
Adds SSH public key to authorized_keys file (file is visible from all machines.)
cat id_dsa.pub >> authorized_keys
Note: Use append (>>) when adding keys to the authorized_keys file so
any existing keys are not overwritten.
Adds a required public key to your authorized_keys file. This is required to allow
the scheduler to launch jobs with your userid. In addition to adding the key, it is
also necessary to set the proper permissions on both the ssh folder and
authorized_keys file for ssh to function.
cat /ctc/tools/velocity.pub >> ~<your_userid>/.ssh/authorized_keys
Changes permissions on authorized_keys to 600 and on the .ssh directory to
700. Returns to your home directory:
chmod 600 authorized_keys
cd ..
chmod 700 .ssh
Creates the file .mpd.conf in your home folder. It will contain the parameter
MPD_SECRETWORD.
Sets permissions so only you can read it.
chmod 600 .mpd.conf
Running the script creates the following files:
$HOME\.ssh\authorized_keys
$HOME\.ssh\id_dsa
$HOME\.ssh\id_dsa.pub
$HOME\.mpd.conf
A.2 Testing MPI Interactively
Create a hosts file, mpd.hosts.
On compute nodes that have been assigned by vsched, this is very easy to do:
vsched -m
14
mv machines mpd.hosts
Alternatively, use a text editor like nano or vi or emacs to add any machine
names you want to mpd.hosts, one name per line, and save it.
Start the mpd daemons.
Preferred method using mpdboot:
mpdboot -n <numberofhosts>
Alternate method if mpdboot doesn't work:
At the command prompt, enter: mpd &
To find the port, run
mpdtrace -l
It will return with the port number it's running on
To start mpd's on the other machines, run
ssh <nextmachinename> mpd -h <firstmachinename> -p <port> -d
Verify all the mpd daemons are running correctly.
Run mpdtrace to get a quick trace and see all the machines
Run mpdringtest 3000 to run a ring around the mpd daemons
Verify you've got the right hosts with mpiexec <numberofhosts> hostname
A.3 Running an MPI Program Interactively
Make sure mpd.hosts is in the same directory as your executable. Then in that directory,
issue:
mpiexec <numberofhosts> <mycodename>
When you are done, close all the daemons by running:
mpdallexit
B. Shell scripts
B.1 PageRank.xml
<?xml version="1.0" ?>
<!-- Sample XML Job File -->
<job>
<nodes>4</nodes>
<minutes>20</minutes>
<type>interactive</type>
15
<affiliation>v2linuxdev</affiliation>
<run>/bin/sh $HOME/Lab/parallel.sh</run>
</job>
B.2 parallel.sh
#!/bin/sh
# parallel.sh
# @author Kwan Dong Kim
#Description of the script
#This script set up environment for Parallel Java in the V2 Linux Cluster
# Set up number of machine
NMACHINES=4
# Set up the number of processes
NPROCS=4
# Set ROOTDIR
ROOTDIR=$HOME/Lab
export ROOTDIR
# Set and create an output directory
tmphost=`hostname | cut -f1 -d"."`
OUTDIR=$ROOTDIR/output
mkdir -v $OUTDIR
export OUTDIR
#Change directory from "linuxlogin" to "vii000XX" node
cd /tmp
# Set up the SSH public key authentication
vsched -m
mpdboot -n $NMACHINES -f /tmp/machines
# Create a local directory on /tmp
# Copy files to local disk(vii000XX)
mpiexec -n $NPROCS $ROOTDIR/setup.sh
16
TMPDIR=/tmp/$USER
export TMPDIR
mv /tmp/machines $TMPDIR/machines
# Run the executable from local disk(vii000XX)
cd $TMPDIR
# Copy "config generator.sh" file
echo "copying $ROOTDIR/config_generator.sh
$TMPDIR/config_generator.sh "
cp $ROOTDIR/config_generator.sh $TMPDIR/config_generator.sh
cp $ROOTDIR/colex $TMPDIR/colex
cp $ROOTDIR/linex $TMPDIR/linex
#Change directory from "linuxlogin" to "vii000XX" node
cd /tmp/$USER
echo "Host name"
echo `hostname`
# Generate config file
./config_generator.sh $NMACHINES
# EXport PJ package class path
export CLASSPATH=.:/tmp/kk386/pj.jar
#:/home/nfs/ctcfsrv11/m/$USER/Lab/pj.jar
#Run the JobScheduler
echo "/usr/java/jre1.5.0_11/bin/java edu/rit/pj/cluster/JobScheduler
scheduler.conf"
/usr/java/jre1.5.0_11/bin/java edu/rit/pj/cluster/JobScheduler
scheduler.conf &
#Wait 10 sec to run the edu/rit/pj/cluster/JobScheduler correctly
echo "Start 10-second sleep"
sleep 10
# Export PJ package class path
export CLASSPATH=.:/tmp/kk386/pj.jar
#Run the CMainClusterProcess
echo "/usr/java/jre1.5.0_11/bin/java -Dpj.np=3 CMainClusterProcess
pidList >& result.$tmphost.out"
17
/usr/java/jre1.5.0_11/bin/java -Dpj.np=3 CMainClusterProcess pidList >&
result.$tmphost.out
# Copy output files to your output directory on the fileserver
# Delete all remaining files on /tmp/$USER
mpiexec -n $NPROCS $ROOTDIR/cleanup.sh
# Cancel the all process and node when all of process is finished.
vsched –c
B.3 setup.sh
#!/bin/sh
# setup.sh
# @author Kwan Dong Kim
#Description of the script
#This script copy the required files to each nodes(both frontend and
backend)
#Remove all data which is created previously
rm -f -r /tmp/$USER
#Set up TMP directory
TMPDIR=/tmp/$USER
#Make TMP directory
mkdir $TMPDIR
echo $TMPDIR
export TMPDIR
#Make Root Directory
mkdir $ROOTDIR $TMPDIR
echo "cp -r $ROOTDIR/ $TMPDIR/"
#Copy all needed data and programs to each cluster nodes(vii000XX)
cp $ROOTDIR/* $TMPDIR/
cp -r $ROOTDIR/edu $TMPDIR/
cp -r $ROOTDIR/pr $TMPDIR/
18
B.4 config_ generator.sh
#!/bin/sh
# config_ generator.sh
# @author Kwan Dong Kim
#Description of the script
#This script generate the configuration file for job scheduler
# save variable java_path
JAVA_PATH="/usr/java/jre1.5.0_11/bin/java"
# save variable PJ_package_path
PJ_PATH="/tmp/$USER/pj.jar"
# save variable log file
LOG_PATH="/tmp/$USER/scheduler.log"
# save variable web host
WEB_HOST_PATH=".tc.cornell.edu"
# save number of cluster to use
Num_Cluster=$1
#Set up the variable index i
i=1
#Get the Frontend node by using vsched, colex and linex command
Frontend=`vsched -u v2Linuxdev | grep $USER | ./colex 1 | ./linex 1`
#Remove empty space in the variable Frontend
parsedFrontend=`echo "$Frontend" | tr -c '\012[a-zA-Z][0-9].\-_' '\n' | uniq`
echo -e "$parsedFrontend"
#Increase index i
i=`expr $i + 1`
# Save variable web host
WEB_HOST_PATH="$parsedFrontend.tc.cornell.edu"
#start write the config file
echo "#Parallel Java Job Scheduler Configuration file" > scheduler.conf
echo "#Frontend processor : $Frontend" >>scheduler.conf
echo "cluster v2linuxdev" >>scheduler.conf
echo "logfile $LOG_PATH" >>scheduler.conf
echo "webhost $WEB_HOST_PATH" >>scheduler.conf
echo "webport 8080" >>scheduler.conf
echo "schedulerhost localhost" >>scheduler.conf
19
echo "schedulerport 20617" >>scheduler.conf
echo "frontendhost $WEB_HOST_PATH">>scheduler.conf
#Start process Backend node
while [ $i -le $Num_Cluster ]
do
#Get the Balcked node by using vsched, colex and linex command
Backend=`vsched -u v2Linuxdev | grep $USER | ./colex 1 | ./linex $i`
#Remove empty space in the variable Frontend
parsedBackend=`echo "$Backend" | tr -c '\012[a-zA-Z][0-9].\-_' '\n' | uniq`
#Write the Balcked node in the config file
echo "backend $parsedBackend $parsedBackend $JAVA_PATH
$PJ_PATH">>scheduler.conf
#Increase index i
i=`expr $i + 1`
done
# Copy scheduler.conf file to Linux login machine
cp $TMPDIR/scheduler.conf
$ROOTDIR/output/scheduler.conf
B.5 cleanup.sh
#!/bin/sh
# cleanup.sh
# @author Kwan Dong Kim
#Description of the script
#This script copy the final result and clean up the nodes and finish the all
process
# Copy all log and result file towad linux login machine
cd $TMPDIR/
cp $TMPDIR/result.* $OUTDIR
cp $TMPDIR/*.log $OUTDIR
#Remove all data and program used in each node(Vii000XX)
rm -f -r /tmp/$USER
C. Java Codes
20
A. CCompressedMatrix.java
/**
* CCompressedMatrix.java
* Created in 2007. 02. 04
* @author Chang Min Kim
* netID : ck273
*/
/**
* Description of the Class
* This class is representation of compressed sparse matrix
* CSM(compressed sparse matrix) consists of three linked lists.
* One is for row representation
* The sencond is for column representation
* The last is for value for the element of a matrix
* For more information, refer to the documentation.
*/
package pr;
import java.util.*;
import Jama.Matrix;
public class CCompressedMatrix {
// LinkedList containing beginning cell number of colList for the row
private LinkedList<Integer> rowList;
// LinkedList containing column number of the corresponding row
private LinkedList<Integer> colList;
// LinkedList containing value of corresponding cell
private LinkedList<Double> valueList;
// These four variables are only to keep the information of the original matrix even if there is
none
// last number of column that actually contains value other than 0
private int lastCol;
// last number of row that actually contains value other than 0
private int lastRow;
// actual row dimension of this matrix including the rows with only 0's
private int rowDim;
// actual column dimension of this matrix including the columns with only 0's
private int colDim;
public CCompressedMatrix() {
// initialize object
rowList = new LinkedList<Integer>();
colList = new LinkedList<Integer>();
valueList = new LinkedList<Double>();
lastCol = -1;
21
lastRow = -1;
rowDim = 0;
colDim = 0;
}
public boolean compareWithMatrix(Matrix mt) {
// compare this object with matrix object
// If dimensions don't match, they are different
if ( this.rowDim != mt.getRowDimension() || this.colDim !=
mt.getColumnDimension() ) {
return false;
}
// cell by cell comparison
for ( int i=0; i< mt.getRowDimension() ; i++ ) {
for ( int j=0; j< mt.getColumnDimension() ; j++ ) {
if ( mt.get(i, j)!=getValueAt(i,j) ) {
System.out.println(i + " " + j);
return false;
}
}
}
return true;
}
public int getRowDimension() {
return rowDim;
}
public void setRowDimension(int i) {
rowDim = i;
}
public int getColDimension() {
return colDim;
}
public void setColDimension(int i) {
colDim = i;
}
public int getLastCol() {
return lastCol;
}
public void setLastCol(int i) {
lastCol = i;
}
22
public int getLastRow() {
return lastRow;
}
public void setLastRow(int i) {
lastRow = i;
}
public void setDims(int i, int j) {
this.rowDim = i;
this.colDim = j;
}
public void clear() {
// initialize this object
rowDim = 0;
colDim = 0;
lastRow = -1;
lastCol = -1;
rowList.clear();
colList.clear();
valueList.clear();
}
public boolean empty() {
// tells if compressedMatrix object has any element
if ( rowDim == 0 || lastRow == -1 || colDim == 0 || lastCol == -1 )
return true;
else
return false;
}
public boolean isEqualTo(CCompressedMatrix cm) {
// compares two compressedMatrix objects
// If dimensions don't mathc, they are different
if ( this.rowDim != cm.getRowDimension() || this.colDim !=
cm.getColDimension() ) {
return false;
}
// cell by cell comparison
for ( int i=0 ; i<rowDim ; i++ ) {
for ( int j=0; j<colDim; j++ ) {
if ( getValueAt(i,j) != cm.getValueAt(i, j) ) {
return false;
}
}
23
}
return true;
}
public LinkedList<Integer> getRowList() {
// returns rowList
return rowList;
}
public LinkedList<Integer> getColList() {
// returns colList
return colList;
}
public LinkedList<Double> getValueList() {
// returns valueList
return valueList;
}
// add cell elements to compressed matrix
// Becasue CCompressedMatrix is row based compression,
// every column for the row should be processed before row increases
// For the data ordered by column first, simply add elements first and transpose this.
// For more information about representation, refer to the documentation.
public void addElement(int i, int j, double k) {
// checks if there is already existing value in that cell
if ( getValueAt(i,j) == 0 ) {
// increase dimension
// This is necessary to keep the information of original matrix or
number of pages as much as possible
// in case last rows don't have any link to it.
if ( rowDim <= i ) {
rowDim = i+1;
}
if ( colDim <= j ) {
colDim = j+1;
}
// if the value is not 0, compressed matrix will contain the value
information for the cell
if ( k != 0.0 ) {
// beginning of new row
if ( lastRow < i ) {
// If there were rows with no values before this row
i,
// the rows from lastRow+1 to i-1 should be filled
with -1
// to indicate there were empty rows
24
for ( int rowLoc = lastRow+1; rowLoc<i;
rowLoc++ ) {
rowList.add(new Integer(-1));
}
// now lastRow with any values is i
lastRow = i;
// column list size is the index of beginning index
number for this row with non zero value
rowList.add(new Integer(colList.size()));
// column list contains the column index of input
cell
colList.add(new Integer(j));
// value list contains the value for given cell
valueList.add(new Double(k));
// lastCol indicates the last column index whose
corresponding cell has non zero value
if ( lastCol < j ) {
lastCol = j;
}
}
// in case a cell with the same row number as given cell was
already inserted to the matrix already
else if ( lastRow == i ) {
// We don't need to modify rowList, but simply add
column and value of the cell to the lists.
colList.add(new Integer(j));
valueList.add(new Double(k));
if ( lastCol < j ) {
lastCol = j;
}
} else {
System.out.println("You are trying to add an
element which should be inserted earlier");
System.out.println("Check if you are trying to
insert elements in sorted order as specified");
}
}
} else {
System.out.println("There is already existing value in that cell, try
'setValuAt(i,j)'");
}
}
public CCompressedMatrix multiply(CCompressedMatrix cm) {
// inner dimension should match to perform matrix multiplication
if ( this.colDim != cm.getRowDimension() ) {
System.out.println("Dimension mis-match for multiplication");
25
return null;
} else {
CCompressedMatrix resCM = new CCompressedMatrix();
for ( int i =0; i <= lastRow; i++ ) {
// proceed only to the last index of column of this matrix and
last row index of cm for efficiency
for ( int j = 0; j <= cm.getLastCol(); j++ ) {
double accumulator = 0.0;
for ( int k = 0; k <= lastCol; k++ ) {
accumulator += getValueAt(i,k)*
cm.getValueAt(k,j);
}
resCM.addElement(i, j, accumulator);
}
}
resCM.setColDimension(cm.getColDimension());
resCM.setRowDimension(this.rowDim);
return resCM;
}
}
// tranpose this matrix
// after(i,j) = before(j,i)
public CCompressedMatrix transpose() {
CCompressedMatrix resCM = new CCompressedMatrix();
for ( int i=0; i <= lastCol; i++ ) {
for ( int j=0; j<= lastRow; j++ ) {
resCM.addElement(i, j, getValueAt(j,i));
}
}
resCM.setLastCol(this.lastRow);
resCM.setLastRow(this.lastCol);
resCM.setColDimension(this.rowDim);
resCM.setRowDimension(this.colDim);
return resCM;
}
// multiplies the values by scalar
public CCompressedMatrix ScalarMultiply(double x) {
CCompressedMatrix resCM = new CCompressedMatrix();
// multiply values by the scalar
for ( int i = 0 ; i < rowList.size(); i++ ) {
resCM.getRowList().add(new Integer(rowList.get(i).intValue()));
}
for ( int i = 0 ; i < valueList.size(); i++ ) {
resCM.getColList().add(new Integer(colList.get(i).intValue()));
26
resCM.getValueList().add(new Double(valueList.get(i).doubleValue()
* x));
}
resCM.setLastCol(this.lastCol);
resCM.setLastRow(this.lastRow);
resCM.setColDimension(this.colDim);
resCM.setRowDimension(this.rowDim);
return resCM;
}
public CCompressedMatrix plus(CCompressedMatrix cm) {
// dimension check
if ( this.colDim == cm.getColDimension() && this.rowDim ==
cm.getRowDimension() ) {
// proceed to the index of the matrix which has larger last index for
efficiency
int lc, lr;
if ( lastCol < cm.getLastCol() ) {
lc = cm.getLastCol();
} else {
lc = lastCol;
}
if ( lastRow < cm.getLastRow() ) {
lr = cm.getLastRow();
} else {
lr = lastRow;
}
CCompressedMatrix resCM = new CCompressedMatrix();
for ( int i =0; i <= lr; i++ ) {
for ( int j=0; j<= lc; j++ ) {
resCM.addElement(i, j, getValueAt(i,j) +
cm.getValueAt(i, j));
}
}
resCM.setLastCol(lc);
resCM.setLastRow(lr);
resCM.setColDimension(this.colDim);
resCM.setRowDimension(this.rowDim);
return resCM;
} else {
System.out.println("Dimension mis-match for addition");
return null;
}
}
public CCompressedMatrix minus(CCompressedMatrix cm) {
// minus is the same as plus with the matrix multiplied by -1
27
return plus(cm.ScalarMultiply(-1));
}
// converts matrix to compressed matrix data structure
// This is not necessary for this project.
// Written only for convenience for experiments
public void compressFrom(Matrix mt) {
rowList.clear();
colList.clear();
valueList.clear();
for ( int i=0; i<mt.getRowDimension(); i++ ) {
for ( int j=0; j<mt.getColumnDimension(); j++ ) {
addElement(i,j,mt.get(i,j));
}
}
this.rowDim = mt.getRowDimension();
this.colDim = mt.getColumnDimension();
}
// Some vectors such as page vector will be represented as compressed matrix for
convenient calculation
// Sometimes it is necessary to set the row dimension of the matrix not to lose dimension
information
// in case last rows have values of 0.
// This is due to the characteristics of compressed representation of sparse matrix
// This is not necessary for our given data set.
public void toVerticalVectorInCompressedMatrixFormat(int i) {
rowDim = i;
colDim = 1;
lastCol = 0;
}
// sets value of the cell indexed by (i,j) in regular matrix with value k
public void setValueAt(int i, int j, double k) {
int begin;
int end;
boolean valFound = false;
// first, check if cell (i,j) has value which is not 0, if so modify it, otherwise do
nothing
// This can be done more siply as following
// if (getValueAt(i,j) != 0)
// modify corresponding valueList
// This was written for more clarity
if ( i < rowList.size() - 1 ) {
begin = rowList.get(i).intValue();
end = rowList.get(i+1).intValue();
if ( begin != -1 && end != -1 ) {
28
for ( int l = begin; l < end && !valFound; l++ ) {
if ( colList.get(l).intValue() == j ) {
valueList.set(l, new Double(k));
valFound = true;
}
}
if ( !valFound ) {
System.out.println("There is no value for the cell,
try addElementAt");
}
} else if ( begin == -1 ) {
System.out.println("There is no value for the cell, try
addElementAt");
} else if ( end == -1 ) {
int idx = i+2;
while ( end == -1 && idx < rowList.size() ) {
end = rowList.get(idx).intValue();
idx++;
}
if ( end == -1 ) {
for ( int l = begin; l < colList.size() && !valFound;
l++ ) {
if ( colList.get(l).intValue() == j ) {
valueList.set(l, new
Double(k));
valFound = true;
}
}
if ( !valFound ) {
System.out.println("There is no value
for the cell, try addElementAt");
}
} else {
for ( int l = begin; l < end && !valFound; l++ ) {
if ( colList.get(l).intValue() == j ) {
valueList.set(l, new
Double(k));
valFound = true;
}
}
if ( !valFound ) {
System.out.println("There is no value
for the cell, try addElementAt");
}
}
}
} else if ( i == rowList.size() - 1 ) {
29
begin = rowList.get(i).intValue();
end = colList.size();
if ( begin == -1 ) {
System.out.println("There is no value for the cell, try
addElementAt");
} else {
for ( int l = begin; l < end && !valFound; l++ ) {
if ( colList.get(l).intValue() == j ) {
valueList.set(l, new Double(k));
valFound = true;
}
}
if ( !valFound ) {
System.out.println("There is no value for the cell,
try addElementAt");
}
}
} else {
System.out.println("There is no value for the cell, try addElementAt");
}
}
// get value of the cell indexed by (i,j) in regular matrix
public double getValueAt(int i, int j) {
int begin;
int end;
// find the colList range for the given row i first, and find j in colList within the
range
// if i and j can be located in data structure, there is non 0 value for the cell (i,j)
// otherwise return 0
if ( i < rowList.size() - 1 ) {
begin = rowList.get(i).intValue();
end = rowList.get(i+1).intValue();
if ( begin != -1 && end != -1 ) {
for ( int k = begin; k < end; k++ ) {
if ( colList.get(k).intValue() == j ) {
return valueList.get(k).doubleValue();
}
}
return 0;
} else if ( begin == -1 ) {
return 0;
} else if ( end == -1 ) {
int idx = i+2;
while ( end == -1 && idx < rowList.size() ) {
end = rowList.get(idx).intValue();
idx++;
30
}
if ( end == -1 ) {
for ( int k = begin; k < colList.size(); k++ ) {
if ( colList.get(k).intValue() == j ) {
return
valueList.get(k).doubleValue();
}
}
return 0;
} else {
for ( int k = begin; k < end; k++ ) {
if ( colList.get(k).intValue() == j ) {
return
valueList.get(k).doubleValue();
}
}
return 0;
}
}
} else if ( i == rowList.size() - 1 ) {
begin = rowList.get(i).intValue();
end = colList.size();
if ( begin == -1 ) {
return 0;
} else {
for ( int k = begin; k < end; k++ ) {
if ( colList.get(k).intValue() == j ) {
return valueList.get(k).doubleValue();
}
}
return 0;
}
} else {
return 0;
}
return 0;
}
// for convenience, get value from given compressed matrix other than 'this' matrix
public double getValueAt(int i, int j, CCompressedMatrix cm) {
int begin;
int end;
if ( i < cm.getRowList().size() - 1 ) {
begin = cm.getRowList().get(i).intValue();
end = cm.getRowList().get(i+1).intValue();
if ( begin != -1 && end != -1 ) {
for ( int k = begin; k < end; k++ ) {
31
if ( cm.getColList().get(k).intValue() == j ) {
return
cm.getValueList().get(k).doubleValue();
}
}
return 0;
} else if ( begin == -1 ) {
return 0;
} else if ( end == -1 ) {
int idx = i+2;
while ( end == -1 && idx < cm.getRowList().size() ) {
end = cm.getRowList().get(idx).intValue();
idx++;
}
if ( end == -1 ) {
for ( int k = begin; k < cm.getColList().size(); k++ )
{
if ( cm.getColList().get(k).intValue() ==
j){
return
cm.getValueList().get(k).doubleValue();
}
}
return 0;
} else {
for ( int k = begin; k < end; k++ ) {
if ( cm.getColList().get(k).intValue() ==
j){
return
cm.getValueList().get(k).doubleValue();
}
}
return 0;
}
}
} else if ( i == cm.getRowList().size() - 1 ) {
begin = cm.getRowList().get(i).intValue();
end = cm.getColList().size();
if ( begin == -1 ) {
return 0;
} else {
for ( int k = begin; k < end; k++ ) {
if ( cm.getColList().get(k).intValue() == j ) {
return
cm.getValueList().get(k).doubleValue();
}
}
32
return 0;
}
} else {
return 0;
}
return 0;
}
public CCompressedMatrix cutRow(int i, int j){
CCompressedMatrix resCM = new CCompressedMatrix();
for ( int k = i ; k < j ; k++ ){
for ( int l = 0 ; l < colDim; l++){
resCM.addElement(k-i, l, getValueAt(k, l));
}
}
resCM.setRowDimension(j-i);
resCM.setColDimension(colDim);
return resCM;
}
}
B. CMainProcess.java
/**
* CMainProcess.java
* Created in 2007. 01. 30
* @author Chang Min Kim
* netID : ck273
* email : ck273@cs.cornell.edu
* Master of Engineering
* Computer Science at Cornell University
*/
/**
* Description of the Class
*
*/
/**
* Description of the Variables
*
*/
package pr;
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
33
import java.io.InputStreamReader;
import java.io.StringReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.LinkedList;
import java.util.StringTokenizer;
import java.io.BufferedWriter;
import java.io.FileWriter;
import Jama.Matrix;
public class CMainProcess implements HTMLHandler {
final double DAMP = 0.80;
LinkedList<String> urlList;
Matrix linkMatrix;
Matrix rank;
String perDoc;
boolean title = false;
boolean link = false;
boolean mail = false;
boolean img = false;
String linkRec = "";
String fileName;
CCompressedMatrix compMatrix;
int siteSize;
public CMainProcess(String fn) {
siteSize = 130;
compMatrix = new CCompressedMatrix();
//fileName = fn;
fileName = "test4.txt";
urlList = new LinkedList<String>();
perDoc = "";
linkRec = "";
importURL();
System.out.println("Canonicalizing Sites");
System.out.println("This may take a while depending on network connection\n");
canonicalize();
//load test data
//importTest();
CCompressedMatrix cmTest = new CCompressedMatrix();
cmTest.compressFrom(linkMatrix);
if ( compMatrix.isEqualTo(cmTest)){
System.out.println("test matrix and compressed matrix are
identical\n");
34
}
else{
System.out.println("test matrix and compressed matrix are *** not ***
identical\n");
}
if ( compMatrix.compareWithMatrix(linkMatrix)){
System.out.println("Link matrix and compressed matrix are
identical\n");
}
else{
System.out.println("Link matrix and compressed Matrix are *** not ***
identical\n");
}
//outputLinkMatrix();
//outputPIDList();
calc_pageRank();
calc_pageRank_compMatrix();
}
public void importTest() {
linkMatrix = new Matrix(5,5);
rank = new Matrix(5,1);
try {
BufferedReader in = new BufferedReader(new FileReader("list.txt"));
String inStr;
while( (inStr = in.readLine()) != null) {
StringTokenizer st = new StringTokenizer(inStr);
int from = Integer.parseInt(st.nextToken());
int to = Integer.parseInt(st.nextToken());
linkMatrix.set(from, to, 1.0d);
compMatrix.addElement(from, to, 1.0d);
}
compMatrix.setDims(5, 5);
}
catch(IOException e){
}
}
public void importURL() {
// import test4.txt
try {
BufferedReader in = new BufferedReader(new
FileReader(fileName));
35
String inStr;
while( (inStr = in.readLine()) != null) {
if(inStr.endsWith("/")) {
inStr += "index.html";
}
if(!urlList.contains(inStr)) {
urlList.add(new String(inStr.trim()));
}
}
}
catch(IOException e){
}
linkMatrix = new Matrix(urlList.size(),urlList.size());
rank = new Matrix(urlList.size(),1);
}
public void canonicalize() {
for(int i =0; i<urlList.size(); i++) {
perDoc = "";
title = false;
link = false;
img = false;
String content = "";
URL url = null;
String tmp = urlList.get(i);
tmp = tmp.trim();
if (tmp.endsWith("/")) {
System.out.println("bad form ou url");
}
try {
url = new URL(tmp);
}
catch ( MalformedURLException m) {
System.out.println("Illegal Format of URL");
}
perDoc += url.toString() + "\n";
try {
// Read all the text returned by the server
InputStreamReader sr = new InputStreamReader(url.openStream());
BufferedReader in = new BufferedReader(sr);
String str;
String contentToBeRevised = "";
36
while ((str = in.readLine()) != null) {
contentToBeRevised += str.trim() + " ";
}
int count = 0;
while( count < contentToBeRevised.length()-5) {
if ( contentToBeRevised.charAt(count) == '=' ) {
if(contentToBeRevised.charAt(count-1) == ' ') {
contentToBeRevised =
contentToBeRevised.substring(0, count-1) +
contentToBeRevised.substring(count);
count--;
}
if(contentToBeRevised.charAt(count+1) == ' ') {
contentToBeRevised =
contentToBeRevised.substring(0, count+1) +
contentToBeRevised.substring(count+2);
}
}
count++;
}
content = contentToBeRevised;
sr.close();
in.close();
} catch (MalformedURLException e) {
} catch (IOException e) {
}
// String cacheSite = "cache_" + i + ".txt";
// try {
// BufferedWriter caWriter = new BufferedWriter(new
FileWriter(cacheSite));
// caWriter.write(content);
// caWriter.close();
// } catch (IOException e1) {
// e1.printStackTrace();
// }
try {
//System.out.println(i+"th site");
if ( (i % 25 == 0) && ( i !=0) ) {
37
System.out.println(( (int)(((float)i/urlList.size())*100)) + "% done");
}
HTMLParserFactory parserFactory = HTMLParserFactory.getInstance();
HTMLParser saxParser = parserFactory.getNewSAXHtmlParser();
saxParser.parse(content, this);
}
catch (Exception e) {
e.printStackTrace();
}
// analyze perDoc string to get info
BufferedReader br = new BufferedReader(new
StringReader(perDoc));
String lineString = "";
try {
while((lineString = br.readLine())!= null) {
if( lineString.trim().equalsIgnoreCase("<title>")) {
title = true;
link = false;
img = false;
}
else
if( lineString.trim().equalsIgnoreCase("</title>")) {
title = false;
link = false;
img = false;
}
else
if( lineString.trim().equalsIgnoreCase("<a>")) {
link = true;
img = false;
title = false;
}
else
if( lineString.trim().equalsIgnoreCase("</a>")) {
link = false;
title = false;
img = false;
}
else
if( lineString.trim().equalsIgnoreCase("<img>")) {
img = true;
title = false;
38
}
else if( (title==true) && (link==false) &&
(img==false)) {
title = false;
}
else if( (title==false) && (link==true) &&
(img==false)) {
StringTokenizer toc = new
StringTokenizer(lineString);
String t = toc.nextToken();
if ( t.trim().equalsIgnoreCase("href")) {
t = toc.nextToken();
linkRec += t.trim()+ "\t";
}
else {
linkRec += lineString.trim() +
"\n";
link = false;
img = false;
}
}
else if ((title==false) && (link==true) &&
(img==true)) {
StringTokenizer toc = new
StringTokenizer(lineString);
String t = toc.nextToken();
if ( t.trim().equalsIgnoreCase("alt")) {
while(toc.hasMoreTokens())
{
t =
toc.nextToken();
linkRec += t.trim()
+ " ";
}
linkRec += "\n";
img = false;
link = false;
}
}
}
} catch (IOException e) {
e.printStackTrace();
}
//filling up link matrix
39
BufferedReader links = new BufferedReader(new
StringReader(linkRec));
String lnk = "";
try {
while( (lnk = links.readLine()) != null) {
StringTokenizer st = new StringTokenizer(lnk);
String add = st.nextToken();
if(add.endsWith("/")) {
add += "index.html";
}
if(add.startsWith("mailto")) {
}
else if(add.startsWith("file")) {
}
else if(add.startsWith("http")) {
if ( add.contains("#")) {
int ix = add.indexOf("#");
add = add.substring(0, ix);
}
int idx = urlList.indexOf(add.trim());
if(idx != -1) {
linkMatrix.set(i, idx, 1.0);
if (compMatrix.getValueAt(i,
idx) == 0.0 ){
compMatrix.addElement(i, idx, 1.0);
}
// the rest is value of the
hyper link, put them appropriately
String at = "";
while(st.hasMoreTokens()) {
String part =
st.nextToken();
if
( part.startsWith("(")) {
part =
part.substring(1);
}
if
( part.startsWith("\"")) {
part =
part.substring(1);
40
}
if
( part.startsWith("\'")) {
part =
part.substring(1);
}
if
( part.endsWith(")")) {
part =
part.substring(0, part.length()-1);
}
if
( part.endsWith("\"")) {
part =
part.substring(0, part.length()-1);
}
if
( part.endsWith("\'")) {
part =
part.substring(0, part.length()-1);
}
if
( part.endsWith(",")) {
part =
part.substring(0, part.length()-1);
}
if
( part.endsWith(".")) {
part =
part.substring(0, part.length()-1);
}
if
( part.endsWith("?")) {
part =
part.substring(0, part.length()-1);
}
if
( part.endsWith(":")) {
part =
part.substring(0, part.length()-1);
}
if
( part.endsWith(";")) {
part =
part.substring(0, part.length()-1);
}
at += part + " ";
41
}
}
}
else if(add.startsWith("../")) {
if ( add.contains("#")) {
int ix = add.indexOf("#");
add = add.substring(0, ix);
}
int ccc = 0;
while( add.substring(3).startsWith("../")) {
add = add.substring(3);
ccc++;
}
String [] el = url.toString().split("/");
String site = "";
for(int x=0; x<el.length-1-ccc; x++) {
site += el[x].trim() + "/";
}
add = site + add;
int idx = urlList.indexOf(add.trim());
if (idx != -1) {
linkMatrix.set(i, idx, 1.0);
if (compMatrix.getValueAt(i,
idx) == 0.0 ){
compMatrix.addElement(i, idx, 1.0);
}
// the rest is value of the
hyper link, put them appropriately
String at = "";
while(st.hasMoreTokens()) {
String part =
st.nextToken();
if
( part.startsWith("(")) {
part =
part.substring(1);
}
if
( part.startsWith("\"")) {
part =
part.substring(1);
}
if
( part.startsWith("\'")) {
42
part =
part.substring(1);
}
if
( part.endsWith(")")) {
part =
part.substring(0, part.length()-1);
}
if
( part.endsWith("\"")) {
part =
part.substring(0, part.length()-1);
}
if
( part.endsWith("\'")) {
part =
part.substring(0, part.length()-1);
}
if
( part.endsWith(",")) {
part =
part.substring(0, part.length()-1);
}
if
( part.endsWith(".")) {
part =
part.substring(0, part.length()-1);
}
if
( part.endsWith("?")) {
part =
part.substring(0, part.length()-1);
}
if
( part.endsWith(":")) {
part =
part.substring(0, part.length()-1);
}
if
( part.endsWith(";")) {
part =
part.substring(0, part.length()-1);
}
at += part + " ";
}
}
}
43
else {
if ( add.contains("#")) {
int ix = add.indexOf("#");
add = add.substring(0, ix);
}
String [] el = url.toString().split("/");
String site = "";
for(int x=0; x<el.length-1; x++) {
site += el[x].trim() + "/";
}
add = site + add;
int idx = urlList.indexOf(add.trim());
if (idx != -1) {
linkMatrix.set(i, idx, 1.0);
if (compMatrix.getValueAt(i,
idx) == 0.0 ){
compMatrix.addElement(i, idx, 1.0);
}
// the rest is value of the
hyper link, put them appropriately
String at = "";
while(st.hasMoreTokens()) {
String part =
st.nextToken();
if
( part.startsWith("(")) {
part =
part.substring(1);
}
if
( part.startsWith("\"")) {
part =
part.substring(1);
}
if
( part.startsWith("\'")) {
part =
part.substring(1);
}
if
( part.endsWith(")")) {
part =
part.substring(0, part.length()-1);
}
if
( part.endsWith("\"")) {
44
part =
part.substring(0, part.length()-1);
}
if
( part.endsWith("\'")) {
part =
part.substring(0, part.length()-1);
}
if
( part.endsWith(",")) {
part =
part.substring(0, part.length()-1);
}
if
( part.endsWith(".")) {
part =
part.substring(0, part.length()-1);
}
if
( part.endsWith("?")) {
part =
part.substring(0, part.length()-1);
}
if
( part.endsWith(":")) {
part =
part.substring(0, part.length()-1);
}
if
( part.endsWith(";")) {
part =
part.substring(0, part.length()-1);
}
at += part + " ";
}
}
}
}
} catch (IOException e) {
e.printStackTrace();
}
// initialize perDoc and content
content = "";
perDoc = "";
linkRec = "";
}
45
System.out.println();
compMatrix.setDims(siteSize, siteSize);
}
public void startElement(String pElementName, HTMLAttributeList pAttrList)
{
perDoc += "<" + pElementName + ">" + "\n";
if(pAttrList != null)
for(int i=0 ; i< pAttrList.size(); i++)
{
HTMLAttribute attribute = (HTMLAttribute) pAttrList.get(i);
perDoc += "\t" + attribute.getAttributeName() + "\t" + attribute.getAttributeValue() + "\n";
}
}
public void endElement(String pElementName)
{
perDoc += "</"+ pElementName + ">" + "\n";
}
public void elementValue(String pElementValue)
{
perDoc += "\t" + pElementValue + "\n";
}
public void startDocument()
{
}
public void endDocument()
{
}
public void outputLinkMatrix() {
String fName = "lMatrix";
try {
BufferedWriter caWriter = new BufferedWriter(new
FileWriter(fName));
for(int i=0; i<linkMatrix.getRowDimension(); i++) {
for(int j=0; j<linkMatrix.getColumnDimension(); j++) {
caWriter.write(linkMatrix.get(i, j) + "\t");
}
caWriter.write("\n");
}
caWriter.close();
46
} catch (IOException e1) {
e1.printStackTrace();
}
}
public void outputPIDList() {
try {
BufferedWriter caWriter = new BufferedWriter(new
FileWriter("pidList"));
for(int i=0 ; i<linkMatrix.getRowDimension(); i++) {
for(int j=0 ; j<linkMatrix.getColumnDimension(); j++) {
if(linkMatrix.get(i, j) != 0.0 ) {
caWriter.write(i + "\t" + j + "\n");
}
}
}
caWriter.close();
} catch (IOException e) {
e.printStackTrace();
}
}
void calc_pageRank() {
// linkMatrix
double [] dangle = new double[linkMatrix.getRowDimension()];
for(int i=0; i<linkMatrix.getRowDimension();i++) {
dangle[i] = 0.0d;
}
for(int i =0; i<linkMatrix.getRowDimension(); i++) {
for(int j=0 ; j<linkMatrix.getRowDimension(); j++) {
if(linkMatrix.get(j, i) == 1.0) {
dangle[i] += 1.0d;
}
}
}
//normalize
for(int i =0; i<linkMatrix.getRowDimension(); i++) {
if(dangle[i] != 0.0) {
for(int j=0 ; j<linkMatrix.getRowDimension(); j++) {
linkMatrix.set(j, i, (double)linkMatrix.get(j,i)/dangle[i]);
}
}
}
47
// construct a, e matrix
Matrix e = new Matrix(linkMatrix.getRowDimension(),1);
Matrix a = new Matrix(linkMatrix.getRowDimension(),1);
for(int i=0; i<linkMatrix.getRowDimension(); i++) {
e.set(i,0, 1.0d);
if(dangle[i] == 0.0) {
a.set(i, 0, 1.0d);
}
else {
a.set(i, 0, 0.0d);
}
}
// calc rank
// pk = dHpk-1 + e(da'pk-1+1-d)(1/n)
// pk = dHpk-1+(d/n)(ea')pk-1+(1-d)(1/n)e
// pk = d(H + (1/n)ea')pk-1 + ((1-d)/n)e
Matrix wBefore = new Matrix(linkMatrix.getRowDimension(),1);
for(int g=0; g<linkMatrix.getRowDimension(); g++) {
wBefore.set(g,0,((double)1.0/linkMatrix.getRowDimension()));
}
Matrix wCurrent = new Matrix(linkMatrix.getRowDimension(),1);
Matrix tmp1;
Matrix tmp2 = new Matrix(linkMatrix.getRowDimension(),1);
boolean conv = false;
while(!conv) {
conv = true;
tmp1 =
linkMatrix.plus(e.times(a.transpose()).times(1.0d/linkMatrix.getRowDimension())).times(DAMP).tim
es(wBefore);
tmp2 = e.times((1.0d-DAMP)/linkMatrix.getRowDimension());
wCurrent = tmp1.plus(tmp2);
for(int z=0; z<wBefore.getRowDimension(); z++) {
if ( Math.abs(wBefore.get(z,0)- wCurrent.get(z,0)) != 0.0 ) {
conv = false;
}
}
wBefore = wCurrent;
}
rank = wCurrent;
double sum = 0.0;
for(int i=0; i<rank.getRowDimension(); i++) {
sum += rank.get(i,0);
48
}
System.out.println("Sum of pageRank from matrix is " + sum + "\n");
}
void calc_pageRank_compMatrix(){
// record dangling pages by counting outlinks
double [] dangle = new double[siteSize];
for(int i=0; i<siteSize;i++) {
dangle[i] = 0.0d;
}
for(int i =0; i<siteSize; i++) {
for(int j=0 ; j<siteSize; j++) {
if(compMatrix.getValueAt(j, i) == 1.0 ) {
dangle[i] += 1.0d;
}
}
}
//normalize by dividing each colum by outlink count
for(int i =0; i<siteSize; i++) {
if(dangle[i] != 0.0) {
for(int j=0 ; j<siteSize; j++) {
if(compMatrix.getValueAt(j,i) !=0) {
compMatrix.setValueAt(j, i,
(double)compMatrix.getValueAt(j,i)/dangle[i]);
}
}
}
}
// construct a, e matrix
CCompressedMatrix e = new CCompressedMatrix();
CCompressedMatrix a = new CCompressedMatrix();
for(int i=0; i<siteSize; i++) {
e.addElement(i, 0, 1.0d);
if(dangle[i] == 0.0) {
a.addElement(i, 0, 1.0d);
}
else {
a.addElement(i, 0, 0.0d);
}
}
e.toVerticalVectorInCompressedMatrixFormat(siteSize);
a.toVerticalVectorInCompressedMatrixFormat(siteSize);
// calculate PageRank
49
// pk = d(H + (1/n)ea')pk-1 + ((1-d)/n)e
// initial column compressed matrix each of which element is 1/(number of pages)
CCompressedMatrix wBefore = new CCompressedMatrix();
for(int g=0; g<siteSize; g++) {
wBefore.addElement(g,0,(double)1.0/siteSize);
}
CCompressedMatrix wCurrent = new CCompressedMatrix();
wCurrent.setDims(siteSize, siteSize);
CCompressedMatrix tmp1;
CCompressedMatrix tmp2 = new CCompressedMatrix();
boolean conv = false;
while(!conv) {
conv = true;
tmp1 =
compMatrix.plus(e.multiply(a.transpose()).ScalarMultiply(1.0d/siteSize)).ScalarMultiply(DAMP).mul
tiply(wBefore);
tmp2 = e.ScalarMultiply((1.0d-DAMP)/siteSize);
wCurrent = tmp1.plus(tmp2);
for(int z=0; z<siteSize; z++) {
if ( Math.abs(wBefore.getValueAt(z,0)- wCurrent.getValueAt(z,0)) !=
0.0 ) {
conv = false;
}
}
wBefore = wCurrent;
}
double sum = 0;
for(int i=0; i<wCurrent.getRowDimension(); i++) {
sum += wCurrent.getValueAt(i, 0);
}
System.out.println("sum of pageRank from compressedMatrix is : " + sum);
}
}
C. CMainClusterProcess.java
/**
* CMainClusterProcess.java
* Created in 2007. 03. 25
* @author Chang Min Kim
* netID : ck273
*/
/**
* Description of the Class
* This class calculates PageRank of given pages by file "pidList"
50
* using compressed sparse matrix representaion and Google PageRank algorithm
* in clustered environment using Parallel Java Middleware system.
*/
package pr;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.util.StringTokenizer;
import edu.rit.mp.buf.*;
import edu.rit.pj.Comm;
public class CMainClusterProcess {
// Prevent construction. Constructor is not necessary
private CMainClusterProcess() {
}
// Shared variables for clustered processes.
// World communicator.
static Comm world; // world of clustered computers
static int size; // number of clustered computers
static int rank; // rank of individual computer in the world
// total count of links in each column
static double[] total_count;
// compressed link matrix
static CCompressedMatrix comp;
// PageRank linkedlist
// the size of page should be siteSize
static CCompressedMatrix page;
static int siteSize; // number of total site count
static int siteSize_per_node ; // number of sites each node should process except master
static int siteSize_master; // number of sites the master should process
static final double DAMP = 0.65; // damping factor as it is defined by pagerank algorithm
51
static final double TOLERANCE = 0.0000000001; // tolerance of convergence for each
iteration
/**
* Main program of PageRank calculation.
*/
public static void main(String[] args) throws Throwable{
//Initialize world communicator.
Comm.init (args);
world = Comm.world();
size = world.size();
siteSize = -1;
rank = world.rank();
int slave_size = size - 1;
if ( rank == 0 ) {
System.out.println("#######################################################
#######");
System.out.println(" PageRank Calculation using Sparse Matrix
Compression");
System.out.println(" in clustered computer environment with Message
Passing");
System.out.println("##############################################################"
);
// time checker
long t0 = System.currentTimeMillis();
System.out.println(rank + " : Starting time at master node is " + t0);
}
String infile = null, outfile = null;
// Parse command line arguments for main process.
if ( args.length != 1 ) usage1();
infile = args[0];
outfile = "result";
// First, all nodes read pid:pid input file and construct compressed sparse matrix
// This is better for efficiency because it reduces the number of message passing
// Master process will take care of remainder part of site list
// For example, if there are 201 pages and 3 nodes in the cluster,
// the master will take care of first page and the other two nodes
// will take care of 100 pages in order from second page
52
// The portion the master will take care is much less than the portion of slaves
typically
// This is to allow the master more time to work on the rolls during message
passing
BufferedReader br = new BufferedReader(new FileReader(infile));
String lastLine = null;
String line = null;
while ( (line=br.readLine())!=null ) {
lastLine = line;
StringTokenizer st = new StringTokenizer(lastLine);
int index = Integer.parseInt(st.nextToken());
// Given structure of pidList, sorted by (from, to) link pair,
// first token of last line in the file is the biggest numbered page in the
closure.
// Index begins with 0
siteSize = index + 1;
}
// Because we know how many pages we should process (= siteCount)
// So assign appropriate number of sites for each process
siteSize_master = siteSize % slave_size; // the number of sites the master
process should load up
siteSize_per_node = siteSize / slave_size; // the number of sites the slave
processes should load up
// array dangle contains number of links in compressed sparse matrix column by
column
// first initialize this so that it can be passed by MP protocol
// If this is not initialized here MP will not work as specified by the package
double [] dangle = new double[siteSize];
for ( int i=0; i<siteSize;i++ ) {
dangle[i] = 0.0d;
}
// now nodes set up page vector
page = new CCompressedMatrix();
for ( int i=0; i< siteSize; i++ ) {
page.addElement(i, 0, 1.0d/(double)siteSize);
}
// now all nodes read input file and records links in comp
readInputFile(infile);
// all nodes calculate total number of links column by column
for ( int i =0; i<siteSize; i++ ) {
for ( int j=0 ; j<siteSize; j++ ) {
53
if ( comp.getValueAt(j, i) == 1.0 ) {
dangle[i] += 1.0d;
}
}
}
// master collects all the dangle array
DoubleArrayBuf [] gatheredSize = new DoubleArrayBuf[size];
for ( int i =0 ; i<size; i++) {
gatheredSize[i] = DoubleArrayBuf.buffer(dangle);
}
DoubleArrayBuf sizeArray = DoubleArrayBuf.buffer(dangle);
world.gather(0, sizeArray, gatheredSize);
// total count will hold the total number of links for each page
total_count = new double[siteSize];
// master aggregating counts
if ( rank==0 ) {
for ( int i = 0; i < siteSize; i++ ) {
for ( int j = 0; j < gatheredSize.length; j++ ) {
total_count[i] = gatheredSize[j].get(i);
}
}
}
// now master broadcasts aggregated size to each process
DoubleArrayBuf bf = DoubleArrayBuf.buffer(total_count);
world.broadcast(0, bf);
for ( int i=0 ; i<bf.length(); i++ ) {
dangle[i] = bf.get(i);
}
// These should be declared inside of pagerank calculation
// but due to the restriction of parallel java that every process should call
// the "gather" and "broadcast" to make them work correctly,
// these were defined here intentionally
// temporary matrices for calculation
CCompressedMatrix tmp1;
CCompressedMatrix tmp2 = new CCompressedMatrix();
// e is a row vector with every element 0
CCompressedMatrix e = new CCompressedMatrix();
// a is a row vector. An element has 1 if that page is dangling page, otherwise 0
CCompressedMatrix a = new CCompressedMatrix();
54
// wBefore is pagerank for each page before each iteration
CCompressedMatrix wBefore = new CCompressedMatrix();
// Because each node will have its own pagerank elements corresponding the
pages it loaded after iteration,
// we save last result for the node and compare the iteration result with this one to
find if those are converged.
CCompressedMatrix wBefore_node = new CCompressedMatrix();
// wCurrent is pagerank for each page after each iteration
CCompressedMatrix wCurrent = new CCompressedMatrix();
// These two matrices(vectors) should be compared after every iteration
// If the difference is smaller than the tolerance given,
// we consider the matrix is converged to some point
//normalize by dividing each colum by outlink count
for ( int i =0; i<siteSize; i++ ) {
if ( dangle[i] != 0.0 ) {
for ( int j=0 ; j<siteSize; j++ ) {
if ( comp.getValueAt(j,i) !=0 ) {
comp.setValueAt(j, i,
(double)comp.getValueAt(j,i)/dangle[i]);
}
}
}
}
// construct a, e matrix
for ( int i=0; i<siteSize; i++ ) {
e.addElement(i, 0, 1.0d);
if ( dangle[i] == 0.0 ) {
a.addElement(i, 0, 1.0d);
} else {
a.addElement(i, 0, 0.0d);
}
}
// These make sure that the dimension of compressed matrix is the same as
original matrix
// We don't actually have to do this due to closure property of the given data
e.toVerticalVectorInCompressedMatrixFormat(siteSize);
a.toVerticalVectorInCompressedMatrixFormat(siteSize);
// calculate PageRank
// pk = d(H + (1/n)ea')pk-1 + ((1-d)/n)e
// Refer to the documentation for details
// initial column matrix each of which element is 1/(number of pages)
wBefore = new CCompressedMatrix();
55
for ( int g=0; g<siteSize; g++ ) {
wBefore.addElement(g,0,(double)1.0/siteSize);
}
if ( rank == 0 && siteSize_master == 0 ){
}
else if ( rank == 0 && siteSize_master !=0 ) {
for ( int g=0; g<siteSize_master; g++) {
wBefore_node.addElement(g,0,(double)1.0/siteSize);
}
}
else if ( rank != 0 ){
for ( int g=0; g<siteSize_per_node; g++) {
wBefore_node.addElement(g,0,(double)1.0/siteSize);
}
}
wCurrent = new CCompressedMatrix();
CCompressedMatrix ea = new CCompressedMatrix();
ea = e.multiply(a.transpose());
// We need to cut ea and a ccording to the rank so that only proper portion of ea
will remain for calculation
int begin = -1;
int end = -1;
if ( rank == 0 && siteSize_master !=0 ) {
begin = 0;
end = siteSize_master;
}
else if ( rank != 0 ){
begin = (rank - 1) * siteSize_per_node + siteSize_master;
end = rank * siteSize_per_node + siteSize_master;
}
ea = ea.cutRow(begin, end);
e = e.cutRow(begin, end);
// begins iteration
boolean conv = false;
int iterCount = 0;
while ( !conv ) {
iterCount++;
if(rank ==0){
56
System.out.println(rank + " : in the middle of " + iterCount + "
th iteration");
}
// pagerank calcuation in master node won't be done if master has no
page to process.
conv = true;
if ( rank == 0 && siteSize_master == 0 ){
wCurrent = new CCompressedMatrix();
}
else {
tmp1 =
comp.plus(ea.ScalarMultiply(1.0d/siteSize)).ScalarMultiply(DAMP).multiply(wBefore);
tmp2 = e.ScalarMultiply((1.0d-DAMP)/siteSize);
wCurrent = tmp1.plus(tmp2);
}
// check if each row of pagerank is converged
if ( rank == 0 && siteSize_master != 0){
for ( int z=0; z<siteSize_master; z++ ) {
if ( Math.abs(wBefore_node.getValueAt(z,0)-
wCurrent.getValueAt(z,0)) >= TOLERANCE ) {
conv = false;
}
}
}
else if ( rank != 0 ){
for ( int z=0; z<siteSize_per_node; z++ ) {
if ( Math.abs(wBefore_node.getValueAt(z,0)-
wCurrent.getValueAt(z,0)) >= TOLERANCE ) {
conv = false;
}
}
}
// decide whether each process has converged
BooleanItemBuf converged = new BooleanItemBuf();
if ( rank == 0 && siteSize_master == 0 ){
converged.set(true);
}
else {
converged.set(conv);
}
BooleanItemBuf [] allConverged = new BooleanItemBuf[size];
for ( int i = 0; i < size; i++ ) {
allConverged[i] = new BooleanItemBuf();
}
57
world.gather(0, converged, allConverged);
boolean allConv = true;
for ( int i=0; i<size; i++ ){
if(!allConverged[i].get()){
allConv = false;
}
}
// now broadcast if all process converged
BooleanItemBuf convergeDone = new BooleanItemBuf();
convergeDone.set(allConv);
world.broadcast(0, convergeDone);
// if all process are converged, stop iteration
conv = convergeDone.get();
wBefore_node = wCurrent;
// wBefore_node should be gathered and combined by master to be
wBefore
// This italic part is the point where we are stuck
// This part causes null point exception from object not being
// serialized properly
ObjectI tem Buf p art _ob = ne w Obje ctItem Buf ();
part _ob .set ( wBefo re_ nod e);
ObjectI tem Buf [] pa rt_ wBe for e = n e w O bjectIt em Buf[si ze ];
for(i nt i = 0; i<si ze ; i++){
part _ wBefo re[i] = ne w Obj ectIt em Buf();
}
wo rld. gath er (0, par t_o b, p art _ wBefo re );
CCom press edM atrix toBeS ent = ne w CC om pres sed Mat rix( );
if ( rank == 0 ) {
int count = 0;
for ( int i=0 ; i<size; i++ ) {
CCompressedMatrix part =
(CCompressedMatrix)part_wBefore[i].get();
for ( int j=0; j<part.getRowDimension(); j++ ) {
toBeSent.addElement(count, 0,
part.getValueAt(j, 0));
count++;
}
}
58
}
for(int i=0; i<size; i++){
part_wBefore[i].reset();
}
// now broadcast combined pagerank matrix(vector) to nodes
ObjectItemBuf full_ob = new ObjectItemBuf();
full_ob.set(toBeSent);
world.broadcast(0, full_ob);
wBefore = (CCompressedMatrix)full_ob.get();
}
// now page has PageRank for every page
page = wBefore;
// write computation result to file
writeOutputFile(outfile);
}
// Hidden operations.
// Loading site list to compMatrix
// Depending on the rank, loading range will be different
private static void readInputFile(String inputFile) {
// in case the data format is -> from_pid : to_pid , just tranpose comp matrix
// assuming data format is -> to_pid : from_pid
comp = new CCompressedMatrix();
int begin = -1;
int end = -1;
if ( rank == 0 ){
begin = 0;
end = siteSize_master;
}
else if( rank != 0 ){
begin = (rank - 1) * siteSize_per_node + siteSize_master;
end = rank * siteSize_per_node + siteSize_master;
}
BufferedReader br;
try {
br = new BufferedReader(new FileReader(inputFile));
String line;
int currentIndex = -1;
// pass the pairs which should not be loaded to this process and load
pairs which are in the range
while ( (line=br.readLine())!=null && currentIndex < end ) {
StringTokenizer st = new StringTokenizer(line);
int argb = Integer.parseInt(st.nextToken());
59
int arge = Integer.parseInt(st.nextToken());
if ( argb >= begin && argb < end ) {
comp.addElement(argb, arge, 1.0d);
}
currentIndex = argb;
}
if ( rank == 0 ){
comp.setRowDimension(siteSize_master);
comp.setColDimension(siteSize);
}
else if ( rank != 0 ){
comp.setRowDimension(siteSize_per_node);
comp.setColDimension(siteSize);
}
} catch ( NumberFormatException e ) {
e.printStackTrace();
} catch ( FileNotFoundException e ) {
e.printStackTrace();
} catch ( IOException e ) {
e.printStackTrace();
}
}
// resultant page file in master node will hold PageRank for every page, so save it to disk
private static void writeOutputFile(String outFileName) {
try {
if ( rank == 0 ) {
BufferedWriter out = new BufferedWriter(new
FileWriter(outFileName));
double sum = 0;
// the format of result file if -> pid<tab>pagerank
for ( int i=0; i<page.getRowDimension(); i++ ) {
out.write( i + "\t" + page.getValueAt(i, 0) + "\n");
sum += page.getValueAt(i,0);
}
System.out.println("Sum of pagerank from each page is " +
sum);
out.close();
System.out.println("Results are written to a file '" +
outFileName + "'");
System.out.println("All jobs done at " +
System.currentTimeMillis());
}
} catch ( FileNotFoundException e1 ) {
e1.printStackTrace();
60
} catch ( IOException e1 ) {
e1.printStackTrace();
}
}
// when the argument given is not appropriate
private static void usage1() {
System.err.println ("For Every Process");
System.err.println ("Usage: java CMainClusterProcess <infile>");
System.err.println ("<infile> = Input file with ' pid<tab>pid ' format");
System.exit (1);
}
}
D. DataGenerator.java
/**
* DataGenerator.java
* @author Kwan Dong Kim
**/
/**
* Description of the Class
* This class generate the pesudeo data for CMainClusterProcess
* args[0] ---------------- max number of input
*/
import java.util.*;
import java.io.*;
class DataGenerator
{
public static void main(String[] args) throws Exception
{
String Column_Out;
int tmp_Column_Out;
String Output_String;
String Section_Output_String="";
//init the tree map structure
TreeSet Second_Column = new TreeSet();
//parse the argument[0] as a max number of input
int max = Integer.parseInt(args[0]);
String outfile = "input_"+max+".txt";
61
// init the file writer
FileWriter fw = new FileWriter(outfile);
BufferedWriter bw = new BufferedWriter(fw);
// init the random generator
Random generator = new Random();
int counter;
for (int i=1 ; i < max ; i++ )
{
//generate number of "to-link" using random generate
int number_of_second_column = generator.nextInt(100) +1;
for (int j=1; j< number_of_second_column; j++ )
{
//generate "to-link" number using
random generate
int second_column =
generator.nextInt(max)+1 ;
Second_Column.add
(second_column) ;
}
// iterate each "from to link"
Iterator iterator = Second_Column.iterator ( ) ;
while ( iterator.hasNext ( ) )
{
tmp_Column_Out = (Integer)iterator.next();
Column_Out = tmp_Column_Out+"";
Output_String = i+" "+Column_Out+"\n";
Section_Output_String =Section_Output_String
+Output_String;
}
Second_Column.clear();
//write the result of each "from to link"
fw.write(Section_Output_String);
Section_Output_String="";
}
// close the writer
bw.close();
}
62
}
63
Get documents about "