Docstoc

CSC337 – Parallel Computation

Document Sample
CSC337 – Parallel Computation Powered By Docstoc
					Parallel Implementation of Classifying Data by K-Nearest-Neighbor Rule

CSC537 – Parallel Computation Fall 2009 Term Project Shun Jiang

Problem Statement This project is to classify data by K-Nearest-Neighbor Rule with the power of parallel computing. There are two applications in this project. One computes the error rate in different value of K, another one finds out which features is more useful to the classification. K-Nearest-Neighbor Rule is a nonparametric technique of pattern recognition that uses when density function is unknown. The procedure of classifying data by K-Nearest-Neighbor Rule is following: a. Load some training data which classes are known and a testing data that to be classified. b. For each testing data, it calculates the Euclidean distances from that testing data to each training data c. Find out K training data which have the shortest distance to the testing data. d. Count the number of training data in each class. e. Assign the testing data to the class in which there are most of the K training data. The whole procedure is time consuming, especially when the number of training data and the dimensions features are large. To enhance the correction of classification, the system usually collects N dimension features from S samples. The time of computation will be very long if S, N and K are very large. In this project, there are 540 feature vector files each has 238 feature values. 440 of them are training data, and 100 are testing data. Although K-Nearest-Neighbor Rule is simple and easy to implement, it doesn’t fit all the classifications. Calculating the error rate of classification is a good way to see how good/bad K-Nearest-Neighbor Rule performs. This application provides the error rate of classification in different value of K, so user is able to know what value of K provides the best result. Furthermore, researchers collect many feature data of an object which are used for classification. However, not all the feature data is useful, and some might have a bad influence on the result. K-Nearest-Neighbor rule is based on the distance among vectors, so a feature is not useful when many objects have the similar values of it. In the other words, the feature have larger value of standard deviation is more useful. This application firstly classifies data and calculates the error rate with one feature has largest standard deviation. And then it runs the same process with two features that have largest standard deviations, and it runs with three, and so on. As a result, user can see what kinds of features provide a better result.

Parallel Implementation a. Program kl Every processor read in some testing data, classifies them and calculates the error of the classification. Then processor 0 receives the sum of number of error. Here is the code sum the error.
MPI_Reduce(&error[x], &sumOfError[x], 1, MPI_INT,MPI_SUM,0,MPI_COMM_WORLD);

b. Program kp Every processor calculates the standard deviations of parts of features. Then processor 0 receives all the standard deviations from other processors, and sends them back after finish the sorting. Also, each processor classifies all the testing data with different number of feature in parallel. Here are the codes processor 0 receives standard deviations send from other processor, and processor 0 broadcast the list of feature sorted by standard deviation.
if (x%rank>0 && id == x%rank) MPI_Send (&stdDev[x],1,MPI_DOUBLE, 0, 1,MPI_COMM_WORLD); if (id == 0 && x%rank>0) MPI_Recv (&stdDev[x],1,MPI_DOUBLE,x%rank,1,MPI_COMM_WORLD,&status); MPI_Bcast (sortDevList,numberOfFeature,MPI_DOUBLE,0,MPI_COMM_WORLD);

Usage: a. Data folder and file Training data are stored in the folder of train, and testing data are saved in test. Each file stores the feature data of an individual object. The file name indicates object’s class. Each line in the file saves one feature value. For example: EMWfemale28neutral.b_app 0.1 0.2 0.3 … b. Program There are 2 programs: kl and kp. kl classifies data with all features, and calculates the error rate of classification in different number of K. kp calculates and sorts the standard deviations of features, classifies data with different features, and calculates the error rate of classification. c. Compile program mpicc –o kl kl.c –lm mpicc –o kp kp.c -lm d. Run program mpirun –nolocal –np <number of processes> kl(or kp)

Performance: Here are the tables and plots of the time running the same data with different number of processor: Program of kl: Number of processor Time 1 2.636 2 2.140625 4 1.882812 6 1.75 8 0.984375 3
2.5 2

Time

1.5 1 0.5 0 1 2 4 6 8 Number of processor

Program of kp: Number of processor 1 2 4 6 8 10 Time 27.10938 18.59376 17.4375 15.28125 11.67578 10.59375

30 25 20
Time

15 10 5 0 1 2 4 6 8 10 Number of processor

Analysis: Above tables show that the processing time of both programs decrease when the number of processors increase. The time didn’t drop dramatically, because these two programs just parallel some portions of the computation and I assume that the idle processes in each machine were different.

Bibliographic References: Parallel Programming with MPI by Peter S.Pacheo Pattern Classification by Richard O. Duda, Peter E.Hart, and David G.Stork Data Set http://compsci.cis.uncw.edu/~pattersone/courses/577fall09blog/?paged=2

Code of kl
/* * * */ #include <stdio.h> #include <sys/types.h> #include <dirent.h> #include <string.h> #include <math.h> #include "mpi.h" int main (int argc, char *argv[]) { int numberOfTrain = 440; int numberOfTest =100; int numberOfFeature = 238; char testAnswer [numberOfTest]; int id; int rank; double time; MPI_Init (&argc, &argv); MPI_Comm_rank (MPI_COMM_WORLD, &id); MPI_Comm_size (MPI_COMM_WORLD, &rank); MPI_Barrier (MPI_COMM_WORLD); time = - MPI_Wtime(); char *trainPath = "./train/"; char *testPath = "./test/"; char filePath[200]; FILE *dataFile; double feature; DIR *trainDir = opendir(trainPath); DIR *testDir = opendir(testPath); struct dirent *entry = NULL; int countOfTrain = 0; int countOfTest = 0; int countOfFeature = 0; char fileName[255]; char trainc [numberOfTrain]; double trainData [numberOfTrain][numberOfFeature]; Shun Jiang, term project

char testc [numberOfTest]; double testData [numberOfTest][numberOfFeature]; double testDistance [numberOfTest][numberOfTrain]; double sum; double difference; int x,y,z; int sortList[numberOfTest][numberOfTrain]; void quickSort(int[],double [],int, int); int partition(int [], double void exch(int [], int, int); [], int, int);

int result [numberOfFeature/2][numberOfTest][2]; char testResult [numberOfFeature/2][numberOfTest]; int error [numberOfFeature/2]; int sumOfError [numberOfFeature/2]; double errorRate; //======================================================== // Read Trainning data // All trainning data are stored in the folder of Train //======================================================== while((entry = readdir(trainDir))) /* If we get EOF, the expression is 0 and * the loop stops. */ { if( strcmp(entry->d_name, ".") != 0 && strcmp(entry->d_name, "..") != 0){ //Get File Name strcpy(fileName,entry->d_name);

//Get the class of the trainning vectors if (strstr(fileName, "female") ==NULL) trainc [countOfTrain] = 'M' ; else trainc [countOfTrain] = 'F' ; //Read feature data strcpy(filePath,trainPath); strcat(filePath,fileName); dataFile = fopen (filePath, "r"); if (dataFile == NULL) {

//printf ("Can't opening file\n"); fflush(stdout); } else { //printf ("Successfully opening file\n"); fflush(stdout); //read number countOfFeature = 0; for (countOfFeature =0;countOfFeature <numberOfFeature ;countOfFeature ++){ fscanf(dataFile, "%lf", &feature); trainData [countOfTrain][countOfFeature]=feature; } fclose (dataFile); } countOfTrain = countOfTrain + 1; } } closedir(trainDir);

//=============================================================== // Read Testing data // All Testing data are stored in the folder of Test //=============================================================== while((entry = readdir(testDir)) ) /* If we get EOF, the expression is 0 and * the loop stops. */ { if( strcmp(entry->d_name, ".") != 0 && strcmp(entry->d_name, "..") != 0 ){ //Get File Name strcpy(fileName,entry->d_name); if (countOfTest%rank==id){ //Get the class of the trainning vectors if (strstr(fileName, "female") ==NULL) testc [countOfTest] = 'M' ; else testc [countOfTest] = 'F' ; //Read feature data strcpy(filePath,testPath);

strcat(filePath,fileName); dataFile = fopen (filePath, "r"); if (dataFile == NULL) { //printf ("Can't opening file\n"); fflush(stdout);} else { //printf ("Successfully opening file\n"); fflush(stdout); //rend number countOfFeature = 0; for (countOfFeature =0;countOfFeature <numberOfFeature ;countOfFeature ++){ fscanf(dataFile, "%lf", &feature); testData [countOfTest][countOfFeature]=feature; } fclose (dataFile); } } countOfTest = countOfTest + 1; } } closedir(testDir); //=============================================================== //Calcualte Distance //=============================================================== for (x =id;x<numberOfTest ;x=x+rank){ for (y =0; y< numberOfTrain;y++){ sum=0; countOfFeature = 0; for (z = 0; z<numberOfFeature;z++){ difference = testData[x][z]-trainData[y][z]; sum = sum + pow (difference,2); } testDistance [x][y] = pow (sum ,0.5); sortList [x][y]=y; } }

//=============================================================== // Sorting Data

//=============================================================== for (x=id;x<numberOfTest;x=x+rank){ quickSort (sortList [x], testDistance[x],0,numberOfTrain); } //=============================================================== // Classify and Calculate the Error //=============================================================== for (x=id;x<numberOfTest;x=x+rank){ for (y=0;y<numberOfFeature/2; y++){ result[y][x][0] = 0; result [y][x][1] = 0; for (z=0;z<2*y+1;z++){ if (trainc[sortList[x][z]]=='M') result [y][x][0]++; else result [y][x][1]++; } if (result[y][x][0]>result[y][x][1]) testResult[y][x] = 'M'; else testResult[y][x]= 'F'; } } //=============================================================== // Calculate the Error Rate //=============================================================== for (y=0;y<numberOfFeature/2;y++){ error[y]=0; for (x=id;x<numberOfTest;x=x+rank){ if (testResult[y][x]!=testc[x]) error[y]++; } } //=============================================================== // Sum Error //=============================================================== for (x=0;x<numberOfFeature/2;x++){ MPI_Reduce(&error[x], &sumOfError[x], 1, MPI_INT,MPI_SUM,0,MPI_COMM_WORLD);

} time += MPI_Wtime(); if (id==0){ for (x=0;x<numberOfFeature/2;x++){ errorRate = (double) sumOfError[x] / (double) numberOfTest * 100; printf("K=%d, ",x*2+1); printf ("Sum of Error = %d, ",sumOfError[x]); printf("Error Rate = %f%\n",errorRate); } printf("Time Used = %f\n",time); }

/* Shut down MPI */ MPI_Finalize(); return 0; } void quickSort (int sortList [], double testDistance [], int left, int right){ if (right<=left) return; int i = partition(sortList, testDistance, left, right); quickSort(sortList, testDistance,left, i-1); quickSort(sortList, testDistance, i+1, right); } int partition(int sortList [], double testDistance [], int left, int right) { int i = left - 1; int j = right; while (1) { while (testDistance[sortList[++i]] < testDistance[sortList[right]]) ; sentinel while (testDistance[sortList[right]]<testDistance[sortList[--j]]) item on right to swap if (j == left) break; out-of-bounds if (i >= j) break; // check if // don't go // find // find item on left to swap // a[right] acts as

pointers cross exch(sortList, i, j); elements into place } exch(sortList, i, right); partition element return i; } void exch(int sortList [], int i, int j) { //exchanges++; int swap = sortList[i]; sortList [i] = sortList [j]; sortList [j] = swap; } // swap with // swap two

Code of kp:
/* * * */ #include <stdio.h> #include <sys/types.h> #include <dirent.h> #include <string.h> #include <math.h> #include "mpi.h" int main (int argc, char *argv[]) { int d; int numberOfK = 10; int numberOfTrain = 440; int numberOfTest =100; int numberOfFeature = 238; char testAnswer [numberOfTest]; int id; int rank; MPI_Status status; double time; double sortTime; MPI_Init (&argc, &argv); MPI_Comm_rank (MPI_COMM_WORLD, &id); MPI_Comm_size (MPI_COMM_WORLD, &rank); MPI_Barrier (MPI_COMM_WORLD); time = - MPI_Wtime(); sortTime = - MPI_Wtime(); Shun Jiang, term project

char *trainPath = "./train/"; char *testPath = "./test/"; char filePath[200]; FILE *dataFile; double feature; DIR *trainDir = opendir(trainPath); DIR *testDir = opendir(testPath); struct dirent *entry = NULL; int countOfTrain = 0;

int countOfTest = 0; int countOfFeature = 0; char fileName[255]; char trainc [numberOfTrain]; double trainData [numberOfTrain][numberOfFeature]; char testc [numberOfTest]; double testData [numberOfTest][numberOfFeature]; int sortDevList [numberOfFeature]; int trainList [numberOfFeature]; double stdDev [numberOfFeature]; double mean [numberOfFeature]; double testDistance [numberOfTest][numberOfTrain]; double sum; double difference; int x,y,z,k; int sortList[numberOfTest][numberOfTrain]; void quickSort(int[],double [],int, int); int partition(int [], double void exch(int [], int, int); [], int, int);

int result [numberOfFeature/2][numberOfTest][2]; char testResult [numberOfFeature/2][numberOfTest]; int error [numberOfFeature/2]; int sumOfError [numberOfFeature/2]; double errorRate; //======================================================== // Read Trainning data // All trainning data are stored in the folder of Train //======================================================== while((entry = readdir(trainDir))) /* If we get EOF, the expression is 0 and * the loop stops. */ { if( strcmp(entry->d_name, ".") != 0 && strcmp(entry->d_name, "..") != 0){ //Get File Name strcpy(fileName,entry->d_name); //printf("%s\n", fileName );

//Get the class of the trainning vectors if (strstr(fileName, "female") ==NULL) trainc [countOfTrain] = 'M' ; else trainc [countOfTrain] = 'F' ; //printf("%c\n", trainc [countOfTrain] ); //fflush (stdout); //Read feature data strcpy(filePath,trainPath); strcat(filePath,fileName); //printf("%s\n", filePath ); dataFile = fopen (filePath, "r"); if (dataFile == NULL) { //printf ("Can't opening file\n"); fflush(stdout); } else { //printf ("Successfully opening file\n"); fflush(stdout); //read number countOfFeature = 0; for (countOfFeature =0;countOfFeature <numberOfFeature ;countOfFeature ++){ fscanf(dataFile, "%lf", &feature); trainData [countOfTrain][countOfFeature]=feature; } fclose (dataFile); } countOfTrain = countOfTrain + 1; } } closedir(trainDir); //=============================================================== // Sort each dimension from high standard deviation to low standard // deviation //=============================================================== //Mean for (x=id;x<numberOfFeature;x=x+rank){ sum = 0; for (y=0;y<numberOfTrain;y++){

sum = sum + trainData [y][x]; } mean [x] = sum / numberOfTrain; } //Standard Deviation for (x=id;x<numberOfFeature;x=x+rank){ sum = 0; for (y=0;y<numberOfTrain;y++){ sum = sum + pow(trainData [y][x]-mean[x],2); } stdDev [x] = pow (sum/numberOfTrain,0.5); } //Send the values of standard deviation to Process 0 for (x=0;x<numberOfFeature;x++){ if (x%rank>0 && id == x%rank) MPI_Send (&stdDev[x],1,MPI_DOUBLE, 0, 1,MPI_COMM_WORLD); if (id == 0 && x%rank>0) MPI_Recv (&stdDev[x],1,MPI_DOUBLE,x%rank,1,MPI_COMM_WORLD,&status); } //Sort the stdDev if (id==0){ for (x=0;x<numberOfFeature;x++){ sortDevList [x]=x; } quickSort (sortDevList, stdDev,0,numberOfFeature); } //BroadCase sort list MPI_Bcast (sortDevList,numberOfFeature,MPI_DOUBLE,0,MPI_COMM_WORLD); sortTime += MPI_Wtime(); //=============================================================== // Read Testing data // All Testing data are stored in the folder of Test //=============================================================== while((entry = readdir(testDir)) ) /* If we get EOF, the expression is 0 and * the loop stops. */ { if( strcmp(entry->d_name, ".") != 0 && strcmp(entry->d_name,

"..") != 0 ){ //Get File Name strcpy(fileName,entry->d_name); //printf("%s\n", fileName ); //Get the class of the trainning vectors if (strstr(fileName, "female") ==NULL) testc [countOfTest] = 'M' ; else testc [countOfTest] = 'F' ; //printf("%c\n", testc [countOfTest]); //fflush (stdout); //Read feature data strcpy(filePath,testPath); strcat(filePath,fileName); //printf("%s\n", filePath ); dataFile = fopen (filePath, "r"); if (dataFile == NULL) { //printf ("Can't opening file\n"); fflush(stdout);} else { //printf ("Successfully opening file\n"); fflush(stdout); //rend number countOfFeature = 0; for (countOfFeature =0;countOfFeature <numberOfFeature ;countOfFeature ++){ fscanf(dataFile, "%lf", &feature); testData [countOfTest][countOfFeature]=feature; } fclose (dataFile); } countOfTest = countOfTest + 1; } } closedir(testDir); //=============================================================== //Classify and calculate error in different number of feature //===============================================================

for (d =id;d<=numberOfFeature; d=d+rank){ //Distance for (x =0;x<numberOfTest ;x++){ for (y =0; y< numberOfTrain;y++){ sum=0; if (d<rank){ difference = testData[x][sortDevList[numberOfFeature-d-1]]-trainData[y][sortDevLis t[numberOfFeature-d-1]]; sum = pow (testDistance [x][y],2)+pow(difference,2); testDistance [x][y] = pow (sum ,0.5); } else{ sum = pow (testDistance [x][y],2); for (z=1;z<=rank;z++){ difference = testData[x][sortDevList[numberOfFeature-(d-rank+z)-1]]-trainData[y][s ortDevList[numberOfFeature-(d-rank+z)-1]]; sum=sum+pow(difference,2); } testDistance [x][y] = pow (sum ,0.5); } sortList [x][y]=y; } } //Sort Distance for (x=0;x<numberOfTest;x++){ quickSort (sortList [x], testDistance[x],0,numberOfTrain); } //Classify and Calculate the Error for (x=0;x<numberOfTest;x++){ for (y=0;y<numberOfK ; y++){ result[y][x][0] = 0; result [y][x][1] = 0; for (z=0;z<2*y+1;z++){ if (trainc[sortList[x][z]]=='M') result [y][x][0]++; else result [y][x][1]++; }

if (result[y][x][0]>result[y][x][1]) testResult[y][x] = 'M'; else testResult[y][x]= 'F'; } } for (y=0;y<numberOfK ;y++){ error[y]=0; for (x=0;x<numberOfTest;x++){ if (testResult[y][x]!=testc[x]) error[y]++; } }

int min = error[0]; int minIndex = 0; for (x=0;x<numberOfK ;x++){ if (error[x]<min){ minIndex =x; min=error[x]; } } errorRate = (double) error[minIndex] / (double) numberOfTest * 100; printf("D=%d,K=%d, ",d,minIndex*2+1); printf ("Sum of Error = %d, ",error[minIndex]); printf("Error Rate = %f%\n",errorRate); fflush(stdout); } MPI_Barrier (MPI_COMM_WORLD); time += MPI_Wtime(); if (id==0){ printf("Sort Time Used = %f\n",sortTime); printf("Time Used = %f\n",time); fflush(stdout); } MPI_Finalize(); }

void quickSort (int sortList [], double testDistance [], int left, int right){ if (right<=left) return; int i = partition(sortList, testDistance, left, right); quickSort(sortList, testDistance,left, i-1); quickSort(sortList, testDistance, i+1, right); } int partition(int sortList [], double testDistance [], int left, int right) { int i = left - 1; int j = right; while (1) { while (testDistance[sortList[++i]] < testDistance[sortList[right]]) ; sentinel while (testDistance[sortList[right]]<testDistance[sortList[--j]]) item on right to swap if (j == left) break; out-of-bounds if (i >= j) break; pointers cross exch(sortList, i, j); elements into place } exch(sortList, i, right); partition element return i; } void exch(int sortList [], int i, int j) { //exchanges++; int swap = sortList[i]; sortList [i] = sortList [j]; sortList [j] = swap; } // swap with // swap two // check if // don't go // find // find item on left to swap // a[right] acts as


				
DOCUMENT INFO