Design and Analysis of Parallel N-Queens on Reconfigurable by alextt

VIEWS: 79 PAGES: 30

									Design and Analysis of Parallel N-Queens on Reconfigurable Hardware with Handel-C and MPI

Vikas Aggarwal, Ian Troxel, and Alan D. George
High-performance Computing and Simulation (HCS) Research Lab Department of Electrical and Computer Engineering University of Florida Gainesville, FL
Aggarwal #198 MAPLD 2004

Outline
    


   

Introduction N-Queens Solutions Backtracking Approach N-Queens Parallelization Experimental Setup Handel-C and Lessons Learned Results and Analysis Conclusions Future Work and Acknowledgements References
2

Aggarwal

#198 MAPLD 2004

Introduction


N-Queens dates back to the 19th century (studied by Gauss) Classical combinatorial problem, widely used as a benchmark because of its simple and regular structure





Problem involves placing N queens on an N  N chessboard such that no queen can attack any other
Benchmark code versions include finding the first solution and finding all solutions



Aggarwal

3

#198 MAPLD 2004

Introduction


Mathematically stated:
Find a permutation of the BOARD() vector containing
numbers 1:N, such that
i=
1 2 3 4 5

BOARD

[]

1

3
Q

5

2

4

Q Q Q Q

for any i != j Board( i ) - i != Board( j ) - j Board( i ) + i != Board( j ) + j

Aggarwal

4

#198 MAPLD 2004

N-Queens Solutions


Various approaches to the problem

      

Brute force Local search algorithms Backtracking Divide and conquer approach Permutation generation Mathematical solutions Graph theory concepts Heuristics and AI
[2] [4] [2], [7] , [11], [12], [13] [2] [6] [2] [4], [14]

[1]

Aggarwal

5

#198 MAPLD 2004

Backtracking Approach


 



One of the only approaches that guarantees a solution, though it can be slow Can be seen as a form of intelligent depth-first search Complexity of backtracking typically rises exponentially with problem size Good test case for performance analysis of RC systems, as the problem is complex even for small data size*




Traditional processors provide a suboptimal platform for this iterative application due to serial nature of their processing pipelines Tremendous speedups achieved by adding parallelism at the logic level via RC
* For an 8x8 board, 981 moves (876 tests + 105 backtracks) are required for first solution alone

Aggarwal

6

#198 MAPLD 2004

Backtracking Examples


Iterative steps performed with and without backtracking
(* The algorithm may start with different initial position; we choose the first row, first column)

Q

Q Q

Q Q

Q Q Q

Q

Q

Q
Q Q

Q
Q
Keep moving the queens . . .

Q
Aggarwal
7

Out of board so backtrack !
#198 MAPLD 2004

Backtracking Approach


Tables provide an estimate of the backtracking approach‟s complexity


# of operations for 1st solution

[7]

Number of solutions

[8]





Problem can be made to find first solution or the total number of solutions Total number of solutions is obviously a more challenging problem Interesting observation: 1st solution‟s complexity (i.e. number of operations) does not increase monotonically with board size
#198 MAPLD 2004

Aggarwal

8

N-Queens Parallelization


Different levels of parallelism added to improve performance
  

Hardware-level parallelism Parallel column check Multiple row validation Q check for next safe position

Q

Q Q Q

Q
Q

Sequential run Number of steps : 11

Q

Note: Assume first four queens have been placed and the fifth queen starts from the 1st row

Aggarwal

9

#198 MAPLD 2004

N-Queens Parallelization


Different levels of parallelism added to analyze performance
  



Q Assume that the first four queens have been placed then the fifth queen starts from the 1st row
Q Q Q

Hardware-level parallelism Parallel column check Multiple row validation check for next safe position

Q

Q
Q

Parallel column check Number of steps : 3

Q
Note: Assume first four queens have been placed and the fifth queen starts from the 1st row

Aggarwal

10

#198 MAPLD 2004

N-Queens Parallelization


Different levels of parallelism added to analyze performance
  



Q Assume that the first four queens have been placed then the fifth queen starts from the 1st row
Q Q
Q
Multiple row check appended Number of steps : 1
Sequential : 11 Parallel column check: 3 Multiple row check appended:1 11x speedup over sequential operation Note: Assume first four queens have been placed and the fifth queen starts from the 1st row

Hardware-level parallelism Parallel column check Multiple row validation check for next safe position

Q

Q

Aggarwal

11

#198 MAPLD 2004

Experimental setup










Experiments conducted using RC1000 boards from Celoxica, Inc., and Tarari RC boards from Tarari, Inc. Each RC1000 board features a Xilinx Virtex-2000 FPGA, 8 MB of on-card SRAM, and PCI Mezzanine Card (PMC) sockets for connecting two daughter cards Each Tarari board features two user-programmable Xilinx Virtex-II FPGAs in addition to a controller FPGA, 256 MB of DDR SDRAM Configurations designed in Handel-C using Celoxica‟s application mapping tool DK-2, along with Xilinx ISE for place and route Performance compared against 2.4 GHz Xeon server and 1.33 GHz Athlon server
12

Aggarwal

#198 MAPLD 2004

Celoxica RC1000
  



PCI-based card having one Xilinx FPGA and four memory banks FPGA configured from the host processor over the PCI bus Four memory banks, each of 2MB, accessible to both the FPGA and any other device on the PCI bus Data transfers: The RC1000 provides 3 methods of transferring data over PCI bus between host processor and FPGA:
 



Bulk data transfers performed via memory banks Two unidirectional 8 bit ports, called control and status ports, for direct comm. between FPGA and PCI bus (note: this method used in our experiments) User I/O pins USER1 and USERO for single bit communication with FPGA



API-layer calls from host to configure and communicate with RC board

* Figure courtesy of Celoxica RC1000 manual

Aggarwal

13

#198 MAPLD 2004

Tarari Content Processing Platform
 











PCI-based board having 3 FPGAs and a 256 MB memory bank Two Xilinx Virtex-II FPGAs available for user to load configuration files from host over the PCI bus Each Content Processing Engine or CPE (User FPGA) configured with one or two agents Third FPGA acts as controller providing high-bandwidth access to memory and configuration of CPP with agents 256 MB of DDR SDRAM for data sharing between CPEs and the host application Configuration files first uploaded into the memory slots and used to configure each FPGA Both single-word transfers and DMA transfers supported between the host and the CPP

* Figure courtesy of Tarari CP-DK manual

Aggarwal

14

#198 MAPLD 2004

Handel-C Programming Paradigm


Handel-C acts as a bridge between VHDL and “C”


Comparison with conventional C
   

More explicit provisioning of parallelism within the code Variables declared to have the exact bit-lengths to save space Provides more bit-level manipulations beyond shifts and logic operations Limited support for many ANSI C standards and extensions



Comparison with VHDL
   

Application porting is much faster for experienced coders Similar to VHDL behavioral models Lacks VHDL concurrent signal assignments which can be suspended until changes on input triggers (Handel-C requires polling) Provides more higher-level routines

Aggarwal

15

#198 MAPLD 2004

Handel-C Design Specifics


Design makes use of the following two approaches


Approach 1
   

Use of an array of binary numbers to hold a „1‟ at a particular bit position to indicate the location of queen in the column A 32 x 32 board will require an array of 32 elements of 32 bits each Correspondingly use bit-shift operations and logical-and operations to check diagonal and row conditions More closely corresponds to the way the operations will take place on the RC fabric



Approach 2
  

Use of an array of integers instead of binary numbers Correspondingly use the mathematical model of the problem to check the validation conditions Smaller variables yield better device utilization; slices occupied reduce from about 75% to about 15% for similar performance and parallelism



Approach 2 found to be more amenable for Handel-C designs

Aggarwal

16

#198 MAPLD 2004

Lessons Learned with Handel-C
Some interesting observations:










 



Code for which place and route did not work, finally worked when the function parameters were replaced by global variables Less control at lower level with place and route being a consistent problem even with designs using up only 40% of total slices Self-referenced operations (e.g. a=a+x) affect the design adversely, so use intermediate variables Order of operations and conditional statements can affect design Useful to reduce wider-bit operations into a sequence of narrowerbit operations Balancing “if” with “else” branches leads to better designs Comments in the main program sometimes affected the synthesis, leading to place and route errors in fully commented code We are still learning more everyday!

Aggarwal

17

#198 MAPLD 2004

Sequential First-Solution Results
Performance Comparison of Sequential Version with Host
10000

Time (ms)

8000 6000 4000 2000 0 1 5 4 7 6 9 11 8 10 13 12 15 14 19 17 16 21 23 18 25 20 24

 Sequential version does not perform well versus the Xeon and Athlon CPUs  Algorithm needs an efficient design to minimize resource utilization

Board Size RC1000
RC1000 clock speed @ 40 MHz

Dual Xeon Server

Athlon Server

 The results do not include the one-time configuration overhead of ~150 ms

Performance Comparison of Similar Version of Bit and Integer Algorithms
500

Algorithm type

Bit manipulation

Integer manipulation 3%

Time (ms)

400 300 200 100 0 1 5 4 7 6 9 11 8 10 13 12 15 14 19 17 16 18 20

Parallel column checks Parallel row and column checks

19%

78%

15%

Board Size RC1000 (Bit Version)
RC1000 clock speed @ 40 MHz

RC1000 (Integer Version)

Aggarwal

18

#198 MAPLD 2004

Parallel First-Solution Results
Performance Comparison of Parallel Algorithm
700 600 500 Performance Comparison of Different Versions with Host

Version With parallel column check (for all columns)
17 16 18 20

Slices Occupied 3% 5%

Time (ms)

400 300 200 100 0

700 600 500 400 300 200 100 0 1

Time (ms)

Board Size

With two row check appended With six row check appended

15%

5

4

7

6

9

11

8

10

13

12

15

14

19

17

16

18

20

Board Size
RC1000(Parallel column check) RC1000(6 Row check appended) Athlon Server RC1000(2 Row check appended) Dual Xeon Server

Version With parallel column check (for columns)

Speedup 0.18

RC1000 clock speed @ 25 MHz

 The most parallel algorithm runs about 20x faster than

sequential algorithm on RC fabric With 2-row check 0.83 appended  Parallel algorithm with two row checks almost With 6-row check 1.74 duplicates behavior of 2.4 GHz Xeon server, appended while 6-row check outperforms it by 74%  Further increasing the number of rows checked is likely to further improve performance for larger problem sizes

Aggarwal

19

#198 MAPLD 2004

Total Number of Solutions Method
  



Employ divide-and-conquer approach Seen as a parallel depth-first search Solutions obtained with queen positioned in any row in the first column are independent from solutions with queens in other positions Technique allows for high degree of parallelism (DoP)

Aggarwal

20

#198 MAPLD 2004

One-Board Total-Solutions Results
Comparison of Tarari CPP vs. RC1000

Target Platform RC1000 (Virtex 2000e)

Area (slices) 10%
Time (ms)

20000000 18000000 16000000 14000000 12000000 10000000 8000000 6000000 4000000 2000000 0 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Tarari CPP (Virtex–II)

94%

Board Size Tarari 1FU RC1000 1FU Xeon Server Athlon Server

RC1000 and Tarari clock speed @ 33 MHz

  



Designs on hardware perform around 1.7x faster than Xeon server Performance on both RC platforms similar for same clock rates RC1000 performs a notch better for smaller chess board sizes while Tarari CPP‟s performance improves with chess board sizes Almost entire Virtex–II chip on the Tarari is occupied for one FU

Aggarwal

21

#198 MAPLD 2004

Multiple Functional Units (FUs)


fu2

fu1

fu6

fu7





  

Handles communication with the host Invokes all FUs in parallel Combines all results Host
processor

Aggarwal

22

#198 MAPLD 2004

Controller

On board FPGA

Used additional FUs per chip to increase parallelism per chip Each FU searches for the number of solutions corresponding to a subset of rows in the first column The controller

fu4

fu3

fu8

fu9

fu10

fu5

Total-Solutions Results with Multiple FUs
Performance Comparison with Host
20000000 18000000 16000000 14000000 12000000 10000000 8000000 6000000 4000000 2000000 0 4 5 6 7 8 9 10 11 12 13 14 15 16 17

N-Queens Optimization 1 Functional Unit 2 Functional Unit 3 Functional Units

Area (slices) 10% 21% 29%

Time (ms)

Board Size RC1000 1fu RC1000 2fu RC1000 3fu Xeon Server Athlon Server
Speedup vs. FU Scaling

RC1000 clock speed @ 30 MHz
5 4

Speedup

 RC1000 with three FUs performs almost 5x faster than Xeon server  Speedup increases near linearly with number of FUs  Area occupied scales linearly with number of FUs

3 2 1 0 1 2 3

Number of FUs

RC speedup vs. Xeon server for board size of 17

Aggarwal

23

#198 MAPLD 2004

MPI for Inter-Board Communication






To further increase system speedup (having more functional units), multiple boards employed Each FU programmed to search a subset of the solution space Servers communicate using the Message Passing Interface (MPI) to start search in parallel and obtain the final result

On-board FPGA (with one or multiple FU‟s)

Host server

MPI

Host server
On-board FPGA (with one or multiple FU‟s)

Aggarwal

24

#198 MAPLD 2004

Total-Solutions Results with MPI
Performance Comparison
20000000 18000000 16000000
6 5 7

Speedup vs. Board Scaling

Time (ms)

14000000

Speedup
4 5 6 7 8 9 10 11 12 13 14 15 16 17

12000000 10000000 8000000 6000000 4000000 2000000 0

4 3 2 1 0

Board Size 1 Tarari 2 Tarari 4 Tarari Xeon Server Athlon Server

1

2

4

Number of Boards

Tarari CPP clock speed @ 33 MHz

RC speedup vs. Xeon server for board size of 12

   




Results show total execution time including MPI overhead Minimal MPI overhead incurred (high computation-to-communication ratio) Communication overhead bounded to 3 ms regardless of problem size and initialization overhead is around 750 ms Overhead becomes negligible for large problem sizes Speedup scales near linearly with number of boards 4-board Tarari design performs about 6.5x faster than Xeon server

Aggarwal

25

#198 MAPLD 2004

Total-Solutions Results with MPI
Performance Comparison with Host
2500000 2000000
14 12 10

Speedup vs. Board Scaling

Time ( ms)

Speedup
8 9 10 11 12 13 14 15 16

1500000 1000000 500000 0

8 6 4 2 0 1 2 4

Board size 1 RC1000 2 RC1000 4 RC1000 Xeon Server Athlon Server

Number of Boards

RC1000 clock speed @ 30 MHz

RC speedup vs. Xeon server for board size of 12

   




Results show total execution time including MPI overhead Minimal MPI overhead incurred (high computation-to-communication ratio) Communication overhead bounded to 3 ms regardless of problem size and initialization overhead is around 750 ms Overhead becomes negligible for large problem sizes Speedup scales near linearly with number of boards 4-board RC1000 design performs about 12x faster than Xeon server

Aggarwal

26

#198 MAPLD 2004

Total-Solutions Results with MPI
Performance Comparison with Host
2500000 2000000

Time (ms)

1500000 1000000 500000 0

8

9

10

11

12 Board Size

13

14

15

16

On 8 Boards

Xeon Server

Athlon Server

RC1000 clock speed @ 30 MHz and Tarari clock speed @ 33MHz

   

Communication overheads still remain low, while MPI initialization overheads increase with number of boards (now 1316 ms for 8 boards) Heterogeneous mixture of boards employed to solve the problem coordinating via MPI Total of 8 boards (4 RC1000 and 4 Tarari boards) allows up to 16 (43 + 41) FUs 8 boards perform about 21x faster than Xeon server for chess board size of 16 What appears to be an unfair comparison really shows how the approach scales to many more FUs per FPGA (on higher density chips)



Aggarwal

27

#198 MAPLD 2004

Conclusions


Parallel backtracking for solving N-Queens problem in RC shows promise for performance
 



 

N-Queens is an important benchmark in the HPC community RC devices outperform CPUs for N-Queens due to RC‟s efficient processing of fine-grained, parallel, bit-manipulation operations Previously inefficient methods for CPUs like backtracking can be improved by reexamining their design This approach can be applied to many other applications Numerous parallel approaches developed at several levels A “C-based” programming model for application mapping provides a degree of higher-level abstraction, yet still requires programmer to code from a hardware perspective Solutions produced to date show promise for application mapping
28



Handel-C lessons learned




Aggarwal

#198 MAPLD 2004

Future Work and Acknowledgements
  



Compare application mappers with HDL design in terms of mapping efficiency Develop and use direct communication between FPGAs to avoid MPI overhead Export approach featured in this talk to variety of algorithms and HPC benchmarks for performance analysis and optimization Develop library of application and middleware kernels for RC-based HPC



We wish to thank the following for their support of this research:  Department of Defense  Xilinx  Celoxica  Tarari  Key vendors of our HPC cluster resources (Intel, AMD, Cisco, Nortel)

Aggarwal

29

#198 MAPLD 2004

References
[1] “Divide and Conquer under Global Constraints: A Solution to the N-Queens Problem”, Bruce Abramson and Mordechai M. Yung [2] ”Different Perspectives Of The N-queens Problem”, Cengiz Erbas, Seyed Sarkeshikt, Murat M. Tanik, Department of Computer Science and Engineering,Southern Methodist University, Dallas [3] “Algorithms and Complexity”, Herbert S. Wilf, University of Pennsylvania, Philadelphia

[4] “Fast search algorithms for N-Queens problem”, Rok Sausic, Jum Gu, appeared in IEEE transactions on Systems, Man, and Cybernetics, Vol 21, 6, pp 1572-76, Nov/Dec 1991
[5] http://www.cit.gu.edu.au/~sosic/nqueens.html [6] http://bridges.canterbury.ac.nz/features/eight.html [7] www.math.utah.edu/~alfeld/queens/queens.html [8] www.jsomers.com/nqueen_demo/nqueens.html

[9] A polynomial time algorithm for N-queens problem
[10] remus.rutgers.edu/~rhoads/Code/code.html [11] http://www.mactech.com/articles/mactech/Vol.13/13.12/TheEightQueensProblem/index.html [12] http://www2.ilog.com/preview/Discovery/samples/nqueens/ [13] http://www.infosun.fmi.uni-passau.de/br/lehrstuhl/Kurse/Proseminar_ss01/backtracking_nm.pdf [14] “From Alife Agents To A Kingdom Of N Queens”, Han Jing, Jimimg Liu, Cai Qingsheng [15] http://www.wi.leidenuniv.nl/~kosters/nqueens.html [16] http://www.dsitri.de/projects/NQP/ Aggarwal
30

#198 MAPLD 2004


								
To top