Docstoc

Fast Thermal Analysis on GPU for ICs with Integrated

Document Sample
Fast Thermal Analysis on GPU for ICs with Integrated Powered By Docstoc
					    Fast Thermal Analysis on
    GPU for 3D-ICs with
    Integrated Microchannel
    Cooling
                   Zhuo Fen and Peng Li
Department of Electrical and Computer Engineering, {Michigan
           Technological, Texas A&M} University

                       ICCAD 2010
Outline
 Introductions
 Backgrounds
 GPU-based full-chip thermal analysis with
  microchannels
 Preconditioned iterative method on GPU
 Experimental results and conclusions
Introduction
 Effective thermal management for 3D-ICs is
  becoming increasingly challenging.
     Increasing power density and chip design
      complexity.
 Traditional heat sinks are expected to quickly
  reach their limits for meeting the cooling
  needs of 3D-ICs.
Introduction (cont.)
 The integrated on-chip microchannel cooling
  has been considered as a very promising
  solution.
     i.e. liquid cooling
 An experiment on a liquid-cooled 2D-IC.
     Peak on-chip temperature: from 85℃ to 57℃
     Maximum temperature variation: from 25℃ to
      6℃
Introduction (cont.)
 Existing design and optimization procedure for
  integrated microchannels are performed without
  considering the full-chip thermal profiles.
      May not provide the most “economic” solution
      Drawbacks: design complexity, packaging cost, etc.


 Hence, a comprehensive design and optimization
  flow should be closely coupled with the full-chip
  thermal analysis.
Why GPUs?
 Finite difference (FD) method is more suitable
  for general 3D full-chip thermal simulations.
 Accurate 3D thermal analysis in a full-chip
  scale using FD method can be very
  expensive, which requires solving a huge
  linear system of equations including multi-
  million unknowns.
Why GPUs? (cont.)
 GPU-based parallel computing has been
  employed in various electrical design
  automation areas.
 Advantages
     High computing power in large-scale
      homogeneous computing, i.e. matrix
      multiplications
     Significantly high memory bandwidth
Contributions
 Proposes novel GPU-based full-chip thermal
  simulation methods for 3D-ICs with integrated
  microchannel cooling
      GPU-friendly data structures and algorithm flows
 Proposes a GPU-friendly two-step block relaxation
  scheme that integrates block-based vertical-line
  relaxations and liquid-flow-direction relaxations.
 Achieves good speedup.
      More than 35x fast to the CPU-based solver
      More than 360x fast to the direct solution solver
Background – liquid cooling in 3D ICs

 The liquid-cooled microchannels are typically
  integrated inside a wafer-level package, where the
  microchannels are connected to the liquid inlets and
  outlets using fluidic through silicon vias (TSVs).
 The heat flux can be more effectively removed than
  ever before since the thermal resistance of such
  integrated liquidcooled heat sinks can be much lower
  than the thermal resistance of the traditional fan-
  cooled heat sinks.
Background – finite difference (FD)
method
 Replacing derivative expressions with
  approximately equivalent difference quotients
  to approximate the solutions to differential
  equations.



 For some small h
Background – full-chip thermal simulation

 Discretize the PDE of the original thermal
  circuit analysis problem by FD method.
 Solve GT = b where



 G is the thermal resistance matrices.
 b is the information about the environment.
Background – GPU programming
Architecture of Nvidia GTX280
 A collection of 30 multiprocessors, with 8 streaming
  processors each.
 The 30 multiprocessors share one off-chip global
  memory.
      Access time: about 300 clock cycles
 Each multiprocessor has a on-chip memory shared
  by that 8 streaming processors.
      Access time: 2 clock cycles
        About some differences between GPU
        and CPU
                   GPU (NVIDIA GeForce
                                               CPU (Intel Pentium 4)
                   8800 GTX)

flops              345.6G                      ~12G

                   86.4GB/s (900MHz memory 6.4GB/s (800MHz memory
Memory bandwidth   clock, 384 bit interface, 2 clock, 32 bit interface, 2
                   issues)                     issues)
Access time of     Slow (about 500 memory      Fast (about 5 memory
global memory      clock cycles)               clock cycles)
GPU-based full-chip thermal analysis
with microchannels
 Many things need to be considered for obtaining the
  most “economic” microchannel designs.
      Pumping power, placement, sizing, …
 Fine-grained thermal modeling and analysis including
  microchannel cooling is non-trivial due to the high
  modeling complexity and simulation costs.
      Model extraction cost and thermal simulation cost
 The characteristic is matched for GPU.
The proposed two-step block
relaxation scheme
 Considers two directions (Z and Y) of heat
  dissipations.
Details
 In the first step, the nodes that are included in a block
  of vertical lines are selected for doing relaxations
  (lines L1 to L3 shown in Fig. 4). Such relaxations
  allow fast solution updates in the vertical heat
  dissipation paths within the block.
 In the second step, a few relaxations in the
  microchannel routing direction (liquid-flow direction)
  are performed to allow heat solution updates in the
  liquid-flow direction.
But why?
 Efficiencies of typical iterative methods
  usually depend on
      Efficiency of the sparse matrix-vector operations
      Effectiveness of the relaxation (iteration) scheme
 Existing iterative algorithms only focus vertical heat
  dissipations.
      Horizontal (plane) dissipations in traditional 2D ICs are
       negligible for relatively small thermal conductance
      But not in 3D ICs
Preconditioned iterative method on
GPU
 Two critical issues about run time.
     Matrix representation format
     Convergence rate of iterative method


 Use and ELL-like format and preconditio-ning
  technique.
Matrix representation format
 GPU-based computations should guarantee that
  most of the global memory accesses are coalesced
  so that efficient data structure and its related memory
  accesses should be carefully designed.
 Use three 1D vector to fully represent the sparse
  matrix and fit memory coalescing.
      Diagonal, off-diagonal and its corresponding indices
 2x to 3x compared with CSR format.
Example
Conjugate gradient (CG) method
 The CG method is an algorithm for the numerical solution
  of particular systems of linear equations, namely those
  whose matrix is symmetric and positive-definite.
 The CG is an iterative method, so it can be applied to
  sparse systems that are too large to be handled by direct
  methods such as the Cholesky decomposition. Such
  systems often arise when numerically solving partial
  differential equations
 Minimize

 Assuming exact arithmetics, CG converges in at most n
  steps where n is the size of the matrix of the system (here
  n=2).
Preconditioning
 Conjugate gradient (CG) method takes too much
  iterations since the matrix is usually ill-conditioned.
      Condition number
 Moreover, the total runtime can be even greater than
  CG if the preconditioning method is bad or high
  runtime cost.
      Though #iteration is less
 Three ways for comparison
    CG, diagonal preconditioned (DP)CG, multi-grid
     preconditioned (MGP)CG
Preconditioning (cont.)
 Preconditioning is a procedure of an
  application of a transformation, called the
  preconditioner, that conditions a given
  problem into a form that is more suitable for
  numerical solution.
 Preconditioned system
 Preconditioned iterative method


 Practical preconditioner
   Multi-grid preconditioner
 Actually not that clear but the idea is to coarsen
  the grid to reduce complexity.
Experimental results
 Environment
       Intel Core 2 Quad 2.66GHz with one NVIDIA
        GeForce GTX 285
      DRAM: 6G for CPU, 2G for GPU
      C++ and CUDA on Linux
   Inlet water temperature: 50℃
   A set of 3D design stack 6 2D dies.
   Convergence criterion of iterative solver: residual norm <
    10^-6.
   The error is negligible.
Experimental results (cont.)




 Traditional smoothing is vertical line smooth.
 Significant speedup of at least 35x.
Conclusions
 Proposes GPU-based thermal simulation
  methods of 3D ICs with integrated liquid-
  cooled microchannels.
 GPU-friendly two-step block-based relaxation
  scheme.
 Highly accurate results with significant speed-
  up.