Learning Center
Plans & pricing Sign in
Sign Out
Get this document free

hard speedups Breadth first search BFS University of Maryland


									            Joint UIUC/UMD Parallel
          Algorithms/Programming Course

David Padua, University of Illinois at Urbana-Champaign
    Uzi Vishkin, University of Maryland, speaker
      Jeffrey C. Carver, University of Alabama
                    Motivation 1/4
Programmers of today’s parallel machines must overcome 3
    productivity busters, beyond just identifying operations
    that can be executed in parallel:
(i) impose the often difficult 4-step programming-for-locality
    recipe: decomposition, assignment, orchestration, and
    mapping [CS99]
(ii) reason about concurrency in threads; e.g., race conditions
(iii) for machines such as GPU, that fall behind on serial (or
    low parallelism) code, whole programs must be highly

    Motivation 2/4: Commodity computer systems

If you want your program to run significantly faster … you’re going to
      have to parallelize it
 Parallelism: only game in town

But, where are the players?
“The Trouble with Multicore: Chipmakers are busy designing
     microprocessors that most programmers can't handle”—D.
     Patterson, IEEE Spectrum 7/2010
•     Only heroic programmers can exploit the vast parallelism in current
      machines – Report by CSTB, U.S. National Academies 2011

•     An education agenda must: (i) recognize this reality, (ii) adapt to it,
      and (iii) identify broad impact opportunities for education
        Motivation 3/4: Technical Objectives
• Parallel computing exists for providing speedups over serial
• Its emerging democratization  the general body of CS students &
  graduates must be capable of achieving good speedups

                       What is at stake?
A general-purpose computer that can be programmed effectively by too
   few programmers, or requires excessive learning  application SW
   development costs more, weakening market potential of not only
   the computer:
Traditionally, Economists look to the manufacturing sector for bettering
   the recovery prospects of the economy. Software production is the
   quintessential 21st century mode of manufacturing. These prospects
   are at peril if most programmers are unable to design effective
   software for mainstream computers
  Motivation 4/4: Possible Roles for Education
• Facilitator. Prepare & train students and the
  workforce for a future dominated by parallelism.
• Testbed. Experiment with vertical approaches and
  refine them to identify the most cost-effective ways
  for achieving speedups.
• Benchmark. Given a vertical approach, identify the
  developmental stage at which it can be taught.
  Rationale: Ease of learning/teaching is a necessary
  (though not sufficient) condition for ease-of-
            The joint inter-university course
• UIUC: Parallel Programming for Science and Engineering, Prof: DP
• UMD: Parallel Algorithms, Prof: UV
• Student population: upper-division undergrads and graduate
   students. Diverse majors and backgrounds
• ~1/2 of the fall 2010 sessions, joint by videoconferencing.
1. Demonstrate logistical and educational feasibility of a real-time co-
    taught course.
Outcome Overall success. Minimal glitches. Helped to alert students
    that success on material taught by the other prof is as important.
2. Compare OpenMP using 8-processor SMP against PRAM/XMTC using
   64-processor XMT (<1/4 of silicon area for 2 SMP processors)

                       Joint sessions
• DP taught OpenMP programming. Provided parallel architecture
• UV taught parallel (PRAM) algorithms. ~20 minutes of XMTC
• 3 joints programming assignments

                    Non-shared sessions
• UIUC: mostly MPI. Submitted more OpenMP programming
• UMD: More parallel algorithms. Dry homework on design & analysis
  of parallel algorithms. Submitted a more demanding XMTC
  programming assignment

JC: Anonymous questionnaire filled by the students. Accessed by DP7
   and UV only after all grades were posted, per IRB guidelines
Rank approaches for achieving (hard) speedups
            Breadth-first-search (BFS) example
• 42 students in fall 2010 joint UIUC/UMD course
- <1X speedups using OpenMP on 8-processor SMP
- 7x-25x speedups on 64-processor XMT FPGA prototype

Questionnaire All students, but one : XMTC ahead of OpenMP for
  achieving speedups

In view of this evidence Are we really ready for standards?

       Parallel Random-Access Machine/Model


n synchronous processors all having unit time access to a shared memory.

You got to be kidding, this is way:
- Too easy
- Too difficult:
Why even mention processors? What to do with                      n processors?
How to allocate processors to instructions?
       Immediate Concurrent Execution

‘Work-Depth framework’ SV82, Adopted in Par Alg texts [J92,KKT01].Example: Pairwise
parallel summation. 1st round for 8 elements: In parallel 1st+2nd, 3rd+4th,5th+6th,7th+8th
ICE basis for architecture specs:
V, Using simple abstraction to reinvent computing for parallelism, CACM 1/2011

Similar to role of stored-program & program-counter in arch specs for serial comp
                  Feasible for many-cores
                                            PRAM-On-Chip HW Prototypes
        Algorithms           64-core, 75MHz FPGA of XMT [SPAA98..CF08]
                                              Toolchain Compiler +
      Programming                              simulator HIPS’11
                             128-core interconnection network
     Programmer’s                           IBM 90nm: 9mmX5mm,
        workflow                -           400 MHz [HotI07]

 Rudimentary yet stable FPGA designASIC
      compiler          •          IBM 90nm: 10mmX10mm
                        •          150 MHz

                            Architecture scales to 1000+ cores on-chip
XMT homepage: or search:
        Has the study of PRAM algorithms
          helped XMT programming?
• Majority of UIUC students No
• UMD students Strong Yes: enforced by written explanation

Exposure of UIUC students to PRAM algorithms and XMT programming
  much more limited. Their understanding of this material not
  challenged by analytic homework, or exams.
For same programming challenges, performance of UIUC and UMD
  students was similar.
Must students be exposed to minimal amount of parallel algorithms and
  their programming, and be properly challenged on analytic
  understanding to internalize their merit? If yes: tension with pressure
  on parallel computing courses to cover a hodge-podge of
  programming paradigms & architecture backgrounds
            More Issues/lessons
• Recall the title of the courses at UIUC/UMD: Should
  we use class time only for algorithms or also for
  Algorithms: high level of abstraction. Allows to
  cover more advanced problems. Note:
  Understanding tested only for UMD students.
• Made do with already assigned courses. Next time:
  more homogenous population; e.g., CS grad class. If
  interested in taking part, please let us know
• General lesson: IRB requires pre-submission of all
  questionnaires. Must complete planning by then.
For parallelism to succeed serial computing in the
  mainstream, the first experience of students got to:
  - demonstrate solid hard speedups
  - be trauma-free
Beyond education Objective rankings of approaches
  for achieving hard speedups provide a clue for
  curing the ills of the field.

Course homepages and

For summary of the PRAM/XMT education approach:

Includes teaching experience extending from middle school to
  graduate courses, course material [class notes,
  programming assignments, video presentations of a full-
  day tutorial and a full-semester graduate course], a
  software toolchain (compiler and cycle-accurate simulator,
  HIPS 5/20) available for free download, and the XMT
    How I teach parallel algorithms at
     different developmental stage
• Graduate In class, same PRAM algorithms course as in prior
  decades and complexity-style dry HW. <20 minutes of
  XMTC programming. 6 programming assigning with target
  hard speedups objectives. Include: parallel graph
  connectivity and XMT performance tuning
• Upper division undergraduate Less dry HW. Less
  programming. Still demand hard speedups
• Freshmen/HS [SIGCSE’10] Minimal/no dry HW. Same
  problems as in freshmen serial programming course
 Understanding of par algorithms needs to be enforced &
  validated by programming, or otherwise most students will
  get very little from it                               16
  What about architecture education?
• Need badly parallel architectures that make parallel thinking easier
• In the happy days of serial computing, stored-program + program
   counter  wall between arch and alg  algs low priority. Not now!
• A trigger for XMT: brilliant incompetence of CSE@UMD.
   ECE faculty never teach undergrad alg courses. Can be alg researcher
   and teach arch courses …  XMT
 Reality Few regularly teach arch and (grad) alg courses, not to say par
  But, why rely on accidents?! teach next generation arch students to
   master both, so that they can be better architects
• Very different thought styles are used for one and the same problem
   more often than are very closely related ones—1935, Ludwik Fleck
   (‘the Turing’ of Sociology of Science)

To top