Detailed Syllabus (tentative)
Contents
-
Detailed Syllabus (tentative)
-
Lectures
- 9/7: Intro, Shopping Day
- 9/14: Processor Design
- 9/21: Software strategies for single-core
- 9/28: Coarse-grain shared memory machines
- 10/5: Programming coarse-grain shared-memory machines
- 10/12: Very many cores (GPUs)
- 10/19: Distributed Memory Machines
- 10/26: Tools and Libraries
- 11/2: Common patterns of Parallel Programming
- 11/9: Implementation Concerns
- 11/16: Applications I
- 11/23: Applications II
- 11/30: Applications III
- 12/7: Overflow
- 12/14
- 12/21: Final Project Presentations
- Leftover Topics
- Thoughts
- References
-
Lectures
Lectures
9/7: Intro, Shopping Day
- Mechanics
- Course Outline
- Administrative: Survey, Prepare Machine Access, Discussion group/mailing list
- History, Motivation
- Simple Examples of what the course teaches--give a taste of the style of the course
- Unix, Version Control (git/hg/something else?), Compiling and Running
Idea from LeVeque: homework submissions as VC tree
- Maybe this is not as 'missable' as the rest of this lecture?
9/14: Processor Design
- A processor designer's toolchest
- Basic makeup of a computer
- µp: Insn decode, register file, data path, functional units, load/store
- Memory subsystem: busses, banks, bandwidths, latencies
- Problems and solutions
- Memory wall: caches (lines, associativity etc.), parallelism
- Throughput: pipelining
- Branch prediction
- Exploit insn-stream parallelism: Superscalarity
- Scoreboarding, register renaming
- Amortize insn decode: SIMD
- Power: clock increase vs core count increase
- Virtual memory: TLBs
- Basic makeup of a computer
- OpenCL practicalities
- The OpenCL machine model
- OpenCL SIMD
- Timing and tuning
HW1: Better know a processor
- Reading: OpenCL spec, Ericsson's "perf engineering"
- Install AMD CPU OpenCL
Use for various microbenchmarks based on a provided template. (Provide cluster?
OpenCL on cluster?)
Motivation:
- Drive the important (hard?) stuff home early so that it has enough time to sink in
- IMO, OpenCL is a good choice of basic parallelism mechanism because it does (sane) CPU SIMD, multi-core CPU, and GPU parallelism all through the same, reasonably efficient interface.
- It's also great because it makes code generation trivial
9/21: Software strategies for single-core
(Andreas at GTC?)
(essentially [Ye07], lec. 2)
- Tiling
- Loop unrolling
- Cache Obliviousness
- Data dependencies, aliasing
- Data alignment
- Flops/byte vs. the memory hierarchy
- Performance modeling ([Ye07,2] again)
- Automated tuning
HW1 due
HW2:
- Single-proc matrix multiply optimization
9/28: Coarse-grain shared memory machines
- Characterizing target workloads
- inherently sequential, parallelizable, overlap (ex: parareal time integration?)
- memory access (=communication) patterns
- Why go parallel?
- Basic concurrency
- Sync primitives: Atomic operations, Compare-and-swap, Mutexes, condition variables, barriers, ...
- Multi-core CPUs
- Threads vs Processes
- Cache coherency
- Hyperthreading
- Data races
- Example: Dense linear algebra on SMP (draw from [Ye07] Demmel guest lectures)
10/5: Programming coarse-grain shared-memory machines
- OpenMP
- Work Sharing
- Data Sharing
- Synchronization
- Scheduling Clauses
- Initialization
- Data copying
- OpenCL on multiple cores
- The OpenCL execution model, indices
- Launch-as-synchronization
- OpenCL atomics
HW2 due
HW3:
- Various microbenchmarks on multiple cores--e.g. contention for memory bandwidth, cache snooping
- Implement a condition variable using atomic operations
Final project preliminary proposals due
10/12: Very many cores (GPUs)
(similar to the GPU lecture I've given a few times)
- Another example of the proc designers toolchest: Choices made in GPU vs CPU
- GPU execution model
- GPU memory subsystem
HW3 due
HW4: GPU matmul
10/19: Distributed Memory Machines
- Machine models: NUMA, Cluster, Grid
- Intro MPI
- point-to-point synchronous
- deadlocks
- Sendrecv
- asynchronous point-to-point
- collectives
- one-sided
- coordinate mapping
- point-to-point synchronous
- Interconnects
- again: bandwidth, latency
- Examples: GigE, Infiniband
- inhomogeneity: core-to-core vs. machine-to-machine
- Odd interconnects:
- BG/P: torus (nearest-neighbor), tree, barrier
- Interaction CL+MPI
HW4 due
HW5: MPI microbenchmarks
10/26: Tools and Libraries
- BLAS, LAPACK
- Petsc/Trilinos (one of the two?)
- Oprofile/Tau
- Metis/Zoltan
- Jumpshot
- Debuggers (MPI/GPU)
11/2: Common patterns of Parallel Programming
Relationship Parallelism <-> Locality
- Embarrassingly parallel
- Example: Monte Carlo simulation
- Master-slave
- Work unit processing
- Trees
- Map/Reduce, Parallel Scan
- Nearest-neighbor communication
- Game of Life
- Rendezvous
- Perhaps too specialized this early on?
- Load balancing
- Partitioning (what, not how)
HW 5 due
Final project final proposals due
HW6: GPU+MPI scan (two parts: write MPI scan until next week, GPU scan the week after, then combine)
Thoughts:
- Perhaps patterns and applications (lecture after next and following) could/should be interlaced
11/9: Implementation Concerns
- Warm-up: GPU reduction
- Sparse Matrices [Ye07]
- Fast SpMV on MPI and GPU
- Performance Modeling
- LogP
- Graph Partitioning (how)
11/16: Applications I
- Iterative system solvers
- Red-Black Gauss-Seidel
- Krylov
- Multigrid
HW6 due
Thoughts:
- The applications parts can be done somewhat interactively, taking suggestions from the students
11/23: Applications II
- Structured Grids
- Halo/ghost cell exchange
- GPU
- Convolution
- Discuss GPU implementation of a few of these, e.g.
- reduction, scan, sparse matrix-vector
HW7: An OpenCL regular-grid Laplace finite-difference code (GPU-targeted, extra credit for good CPU perf)
11/30: Applications III
- FFT
- The FFTW way, 'codelets'
- Parallelizing FFTs
- GPU FFTs
- N-body calculations
- FMM/tree codes? (idea only)
- Sorting (if time?)
12/7: Overflow
HW7 due
12/14
(No class?--Academic Calendar says "Legislative Day - Classes will run on a Thursday schedule")
12/21: Final Project Presentations
5:10 -- 7:00pm in room CIWW 317.
Leftover Topics
- Parallel Languages?
Thoughts
- For now, this is pretty rigidly 'bottom-up'. We might need to break this up a bit to provide motivation.
References
[Ye07] Yelick, CS267, Srping 2007.
