ASAP: Automatic Speculative Acyclic Parallelization for Clusters [abstract] (PDF)
Hanjun Kim
Ph.D. Thesis, Department of Computer Science,
      Princeton University, September 2013.
  
  
While clusters of commodity servers and switches are the most
popular form of large-scale parallel computers, many programs are not
easily parallelized for clusters due to high inter-node communication
cost and lack of globally shared memory.  Speculative Decoupled
Software Pipelining (Spec-DSWP) is a promising automatic
parallelization technique for clusters that speculatively partitions a
loop into multiple threads that communicate in a pipelined manner.
Speculation can complement conservative static analysis, making
automatic parallelization more robust and applicable.  Pipelining
allows Spec-DSWP to speculate only rarely occurring dependences while
respecting the other dependences through communication among threads.
Acyclic communication patterns in pipelining make the parallelized
programs tolerant of high communication latency of clusters.  However,
since Spec-DSWP partitions a loop iteration (a transaction) into
multiple sub-transactions across multiple threads according to the
pipeline stages, a special runtime system is required that supports
multi-threaded transactions (MTXs).
 
This dissertation proposes the Automatic Speculative Acyclic
Parallelization (ASAP) system that enables Spec-DSWP for clusters
without any hardware modification.  The ASAP system supports
various speculation techniques that require different validation and
communication costs, and automatically parallelizes sequential loops
using the Spec-DSWP transformation with the optimal application of the
speculation techniques.  The ASAP system efficiently supports MTXs
to correctly execute the speculatively transformed programs on
clusters.  With synergistic combination of speculation, acyclic
communication, and runtime system support, this approach achieves or
demonstrates a path to achieve scalable performance speedup up to
109x for a wide range of applications on clusters without any
hardware modification.  
