Preparing your result...
Loading...
Press Esc to dismiss this message

Sparse Matrix-Vector Multiplication on Processors with Small Memory Footprints (29-Jun-2009)

Thumbnail
IP.com Prior Art Database Disclosure (Source: IPCOM)
Disclosure Number IPCOM000184518D dated 29-Jun-2009
Originally published in Prior Art Database
Disclosed by: IBM
Country: Undisclosed
Disclosure File: 5 pages / 32.5 KB / English (United States)

We present a sparse matrix-dense vector multiplication procedure which is suitable for a subclass of sparse matrices (which we call block-sparse matrices), and which effectively harnesses the compute power of the Cell/B.E. processor. This technique could be mapped to GPUs, Larrabee, or any low-memory footprint microprocessor. .

This text was extracted from a PDF file.
This is the abbreviated version, containing approximately 24% of the total text.

Page 1 of 5

Sparse Matrix-Vector Multiplication on Processors with Small Memory Footprints

In this invention we describe an implementation of the 3x3 block sparse matrix multiply operation that leverages the architectural elements fast local storage, fast element interconnect bus (EIB), and high-

performance SIMD units to distribute, synchronize, and balance the

workload across multiple SPEs to achieve the highest performance

To make more effective use of the Cell/B.E.'s 4-way SIMD operations and avoid the sum-across operation, it would be convenient to calculate four elements of the result vector at a time. This can be done as follows. Assuming that the matrix

A

                               can be stored in column-major form, we load the first element of each of the first four rows into a vector register, replicate ("splat") b 0 into the four elements of a second vector register, multiply the registers together, and store the product into an accumulator. We then move to the next column of the original matrix, loading the first four elements into one vector register, replicating the corresponding vector element into another vector register, and performing a multiply-add of the registers to the accumulator. When all of
the columns have been processed in this manner, we have the first four elements of the result vector in the accumulator. This calculation is repeated to process the remaining rows, four at a time.

The soft-body dynamics application with which we are concerned gives us a block-sparse matrix with 3x3 subblocks. We represent this matrix as a row-major ordered list of 3x3 subblocks; within each subblock, the matrix elements are stored in column-major order. In this scenario, each SPE can copy some number of block rows into its local storage and perform the matrix-vector multiplication for that subset of the entire sparse matrix. SPEs can then work in

parallel and asynchronously, computing parts of the result vector and copying them out to main

storage via DMA. By computing on three rows (one block row) at a time we waste one of the four positions in the SIMD registers, but we consider this a reasonable price to pay for relatively straightforward code.

Implementing sparse matrix-vector multiplication on the Cell/B.E.™ presents a number of challenges. While the operation itself parallelizes readily (each block row of the matrix combines with the input vector to produce three elements of the output vector, which are not affected by any other block rows of the matrix), the limited amount of local store on the SPEs and the alignment requirements for efficient DMA of the matrix and vector data require some planning.

Workload Parallelization

In order to divide the workload between the SPEs we used a static approach in which each SPE works on a pre-determined set of rows (which is a multiple of the 32-block-row work unit). The assignment of rows to the SPEs is done before matrix multiplication begins and it is not dynamically m...

(Source: IPCOM)
First page image
(Source: IPCOM)