# Matrix transposition on a 2D torus of processors

## Interactive animation

You can command the movement of data with the buttons below the drawing. A step counter appears in the icon above the drawing. For explanations of the function of the processor array, see the text following the drawing.

## Algorithm description

An n x n matrix the elements of which are distributed across the processors of a two-dimensional torus-interconnected processor array of size n x n is transposed in n time steps. The illustration given above concerns the case n = 4. Only the subscripts of the matrix elements are shown: for instance, ` "3,2"` stands for element A(3,2) of a matrix A.

The algorithm runs on a systolic array, an abstraction of a parallel computer, which is represented as a cellular automaton. The big boxes, referred to as the processing cells, represent processing functions realized as combinatorial circuits and the small black boxes represent clocked delay elements. This automata notation is useful for the unified representation of systolic algorithms. In a parallel computer implementation (see the data-parallel program given below), delay elements are implemented as memory locations and processing cells as processor operations. We assume that a cell and the delay elements arranged in its inputs (to the right and below of the cell) are implemented on one processing node of a parallel computer.

The cells on the main diagonal of the processor array take the data from their east and south inputs and transfer it to the the north and west, respectively. The other cells transfer data from east to west and south to north. At the beginning of the algorithm, the matrix elements are resident in the delay elements which are arranged in the horizontal interconnections and element A(i,j) resides in processing node (i,j); i = 1 ... 4; j = 1 ... 4. Upon completion of the algorithm, the elements will be resident in the in the delay elements which are arranged in the vertical interconnections and element A(j,i) will reside in processing node (i,j); i = 1 ... 4; j = 1 ... 4.

The algorithm functions as follows. The first row of the matrix is shifted to the left in the first row of the processor array. The direction of its movement is, however, changed in the leftmost processor of this row: the matrix elements which enter this row from the east are send to the north and then via the corresponding toroid interconnection to the bottommost processor of the first column. (For simplicity, the toroid wrap-around interconnections are only sketched in the figure above.) After that these elements are shifted upwards in the first column of the systolic array. The movement of the other matrix rows is similar. In general, the i-th matrix row (i = 1 ... n) is shifted to the left in the i-th row of the processor array and the lelements which enter processing node (i,i) are redirected to move upwards, after which they are shifted upwards in the i-th column of the processor array. The result is that after n steps (clock periods) the i-th matrix row resides in the i-th column of the processor array.

## Data-parallel program implementation

The above algorithm can be implemented as follows in a High Performance Fortran data-parallel program.

```       PROGRAM Transposition

INTEGER, PARAMETER   :: N=4
REAL, DIMENSION(N,N) :: A, AT, H, V
REAL, DIMENSION(N)   :: D
INTEGER :: T

!HPF\$  ALIGN H(I,J) WITH V(I,J)
!HPF\$  ALIGN D(I)   WITH V(I,I)

1      H = A
DO T=1, N
3         FORALL (I=1:N) D(I) = H(I,I)
4         FORALL (I=1:N) H(I,I) = V(I,I)
5         FORALL (I=1:N) V(I,I) = D(I)
6         H = CSHIFT(H, DIM=2, SHIFT=1)
7         V = CSHIFT(V, DIM=1, SHIFT=1)
END DO
8      AT = V

STOP
END
```

The two-dimensional data arrays H and V implement the memory locations represented by the sets of delay elements arranged in the horizontal and vertical interconnections of the systolic array given in the drawing above. The matrix to be transposed is resident in the two-dimensional array A; the result of the transposition will be stored in the data array AT. The one dimensional array D is used to swap the elements on the main diagonals of H and V. The High Performance Fortran align directives (starting with !HPF\$) are included to ensure that the counterpart elements H(i,j) and V(i,j), (i = 1 ... n; j = 1 ... n) are allocated memory in the same physical processing node when the program is compiled and executed. For a similar reason, the auxiliary array D is aligned with the main diagonals of V and H.

At the beginning of the algorithm the elements of the matrix A are assigned to the corresponding elements of the array H by means of the data-parallel statement with label 1. The counter T of the DO loop corresponds to the time counter in the interactive animation given above. The statements with labels 3, 4 and 5 realize the swapping of the diagonal elements of H and V; the auxiliary array D is used to prevent overwriting. After this swapping, the elements of H and V are circularly shifted by one position to the left and upwards by statements 6 and 7, respectively. Finally, the result of the transposition which upon completion of the algorithm is resident in V is assigned to AT (statement with label 8).

## Literature

N. Petkov: Systolic Parallel Processing (Amsterdam: North-Holland, Elsevier Sci. Publ., 1993)