You can command the movement of data with the buttons below the drawing. A step counter appears in the icon above the drawing. For explanations of the function of the processor array, see the text following the drawing.

An *n* x *n* matrix the elements of which are distributed
across the processors of a two-dimensional torus-interconnected
processor array of size *n* x *n* is transposed in *n*
time steps. The illustration given above concerns the case *n =
4*. Only the subscripts of the matrix elements are shown: for
instance, ` "3,2"`

stands for element A(3,2) of a matrix A.

The algorithm runs on a systolic array, an abstraction of a parallel computer, which is represented as a cellular automaton. The big boxes, referred to as the processing cells, represent processing functions realized as combinatorial circuits and the small black boxes represent clocked delay elements. This automata notation is useful for the unified representation of systolic algorithms. In a parallel computer implementation (see the data-parallel program given below), delay elements are implemented as memory locations and processing cells as processor operations. We assume that a cell and the delay elements arranged in its inputs (to the right and below of the cell) are implemented on one processing node of a parallel computer.

The cells on the main diagonal of the processor array take the
data from their east and south inputs and transfer it to the the north
and west, respectively. The other cells transfer data from east to
west and south to north. At the beginning of the algorithm, the matrix
elements are resident in the delay elements which are arranged in the
horizontal interconnections and element *A(i,j)* resides in
processing node *(i,j); i = 1 ... 4; j = 1 ... 4*. Upon
completion of the algorithm, the elements will be resident in the in
the delay elements which are arranged in the vertical interconnections
and element *A(j,i)* will reside in processing node *(i,j); i =
1 ... 4; j = 1 ... 4*.

The algorithm functions as follows. The first row of the matrix is
shifted to the left in the first row of the processor array. The
direction of its movement is, however, changed in the leftmost
processor of this row: the matrix elements which enter this row from
the east are send to the north and then via the corresponding toroid
interconnection to the bottommost processor of the first column. (For
simplicity, the toroid wrap-around interconnections are only sketched
in the figure above.) After that these elements are shifted upwards in
the first column of the systolic array. The movement of the other
matrix rows is similar. In general, the *i*-th matrix row (*i =
1 ... n*) is shifted to the left in the *i*-th row of the processor
array and the lelements which enter processing node *(i,i)* are
redirected to move upwards, after which they are shifted upwards in
the *i*-th column of the processor array. The result is that
after n steps (clock periods) the *i*-th matrix row resides in
the *i*-th column of the processor array.

The above algorithm can be implemented as follows in a High Performance Fortran data-parallel program.

PROGRAM Transposition INTEGER, PARAMETER :: N=4 REAL, DIMENSION(N,N) :: A, AT, H, V REAL, DIMENSION(N) :: D INTEGER :: T !HPF$ ALIGN H(I,J) WITH V(I,J) !HPF$ ALIGN D(I) WITH V(I,I) 1 H = A DO T=1, N 3 FORALL (I=1:N) D(I) = H(I,I) 4 FORALL (I=1:N) H(I,I) = V(I,I) 5 FORALL (I=1:N) V(I,I) = D(I) 6 H = CSHIFT(H, DIM=2, SHIFT=1) 7 V = CSHIFT(V, DIM=1, SHIFT=1) END DO 8 AT = V STOP END

The two-dimensional data arrays H and V implement the memory locations
represented by the sets of delay elements arranged in the horizontal
and vertical interconnections of the systolic array given in the
drawing above. The matrix to be transposed is resident in the
two-dimensional array A; the result of the transposition will be
stored in the data array AT. The one dimensional array D is
used to swap the elements on the main diagonals of H and V.
The High Performance Fortran align directives (starting with
!HPF$) are included to
ensure that the counterpart elements *H(i,j)* and *V(i,j)*, (*i =
1 ... n; j = 1 ... n*) are allocated memory in the same
physical processing node when the program is compiled and executed.
For a similar reason, the auxiliary array D
is aligned with the main diagonals of V and H.

At the beginning of the algorithm the elements of the matrix A are assigned to the corresponding elements of the array H by means of the data-parallel statement with label 1. The counter T of the DO loop corresponds to the time counter in the interactive animation given above. The statements with labels 3, 4 and 5 realize the swapping of the diagonal elements of H and V; the auxiliary array D is used to prevent overwriting. After this swapping, the elements of H and V are circularly shifted by one position to the left and upwards by statements 6 and 7, respectively. Finally, the result of the transposition which upon completion of the algorithm is resident in V is assigned to AT (statement with label 8).