Hardware/Software Interaction
Next: Update of shared
Up: Prototype Design Details
Previous: Network Cache
We have chosen to give software access to the low-level capabilities of
our hardware. This low-level control, in conjunction with the natural
multicast capability of our interconnect, allows system software to
provide applications with a rich set of features.
We first describe some of the low-level control that is provided to
system software, and then briefly describe some of the capabilities this
control gives to applications and system software.
- System software can (bypassing the coherence protocol of the
hardware) read and write the tags, state information, and data of: any
memory module in the system, any network cache, and the local
secondary cache. These accesses can be performed atomically with
respect to coherence actions by the hardware and with respect to other
such accesses by software.
The data of
the caches can be accessed either by index or by address.
- A processor can request any memory in the system to: invalidate
shared copies of any of its cache lines, kill dirty copies, and obtain
(at memory) a clean exclusive copy. Similarly, a processor can
request any network cache to: invalidate any shared copies of cache
lines local to the station, kill a dirty local copy, prefetch a cache
line that will soon be accessed, or write a dirty cache line back to
(remote) memory.
- Some of the above operations (e.g. requests to set the state of
a cache line, invalidate a cache line, and write-back a dirty cache
line) can be done using a block operation that affects all cache lines
in a range of physical memory. For example, a single request to a
network cache can be used to invalidate the locally cached copies of
data belonging to a sequence of physical pages. (The initiating processor
receives an interrupt when the requested operation has completed.)
Also, a block prefetch request can be made to the network cache, which
will asynchronously prefetch the requested block of data from remote
memory.
- NUMAchine supports efficient coherent memory-to-memory block
copy operations, where the unit of transfer is a multiple,
, of the
cache line size. The request to copy a region of memory is made to
the target memory module, which for each
cache lines: kills any
existing cached copies of the cache lines, and makes a request to
the source memory module to transfer them. The source memory module
collects any outstanding dirty copies of affected cache lines from
secondary caches, and transfers the data to the target memory using a
single large request of
cache lines. An efficient block transfer
capability facilitates page migration and replication, which are used in
NUMA systems to improve performance through improved locality [14][7].
- System software can directly specify some of the fields of
packets generated in the processor module.
This can be used to implement special commands and to multicast packets
to many targets by specifying the routing mask used for distributing
the packet. For example, software can supply a routing mask to be
used for write-back packets, causing subsequent write-backs of cache
lines from the secondary cache to be multicast directly to the set of
network caches specified by the routing mask (as well as to memory).
- Each processor module has two interrupt registers, one for
cross-processor interrupts and one for device interrupts.
Data written to an interrupt register is ORed with current contents
of the register, and causes an interrupt.
When the processor reads an interrupt register, the register
is automatically cleared. For cross-processor interrupts, the
requesting processor identifies the location of the target register(s)
using a routing mask and a per-station processor bit mask. This allows an
interrupt to be multicast to any number of processors, which may be used
for an efficient implementation of TLB
shoot-down [2] or other coordinated OS
activities [19].
On a request to an I/O device, system
software can specify the processor to be interrupted as well as the
bit pattern to be written to the processor's interrupt register when
the request has completed. This flexibility allows a processor
to efficiently handle concurrent requests (such as
I/O requests and memory-to-memory transfers) involving multiple
devices distributed across the system.
- The performance of SPMD applications that barrier often
depends on the efficiency of the barrier
implementation [18]. To efficiently support barrier
synchronization, each processor module has a barrier register
that differs from the interrupt register only in that
writes to the barrier register do not cause an interrupt. In a simple
use of these registers, when a processor
reaches a barrier it multicasts a request that sets a bit
corresponding to its ID in the barrier register of each of the
participating processors, and then spins on its local register until
all participating processors have written their bit.
- At system boot time, the latency and bandwidth of all components
of the system (e.g., network components, processors) can be
constrained. This will alow practical experimentation to determine
the effect of different relative performance between system
components, such as processor speed, network latency, and network
bandwidth.
While much of the functionality that results from the above control is
obvious, sophisticated application and operating system software can
make use of this control in a number of non-obvious ways. In the
remainder of this section we give three non-trivial examples of how this
control could be used.
Next: Update of shared
Up: Prototype Design Details
Previous: Network Cache
Stephen D. Brown
Wed Jun 28 18:34:27 EDT 1995