NUMAchine allows system software a fair bit of control over how data is cached and how coherence is maintained. At the simplest level, system software can specify on a per-page basis: (i) if caching should be disabled or enabled, (ii) if the coherence of cached data should be enforced by hardware, (iii) if hardware should allow multiple processors to have data in a shared state (or only allow exclusive access by a single cache), and (iv) if the processor supports it, if coherence should be maintained using an update or invalidate protocol.
We are currently evaluating supporting both sequential consistency, and a weaker model (that doesn't quite fit any of the established weak consistency definitions). The full overhead of this is not yet clear, and more importantly it is not clear what the performance advantages will be, since on our architecture the topology of the interconnect allows sequential consistency to be implemented at much lower overhead than other architectures.
For cache coherent pages, software can use some of the hardware control described above to improve performance. For example, multicasting data can be used by software to reduce latency, and data can be written back from any cache under software control to reduce the cost of coherence. Similarly, with a write update hardware protocol, processors that are no longer using the data can explicitly invalidate it from their secondary and network caches in order to reduce the overhead of updates.
Cacheable but non-coherent pages can be used to enable software controlled cache coherence. Such techniques can take advantage of application specific semantics to reduce the overhead of coherence for many applications . To make the implementation of these techniques more efficient, NUMAchine maintains state about cache lines (such as which processors have the data cached) that can be directly accessed by the software. We also expect that the support for multicast interrupts provided by our hardware will be useful for some of these techniques.