This page is under construction...
NUMAchine Performance Monitoring
Performance Monitoring in Processor Card
UPDATE
These features are now outdated. See my thesis for
current information.
Key Features
General-Purpose 32-bit Counters
- see this table
- there are 4 counters
- each counter can be individually loaded with an initial value. after a load,
it takes about 4 cycles for the counter to be ready to start counting
(however, loading zero/resetting in 1 cycle can be arranged
with some effort)
- counters can be loaded from the local processor or a remote processor
- all counters can be loaded with the same value (at the same time)
- counters are NOT reset after a read, they must be explicitly
cleared (is this reasonable? atomic read & reset not currently supported!!!!
reading from a specific address could clear them, or there could be a 'mode' bit
in the config registers, if required)
- each counter produces a maskable interrupt on overflow
- events which can be counted are:
- can constantly count (counts time)
- can constantly do nothing (stops counting time/other)
- can count any of the 15 R4400 status pin events which indicating CPU use
(status items can be counted in groups as they appear here):
- run cycles (useful work)
- load
- store
- branches
- untaken conditional branch
- taken conditional branch
- other integer instruction (not load/store/conditional branch.
includes jump and ERET)
- other float instruction (not load/store/conditional branch)
- norun cycles
- stall cycles (entire pipeline blocked until result comes)
- cache stalls
- primary instruction cache stall
- primary data cache stall
- secondary cache stall
- multiprocessor stalls
- MP (multi-processor) stall -- due to external coherence transactions
- other stall type ??????
- killslip cycles
- killed cycles (taken branches kill 2 of the 3 delay instructions)
- instruction killed by branch, jump, or ERET
- instruction killed by exception
- slipped cycles (pipeline interlock due to data or resource dependency)
- integer instruction killed, pipeline slip
- floating-point instruction killed, pipeline slip
- # successful invalidates completed (software must count # unsuccessful invalidates,
which are presented to the CPU as bus errors)
- # incoming invalidates which HIT in the secondary cache ( ==> primary
cache hit too) -- this is specially provided for parallel software
tuning and requires special hardware support. monitoring this
feature disables monitoring with the SRAM counters
- perhaps above feature combined with counting invalidates to cache lines?
- # packet headers (INCLUDES or EXCLUDES nacked packets?)
- # packet headers + # data cycles
- # NACKs received
- # requests that were NACKed once or more
- total time lost due to NACKs
- latency of most recent cached read request (all of RD/RDEXCL/UPGD)
this implies the counter must be reset each time -- i'm
not sure how to do this
- total latency of cached read requests (all of RD/RDEXCL/UPGD)
- put most recent external INVALIDATE into AR, count number of RD/RDEXCL/UPGD
(expecially RDEXCL) to that cache line -- this indicates possible false
sharing, but imprecise filter masks reduces effectiveness of this
- # cycles waiting for bus_grant
- # cycles bus is busy
- # cycles my ring is busy
- # cycles upper-level ring is busy -- ring busy cycles are useful to
determine if the interconnect is too busy for prefetch requests or
other optional things.
- # cycles memory (card #1 or #2 or both?) is busy
- can we measure how long it takes for external requests to complete?
- not all items will be countable on all counters
- one master MASK feature exists, consisting of various subfeatures. it
can be applied to any of the counters. each subfeature can
be individually enabled
(when >1 subfeature is on, all enabled conditions must be simultaneously satisfied)
- only when Address matches specified mask:
- each address bit can be considered "don't care", "must be 1", or "must be 0"
- only when a packet is being sent out to the bus (mutex of next field!)
- only when a packet is coming in from the bus (mutex of previous field!)
- only when CMD == CMDMask (loadable constant)
- only when PhaseID == PhaseIDMask (loadable constant)
- only when FIFO depth > FIFOMask (loadable constant)
- how necessary is the FIFO depth requirement? the FIFO has special counters
although they should be empty almost all of the time?
- when the countable items are mixed with the MASK feature, the following
items can be counted (for example):
- # incoming invalidates
- # incoming interventions
- # outgoing reads, read exclusives, or upgrades
- etc.
- would like to count these, but it is too hard right now:
- # primary instruction cache misses
- # primary instruction cache hits
- # primary data cache misses
- # primary data cache hits
- # secondary (instruction/data?) cache misses
- # secondary (instruction/data?) cache hits
- have counter count number of times another counter is > N ???? FIXME
- IRQ response time (start when an IRQ occurs, stop when read of IRQ done)
this is actually quite trivial -- space
needs to be allocated in GPC table
SRAM-based Counters and Histograms
- see this table
- note that counting invalidates that hit in the secondary cache
requires use of the cache index (address) bits, so the SRAM counters can
only be used to profile secondary cache activity
- SRAM counters can create histograms as well as count numerous events,
provided that the events don't occur simultaneously
- SRAM is 64k deep (16 address bits), 15ns, 32-bits wide
- contents can be read and written by R4400
- address of SRAM can come from:
- R4400 AD[17:2] lines (word-addressable for reading/writing counters)
- R4400 SCADDR[17] and SCADDR[15:1], so secondary cache accesses can be histogrammed
to help identify conflicts (note SCADDR[17] is used to split cache into Instruction
and Data halves, so that only DATA ACCESSES can be profiled (see feasability below)
- merging of these bits:
- StateInfo (3 bits - Global, Valid, NC_Hit),
- RequestInfo (2 bits - ExclusiveRead, RemoteRead),
- ExtendedPhaseID (11 bits, regular PhaseID is 4 LSB, 7 additional bits can
be used by OS for context switching)
- merging of these bits:
- StateInfo (3 bits - Global, Valid, NC_Hit),
- RequestInfo (2 bits - ExclusiveRead, RemoteRead),
- PhaseID (4 bits)
- ServiceTime (7 bits), floating-point representation:
- 3 exponent bits (excess-1 notation)
- 4 mantissa bits
- exponent value of 0 means mantissa is denormal, so
the value represented is 0.mantissa x 2^(0+4).
- exponent >1 means mantissa has an implied '1' bit in the msb position, so
the value represented is 1.mantissa x 2^(exponent-1+4)
- provides equivalent of a 12 bit counter,
with some loss of precision at high end
- merging of these bits:
- PhaseID (4 bits)
- ServiceTime (12 bits, normal representation)
- fetch-increment-and-writeback done by CPLD, takes 3 cycles,
SRAM and CPLD TIMING PROBLEMS EXIST RIGHT NOW
- can be used to do dynamic basic-block counting (via writes from SW,
minimal intrusion)
- overflows should raise maskable interrupt (HOW TO DO? DIFFICULT!)
- count external invalidates to each cache line?
General Notes
- see this table
- SRAM control signals (WE*, OE*, Addr) will be pipelined to the same depth,
otherwise timing hazards will occur
- timing and logic for invalidates that hit in the secondary cache are tricky
- profiling the R4400 L2 cache requires special clocking of the
fetch-and-increment CPLD -- it may not be feasible
- keep most recent INVALIDATE address in (AM+AF) register. compare to
all outgoing address requests (in AR). count number of matches.
this gives idea how many times invalidated data was rereferenced.
update strategy likely to be better for this kind of data.
if data is rereferenced, must stop future comparisons to avoid falsely
counting replacement/conflict misses.
-