3.2. Software energy modelling
instruction that is executed, the inter-instruction overheads of switching between one instruction
and another, and any external effects such as cache misses. These values are extracted from a target
system with a test harness executing specific instruction sequences and measurement equipment
collecting energy consumption data. The model is then expressed by Equation 3.1 [Tiw+96]. For
all instructions, i, in the target ISA, the base instruction cost, Bi, is multiplied by, Ni, the number
of times the instruction, i, is executed. For each pair of instructions executed in sequence, i, j, the
inter-instruction overhead, Oi,j , and frequency of occurrence, Ni,j , is counted. Finally, for each
external component, k, the energy cost of external effects, Ek, is determined, for example with an
external cache model.
Ep=X
i
(Bi×Ni) + X
i,j
(Oi,j ×Ni,j ) + X
k
Ek(3.1)
Building on this research are energy modelling tools such as the Wattch framework [BTM00].
Wattch produces energy estimates of software through simulation, by modelling key components of
a processor architecture, such as the cache hierarchy and size, functional unit utilisation and branch
prediction capabilities. Wattch can model software targeting various architectures, to within 10 %
of commercial low-level hardware modelling tools. The SimpleScalar [Aus02] architecture modelling
software was used as a basis for a similar power model, resulting in Sim-Panalyzer [Sim04].
The idea of measuring instructions and their interactions can be broken down further, a model
for which was proposed by Steinke et al. [Ste+01a]. This model extracts more information on the
source of energy consumption in the processor pipeline, such as the cost of switching in each read
action upon the register file, as well as the cost of addressing different registers for read and write-
back. The precision of the approach is shown to be within 1.7 % of the target hardware, although
it significantly increases the number of variables that must be considered when implementing the
model. Other types of processor architectures have also been modelled in similar ways, such as
VLIW DSPs [Sam+02;IRH08], with average accuracies of 4.8 % and 1.05 % respectively.
To model complex micro-architectures or large instruction sets, linear regression analysis can be
used. With sufficient supporting empirical data, a solution to a parameterised model can be found
that establishes values for any unknown terms. This has been utilised to model an ARM7TDMI
processor [LEM01], using empirical energy data from observed test programs to aid the solver,
yielding a model with a 2.5 % average error.
These approaches can all deliver an accuracy of 1–10 % across various architectures. However, the
architectures that they analyse are either single threaded, or special purpose DSPs. As such, these
models are not equipped to model a hardware multi-threaded processor. Either these approaches
must be extended, or an alternative approach found, ideally whilst maintaining comparable accu-
racy to prior models.
3.2.3. Performance counter based modelling
In a number of modelling methods, hardware performance counters are used to estimate energy
consumption. The benefit is that these counters can be used by a wider range of users who do not
necessarily possess direct energy measurement capabilities for their target system.
In [CM05], a set of performance events are monitored via an Intel PXA255’s configurable per-
formance counter sampling mechanism. These include characteristics that have been modelled
via various means throughout the literature review in this section, such as cache misses, but also
instruction counts, data dependency events and an abstraction of main memory behaviour through
some of these events. The work states that the embedded PXA255 has fewer counters than a larger
processor, requiring profiling runs in order to gather sufficient data for a robust model. It is shown
that the average error is 4 % for the SPEC2000 and Java benchmarks run on this processor via a
Linux based OS.
The Xeon Phi, whose architecture is discussed in Chapter 10, is modelled in a similar way [SB13].
In this case, a set of micro-benchmarks are used to exercise various behaviours and extract a
performance counter lead model. The Phi is significantly more complex than a PXA255, in that it
contains multiple x86 cores and multi-threading. As such, multi-threaded behaviour must also be
accounted for, with the model containing a scaling term defined by the number of active threads
in a core. The model accuracy is stated as being within 5 % of hardware energy for real world
43