Energy modelling of multi-threaded,
multi-core software for embedded systems
Steven P. Kerrison
A thesis submitted to the University of Bristol in accordance with the
requirements of the degree Doctor of Philosophy in the Faculty of
Engineering, Department of Computer Science, September 2015.
51,000 words.
Copyright 2015 Steven P. Kerrison, some rights reserved.
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0
International License.
http://creativecommons.org/licenses/by-nc-nd/4.0/
3
Abstract
Efforts to reduce energy consumption are being made across all disciplines. ICT’s
contribution to global energy consumption and by-products such as CO2 emissions con-
tinues to grow, making it an increasingly significant area in which improvements must
be made. This thesis focuses on software as a means to reducing energy consumption.
It presents methods for profiling and modelling a multi-threaded, multi-core embedded
processor at the instruction set level, establishing links between the software and the
energy consumed by the underlying hardware.
A framework is presented that profiles the energy consumption characteristics of a
multi-threaded processor core, associating energy consumption with the instruction set
and parallelism present in a multi-threaded program. This profiling data is used to
build a model of the processor that allows instruction set simulations to be used to
estimate the energy that programs will consume, with an average of 2.67 % error.
The profiling and modelling is then raised to the multi-core level, examining a chan-
nel based message passing system formed of a network of embedded multi-threaded
processors. Additional profiling is presented that determines network communication
costs as well as giving consideration towards system level properties such as power sup-
ply efficiency. Then, this is used to form a system level energy model that can estimate
consumption using simulations of multi-core programs. The system level model com-
bines multiple instances of a core energy model with a network level communication
cost model.
The broader implications of this work are explored in the context of other embed-
ded and multi-core processor architectures, identifying opportunities for expanding or
transferring the models. The models in this thesis are formed at the instruction set
level, but have been demonstrated to be effective at higher-levels of abstraction than
instruction set simulation, through their support of further work carried out externally.
This work is enabled by several pieces of development effort, including a profiling
framework for taking power measurements of the devices under investigation, tools for
programming, routing and debugging software on a multi-core hardware platform called
Swallow, and enhancements to an instruction set simulator for the simulation of this
multi-core system.
Through the work of this thesis, an embedded software developer for multi-threaded
and multi-core systems is equipped with tools, techniques and new understanding that
can help them in determining how their software consumes energy. This raises the
status of energy efficiency in the software development cycle as we continue our efforts
to minimise the energy impact of the world’s embedded devices.
5
Acknowledgements
I owe the successful completion of this PhD thesis to a great many people, and I can
directly thank but a few of them here. If I interacted with you during the course of my
research, please know that I am eternally grateful for that. Thank you to my family
for supporting and encouraging me throughout.
Thank you to my supervisor, Kerstin Eder, whose guidance helped me develop a
compelling research topic and secure funding for my work, without which none of this
would have been possible. Thank you to Simon Hollis and Jake Longo Galea for creating
a rather interesting set of research problems for us to collectively solve in the Swallow
platform. David May, your thought provoking discussions have been invaluable and
inspiring. Many thanks to my external examiners, Alex Yakovlev and Peter Marwedel,
as well as my internal coordinator Jos´e Luis N´u˜nez-Y´a˜nez.
My colleagues and companions in research deserve much gratitude for their input,
collaboration, and of course their tolerance. Jamie Hanlon, Roger Owens, Neville Grech,
Kyriakos Georgiou, Jeremy Morse and James Pallister, you and many others in the
department made research an exciting experience. A special thank you to Dejanira
Araiza Illan for your support, particularly during the write-up.
In the first year of my studies I was hosted by XMOS. This was an excellent place
to form ideas, gain industrial insight and motivate my work. In particular, thanks to
Henk Muller, John Ferguson, Matt Fyles and Richard Osborne for their expert advice
and support.
My work was funded by a University of Bristol PhD Scholarship, and much of it
became relevant to the ENTRA EU FP7 FET research project. I am grateful to these
funding sources for making this work possible, and for creating an ecosystem in which
to disseminate and further explore this work.
7
Author’s declaration
I declare that the work in this dissertation was carried out in accordance with the
requirements of the University’s Regulations and Code of Practice for Research Degree
Programmes and that it has not been submitted for any other academic award. Except
where indicated by specific reference in the text, the work is the candidate’s own work.
Work done in collaboration with, or with the assistance of, others, is indicated as such.
Any views expressed in the dissertation are those of the author.
Signed:
Date:
9
Contents
List of Figures 13
List of Tables 15
List of Code Listings 15
1. Introduction 17
1.1. Research questions and thesis .............................. 18
1.2. Contributions ....................................... 20
1.3. Structure ......................................... 22
1.4. Terminology and conventions .............................. 23
I. Background 25
2. Parallelism and concurrency in programs and processors 29
2.1. Concurrent programs and tasks ............................. 29
2.2. Parallelism in a single core ................................ 33
2.3. Multi-core processing ................................... 35
2.4. Summarising parallelism and concurrency ....................... 36
3. Energy modelling 39
3.1. Hardware energy modelling ............................... 40
3.2. Software energy modelling ................................ 42
3.3. Summary ......................................... 44
4. Influencing software energy consumption in embedded systems 47
4.1. Forming objectives to save energy in software ..................... 47
4.2. Energy’s many relationships ............................... 50
4.3. Can we sit back and let Moore’s Law do the work? .................. 54
4.4. Efficiency through event-driven paradigms ....................... 56
4.5. Summary ......................................... 56
5. A multi-threaded, multi-core embedded system 59
5.1. The XS1-L processor family ............................... 59
5.2. Swallow multi-core research platform .......................... 65
5.3. Research enabled by the XS1-L and Swallow ..................... 71
II. Constructing a multi-threaded, multi-core energy model 73
6. Model design and profiling of an XS1-L multi-threaded core 77
6.1. Strategy .......................................... 77
6.2. Profiling device behaviour ................................ 77
6.3. Model design considerations ............................... 79
6.4. XMProfile: A framework for profiling the XS1-L ................... 80
6.5. Generating tests ..................................... 83
6.6. Profiling summary .................................... 85
11
Contents
7. Core level XS1-L model implementation 87
7.1. Workflow ......................................... 87
7.2. A preliminary model ................................... 88
7.3. Preliminary model evaluation .............................. 97
7.4. An extended core energy model ............................. 100
7.5. Evaluation of the extended model ............................ 104
7.6. Beyond simulation .................................... 107
7.7. Summary ......................................... 109
8. Multi-core energy profiling and model design using Swallow 111
8.1. Core energy consumption on Swallow .......................... 111
8.2. Network communication energy profiling ........................ 113
8.3. Determining communication costs ............................ 115
8.4. Summary of Swallow profiling .............................. 117
9. Implementing and testing a multi-core energy model 119
9.1. Workflow ......................................... 119
9.2. Core and network timing simulation in axe ...................... 120
9.3. Communication aware modelling ............................ 122
9.4. Displaying multi-core energy consumption data .................... 126
9.5. Demonstration and evaluation .............................. 128
9.6. I/O as an adaptation of the network model ...................... 132
9.7. Summary ......................................... 133
10.Beyond the XS1 architecture 135
10.1. Epiphany ......................................... 135
10.2. Xeon Phi ......................................... 137
10.3. Multi-core ARM implementations ............................ 139
10.4. EZChip Tile processors ................................. 140
10.5. Summary of model transferability ............................ 141
11.Conclusions 143
11.1. Review of thesis contributions .............................. 143
11.2. Building a multi-core platform for energy modelling research ............ 144
11.3. ISA-level energy modelling for a multi-threaded embedded processor ........ 144
11.4. Multi core software energy modelling from a network perspective .......... 145
11.5. The transferability of multi-threaded, multi-core models ............... 146
11.6. Writing energy efficient multi-threaded embedded software .............. 147
11.7. Future work ........................................ 148
11.8. Concluding remarks ................................... 149
List of acronyms 151
Bibliography 155
12
List of Figures
2.1. A multi-threaded task structure in a USB audio application ............. 31
2.2. An abstract example of instruction flow through a super-scalar processor . . . . . . 34
4.1. Power savings for an Ethernet receiving with DVFS ................. 54
4.2. CPU frequencies since 1972 ............................... 55
5.1. XS1 architecture block diagram ............................. 60
5.2. Channel communication in the XS1 ISA ........................ 63
5.3. Photos of the Swallow platform ............................. 65
5.4. Dual-core XS1-L link topology ............................. 66
5.5. Swallow board JTAG chain ............................... 68
5.6. Swallow network toplogy ................................. 69
6.1. Process undertaken to profile/model the XS1-L core ................. 78
6.2. XMProfile test harness hardware and software structure ............... 81
6.3. Test harness process flow ................................ 82
7.1. XMTraceM workflow for a single-core multi-threaded XMOS device .......... 87
7.2. Active and inactive thread costs for the XS1-L processor ............... 89
7.3. Instruction power heat-maps for the XS1-L ...................... 91
7.4. Data constrained instruction power heat-maps for the XS1-L ............ 93
7.5. Power distribution measurements for groups of XS1 instructions .......... 95
7.6. Benchmark energy results and error margins ...................... 99
7.7. Box-whisker comparison of original and modified instruction groupings . . . . . . . 100
7.8. Extended profiling data ................................. 102
7.9. Visualisation of a regression tree for the XS1 architecture .............. 105
7.10. Completed model benchmark results .......................... 108
8.1. Power consumption of Swallow cores .......................... 113
8.2. Heat sensitivity of Swallow profiling .......................... 113
8.3. Experimental setup of the Swallow hardware and measurement apparatus . . . . . 114
8.4. Communication costs of Swallow system ........................ 116
9.1. XMTraceM workflow for a multi-core XMOS system .................. 119
9.2. Top-level abstraction of components in a modelled multi-core network ....... 124
9.3. Network-level energy consumption visualisation .................... 128
9.4. Multi-core modelling accuracy .............................. 131
9.5. Measured and estimated energy consumption ..................... 131
9.6. Refined modelling visualisation for Swallow ...................... 132
13
List of Tables
2.1. Example of a five stage processor pipeline, including warm-up and stalling ..... 33
3.1. Energy modelling technique overview .......................... 39
5.1. XS1-L routing table example .............................. 70
5.2. Swallow boot methods .................................. 71
6.1. Comparison of key differences between various architectures ............. 80
6.2. XS1-L pipeline occupancy for various thread counts ................. 83
7.1. Instruction encoding summary for the XS1 instructions under test ......... 90
7.2. Hamming weight of inputs and outputs for interleaved lmul instructions ...... 94
7.3. Power measurements for lmul under differing data conditions ............ 94
7.4. Benchmarks used to evaluate energy model accuracy ................. 98
7.5. OLS coefficients for XS1-L instruction features .................... 104
7.6. Percentage error of three evaluated models ....................... 107
8.1. Calibration tests for Swallow .............................. 112
8.2. Test combinations for communication power measurements ............. 115
8.3. Swallow communication cost validation ......................... 117
9.1. Definition of elements in axe JSON trace ....................... 121
9.2. Graph attributes for multi-core model ......................... 123
9.3. Resource instructions for network communication ................... 125
10.1. Architecture comparison summary ........................... 141
List of Code Listings
4.1. Spinlock loop ....................................... 56
4.2. Event-driven wait ..................................... 56
5.1. Sending on a channel ................................... 63
5.2. Receiving on a channel .................................. 63
6.1. Example kernel of first thread on the device under test ................ 82
6.2. Example kernel of further slave threads ........................ 82
8.1. XC top-level multi-core allocation example ....................... 115
9.1. Example JSON trace line from axe ........................... 121
9.2. XMTraceM report in text format ............................. 127
15
1. Introduction
The goal of saving energy is considered a contemporary challenge, motivated by several factors,
but dominated by two: managing the world’s consumption of resources and limiting the rate at
which we produce harmful by-products of that consumption, such as carbon dioxide. In computing,
however, it is not necessarily a contemporary challenge, nor do those two factors alone form the
primary goals.
Energy has always governed the uses for and effectiveness of computers. Mechanical computers
were large and slow, whilst the adoption of vacuum tubes offered higher performance. The transis-
tor and its subsequent miniaturisation to nanometer scale allowed computers to increase in speed,
reduce in size, and consume a small enough amount of energy to be pervasive devices in offices,
homes and vehicles.
While the practicalities of energy consumption in computing have been a governing factor for
nearly a century, the motivation to reduce processor energy continues, as the Internet of Things
(IoT) — an ever-growing number of interconnected embedded devices — creeps into the techno-
logical lexicon. These devices must be small and consume tiny amounts of energy, often powered
by minute batteries or via energy harvesting.
Using energy consumption data from studies into data centers, PCs, network hardware and
other Information and Communication Technology (ICT) equipment, ICTs energy consumption
was determined to be 8 % of global consumption in 2008 [Pic+08]. Therefore, progress towards
both environmental and product-centric goals can be made by continuing to reduce the energy
consumption of devices.
As we reach technological limits, new techniques must be created to allow progress. For decades
we have relied, and continue to rely upon Moore’s Law [Moo65] and trends related to it. The
shrinking of transistors and improvements in process technology yield energy efficiency improve-
ments, but now more aggressive energy saving techniques are devised and applied at higher levels,
from circuitry to turn off temporarily unused silicon, up to software controlled sleep states. The
advent of multi-core, which was necessary to avoid the practical limits of operating frequency and
power, introduces new opportunities but also new challenges, particularly in the areas of task
scheduling and effective programming models.
As the aggressiveness of energy saving techniques increases, the software that runs on top of
the processor eventually becomes a point of interest. This software is ultimately responsible for
the behaviour of the hardware — the hardware exists to perform the tasks defined in software by
the authors of that software. The software, therefore, is largely responsible for a device’s energy
consumption. To re-state the argument from a bottom-up perspective, a device with many energy-
saving features is inefficient if the software running on it prevents those features from being used,
or fails to adequately exploit them.
An abundance of evidence towards this is present in mobile phones, where devices must be
designed to be energy efficient. However, a large number of software energy bugs have been
observed at all levels of the software stack [PHZ11]. These software problems amount to 35 % of
the energy bugs surveyed. Typically, these bugs prevent the hardware from entering low power
states. Energy bugs have several negative impacts, including poor reviews for buggy applications,
reports of phones with poor battery life and even increased product returns.
In order to write energy efficient software, developers must understand the energy that their
code will consume. To that end, this thesis proposes new techniques for addressing the imbalance
between understanding of hardware energy consumption and how the software running upon that
hardware affects it.
The focus of this work is on multi-threaded and multi-core processors in the embedded system
space, where the processor contributes a significant proportion of system energy consumption.
This is evident if we consider a particular device class: the mobile phone. The most significant
energy consumption within these devices is a combination of back-light, display, radio, graphics
17
1. Introduction
and processor [CH10]. If we consider that in a more deeply embedded system, such as one not
interacting directly with humans, then the display, along with back-light and graphics processor, are
no longer present. Thus, the processor’s energy consumption becomes dominant. Further, in such
systems, energy is often in scarce supply, either due to the delivery mechanism or storage method,
for example a battery of limited capacity. It is desirable to maximize energy efficiency in order to
reduce the complexity of providing sufficient energy to these devices. The goal of this work is to
propose new methods for identifying how software consumes energy in such systems, supporting
these proposals with experimental tools, energy models, along with testing and evaluation. These
contributions can then be used as the basis for future work.
The research herein includes an in-depth study of a multi-threaded processor, assembled into
multi-cores. The hardware’s energy consumption, and its relationship to the software running
upon it, is analysed at multiple levels, starting at the instruction set and progressing to a system
level considering multiple networked cores. Through this analysis, this thesis is able to present an
energy model for a multi-threaded embedded processor architecture and raise that modelling up
to the multi-core level. It is shown that a combination of understanding the target hardware and
writing software that fits the hardware well is essential for energy efficiency.
Software is selected to demonstrate behaviours typical of an embedded system, including multi-
threaded and multi-core examples. This software is compiled and the executables are then energy
modelled using simulation at the instruction set level. The presented core-level multi-threaded
energy model delivers accuracy within 10 % of measured hardware energy consumption and 2.67 %
on average, with a standard deviation of 4.40 percentage points. At the network level, absolute
energy estimations diverge from the hardware. However, the energy implications of communi-
cating tasks are made clear through the reporting and visualisation methods that are presented.
Most importantly, the relative improvements (or otherwise) from changes to the software can be
observed without the detailed hardware modelling used in processor design, and without needing
to instrument the target hardware. This makes energy modelling more accessible to the software
developer.
Finally, this work enables higher level analyses, such as static analysis, to be performed, by
feeding the model data into them. Thus, this research aims to provide enlightenment to software
engineers with an interest in the energy consumption of their embedded software, and to other
researchers seeking new methods to provide and act upon this information through reporting and
optimisation.
The rest of this introductory chapter formally defines the research questions posed in this work,
summarises the contributions of this thesis along with related publications, outlines the structure
of the document and states the terminology and conventions used within.
1.1. Research questions and thesis
At its core, this thesis seeks to further the state of the art in energy modelling of software. It
does so by focusing at the embedded device level, observing emerging changes in how devices are
constructed and used across ICT. The fundamental question that lies beneath this work can be
posed from the perspective of an embedded systems software developer:
How much energy will the software that I am writing consume?
Without sufficient hardware knowledge, there is very little intuition when seeking the answer to
this question. Yet, in embedded systems, energy consumption is critical to the safe and correct
operation of a device. If this question can be answered, then the software developer can make
educated decisions about what action to take, be it make changes to their software, modify the
system hardware, or re-visit the specification.
This question is quite a broad one, which when asked by an embedded software developer,
indicates a specific goal: to minimise energy consumption in order to provide optimal functionality
of the embedded device, without breaking any of the constraints essential to its correct operation.
This can be phrased as a more specific question:
18
1.1. Research questions and thesis
Software that is a good fit to the underlying hardware is more energy
efficient, but how can I achieve this?
Whilst abstraction allows a developer to avoid concerning themselves with the engineering beneath
the level at which they want to work, understanding how higher-level implementations map down to
low-level activity is fundamentally important, both in terms of performance and energy. Regardless
of energy saving features in the hardware, a piece of software that neither directly exploits the
best features of the hardware, nor passively allows the features to work, will lead to sub-optimal
power [RJ97]. This is true historically and continues to be true today, and methods for allowing
this mapping to take place must continue to be developed if energy consumption of software is
to be better understood on contemporary hardware. Understanding this research question also
provides insight into what software is not a good fit to a particular system.
This thesis contributes new answers to these research questions. The statements underpinning the
work of this thesis are as follows:
Effective energy estimates for modern embedded software must consider multi-threaded, multi-
core systems. Parallelism in hardware is now necessary as a means to deliver increases in per-
formance. This requires multi-threading and multi-core hardware, and by extension software that
maps onto this type of system.
Energy modelling at the instruction set level provides good insight into the physical behaviour
of a system whilst preserving sufficient information about the software. To be useful to a
software developer, an energy model must be expressible in a way that relates to both the software
and the underlying hardware, exposing reasons for the behaviours that are seen.
Energy saving and energy modelling techniques are placed under greater constraints in the
embedded space. In an embedded system with hard real-time constraints, software or hardware
changes that may save energy cannot risk breaking those constraints. Similarly, the available
hardware resources, such as performance counters, may make it difficult to collect data to aid
energy modelling, either online or offline. This necessitates a modelling strategy that accounts for
these limitations or is unaffected by them.
Multi-threaded and multi-core devices introduce new characteristics that must be considered
in energy models. Embedded processors often have simpler pipelines than more general purpose
counterparts, but the introduction of multi-threading and multi-core systems into the embedded
space creates characteristics to consider. These characteristics can be unique to embedded systems,
which address the need for more performance in different ways to larger processors, to enable them
to satisfy the constraints placed on real-time systems. Further, the objective in such systems is
to satisfy an energy budget that is often defined by a limited source of energy, such as a battery.
This is in contrast to a high performance processor, which is more limited by heat dissipation and
power delivery.
Energy models that do not rely on run-time data from the processor provide greater flexibility
for multi-level analysis. Prior research has shown a variety of methods for estimating the energy
consumption of software, some of which utilise real-time data from the processor. Such methods
preclude higher level analysis, whereas this thesis presents methods that can be used across several
levels of abstraction, from instruction set simulation up to abstract network level views.
Both absolute accuracy and relative indicators provide useful information to a developer.
Where energy consumption constraints can be specified and are hard targets, an energy model
must provide sufficient accuracy to give the developer confidence that they have or have not met
that target. Using the performance of a range of prior research as a baseline, this accuracy thresh-
old will be established as ±10 %. Without this confidence, the development cycle is lengthened
19
1. Introduction
by the need to repeatedly deploy and test on real hardware, which may be significantly more in-
convenient then running a simulation or other analysis. However, where an energy target is not
absolute, or a higher level view and understanding are required, relative measures remain appro-
priate, for example to answer the question “which version of this software uses less energy?” Given
the current lack of intuition towards software’s contribution to energy consumption, this is still a
valuable contribution to a developer’s knowledge. What is important in such cases, however, is
that a sufficiently wide view of the system is given, so that an apparent improvement in one area
is not eclipsed by a side-effect created in another.
Movement of data costs energy, no matter the form that movement takes. The embedded
processors studied in this work do not feature caches, nor do they use shared memory to communi-
cate between threads. Thus, the significant energy consumption arising from cache misses and the
memory hierarchy is not present. However, data must still be moved between threads via other
means, and a synchronisation or other flow coordination effort between threads must take place.
The cost of this must still be analysed and presented to the developer, in order to assist them in
reducing energy. A network-level view of communicating threads presents a different paradigm for
identifying how communication takes place and how improvements can be made, departing from
the often complex behaviours of large memory hierarchies that can be difficult to reason about.
Energy models for different architectures can have elements in common. Parallelism is being
provided in modern processor architecture in various ways, as challenges such as distributing data
across or sharing data between cores seek to be addressed. Although this creates variety in how
different processors behave and need to be programmed, an instruction set level energy model can
include at least some transferable properties between different architectures. This serves to ensure
the energy models can be developed for new architectures more rapidly.
From these statements, many questions can be raised that guide the research. The structure
of this document follows these thesis statements closely, posing and investigating these questions
progressively. An explanation of this document’s structure is given in 1.3.
1.2. Contributions
This thesis makes contributions to research in the areas of energy modelling of software, computer
architecture and embedded systems. The main contributions and related publications are outlined
in this section.
Energy modelling a novel embedded processor architecture
The XMOS XS1 processor architecture has a number of novel aspects to it, relating to software-
defined real-time Input/Output (I/O), hardware thread scheduling, parallelism in embedded pro-
cessors and multi-core networks of message-passing processors. This thesis furthers the under-
standing of these architectural features in relation to energy consumption at the software level,
defining the particular influences that software has upon this hardware.
Contributing to the creation of a multi-core research platform
The Swallow project [Hol12] was created by Simon Hollis at the University of Bristol with the
intention of building a real multi-core embedded system for demonstration and experimentation,
where previously a significant amount of research was based purely upon modelled or theoretical
systems. The Swallow platform forms an essential part of the research conducted in this thesis,
specifically in studying and modelling multi-core communication costs.
The research conducted in this thesis has resulted in a number of significant contributions to
the Swallow project, namely:
Initial bring-up and testing of the Swallow hardware, post-manufacture.
20
1.2. Contributions
The introduction of wrapper scripts and pre-processing for the XMOS compiler tool-chain to
provide support for the large number of processors, not previously handled by the compiler.
Development and testing of the platform description files (XN files [XMO13a]), including
mapping Joint Test Action Group (JTAG) device IDs to XMOS network node IDs and
implementation of the deadlock-free dimension-order routing algorithm on Swallow’s unique
topology.
Code to boot Swallow devices over their network links rather than JTAG, significantly re-
ducing start-up time for large grids from over a minute to less than ten seconds.
An Ethernet software stack to allow both Ethernet based Trivial File Transfer Protocol
(TFTP) booting and communication with running applications.
Communication libraries to provide more flexible channel communication than what is built
into the XC language.
A significant amount of hardware surgery involving a soldering iron, microscope and scalpel.
These contributes enabled the multi-core energy data that is presented in this thesis to be
collected, and has assisted in the enablement of research by others using Swallow.
Energy modelling of a network of embedded processors
This thesis traverses various levels of system abstraction, from Instruction Set Architecture (ISA)
up to system level. At the system level, a Multi-Threaded and Multi-Core (MTMC) is viewed as a
network of interconnected components. These components can be independently energy modelled,
as well as the interconnects between them.
The core level energy model is combined with this relatively abstract network level view and
a multi-core simulation, to provide energy modelling of embedded software with a unique level
of detail given to where the most significant quantities of energy are consumed. This serves to
provide better information into how software consumes energy in modern embedded systems, so
that informed decisions can be made to reduce that energy consumption, rather than through
undirected experimentation.
Related publications
The following publications are, at the time of writing, work directly related to this thesis. For each
publication, a brief description of the relationship to the thesis is given.
Steve Kerrison and Kerstin Eder. “Energy modelling of software for a hardware multi-
threaded embedded microprocessor”. In: Transactions on Embedded Computer Systems
(TECS) (2015) [KE15b]
This journal paper describes the initial energy profiling phase and preliminary model that
was produced for a sub-set of the XMOS XS1 ISA. This thesis contains that same work,
described in more detail, and then built upon to produce a refined model for full ISA.
Umer Liqat, Steven Kerrison, Serrano Alejandro, Kyriakos Giorgiou, Pedro Lopez-Garcia,
Neville Grech, Manuel V. Hermenegildo, and Kerstin Eder. “Energy Consumption Analysis of
Programs based on XMOS ISA-Level Models”. In: 23rd International Symposium on Logic-
Based Program Synthesis and Transformation (LOPSTR’13). Springer, Sept. 2015 [Liq+15]
The model described in [KE15b] is used in this paper as the basis for providing energy
consumption predictions through static analysis of the software. The author of this thesis
contributed a description of the model to the paper, along with simulation based energy
estimation results, for comparison with the static analysis method.
Steve Kerrison and Kerstin Eder. “Measuring and modelling the energy consumption of
multi-threaded, multi-core embedded software”. In: ICT Energy Letters (July 2014), pp. 18–
21
1. Introduction
19. url:http:/ /www.nanoenergyletters.com/files/nel/ICT- Energy_Letters_8.
pdf [KE14]
This letter summarises work on further development of the model in [KE15b], along with
preliminary results into the impact of multi-processor communication costs. The Swallow
project, which is also described in this thesis, is an essential part of this work.
Steve Kerrison and Kerstin Eder. “A software controlled voltage tuning system using multi-
purpose ring oscillators”. In: arXiv (2015). arXiv: 1503.05733.url:https://arxiv.
org/abs/1503.05733 [KE15a]
The ring oscillators onboard the XMOS XS1-L are used in this work to calibrate an opti-
mised safe (faultless) core voltage for a given operating frequency. Components of this work,
particularly the background, are used in 4.2.3.
Simon J. Hollis and Steve Kerrison. “Overview of Swallow — A Scalable 480-core System
for Investigating the Performance and Energy Efficiency of Many-core Applications and Op-
erating Systems”. In: arXiv (2015) [HK15]
This overview of the Swallow system describes the salient parts of its construction, such as
the routing, performance, and energy consumption. This thesis and the work surrounding it
has contributed to the figures and information presented in the paper.
Neville Grech, Kyriakos Georgiou, James Pallister, Steve Kerrison, Jeremy Morse, and Ker-
stin Eder. “Static analysis of energy consumption for LLVM IR programs”. In: Proceed-
ings of the 18th International Workshop on Software and Compilers for Embedded Sys-
tems. SCOPES ’15. Sankt Goar, Germany: ACM, 2015. doi:10 . 1145 / 2764967 .
2764974 [Gre+15]
The energy models from this thesis and [KE15b] are leveraged in this paper to perform
static analysis at the LLVM IR level — the intermediate representation used in the LLVM
compiler toolchain. This provides potentially richer program information than at the ISA
level, preserving more control flow and other data, assisting the analysis process. For the
XMOS XS1 model, this work was enabled by a mapping between the instructions used in
the ISA level model and sequences of LLVM IR instructions. The author of this thesis
contributed the XMOS ISA model data, as well as hardware and simulation based energy
results. The static analysis and mappings between LLVM IR and ISA were contributed by
the other authors of the paper.
1.3. Structure
This document is structured to follow the arguments that form the thesis described in 1.1. Each
of the thesis statements builds upon the research conducted in response to the points before it.
To effectively communicate this research this work is divided into two main parts, each comprising
several chapters.
Part Iaddresses prior work and essential background. Parallelism is explored in Chapter 2,
drawing attention to the topic from both a software and hardware perspective. A variety of energy
modelling methods are then detailed in Chapter 3, including discussion of the challenges that
parallelism introduces to the energy modelling process.
Chapter 4then draws upon the previous two chapters to address the properties of modern
embedded systems that present further challenges to energy modelling of software. Part Iis
concluded with Chapter 5, which examines the XMOS XS1-L processor core and a system of these
processors assembled into a grid style network; the Swallow project. The unique properties of the
processor and Swallow are discussed, in relation to the topics presented in the previous chapters.
This lays out the key challenges that guide the implementation decisions of this thesis.
Part II focuses on implementation, using the previously established background work, combined
with new research, to address the statements made in 1.1. It begins with two chapters that focus
on a single XS1-L multi-threaded processor core. Chapter 6presents methods for relating the
energy consumption of the XS1-L to its ISA, through a newly developed profiling rig, comprising
22
1.4. Terminology and conventions
both hardware and software. The profiling demonstrates a number of the properties of energy
consumption that are unique to this particular multi-threaded embedded processor. Chapter 7
then uses this profiling data to construct an ISA level model that can be used at various levels of
abstraction, starting with instruction set simulation. Several variations of the model are presented
and evaluated in order to determine the best possible model accuracy.
The subsequent two chapters are structured in a similar fashion, presenting the profiling tech-
niques and simulation tools used for the multi-core Swallow system in Chapter 8, then the model
and evaluation in Chapter 9. This completes the contribution of this thesis towards a multi-
threaded, multi-core, network-level energy model for an embedded real time processor.
A broader view is applied in Chapter 10, which looks beyond the XS1 processor to identify how
the contributions made in this work could be applied to other architectures. Several architectures
are surveyed, indicating where common characteristics may be present, and where novel features
may require new research in order to further the state of the art in energy modelling of software.
Finally, the thesis is concluded in Chapter 11. The chapter contains a review of the contributions
made, a summary of all evaluations made throughout the work, and a description of future work
opportunities that have either been discovered during this research, or created as a result of it.
1.4. Terminology and conventions
A small summary of critical terminology and chosen conventions are described herein. Other terms
are defined as necessary throughout the document. Acronyms are expanded upon the first instance
of their use and also in the List of Acronyms (LoA).
Power and energy
In this thesis the terms power and energy are used frequently. These terms are often interchanged
in literature, but in the context of this work it would not be appropriate. For clarity, therefore,
their definitions are given.
Power, P, or power dissipation, is an instantaneous measure of a rate of energy transfer, or the
rate at which work is done. It is quantified in Watts, or W. Energy, E, or energy consumption, is
a measure of total work done. This is the amount of charge that traverses the potential difference
present in a circuit. This process transforms the energy, mostly from electrical form into thermal
form. The charge present in the system is not constant, nor necessarily are the potential differences.
As a result, power changes continuously. Energy therefore is the integral of power during a period
of time, per Eq. (1.1). It is typically expressed in Joules, or J.
E=ZT
0
P(t)dt (1.1)
Applying both the concepts of energy and power, a system that sustains a constant power
dissipation of 1 Watt for 1 second, will have transferred 1 Joule of energy.
Multi-threaded and multi-core
A number of the processors in this work require a distinction between multi-core and multi-threaded
to be made. This culminates in the study of a system that has both of these properties. The term
Multi-Threaded and Multi-Core (MTMC) is used to refer specifically to this type of system. For
further clarification of the distinctions, parallelism’s various forms in both software and hardware
are detailed in Chapter 2.
23
Part I.
Background
25
Introduction
Part Iof the thesis introduces the research and components that form the foundations of the
contributions presented in Part II. There are three essential topics: parallelism,energy modelling
and energy saving. These are each covered in turn, with the inclusion of the referenced research
justified in relation to the goals of this thesis.
The final chapter in this part introduces the hardware platforms upon which the majority of this
thesis bases its work. This chapter includes the work that was put into developing the Swallow
system in a platform that was usable for the profiling, analysis and modelling presented in Part II.
27
2. Parallelism and concurrency in programs
and processors
This chapter provides a review of the technology and concepts behind parallelism and concurrency
in hardware and software. It starts with programming and multi-tasking concepts in 2.1, then ex-
amining single-core parallelism in 2.2 before reviewing multi-core technologies that are becoming
increasingly prevalent in modern computing in 2.3. Where appropriate, the literature is reviewed
in the context of embedded systems, although a broader view is suitable for much of this chapter.
The distinction between parallelism and concurrency is important to the understanding of
MTMC systems and how to express programs for them. Concurrency allows components to make
progress independently of each other, such that in a given period of time, all of the components
can have performed work. However, this can be achieved by sub-dividing the observed time period,
allocating a division of that time to each component, so that at any given point in time, only one
component is doing work. Parallelism provides simultaneous progression of components, therefore
multiple activities can happen at the same time.
The notion of parallelism is present throughout the history of computing, with Flynn establishing
a taxonomy of computer architectures that remains relevant today [Fly72]. From this taxonomy,
both Single Instruction Multiple Data (SIMD) and Multiple Instruction Multiple Data (MIMD)
require parallelism of some kind, both of which are relevant to this thesis. In addition, Single
Instruction Single Data (SISD) implementations can also contain some degree of parallelism when
sequences of SISD instructions are considered. These are all explored in this chapter.
In the software domain, programs may express solutions to problems in ways that are concur-
rent. These are activities that can take place at the same time, conceptually. The execution of
these programs may be serialised and therefore not parallel, whilst still retaining the property of
concurrency [AS83, p. 4].
Parallelism and concurrency exist across the hardware/software stack, from programming para-
digms that aid the expression of concurrent problems [Pin98], techniques to parallelize sequential
non-dependent operations, through to the necessity to parallelize hardware, brought about by
technological limitations [Kah13].
2.1. Concurrent programs and tasks
There are numerous forms of concurrency exposed at the software and programming levels. Con-
currency can allow multiple independent workloads to be processed simultaneously if support for
parallelism is present in the system ( 2.1.1). Alternatively, a single problem, when written appro-
priately, can be expressed as a concurrent program ( 2.1.3). When communication or response
to events is required, techniques to handle multiple events in a desirable order and with adequate
responsiveness must be used ( 2.1.4). All of these rely on multi-threading either in expressiveness
or implementation. This section begins with an overview of multi-threading and several related
terms.
2.1.1. Multi-threading
Multi-threading is common across all computing, from high-performance scientific computing,
through general purpose and down to embedded computing. The basic principle is to express mul-
tiple activities that may take place concurrently. There is a distinction between a multi-threaded
view of a system, and how those threads are actually executed by the underlying processor(s).
Many of the implementation details at the hardware level are discussed in the subsequent sections
of this chapter. However, in this section, the software and Operating System (OS) level are the
focus.
29
2. Parallelism and concurrency in programs and processors
Depending on the context of the system, threads might also be termed tasks or processes. The
distinction between them, if one exists, may differ. For example, in the Linux kernel [Mcc02],
which closely follows the Portable Operating System Interface (POSIX) threading standards, a
process is an address space and set of resources dedicated to the execution of a program, while a
thread is an independent path of execution within a process; there may be one or more threads in
a process. A task is a basic unit of work in Linux. If a process is cloned and some resources shared
between instances of that process, then a set of cooperating tasks is created.
Defining multi-threading
For the purposes of this thesis and in the context of embedded systems, where an OS or POSIX
implementation may not always be used, the term process is avoided except where supporting
literature uses it. Terms relating to threads and tasks are defined as follows:
Software thread A unit of sequential execution, which may form the entirety of a program, or
may work alongside other threads to achieve a common goal. This provides concurrency.
Hardware thread A front-end to a processor retaining its own program counter and other registers,
able to accommodate a software thread. The computational resources behind the front-end
may be shared, allowing multiple threads on a single processor core. This provides parallelism.
Task A separation of units of work that may have constraints such as hard real-time deadlines. A
set of tasks might be realised as a group of time-sliced software threads managed by a Real-
Time Operating System (RTOS), or they might be allocated as separate hardware threads on
a sufficiently capable system. In any case, some tasks may need to complete their activities
within a given time period.
2.1.2. Parallel tasks
An embedded system may have multiple objectives to achieve, defining multiple tasks. For example,
take an embedded real-time system which is responsible for controlling an industrial process. It
may have multiple sensors to communicate with, each of which requires data processing, along
with actuators that must be controlled based on the result of that processing. It may also need
to provide interfaces for reconfiguring the parameters that direct the processing of sensor data or
control of actuators. Several interfaces will be involved in such a system, possibly implementing a
range of protocols, such as Inter-Integrated Circuit (I2C), Ethernet and Controller Area Network
(CAN).
Figure 2.1 depicts a USB audio application for an XMOS-based platform [XMO14a]. It com-
prises several inter-connected tasks. There are tasks for I/O over various interfaces, as well as
audio processing. The I/O protocols each have timing requirements, defined by the standards
and behaviour of the components that are using them. As such, the embedded system must be
able to send and receive data to and from these interfaces within their specifications. Further,
for the system to operate correctly, there may be additional timing constraints that need to be
applied. For example, in the context of an audio application, delayed audio processing could result
in undesirable latency, or audible glitches caused by lost samples.
In such a system, all of these tasks must be able to run with sufficient speed and frequency in
order to meet the timing requirements. In some implementations, an RTOS may be used to help
with allocation of resources to meet these requirements. There may still be a need for the system
software developer to correctly define priorities.
In general purpose computing, the OS also has task scheduling responsibilities, although the
majority of tasks, particularly those initiated by users, are not considered time critical or only
have soft deadlines (a missed deadline is inconvenient rather than system-breaking). Different
scheduling techniques are used depending on the OS used and how it is configured. For example
Linux and Windows have different schedulers and scheduling options [BC05;Mic12].
30
2.1. Concurrent programs and tasks
Device
USB ctrl
endpoint
USB XUD LED
driver
MIDI
driver
Clock
generator
Audio
driver
SPDIF TX
Mixer
Decoupler
Endpoint
buffer
SPDIF RX
ULPI
I2S
PLL
MIDI
SPDIF
SPDIF
GPIO
Figure 2.1: A multi-threaded task structure in a USB audio application.
2.1.3. Parallel programs
Certain types of programming problems, such as multi-stage processing, client-server and data
parallelism, can be implemented in single programs that contain some level of parallelism. They
can be distinguished from parallel tasks in that they cannot be separated from the other parts
of the program and remain useful, or they are simply a replicated component. For example, an
Ethernet interface task might be modularised to be used in multiple applications. A concurrent
matrix multiplication algorithm, however, may replicate worker threads that each process a subset
of the input data.
Software such as pigz [Adl10] allows data compression to be performed concurrently and is
designed to exploit available parallelism in a system. The POSIX threading system is used for OS
portability in pigz. The sc matrix library, used in Chapter 7, expresses a number of vector and
matrix operations concurrently, although it is targeted at bare-metal embedded programs rather
than at devices running an OS and so exploits device specific parallelism features rather than a
portable threading library.
Client-server arrangements can often exploit parallelism, in that a server may need to handle
multiple clients simultaneously. The widely used Apache web server can use multiple worker threads
or processes to serve a larger number of client connections simultaneously. The performance of such
an implementation is both workload and configuration dependent, making it an area of interest to
research in web technology [DKC08].
Many libraries and languages have been created to allow parallelism to be expressed in pro-
grams. Open Multi-Processing (OpenMP) [DM98] is a library that provides extensions to Fortran,
C and C++ to enable shared-memory programming. The Message Passing Interface (MPI) stan-
dard [Sni98] provides methods for communicating between threads in parallel systems. Open
Compute Language (OpenCL) [SGS10] provides a language and framework for leveraging paral-
lelism in heterogeneous systems, allowing work to be allocated to different compute units, such
as Central Processing Units (CPUs),Graphics Processing Units (GPUs) and Field Programmable
Gate Arrays (FPGAs) [Cza+12]. There are many more languages, each expressing parallelism
using different paradigms [Pin98], including Occam, MultiLisp, and Sire [Han14]. Message pass-
ing is a commonly used abstraction for parallel programing, two notable forms are formalised as
Communicating Sequential Processes (CSP) [Hoa78] and the Actor model [Kow88]. The former
uses communication channel ends with synchronisation, whilst the latter uses mailboxes at the
receiver. The communication model of the XS1 architecture, described in Chapter 5, follows a
CSP model of parallel processing. Other methods of communicating in process networks exist,
either synchronous or asynchronous in nature [Mar11, pp. 21–118].
31
2. Parallelism and concurrency in programs and processors
The choices for expressing parallelism in programs is rich and varied. To some extent, choices
are driven by particular application areas. In the embedded space, a significant proportion of
applications continue to be developed in C or its derivatives [Phi04, p. 151]. Although alternatives
exist [Taf14], the inertia present from a significant amount of historic code, means that C is likely
to remain a popular choice for the foreseeable future.
2.1.4. Event driven software
In event driven software, waiting on the availability of data, for example through I/O, is kept
efficient by avoiding activities such as spin locks. Examples of this and alternative constructs are
given later, in 4.4. Event driven behaviours allow applications to wait without wasting CPU
cycles, and for inputs to be queued for handling with minimal blocking. Event handling is an
activity often handled by the OS, the software interfaces to which vary between OSs. Libraries
such as libevent [Mat10] provide an abstraction layer on top of these various implementations.
Languages that provide channel or other communication based abstractions must also rely on event
implementations at a lower level in order to efficiently provide their data sharing model.
Software such as the nginx [Sys14] web server use events to handle the so-called C10k problem,
where ten thousand client connections may need to be maintained simultaneously. Although the
processing of this number of connections may not be fully parallel, the software architecture is able
to accommodate this many open connections with low overhead.
In embedded computing, interrupts are frequently used to avoid polling of devices that may
or may not be ready for some activity to be performed upon them (for example, a device buffer
may be free to receive more data). Interrupts exist in both a hardware and software sense. A
hardware interrupt uses an I/O signal to cause a context-switch in the processor that receives the
signal. Typically, an Interrupt Service Routine (ISR) is entered, which deals with the cause of the
interrupt, before returning to the previous context. Interrupts may be masked to avoid context
switching in time-critical sections of software, and interrupts may also be nested or prioritised, so
that multiple simultaneous interrupts, from numerous sources, can be handled appropriately.
If there is no computation to be performed, interrupts can be exploited for power saving. An
idle processor can sit in a low power or sleep state until an interrupt triggers a wake-up into its
fully active state. Thus, interrupts can be used not just for rapid context switching, but also power
state transitioning. For example, many ARM devices feature wait-for-interrupt instructions that
put the device into low power mode until an interrupt takes place. Similarly, the XMOS XS1-L
can do this with both conditional and unconditional wait instructions.
Asoftware interrupt uses a similar context-switch approach, but the activity is handled, and
possibly initiated by an OS. For example, the OS may interrupt a running program to allow another
to have processor time, thus achieving time-slicing multi-tasking. Alternatively, a program may
cause a software interrupt in order to request a privileged activity from the OS, such as disk access.
Interrupts create scheduling challenges and potential context-switch overheads [Tsa07;TT09].
Certain multi-threaded architectures, such as XS1, also implement events. These are similar in
behaviour to interrupts, except context is not preserved when an event takes place; the thread
responding to the event simply jumps to a designated program location. This allows a thread
to efficiently wait to respond to one of multiple possible events, whilst other threads continue to
execute. The hardware architecture and distribution of work between threads then become the
determining factors in responsiveness, rather than context switch and ISR overheads. The XS1
event handling implementation is discussed in more detail in Chapter 5.
2.1.5. Summary
This section has provided background on various parallel processing and concurrent programming
paradigms, technologies and challenges. A number of these are relevant across computing while
others are more specific to embedded systems or at the OS level.
In task concurrency, a program may comprise multiple threads, all working on independent
tasks, with communication where necessary. Libraries such as POSIX threads, or a RTOS provide
a means of defining these tasks.
32
2.2. Parallelism in a single core
Cycle Stage 1 Stage 2 Stage 3 Stage 4 Stage 5
0I0 — — — —
1I1 I0 — — —
2I2 I1 I0 — —
3I2 I1 I0
4I3 I2 I1 I0
5I4 I3 I2 I1
Table 2.1: Example of a five stage processor pipeline including warm-up and stalling.
In a single concurrent program, inseparable tasks are composed together, with an expectation
that each task can progress at any given time (save for synchronisation and communication).
Programming languages such as Occam and Sire allow concurrency to be expressed in algorithms.
This concurrency can then be used to realise parallelism on a suitably equipped hardware platform.
At the hardware and low level of software, interrupts and events allow asynchronous activities
to be handled concurrently. There is motivation to minimise the overhead of handling these, both
to ensure correct operation thus avoiding missed deadlines, and to provide good scaling, serving
ongoing application challenges such as C10K.
The next two sections discuss how parallelism is made possible in hardware. Thus, the con-
currency presented by the various techniques in this section have the potential to be exploited as
parallelism by the underlying computer architecture.
2.2. Parallelism in a single core
At the hardware level, there are two main approaches to maximise performance. The first, is to
increase the operating frequency of the processor, so that a newer, faster processor can deliver a
higher throughput of work in a given unit of time. The second, is to do more work per clock cycle,
such that a new processor with a higher Instructions Per Clock (IPC) can do more work in a given
unit of time, at the same frequency.
The former has been maintained through lower threshold and operating voltages combined with
increased transistor count at the same power density, per Dennard’s scaling observations [DGY74].
However, at 130 nm feature size, this property has ceased to hold [Kuh09;Boh07]. This has
resulted in a plateau in processor operating frequency since 2005 [Kah13]. To maintain competi-
tiveness, processor manufacturers have sought and continue to seek methods for maximising IPC
and throughput, creating parallelism of various forms in a single core.
If the objective of higher performance is substituted for lower energy, then a processor that
can do more work per clock cycle than another can be run more slowly, and potentially at a lower
voltage, whilst maintaining performance. In these conditions, it will most likely be the more energy
efficient processor of the two.
This section examines methods of increasing IPC and throughput, all of which have an impact
not just on performance, but on energy consumption and how such processors can be modelled for
energy, as will be discussed in Chapters 3and 4.
2.2.1. Pipelining
The majority of computer architectures break execution of instructions down into multiple stages,
forming a pipeline. A sequence of instructions can be processed in this pipeline, progressing
between stages on each clock cycle. For an Nstage pipeline, up to Ninstructions can be executed
simultaneously. The execution latency of an individual instruction is not improved by pipelining.
However, the next instruction can begin to be processed before the current one is completed,
thus increasing throughput. In addition to this, smaller stages typically allow a higher operating
frequency for the processor, by shortening the critical path.
An example of instruction occupancy of a pipeline is given in Table 2.1. In ideal circumstances,
the pipeline is always full. At the start of execution, the pipeline must warm up. Further, instruc-
tions may need to wait for an earlier instruction to complete before proceeding. For example, two
33
2. Parallelism and concurrency in programs and processors
Decoder
FU FU FU
ROB
Decoder
FU FU FU
ROB
Decoder
FU FU FU
ROB
Decoder
FU FU
ROB
Decoder
FU FU
ROB
FU FU
Start Decode Issue I0, I2 Issue I1 & I3,
complete I0, hold I2 I0-2 completed
I0
I0
I0
I0
I3
I2 I1 I0 I3
I2
I1
I0 I3
I2
I1
I0
I3
I2
I1
I0
Figure 2.2: An abstract example of instruction flow through a super-scalar processor.
consecutive instructions may use the result of the previous instruction as a source operand. If the
previous instruction is not completed before the current instruction proceeds, the wrong operand
value would be used. This is a data hazard which must be detected and avoided, either by stalling
the pipeline (shown at cycle 3 in the table), which reduces IPC, or adding forwarding logic into
the pipeline [Pat85, pp. 17–19], increasing pipeline complexity.
Another example is branching, where a decision to branch may invalidate instructions that have
already entered the pipeline. This requires the pipeline to be flushed (emptied), again reducing
IPC. As the depth of a pipeline increases, the performance penalty from a flush increases. Branch
prediction techniques [Smi81] or branch delays [Pat85, pp. 12–13] can be used to try to avoid this
scenario.
2.2.2. Super-scalar
Where a processor possesses multiple Functional Units (FUs), such as Arithmetic Logic Unit
(ALU),Floating Point Unit (FPU) and memory unit, IPC can be increased by attempting to
utilise as many of these in parallel as possible. If one unit is in use and requires multiple cycles
to complete, it may be possible to issue an instruction to another unit, provided there are no
data hazards between instructions. These super-scalar designs allow multiple instructions to be
executed simultaneously [Joh89]. Throughput can be improved further by allowing out of order
execution, where instructions are issued internally in an order that maximises FU utilisation whilst
avoiding data hazards, with re-ordering hardware at the end of the pipeline [HP06, pp. 104–114],
so that instructions are seen externally to complete in the order that was expressed within the
software thread.
An abstract view of a super-scalar processor is shown in Figure 2.2, capturing the progression of
several sequential instructions at various points in time. The diagram shows several techniques that
contribute to Instruction Level Parallelism (ILP) in a super-scalar system, including pipelining,
sub-pipelining, multiple instruction issue and re-ordering. Initially, two of the instructions can
be issued to different FUs, but a third cannot as the target FU is already in use. This results in
instruction I2 overtaking I1, so it must be held in the Re-Order Buffer (ROB). Once I1 completes,
the ROB can retire both I1 and I2.I3 is still executing at this point, because the sub-pipeline of
the FU that it is utilising takes a larger number of cycles than some other FUs.
34
2.3. Multi-core processing
2.2.3. Hardware threads
Introducing hardware threads to a processor core allows further exploitation of the previously
described methods by fetching multiple sequences of instructions that can be fed into the pipeline
and FUs. The presence of multiple hardware threads provides the benefit of avoiding data hazards
between instructions by having more than one context to choose from.
Hardware multi-threading requires multiple register banks, one for each thread, along with
additional logic to fetch and buffer instructions from multiple memory locations. Further, there
may be some replication in the instruction decode logic. This adds to processor complexity.
In ideal operation, multiple threads can be used to keep all FUs and pipeline stages full, to the
benefit of IPC. However, it is also possible that threads may contend the same FUs, leading to
similar performance to a single-threaded processor. At an OS level, the scheduler may need to be
aware how a processor’s multi-threading is implemented, as the OS may otherwise treat it as an
independent processor core, assuming that resources on that core are uncontested.
An example of multi-threading can be found in Intel’s Hyper-Threading [Int03a] technology,
which provides two front-ends to a single super-scalar core and has been used in a variety of
Intel processors including the Pentium 4, Xeon, Atom and Core product ranges. Other processors
implement multiple cores each with multiple front-ends, such as the Sun Sparc 32-way Niagara
processor [KAO05], which comprises eight cores each with four-way multi-threading. The AMD
Bulldozer micro-architecture implements two threads per core, but with a shared FPU and two
integer pipelines [But+11]. This created an example of the aforementioned OS scheduling issues,
which needed to be resolved to ensure best performance, for example in the Windows 7 OS [Shi12].
The XMOS XS1 processor [May09b], which is described in detail in 5.1, provides eight hardware
threads, sharing a simple four-stage pipeline, in which IPC is only maximised when four or more
threads are active.
2.2.4. Data parallelism
The previously described techniques all apply to SISD structures. However, SIMD can be exploited
in a single processor core for certain data maniuplation tasks. Various ISAs contain extensions that
provide SIMD instructions and registers. For example, ARM NEON [ARM14] and Intel Streaming
SIMD Extensions (SSE) [RPK00] can perform instructions upon wide vectors of data up to 128
bits. Intel Advanced Vector Extensions (AVX) [Lom11] can operate on 256-bit wide data sets.
GPUs, in particular those with General Purpose GPU (GP-GPU) capabilities can handle a
large amount of data parallelism per core. Such devices, while present in some high performance
embedded devices, like mobile phones, do not fall within the area of research explored in this thesis.
Very Long Instruction Word (VLIW) processors conform to a MIMD organisation, where mul-
tiple operations are performed on a set of data in a single instruction. Such technology is most
frequently used in embedded Digital Signal Processors (DSPs) [FDF98], where software-pipelined
activities can be expressed as a series of VLIW sub-instructions. VLIW processors perform MIMD
in lock-step, where the long instruction encodes the various operations that will be performed on
each operand. Therefore, in VLIW processors, the compiler must be able to schedule instruc-
tions in order to maximise IPC and satisfy data dependences, otherwise hand-optimisation may be
required to attempt to perform useful operations in the slots available in the instruction encoding.
2.3. Multi-core processing
Multi-core processors provide several independent processing units, with no contention for the
resources on each core, forming a MIMD organisation. This forms the distinction between these and
multi-threaded processors, where a multi-threaded processor creates the possibility to execute more
than one instruction sequence simultaneously, but shares FUs internally and may not necessarily
have MIMD characteristics. However, contention of resources is not completely removed by multi-
core architectures. The memory hierarchy and interconnection between cores can still be contended
and indeed this forms a significant problem in achieving good performance in multi-core systems.
This is particularly significant if it is not possible to scale the interconnect with the rest of the
35
2. Parallelism and concurrency in programs and processors
system, which is another observation of Dennard [DGY74] that is problematic in modern processor
design [Boh07].
A multi-core processor has more than one core on a single die or chip, distinguishing it from a
multi-processor system in structure. Ultimately a system comprising a large number of cores may
be formed of multiple processors, each with multiple cores, and each capable of multi-threading.
Indeed, this is the case for the Swallow system described in Chapter 8as well as many server
systems. For simplicity, this thesis refers to systems of multiple multi-threading capable cores as
MTMC, distinguishing between chip-local multi-core and system level multi-processing only where
necessary.
A wider view of the different types of multi-process architectures, predominantly from a general
purpose computing and server perspective, is given by Roberts and Akhter [RA06, pp. 5–13].
2.3.1. General purpose multi-core
The first general purpose x86 multi-core processors were introduced in 2005, with both AMD
and Intel offering dual-core products. The number of cores has since expanded, with currently
announced products containing as many as 18 cores, for example the Intel Xeon E5-2699V3. Many
ARM-based products are also multi-core, including a number of Cortex-A series devices, commonly
found on mobile phones, but also in servers and high performance embedded multimedia devices.
Multi-core is reasonably well suited to general purpose computing, where even a single-user ma-
chine is frequently used for multiple concurrent tasks. These may include user-triggered activities,
such as multimedia, web-browsing and gaming, but also compute-intensive background tasks, such
as virus scanning and data indexing. However, there is still sufficient demand for single-threaded
performance, that multi-core processors may offer aggressive Dynamic Voltage and Frequency Scal-
ing (DVFS) strategies that provide temporary boosts to core frequency and voltage when a single
thread will benefit. Intel’s Turbo Boost [Int15] is an example of this. Such techniques are typ-
ically only temporary performance boosts because the power demand would push the processor
beyond its Thermal Design Power (TDP) with prolonged use and sustained operation at this higher
frequency/voltage would likely have a negative impact on device longevity and reliability.
2.3.2. Embedded multi-core
Multi-core processors in the embedded space include various designs, often targeting communica-
tion or other hard real-time environments. Companies including Picochip, EZChip, XMOS and
Adapteva have developed architectures that directly serve embedded use.
The Picochip PicoArray processor [DPT03], contains an array of signal processing cells on a
Time Division Multiplexing (TDM) network, principally for implementing cellular communication
modems and codecs. The Adapteva Epiphany and EZChip Tile processors are described in more
detail in Chapter 10, forming part of this thesis’ discussion of modelling a wider range of multi-core
devices.
The application space for embedded processors is different to that of general purpose computing.
As such, the way in which multi-core is exploited is different. For example, the embedded system
is typically utilised closer to the limit of its capabilities, in order to ensure cost effectiveness. Its
life-span may be significantly longer than a general purpose device, either due to its more restricted
set of uses, or the difficulty of access to replace or upgrade it. This is indicated by support periods
for embedded versions of software such as Windows, for which the embedded variants have longer
support periods than regular versions [Mic14]. The motivation for energy efficiency in software is
therefore stronger, because the energy savings that may be obtained from hardware improvements
may be less readily available in the longer product cycle. This, coupled with the scarcity of energy
in many embedded system deployments, strengthens the motivation further.
2.4. Summarising parallelism and concurrency
Concurrency can be defined at multiple levels, from programming languages through to OS-level
task definition. Parallelism can be provided by the physical computer architecture to allow ex-
ploitation of concurrency. Expression of concurrency or parallelism at one level does not necessitate
36
2.4. Summarising parallelism and concurrency
that other levels be aware of or exploit it. Architectures can exploit implicit parallelism that may
be present in sequential programs, where independent groups of instructions may be executed
concurrently for improved performance.
Converging upon multi-core processing, such hardware requires parallelism to be present at
higher levels of abstraction in order for the hardware to be utilised efficiently. This efficiency
can be measured both in terms of performance and energy. Processors that belong to this group
form the core interest area for this thesis, wherein their relatively recent introduction poses new
research challenges. This includes the complexity in programming them, modelling their behaviour,
maximising their performance and minimising energy. While this chapter has focused on the
concepts and mechanisms of concurrency and parallelism, it has not addressed energy efficiency
at length. Considerations for energy efficiency, particularly with respect to embedded systems
hardware and software, are examined in detail in Chapter 4.
The XMOS XS1-L processor, which is introduced in Chapter 5, draws upon several of the
paradigms and technologies described in this chapter. Of particular interest are:
Multi-threaded programming ( 2.1.1) via a C dialect, XC.
The multi-threaded pipeline in the processor core ( 2.2.3).
Multi-core ( 2.3), featuring:
Channel based message passing in software.
Shared memory on a single core.
Hardware based channel communication on- and off-core, backed by a routed intercon-
nect.
These are explored throughout this thesis. This review of parallelism has provided background
on these and other complementary methods, in order to frame the unique contributions of this
work around the state of the art and alternative approaches.
37
3. Energy modelling
This thesis seeks to establish both a single and multi core energy model for XS1 based embedded
processors. The processor, which is discussed in Chapter 5, has multi-threaded and multi-core
networking properties that necessitate new modelling approaches. In order to communicate the
energy consumption of this architecture with respect to software, prior work must be considered
in order to determine a feasible approach.
This chapter presents a review of energy modelling techniques at various abstraction levels,
extracting useful techniques that are applicable to this thesis, identifying the further work needed
within, thus justifying the research conducted in subsequent chapters. There are two main sections:
hardware energy modelling ( 3.1) and software energy modelling ( 3.2), where the line between
these sections is at times blurred. Therefore, these are simple categorisations, where hardware
modelling best serves device designers and software modelling may better serve software developers.
Both single and multi-core modelling techniques are considered, where some modelling areas have
more significant developments for MTMC than others.
Table 3.1 presents the key features of each model approach, including a section reference that
describes the approach in more detail as well as specific implementations and uses.
Model type Summary Ref
Hardware-oriented modelling
Component
parameter
exploration
Provide estimates based on key parameters of devices such as memory
hierarchy, width, etc.
Rapid design space exploration.
Require external, lower-level simulations to provide the most reliable
estimates.
3.1.1
Modular
system level
Modular construction of various architectures and other system
components.
Energy models can be attached to components.
Energy modelling accuracy decreases with more complex systems.
3.1.2
Transaction
level
Analogous to activity (data/commands) exchanged system components.
Can provide a high level view of system behaviour.
Modelling of individual components can be done differently.
Harder to associate with software blocks than e.g. ISA level.
3.1.3
Software-oriented modelling
Functional
block level
Reflect energy consumed by functional units (multiplier, FPU, etc) at
the behest of instructions.
Relationship between ISA and micro-architecture can be made.
More access to processor design details required, depending on profiling
method.
3.2.1
ISA level
Device profiling and simulation to give ISA energy model.
“External effects” such as cache-misses considered.
Various implementations for different architectures.
3.2.2
Performance
counter based
Estimate energy based on properties such as cache hit rate.
A possible alternative to direct hardware energy measurement.
Can also be used in simulation.
Requires hardware performance counters that provide sufficient data for
accurate modelling.
3.2.3
Software
functional
level
Build database of energy consumed by software library calls.
Estimate program energy based on those calls.
Relies upon profiled library calls occupying majority of execution.
3.2.4
Table 3.1: Energy modelling technique overview.
39
3. Energy modelling
3.1. Hardware energy modelling
Approaching energy modelling from the perspective of hardware, physical properties such as de-
vice size, transistor behaviour and interconnection types dictate how energy consumption can be
calculated. This section examines modelling approaches that describe these characteristics in a
relatively high level of detail. They are still usable for energy consumption analysis of software.
However lower level models are typically concerned with a level of detail that would make it im-
practical to model long sequences of instructions, particularly in terms of the investment a software
engineer would be willing to put into such an activity. The time taken to perform this form of
modelling does not fit easily into a software developer’s compilation and testing process.
3.1.1. Component parameter exploration
Processors have a number of fundamental components, including processing elements, memories
and interconnects. Exploring different configurations of these can give insight into creating a more
energy efficient implementation of a processor.
CACTI
CACTI [WJ96] is a cache access, cycle time and power modeller that captures the behaviour of
cache implementations with sufficient accuracy that its initial version is accurate to within 6 % of
lower level electrical simulations such as those performed in Hspice.
The aim of CACTI is to represent the cache as a model of various key parameters, extended on
prior research by introducing features such as a transistor-level model for the decoder and load-
dependent transistor sizes. Version 1 of CACTI is purely for timing modelling. However, version
2 [RJ00] and beyond include power models. At the time of writing, the latest version of CACTI
is 6.5 [HP 14]. It features a web interface and downloadable source, and focuses on many types of
interconnect related to the memory hierarchy, including routed data and various wire types.
CACTI serves as a design space exploration tool for memory hierarchies in processor architec-
tures. In addition, its power models can be exploited in energy models at various levels, where
cache access patterns are available. For example, Bathen et al. utilise CACTI to establish mem-
ory subsystem power costs of software optimisations that seek to lower overall energy consump-
tion [Bat+09].
The movement of data around a processor and the surrounding system can form a large part of
the system’s activity. As such, architectures where memory and cache access dominate the energy
consumption can be modelled with reasonable accuracy with an appropriate CACTI configuration.
McPAT
The Multi-core Power Analysis and Timing (McPAT) tool provides a framework for modelling var-
ious micro-architectural properties of contemporary multi-core systems, with a view to estimating
the inter-related characteristics of power, size and timing [Li+09]. McPAT uses eXtensible Markup
Language (XML) files as a configuration interface to its underlying models, allowing external per-
formance, power and thermal simulators to be used as a source of further simulation data. McPAT
can be seen as a similar tool to CACTI, but one that is aimed at capturing characteristics of more
than just the memory hierarchy.
Early examples of McPAT processor models yield power and area accuracy ranging between
10.84 % and 27.3 %. These examples include Niagara, Alpha and Xeon architecture variants.
McPAT is then demonstrated as a design space exploration tool by examining the effect of changing
various parameters such as feature size and number of cores per shared cache. This was then used
to demonstrate that at 22nm, a 4-core cluster gives a better Energy Delay Product (EDP) than 8
cores.
More recently, McPAT has been combined with the Sniper x86 performance simulator, to give
early energy optimisation opportunities for hardware and software that are both still in devel-
opment [Hei+12]. Performance counters from Sniper include component utilisation levels (duty
cycles) and cache miss rates. The reported error is between 8.3 % and 22.1 % when compared to a
real Intel Nehalem based system measured at its 230 V AC supply input. The performance counter
40
3.1. Hardware energy modelling
estimations are then used to implement improvements energy efficiency by 25 % and performance
by 66 %. This results in a software implementation that is a particularly good fit to the underlying
hardware. Finding a good fit was expressed in 1.1 and is considered important in creating low
energy software. This is discussed further in Chapter 4.
The memory hierarchies and performance counters that relate to them form a significant part
of this type of performance and power modelling. This serves relatively large processor architec-
tures well. However, in smaller, simpler architectures targeted at embedded systems, the memory
hierarchy and system structure can be quite different, as is explained in Chapter 5with respect to
the XS1, and Chapter 10 in relation to a variety of other architectures.
3.1.2. Gem5
Gem5 is a freely available [Gem14] system level simulator that allows a combination of components
to be characterised and assembled together. Typically, Gem5 is used as a platform to simulate new
system designs in order to perform design-space exploration and the possibility to test software
prior to the construction of a physical system.
The Gem5 simulator, as described by Binkert et al. [Bin+11], draws upon the work of two prior
simulators — M5 [Bin+06], with its configurable ISA and processor models, and GEMS [Mar+05],
which has good memory and interconnect configuration and simulation capabilities to create a
highly configurable simulator for a full embedded system.
Accuracy in Gem5 varies depending on the complexity of the system and the behaviour of the
software that is executed on the simulated platform. Butko et al. [But+12] demonstrate that errors
in performance modelling can reach almost 18 % if there are heavy external memory accesses to a
complex Double Data Rate (DDR) Dynamic Random Access Memory (DRAM). However, in other
cases performance error is as low as 1.39 %.
Energy modelling with Gem5
Although Gem5 not itself an energy modelling tool, its modular nature allows energy models to be
attached to the components that are assembled into a simulated system, from which energy data
can be extracted.
Rethinagiri demonstrates a system level power estimation tool [Ret+14] that uses performance
counters from various simulated components in Gem5, combined with other properties of the
system, in order to estimate the power for various ARM-based systems. Performance counters
include features such as external memory accesses, cache misses and the IPC of processor cores.
System properties include bus and core frequencies. A set of simple assembly programs, consisting
of test vectors of small algorithms, were used to characterise the costs that needed to be associated
with the various performance counters and system properties.
This modelling technique is shown to achieve less than 4 % error on average, which performs sig-
nificantly better in comparison to McPAT. However, the research focuses on a single frequency and
voltage operation for each processor core, where the static property of frequency is the dominant
term in all cost equations, thus the relative error from the simulated components (performance
counters) is low. This is more profound in the Cortex-A9 dual-core processor, where the measured
system power varies by single percentage points across all tests. It is unclear how the estimation
model would perform when DVFS or other aggressive power saving features are used.
3.1.3. Transaction level modelling
In Transaction Level Modelling (TLM), a system’s components can be represented at different
levels of abstraction, but the modelling is driven from the perspective of data exchanges between
these components. System activity is viewed as combinations of reads and write events, possibly
with timing information attached to these events. The components involved in the transaction
can then update model state based on parameters given by the transaction. In the context of
energy modelling, the focus is upon using these transactions to increment energy consumption
of components as they act upon transactions, yielding an overall energy consumption estimate
for the system. The motivation behind this approach is improved performance over finer-grained
41
3. Energy modelling
modelling such as gate-level simulation of the entire system, allowing more rapid design space
exploration whilst retaining good modelling accuracy.
TLM is demonstrated for a Multi Processor System on Chip (MPSoC) in [AND07]. The work
models an MPSoC with a group of MIPS R3000 procesors combined with caches, a crossbar
interconnect, Static Random Access Memory (SRAM) and other peripheral components. Each of
the components has an energy model associated with it, which can be constructed separately using
an appropriate method. For example, the SRAM component is modelled based on data extracted
from analogue simulation of a range of device configurations, yielding a parameterised model based
on the device’s capacity and organisation (e.g. word length).
The final SRAM model considers three activities in relation to the TLM: read, write and idle.
The underlying behaviour of the state machine that forms the SRAM, which would be present
in a cycle-accurate simulation, is omitted in this approach. The work demonstrates a speedup in
simulation of more than 20 % compared to a cycle accurate approach for 16 processors, with power
estimation error of less than 10 % in all test cases. Designs with smaller numbers of processors and
larger caches were the most accurate, as error from simulation of contention in the interconnect
has less of an impact.
Other research has extended TLM to support multiple levels of accuracy within the modelling
framework. Beltrame et al. [BSS07] are motivated by providing good simulation performance that
uses more resources to model interesting parts of the simulation. Conversely the uninteresting
periods are simulated at higher speed with lower accuracy. The communication channels that
carry the transactions and the modules (or components) that they interact with must be modelled
at more than one level of detail in order to make this approach possible. It is left to the designer
of the target system to choose where to switch between accuracy levels, although the research does
provide the accuracy/performance trade-off which can help in making this decision.
3.2. Software energy modelling
A software-centric energy modelling approach can, as with hardware, take place at various levels
of abstraction with trade-offs in performance, accuracy, and the granularity at which information
can be presented. This section begins the functional block level, then increases the abstraction
level through the ISA, performance counters and finally at a purely software level.
3.2.1. Functional block level modelling
A similar approach to the above seeks to identify processor energy by modelling activity within
particular functional blocks in the processor. For example, a processor may have various functional
units for simple arithmetic, multiplication, division and memory operations.
In [IRF08], a TI C6416T VLIW processor is separated into six blocks: the clock tree, fetch and
dispatch components, the processing unit, internal memory and the L1 cache, split into data and
instruction parts. Parameters to the model include read and write access rates from and to these
components, along with cache miss rates. Validation across a series of DSP-centric benchmarks
shows a worst case energy estimation error of 3.6 %.
Other work explores a different set of processor designs and alternative methods for classifying
functional groups and associated instructions. Blume et al. [Blu+07] show that classifying instruc-
tions into 6 groups for the ARM 940T is the optimal grouping, where fewer classes significantly
increase error and more yield only a small improvement. Additional characteristics such as mem-
ory must also be accounted for. This is presented as a hybrid functional/instruction level power
analysis approach, with a worst case estimation error of 9 % for the aforementioned ARM 940T
and 4 % for the OMAP5912 across a set of benchmarks. The work also shows that reducing model
complexity by ignoring behaviours such as cache misses can result in estimation error increasing
by over five times.
3.2.2. ISA level modelling methods
Tiwari’s early work into x86 instruction set modelling [TMW94b] seeks to estimate the energy, E,
of a program, p, by considering three components of execution: the base instruction cost of each
42
3.2. Software energy modelling
instruction that is executed, the inter-instruction overheads of switching between one instruction
and another, and any external effects such as cache misses. These values are extracted from a target
system with a test harness executing specific instruction sequences and measurement equipment
collecting energy consumption data. The model is then expressed by Equation 3.1 [Tiw+96]. For
all instructions, i, in the target ISA, the base instruction cost, Bi, is multiplied by, Ni, the number
of times the instruction, i, is executed. For each pair of instructions executed in sequence, i, j, the
inter-instruction overhead, Oi,j , and frequency of occurrence, Ni,j , is counted. Finally, for each
external component, k, the energy cost of external effects, Ek, is determined, for example with an
external cache model.
Ep=X
i
(Bi×Ni) + X
i,j
(Oi,j ×Ni,j ) + X
k
Ek(3.1)
Building on this research are energy modelling tools such as the Wattch framework [BTM00].
Wattch produces energy estimates of software through simulation, by modelling key components of
a processor architecture, such as the cache hierarchy and size, functional unit utilisation and branch
prediction capabilities. Wattch can model software targeting various architectures, to within 10 %
of commercial low-level hardware modelling tools. The SimpleScalar [Aus02] architecture modelling
software was used as a basis for a similar power model, resulting in Sim-Panalyzer [Sim04].
The idea of measuring instructions and their interactions can be broken down further, a model
for which was proposed by Steinke et al. [Ste+01a]. This model extracts more information on the
source of energy consumption in the processor pipeline, such as the cost of switching in each read
action upon the register file, as well as the cost of addressing different registers for read and write-
back. The precision of the approach is shown to be within 1.7 % of the target hardware, although
it significantly increases the number of variables that must be considered when implementing the
model. Other types of processor architectures have also been modelled in similar ways, such as
VLIW DSPs [Sam+02;IRH08], with average accuracies of 4.8 % and 1.05 % respectively.
To model complex micro-architectures or large instruction sets, linear regression analysis can be
used. With sufficient supporting empirical data, a solution to a parameterised model can be found
that establishes values for any unknown terms. This has been utilised to model an ARM7TDMI
processor [LEM01], using empirical energy data from observed test programs to aid the solver,
yielding a model with a 2.5 % average error.
These approaches can all deliver an accuracy of 1–10 % across various architectures. However, the
architectures that they analyse are either single threaded, or special purpose DSPs. As such, these
models are not equipped to model a hardware multi-threaded processor. Either these approaches
must be extended, or an alternative approach found, ideally whilst maintaining comparable accu-
racy to prior models.
3.2.3. Performance counter based modelling
In a number of modelling methods, hardware performance counters are used to estimate energy
consumption. The benefit is that these counters can be used by a wider range of users who do not
necessarily possess direct energy measurement capabilities for their target system.
In [CM05], a set of performance events are monitored via an Intel PXA255’s configurable per-
formance counter sampling mechanism. These include characteristics that have been modelled
via various means throughout the literature review in this section, such as cache misses, but also
instruction counts, data dependency events and an abstraction of main memory behaviour through
some of these events. The work states that the embedded PXA255 has fewer counters than a larger
processor, requiring profiling runs in order to gather sufficient data for a robust model. It is shown
that the average error is 4 % for the SPEC2000 and Java benchmarks run on this processor via a
Linux based OS.
The Xeon Phi, whose architecture is discussed in Chapter 10, is modelled in a similar way [SB13].
In this case, a set of micro-benchmarks are used to exercise various behaviours and extract a
performance counter lead model. The Phi is significantly more complex than a PXA255, in that it
contains multiple x86 cores and multi-threading. As such, multi-threaded behaviour must also be
accounted for, with the model containing a scaling term defined by the number of active threads
in a core. The model accuracy is stated as being within 5 % of hardware energy for real world
43
3. Energy modelling
benchmarks, and the information from this model is used to demonstrate that code from the
Linpack benchmark suite can be energy optimised for the Phi, increasing efficiency by up to 10 %.
The previous examples use performance counters in lieu of direct energy measurements, the
motivation being that instrumenting most hardware to measure energy is time consuming or a
technical barrier for software developers. However, simulated performance counters can be used,
to estimate the energy consumption of a program on a device that the developer does not have
access to. For example, architecture simulators such as Sniper and system simulators such as
Gem5 can provide performance counters that can be used to the same effect as those found in
real hardware. Modelling that centers around these simulators is discussed earlier in this chapter,
in 3.1.
3.2.4. Software functional level modelling
Analogous to modelling hardware activity at a functional block level, the compartmentalisation
of software into libraries of frequently used functions can also be used as the basis for an energy
model. In this case, the structure of the underlying hardware is not a concern. Instead, the energy
consumption of sets of software library calls is measured and these are then used in an energy
model, based on the frequency with which each of the calls is made within a program [Qu+00].
The work of Qu shows that if a suitably large database of library call energy consumption is
built, a program’s energy can be modelled to within 3 % of hardware. For this method to work
well, it is assumed that a program spends a majority of execution time in library calls, therefore
executing code for which the energy cost is already known.
Beyond the challenge of having a suitable distribution of library calls in the modelled program,
there are several other potential issues to using this method. Firstly, the architecture-oblivious
approach may require re-profiling of library calls if a new target processor is used in order to
preserve accuracy. Secondly, library calls for which energy consumption is heavily dependent on
the supplied arguments may have poor accuracy if these dependencies are not considered.
3.3. Summary
This chapter has provided a review of various energy modelling approaches for a broad range
of architectures. A number of these approaches focus on detailed hardware characteristics, for
example pipeline and functional unit behaviour, whilst others utilise indicators at a higher level,
such as performance counters or library calls. The accuracy of these models vary, from within
single digit percentage error, up to 20 % or more. The trends observed in published works suggests
that sub-10 % error margins are more than sufficient for useful energy modelling of software. The
comparisons drawn between approaches with respect to their accuracy is somewhat subjective, due
to various ways in which accuracy is calculated and the ground truth energy figures obtained, i.e.
low level gate simulation versus direct hardware measurement.
The granularity of modelling can affect the impact that such errors have. The wider the view
taken, the less of an impact sub-components have. For example, an energy model of a register file
might have an error margin of ±15 %, but if its contribution to the processor’s energy consumption
is only 10 %, its effects are diminished. A pragmatic approach to evaluating error must therefore
be taken, where the impact of an error must be considered alongside its magnitude at a given level
of detail.
Many of these works are for single-threaded processors, although it is shown that more re-
cently, multi-threaded and multi-core architectures can also be modelled and energy consumption
estimates for software running upon them can be given. However, in the embedded space, multi-
threading and multi-core is less prolific that in High Performance Computing (HPC) or larger,
more general purpose systems. Moreover, there remain novel architectures for which successful
energy modelling approaches are not demonstrated or sufficiently explored. One such embedded
architecture — the XMOS XS1 — forms the focus of this thesis.
This thesis is motivated by energy modelling of software and design space exploration where the
software is the focus of the design effort. As such, modelling approaches that lean towards the
software part of the stack provide the best foundation for further work. The energy consumption of
44
3.3. Summary
the hardware must be relatable to the software running upon it in order for meaningful information
to be made available that could prompt energy-saving changes to the software. In particular,
modelling at the ISA level provides a good intersection between the hardware and software because
it is in many senses the bridge between the two domains. This will be the main abstraction level
developed further by this thesis, although other levels will necessarily be incorporated and extended
as well.
45
4. Influencing software energy consumption
in embedded systems
The previous chapters have reviewed parallelism paradigms for software and hardware, as well
as existing energy modelling approaches. This chapter establishes the design space exploration
challenges that are present when attempting to save energy in an embedded system. This comprises
general approaches that are applicable to all systems and software, but also includes specific focus
on the additional constraints that are present in the embedded hardware/software design space.
This analysis of energy saving discusses both requirements and techniques. First, a set of objec-
tives are discussed in 4.1. Then, 4.2 reviews the various interacting facets that govern energy
consumption, giving consideration to constraints that are present in embedded real-time systems.
4.3 focuses on historical and continuing reliance on Moore’s Law and how the technology roadmap
has changed in recent years in response to the ceilings of certain physical constraints being reached,
and the increase in demand for devices that are more energy efficient. In 4.4, the virtues of event-
driven programming are explored, where the avoidance of waiting-loops can dramatically improve
both performance and energy efficiency. The final section summarises the background that has
been covered and relates it to the contributions this thesis makes in Part II.
4.1. Forming objectives to save energy in software
Taking a software-oriented approach to saving energy, there are multiple opportunities for the
savings to be made. However, the scale of impact varies between approaches. Therefore, objectives
must be enumerated with appropriate priorities, in order to yield the best results and to avoid
simply deflecting the inefficiency into another area.
In 1.1, a number of research questions and thesis statements were made. This included the
argument that making sure software is a good fit to the hardware is essential for energy optimised
systems. This section expands on this argument and goes deeper into the energy optimisation
process, drawing on arguments made by Roy and Johnson [RJ97].
Roy and Johnson’s work details a number of techniques and considerations for energy optimi-
sation in the design of software. Written in 1997, the state of embedded system processors is
somewhat different to how they are now, but the remarks made still apply, albeit with some adap-
tation in places. These will now be summarised and related to contemporary embedded processors
and software.
4.1.1. Algorithm choice is the first and most important step
A poorly implemented piece of software is likely to perform poorly. In terms of energy, this
can be manifested by using algorithms that do not fit well with the computation hardware. For
example, code that relies on frequent divergent branching will not fit well to GP-GPU pipelines,
leading to significant inefficiencies, harming both performance and energy consumption. Similarly,
code that is compiled for a generic instruction set, rather than a specific instruction that that
includes additional features present in the target processor, misses out on opportunities to perform
optimally.
An excellent example of this exists in general purpose computing, wherein a long-established
binary-heap tree algorithm is shown to be sub-optimal when virtual memory is considered [Kam10].
When given correct consideration to the underlying system, in this case including an OS, the
algorithm can be improved to deliver a further ten times performance in some cases. Specifically,
the proposed modification to a binary-heap tree populates memory pages vertically, matching the
direction in which the data structure itself is populated. This results in fewer page changes during
47
4. Influencing software energy consumption in embedded systems
accessing the data structure. This gives a very strong argument for knowledge of elements in levels
of the system stack below that at which a developer is actually implementing software. Virtual
memory is not typically a consideration in an embedded system, but other similar factors, such as
RTOS driven context switching, can be considered similarly important.
Fundamentally, this change affects performance the most. However, execution time is a signifi-
cant factor in energy consumption too, and so remains essential when considering energy optimisa-
tion. Las Vegas style algorithms become an interesting counter-example to this, when parallelism
is exploited in a system. A Las Vegas algorithm will always produce the correct answer, but
some randomisation in its approach means that the execution time is variable [LE94]. If such an
algorithm is replicated in parallel, there is a higher likelihood of finding the fastest possible vari-
ant. However, doing so effectively wastes the energy consumed by all instances except the fastest.
Thus, parallelized Las Vegas algorithms can potentially be time-optimal, but very bad for energy
consumption.
4.1.2. Manage memory accesses
The previous examples also apply somewhat to this particular claim. Accessing memory takes
time, and in a memory hierarchy, that time can be unpredictable due to caches, with a significant
difference in both time and energy consumption in the worst case. In [Kam10] the concern was in
virtual memory, but here, the physical memory implementation is the subject of interest.
In a memory hierarchy, the further away that data is kept, the longer it takes to access. At each
level of caching, the performance penalty is typically an additional order of magnitude [Dre07, p.
16]. For example, if register accesses take a single cycle, then a level-1 cache access may taken in
the region of five cycles, whilst a level-2 access could take 15. A main memory access may take
hundreds of cycles. Although some components may be able to enter a low power state during
longer-running memory accesses [CSP04], avoiding these accesses altogether is preferable.
Once again, improvements in this area are best achieved by modifying the algorithm, finding
ways to reduce the memory footprint and establishing a memory access pattern that makes the best
use of available caches. In an embedded system, the memory hierarchy may be far simpler than
a general purpose processor, to the point where the may be no caches at all. However, memory
access should still be considered.
At the very least, register accesses are faster and consume less energy than accessing Random
Access Memory (RAM). Thus, ensuring that spills to memory are minimised will help performance
and energy. Many embedded systems execute code directly out of flash memory, allowing them
to have a smaller RAM. However, even if access times between flash and RAM are equal, the
energy consumption may differ, such that relocating frequently accessed code segments to RAM
may be preferable [PEH14]. This optimisation can be performed by a compiler, but a developer
may be able to indicate the best candidate code sections to apply this optimisation on, given their
understanding of the algorithm.
4.1.3. Utilise parallelism
If a processor has parallelism in it, then using these features will further improve performance.
When suggested by Roy and Johnson, this mainly considered parallelism in a single core, as
discussed in 2.2. This can now be expanded to include multi-core, although at the multi-core
level, the algorithm must be fundamentally parallel in order to exploit the hardware fully, returning
the priority back to the original goal in this section: mapping the algorithm to the hardware. More
subtly, implicit parallelism in code, such as independent sequences of instructions can be exploited
through the compilation process or software pipelining.
4.1.4. Utilise power management features
This goal somewhat counter intuitively sits relatively low down the list of priorities. If the un-
derlying device has power saving features, such as low power states, DVFS or the disabling of
inactive units, then they should be made use of. This implies that these features are software
controlled, which is not always the case. For example, tuning of device voltage can be done with
48
4.1. Forming objectives to save energy in software
an appropriate combination of hardware and software [KE15a], but this can also be done purely
in hardware [Bur+00].
Utilising power management will save some energy, but the saving may be less than could be
obtained through the previously stated measures. The most apparent reason for this is that the
number of operations that the software needs to perform is not affected by such measures. In an
embedded system context, the best that can be achieved is to lower the frequency and voltage
to the minimum that enables deadlines to be met, then disable unused components. However,
changes at higher levels may instead use more components, but reduce the number of operations,
thus allowing even more aggressive power saving to then be applied. The balance of priority here
clearly leans towards the algorithm first, then power management; there is not a chicken or egg
dilemma to resolve.
4.1.5. Minimise inter-instruction overheads
This is the final goal defined by Roy and Johnson, and operates at the same level as a number of
the energy models described in Chapter 3, such as the Tiwari ISA model [TMW94b]. At the ISA
level, the trace of instructions that are executed is valuable for determining energy consumption
and it is shown that the precise sequence can have an impact through changing inter-instruction
effects. However, the gains to be made from careful instruction ordering and register addressing
are small compared to other efforts.
This assumes that the scheduling of instructions is purely to minimise switching activity in the
processor as the sequence of instructions progress through the pipeline. This does not have an
impact on performance. However, if the re-scheduling allows parallel utilisation of FUs or avoids
a pipeline stall, this would improve performance. In this case, the change would be categorised in
one of the earlier goals.
4.1.6. Multi-threaded, multi-core specific considerations
The previously stated goals largely hold today, as they did when originally stated. However,
additional multi-threaded and multi-core considerations can be added with respect to a number of
the points made.
A sequential algorithm does not necessarily parallelise well, if at all. As such, a new dimension
is added to the challenge of developing an algorithm that maps well to the underlying hardware.
In particular, giving consideration to Amdahl’s Law [Amd67], sequential parts must be kept to a
minimum.
Certain strategies, such as instruction scheduling to minimize switching, may be impossible in a
multi-threaded system. If the instruction sequence in the pipeline is sourced from multiple threads,
then the ability to control that sequence in a normal program is likely impossible. Fortunately,
there are other higher priority strategies that will yield better energy reductions regardless of this.
In a multi-core system, the management of memory access becomes a more complex problem
than before. If shared memory is used as the underlying mechanism for communication between
threads (regardless of the programming model that is used), then the latency of the memory
subsystem can be a significant bottleneck, reducing performance and costing energy. The impact
of the memory subsystem may vary, depending upon which threads are exchanging information.
Higher level caches, such as level-2 or level-3, may be shared between multiple cores, reducing
latency in data sharing. Multi-core caches must implement coherency mechanisms to ensure that
changes made from one core are visible to other cores when needed. The access of each core to
memory may not be equal in terms of performance of connectivity, resulting in a Non-Uniform
Memory Architecture (NUMA), further reducing the ability to predict how to make changes that
will yield an improvement. In a larger scale system messages may need to be passed along a
network in order to access information in the main memory of another compute node.
Departing from shared memory and instead using a message passing implementation, such as
that described for the XS1-L and Swallow in Chapter 5, may remove a number of layers of com-
plexity from the memory hierarchy, but still requires more consideration than single-core software
development. If messages between cores are traversing a network, then the bandwidth and latency
of that network must be considered. If a core is able to do other work whilst waiting for a network
49
4. Influencing software energy consumption in embedded systems
communication to take place, then latency can be hidden. However, whilst in such a case perfor-
mance may be improved, the energy cost may be higher than optimal. For example, if a program
suitably hides latency, but communication takes place between cores that are more distant than
is necessary (task placement upon the set of available processors is poor), then energy could be
reduced by re-arranging tasks. In addition to this, many intercommunicating threads may create
contention in the network, thus increasing latency and reducing throughput in an unpredictable
manner.
4.1.7. Summary
This section has examined a list of goals that a software developer can seek to achieve in order to
reduce the energy of their program. These goals were then related specifically to embedded systems
where appropriate, and then extended from the original set [RJ97] to consider multi-threaded and
multi-core systems, which are the focus of this thesis.
The first target of energy optimisation in software should always be making the program a
good fit to the underlying hardware. The benefit of exploiting more fine grained goals is small
in comparison, therefore should only be explored after the best possible fit is found. As such,
strategies such as DVFS and switching minimisation through instruction scheduling should not be
the first optimisation activities.
In multi-core systems, both shared memory and message passing architectures add complexity
to the challenge of reducing the cost of exchanging data between threads. However, once the
processing power of the system is utilised optimally, the movement of information becomes the
most important goal, and so understanding the latencies and bandwidths present when moving
data, as well as how much these characteristics may vary, is particularly important.
4.2. Energy’s many relationships
The previous section focused on energy optimisation strategies from a software perspective. This
section examines the underlying hardware properties that dictate how energy is consumed and
how changes can be made to reduce (or increase) energy consumption. The properties and the
interactions between them, form necessary understanding for the energy profiling and modelling
that is performed in Part II, and provides clarification for some of the reasons for the order of
priorities given in 4.1.
4.2.1. The power, time, energy triangle
The relationship between power, time and energy was defined within the introductory chapter
(1.4). The distinction between power and energy is essential for this work, whereas in other
contexts it may be possible to interchange the terms without consequence.
In its simplest form, E=P×T, where the energy consumption of a system, E, is the product
of the power dissipation, Pand time, T. Power is not usually a fixed value in a system, because
system activity is constantly changing, resulting in an integral form of the equation Eq. (1.1). An
intuitive energy saving objective is to minimize P,Tor both simultaneously, in order to lower
the energy consumption of a system. However, the changes that must be made to deliver such a
reduction have to work within the limitations of the system itself and the components that form
it.
Seeking to improve one parameter may in fact have an opposite effect on the other. In such
cases, the desire is for the opposing negative impact to be less than the positive impact that is
introduced.
4.2.2. Supplying power
In the previous subsection it is clear that energy can be reduced if time is reduced, even at the
cost of increased power dissipation, provided the former is more significant, giving a desirable net
effect. However, the change in power profile may have effects reaching beyond the computational
parts of the system.
50
4.2. Energy’s many relationships
Batteries are capable of storing a certain amount of charge, in order to provide energy to a
system when no other source is feasible. However, the rate of energy transfer (power), has an
impact on the available charge, or effective capacity of the battery. It is shown in [PW99] that
battery capacity is affected by various factors, but of particular interest is the current and its
behaviour over time.
A higher current (implying higher power dissipation if the voltage is unchanged), reduces the
efficiency of the battery, thus under higher loads it will not provide as much total energy to the
system. Further, lowering the average current is not necessarily sufficient to improve efficiency.
Pedram et al. [PW99] also show that a current with high variance also has a negative impact on
efficiency.
Reconciling this against some of the previously described energy saving scenarios, it may not
be desirable to make optimisations that result in a smaller execution time if the power profile is
higher, or becomes more variant.
Other power sources, such as DC-DC converters, also have current and voltage dependent effi-
ciency characteristics. A concrete example of this is the supplies used in the Swallow system used
in this thesis. The NCP1529 [ON 10] converters are most efficient at approximately 5 % of their
maximum rated output current. At very low current, efficiency is extremely poor (asymptotic to
0), with a less dramatic reduction in efficiency as the load increases towards the maximum rated
current.
The consequences of making poor choices when seeking energy savings may vary depending on
the power supply. For example, a current with high variance in a battery-powered system may
result in the device ceasing to work before it is expected, and it may be non-trivial to access the
device and replace the battery. However, sub-optimal DC-DC efficiency may not be so catastrophic.
Nevertheless, one cannot aggressively seek changes to execution time or power dissipation without
also considering the behaviour of the power supplies in the system.
4.2.3. Power dissipation in silicon and DVFS
The technique of DVFS is motivated by a desire to minimise energy consumption by balancing the
trade-off between power vs. performance for a given workload [Bur+00]. In the Complementary
Metal Oxide Semiconductor (CMOS) technology used by the majority of processors, DVFS is
affected mainly by two components: static and dynamic power.
Static power
The main component of static power is the leakage current of the transistors in the silicon. This is
present regardless of the on/off state of transistors. As processors are fabricated on smaller process
nodes, the percentage of overall power dissipation that is attributed to leakage is growing [Kim+03],
for example due to increased leakage through thinner gate oxide layers, which must be combated
with technology such as improved high-k gate dielectrics [WWA01].
Ps=V Ileak (4.1)
In Equation 4.1, the static power, Ps, is proportional to the product of the device voltage, V, and
the leakage current, Ileak. This is a simplified linear relationship between operating voltage and
static power. In reality, the relationship is quadratic and dependent on multiple factors, including
supply voltage, temperature, feature size and gate oxide thickness [BR06]. However, a linear
approximation can be sufficient at a high level of modelling that does not reach any extremes of
operation. Most importantly, however, static power is not directly influenced by circuit switching
activity.
Dynamic power
Power dissipated in order to switch transistors on or off is termed dynamic power, Pd, and is
expressed in Equation 4.2.
Pd=αCswV2F(4.2)
51
4. Influencing software energy consumption in embedded systems
Csw is the capacitance of the transistors in the device and αis an activity factor or the proportion
of them that are switched. Activity factor is workload specific, but often estimated as switching
half of the transistors in the device [BTM00], giving α= 0.5. Fis the operating frequency of the
device. Observe that changes in Vhave the biggest influence on dynamic power dissipation.
A reduction in V, however, will slow the transistor switching speed, increasing the delay in the
critical path, requiring that Falso be lowered. Thus, there is a trade-off between reduced power
dissipation and the total energy consumption due to longer execution time — in some cases it is
not beneficial to slow the device down further. Choosing a strategy for energy saving, be it tuning
the frequency to avoid slack time, or racing to idle by operating at high speed briefly, then reducing
to a low power state, is dependent on the type of work and the behaviour of the system; there is
not one strategy that works in all cases [ANG08].
The relationship between voltage and frequency varies depending on manufacturing process
and device implementation. Simplistic representations, such as that in [Kim+03], represent the
relationship as FVVth
V, where Vth is the threshold voltage of the transistor. This representation
projects that as Vapproaches Vth,Fapproaches zero. However, Sub-Threshold Voltage (STV)
operation is possible [Zha+09] and Fonly drops exponentially, therefore slow, very low voltage
devices can be made.
Working above STV, the nominal operating frequency and voltage, Fnorm and Vnorm respectively,
can therefore be represented as Equation 4.3, taken from [Kim+03], where Vmax is the maximum
operating voltage of the transistor.
Vnorm =Fnorm 1Vth
Vmax +Vth
Vmax
(4.3)
A step reduction in voltage requires a larger step reduction in frequency. With a conservative
view, where preserving correct operation is required, the relationship can be represented linearly.
Other losses
Conditions such as short-circuit current can also be factored into the overall power dissipation of
a device. Techniques such as the α-power law MOS model consider these [Sak88]. In this thesis,
however, these additional effects are considered to be part of either dynamic or static power,
depending on their relationship to transistor switching activity.
Environment and workload affect silicon speed
Transistor switching speed increases in an approximately linear relationship to voltage whereas
speed’s relationship with temperature is feature size dependent. However, higher voltages result in
greater dynamic and static power dissipation, and so the relationships between design thresholds,
workload, speed, voltage and temperature are not always straightforward. For example, for larger
feature sizes of 65 nm or more, the relationship between temperature and threshold voltage can
typically be represented linearly, but the static current leakage has an exponential relationship
with temperature [WA12]. Sub-65 nm exhibits an inversion in the temperature-speed relation-
ship [KK06].
Processor temperature may be influenced by the ambient temperature of the operating environ-
ment, but also by the workload run upon it, as this will increase energy consumption and thus
power dissipated as heat.
In order to provide a reasonable expectation of safety in a voltage tuned chip, its speed should
either be constantly monitored, or if this is not possible, it should be measured during a period
of slowest silicon performance. Inadequate monitoring or profiling could lead to an environmental
change triggering a fault, or simply sub-optimal energy usage.
Summary
This section has demonstrated that energy saving is a multi-dimensional problem, where any one
effort to reduce energy may, inadvertently increase it in some other way. Providing visibility of
this is therefore essential if any form of design space exploration, at the hardware or software level,
is to be effective.
52
4.2. Energy’s many relationships
4.2.4. Racing to idle in a real-time system
In a general purpose system, DVFS can be used as part of a race to idle strategy, where the
device voltage and frequency can be aggressively scaled back upon completion of the current task,
significantly reducing power dissipation. It may even be possible to turn-off certain components
through power gating, removing static leakage as well.
In an embedded real time system, hard deadlines and responsiveness constraints can work against
this strategy.
In a real time application, for a given block of code there exists a minimum timing constraint t
between its endpoints. Given the number of cycles cneeded to execute the block, the minimum
frequency at which the block can operate and meet those constraints is:
F=c
t(4.4)
In a system involving external I/O, consider the entry point to a block as the receipt of a stimulus
(i.e. an interrupt or event) and the exit point some after time tis a response to that stimulus. If
the I/O activity is not monotonic, the system must always preserve a sufficient level of readiness
to respond within time t.
If DVFS is applied within idle periods, such that Fis lower than required to satisfy t, then
the event triggered by the I/O stimulus will require Fto then be raised, either automatically
be the hardware, or in software. In either case, this takes time, during which either the clock is
halted [Int03b, p. 31], or continues to run at the slower frequency. In response to this, the active
period may require a new, higher frequency to be used, in order to complete the code within t.
MII Ethernet receiver
To illustrate the above problem, an example of this is given in the form of a physical layer Media
Independent Interface (MII) to an Ethernet receiver. The interface has a strict timing requirement
and must meet it to avoid corrupting an incoming packet. The XMOS XS1-L architecture, which
is explained in more detail in Chapter 5, is used as the case study processor for this example.
A 100 Megabits per second (Mbps) Ethernet frame is formed of a preamble, Start of Packet
(SoP) token, up to 1500 bytes of data followed by 4 bytes of Cyclic Redundancy Check (CRC).
There is a minimum inter-frame gap of 920 ns. The preamble takes 600 ns. Data is delivered via a
port RXD that is 4-bits wide, receiving a nibble every 40 ns. With a buffered input on an XS1-L,
32-bits can be read at a time, thus allowing 320 ns to process each word. A separate input RXDV
from the MII indicates that data is being received.
If a slow clock frequency is used during the inter-frame gap, then there is a 600 ns period from
detecting RXDV changing, to being ready to receive the SoP and subsequent data.
If 16 instructions are required to input and process each word of the Ethernet frame, then a
200 MHz core clock is needed to satisfy the instruction rate needed by the receiving thread in the
XS1-L, as per Eq. (5.2). Once RXDV goes low to process the end of the packet, let us assume that
a further 8 instructions are needed before entering a lower frequency.
Assuming the best case delay in DVFS is a single instruction cycle, the lowest slow clock would
be 6.67MHz (600nS period). When the mode switch latency is 29 cycles, the slow clock can be
193.33MHz. Beyond this, the slow clock would equal or need to be higher than the fast clock in
order to satisfy response times and thus would be counter-productive. If the Ethernet interface
is not 100 % utilised, then the inter-frame gaps may be larger. If the duty cycle of the Ethernet
frames is known, then this can potentially be exploited to further save energy, by allowing limited
periods of slower operation.
Figure 4.1 shows the trade-off between power saving, frequency reduction with a known duty
cycle, and the DVFS mode switch latency. The mode switching latency dominates the ability to
save energy. The steep edges on the surface plot are the points at which the switching latency
requires a slow clock that uses a higher core voltage, reducing the potential power savings.
Online versus offline energy optimisation
It is possible to set DVFS scaling points offline, before programs run, or to determine them online
during execution. The latter has the potential to be more flexible, in that certain properties that
53
4. Influencing software energy consumption in embedded systems
Figure 4.1: Percentage power saving obtained with varying Ethernet packet duty cycles and mode
switch latencies.
might not be known statically, such as the actual ingress rate of Ethernet frames, can be used to
guide parameter selection. However, performing these calculations online incurs an overhead, which
may itself have a negative impact on energy consumption. This can be mitigated by combining
both offline and online scheduling techniques [CLH09].
It is also possible to reverse the goals, instead controlling the quality of service in response to the
energy available. Computational effort can be modulated in response to the energy available to a
system [Yak11]. Appropriate decisions must be made as to how the application degrades if energy
becomes scarce, and this requires online data from the hardware along with suitably adaptive
software.
Summary
This subsection has demonstrated mostly hardware-oriented efforts that can be made to save power,
working within the constraints that may be imposed upon an embedded system. Whilst DVFS can
be beneficial, the timing constraints in an embedded system necessitate that changes in frequency
and voltage be very fast.
Once again, the software becomes the focus of optimisation effort after the hardware’s energy
saving features cannot help any further. This lends further credence to the prioritisation of goals
stated in 4.1, where this thesis focuses on the software level, where the greatest potentials for
savings can be made.
4.3. Can we sit back and let Moore’s Law do the work?
The previous section demonstrated how hardware features such as DVFS have limits, particularly
in embedded real time systems. However, the pragmatic software developer may choose to assume
that the next generation of a hardware component can deliver sufficient improvements in energy
consumption and/or performance, that spending effort in energy optimisation at the software level
is without merit. This section argues against such a viewpoint.
The often cited Moore’s Law [Moo65] is long-standing assertion relating to the progress made
in microprocessor manufacturing over time. Moore observed that “the complexity for minimum
component costs has increased at a rate of roughly a factor of two per year” and that this was
likely to be a constant rate. This lead to the now popular interpretation of Moore’s Law that
states the number of transistors in a processor doubles every two years.
54
4.3. Can we sit back and let Moore’s Law do the work?
Figure 4.2: CPU frequencies since 1972. Generated by the Stanford CPU database [Sta12].
When viewed with some flexibility this remains the case in 2015, some fifty years after Moore’s
observations were first stated. Based on this continuing trend of smaller, more capable devices
and the improvements in performance and energy efficiency that come as part of this, a plausible
solution to energy efficiency might be to sit back and benefit from improvements to hardware.
However, multiple factors conspire to limit the benefits of Moore’s Law, or even negate them.
Other aspects of processor designs that previously benefited from growth in line with Moore’s
Law no longer do so. The most notable example is the near stall in device clock frequencies since
2005, with current International Technology Roadmap for Semiconductors (ITRS) reports observ-
ing low single-digit percentage increases in clock speeds year on year [Kah13]. This is caused by
device operating voltages no longer reducing in line with Dennard’s scaling observations [DGY74].
Transistor counts continue to increase, but frequency boosts are restricted by thermal design lim-
its. Increased performance must now be extracted principally through multi-core or other forms
of parallelism. This can easily be verified by examining the Stanford CPU DB [Dan+12;Sta12],
wherein a plateau of CPU frequencies is clearly visible after 2005, as shown in Figure 4.2.
In addition to the above, the operating voltage of processors approaches a floor, as the difference
between Vth and Vdd becomes very small. The benefits of reduced operating voltages was dis-
cussed in 4.2.3. Work continues into silicon devices that can operate at Near-Threshold Voltage
(NTV) [Kau+12] and STV [Zha+09].
Even continuing to increase the transistor count in devices, another parameter that is essential
in manufacturing products is cost. New generations of technology node become prohibitively
expensive as a result [OrB14b], except for larger organisations and large product runs [OrB14a].
In the software realm, Wirth’s Law [Wir95] argues that as systems grow in size and capability,
the software running upon them grows in complexity, with a slow-down associated with that
growth. May’s Law refines this argument with respect to Moore’s Law’s two-year cycle, asserting
that “software efficiency halves every 18 months, compensating for Moore’s Law” [Ead11]. In
parallel programming, this is particularly significant, because the impact of slow sequential parts
of a program can lead to profound inefficiency at the behest of Amdahl’s Law [Amd67].
From this sm¨org˚asbord of laws and observations comes a strong motivation to specifically target
software when improving energy efficiency, or indeed any kind of efficiency in a system. Without
doing so, many of the potential benefits made possible through hardware improvements are lost,
particularly in the new era of parallelism.
55
4. Influencing software energy consumption in embedded systems
4.4. Efficiency through event-driven paradigms
All computing systems must perform some form of I/O. This requires some form of data exchange
between a processor and an external device. In many cases, the availability of data varies based on
external, unpredictable parameters, such as the speed at which a user types or the network delay
before the receipt of a new Ethernet frame. Spending time waiting for these unpredictable time
periods to lapse is wasting energy waiting. In concurrent programs, delays may be incurred from
waiting for another thread to reach a particular state. In either the case of concurrency or I/O,
some form synchronisation must be performed.
Historically, software has often adopted a busy-waiting approach to delays. Various algorithms
for synchronisation exist [GT90]. Typically they may involve a spinlock or some other form of busy
loop. A superior alternative to this is to adopt an event-driven paradigm, where a wait condition
is specified and an event vector followed upon the satisfaction of that condition. Prior to the event
condition, the processor can execute other software, such as additional tasks if a multi-tasking OS
is used, or the processor can enter a lower power state where no instructions are executed. An
outline of the programming styles of these two methods is shown in Listing 4.1 and 4.2.
1l oc k ed T ask ( l oc k_ t l ,
2reso u rce_t r )
3{
4{
5// Spin , re p eat edl y
6// testi n g lock
7} ( tryLoc k (l ) == 0);
8p erf or m Ac ti v it y ( r );
9u nl ock ( l );
10 }
Listing 4.1: Spinlock loop.
1l oc k ed T ask ( l oc k_ t l ,
2reso u rce_t r )
3{
4w ai t Lo ck ( l ); // Bl o cks th r e ad
5p erf or m Ac ti v it y ( r );
6u nl ock ( l );
7
8
9
10 }
Listing 4.2: Event-driven wait.
In order for a system to reap energy savings from event driven paradigms, both the software and
hardware must support it. Without hardware support for interrupts and associated conditions,
the best effort a kernel or application can do is to emulate the checking of these conditions in
what effectively becomes another spinlock. Thus, events are abstracted into busy-waiting loops.
In Listing 4.2, line 4, it is assumed that the blocking of the thread in order to wait for the
acquisition of a lock will result in de-scheduling and therefore either idle time or the execution of
another thread. However, it may be that the implementation of waitLock elaborates to lines 4–7
of Listing 4.1.
In a review of synchronisation methods for MPSoCs [GLP07], the sleep based methods consume
significantly less energy than all others. These sleep based synchronisation algorithms either apply
DVFS in idle periods between checks, or obtain notification from hardware, reducing activity even
further.
An RTOS may provide a framework for writing tasks that make use of interrupts. The RTOS
itself may use timer interrupts to allow it to enforce task scheduling without tasks needing to
manually yield or make any kind of system call. The same is true of the kernel in a general
purpose OS, although the exposure of interrupt events is typically kept to device drivers, with
software libraries providing further abstractions between the hardware and user-space applications.
The XMOS XS1 architecture is built around events. In the XS1 lexicon, an event is analogous
to an interrupt without state-saving, where the handling of an event is assumed to be the intended
outcome of a thread. This is contrary to an interrupt, which assumes that the thread will resume
from its previous position after the ISR is complete, or at some other point thereafter, if kernel
scheduling takes place. This is examined in more detail in Chapter 5.
4.5. Summary
This chapter has introduced a set of problems relating to the goal of saving energy in an embedded
system. These problems are typically multi-dimensional, and the ideal outcome constrained with
respect to these dimensions.
56
4.5. Summary
Strategies such as DVFS have clear benefits in certain contexts. However, selecting the correct
DVFS parameters, particularly in a real-time system, is non-trivial.
A common failing in energy saving efforts is to push the problem from one dimension into
another. For example, aggressively optimising one part of the software may place an additional
burden on another. Similarly, lack of awareness of timing constraints may preclude any benefits
from being obtained. The prioritisation of effort is particularly important. In 4.1 it was shown
that from a software level, the algorithm is critically important to energy consumption, and effort
such as using software to better exploit low-level hardware energy saving features, is wasted if the
algorithms used in a piece of software do not map well onto the underlying hardware.
An objective of this thesis is to further the state of the art in awareness of energy consumption
in MTMC embedded systems. By doing this, the developer is empowered to explore energy saving
options such as those described in this chapter. Most importantly, the developer of embedded
systems software is given sufficient visibility to identify where energy optimisation effort would be
wasted, allow them to favour areas with greater potential for improvements (low hanging fruit).
This helps the developer establish a good balance between potentially numerous complex trade-offs.
57
5. A multi-threaded, multi-core embedded
system
This is the final chapter in Part Iof this thesis. It details both the XS1 processor architecture at
the center of the modelling effort Part II, along with the Swallow platform, an assembly of these
processors into a networked, Multi-Threaded and Multi-Core (MTMC) embedded system, which
is used to extend the modelling for networked, multi-core embedded software.
This chapter begins with a discussion of why the XS1-L is used as the focus of this work, in
response to the thesis statements put forward in 1.1.
Motivation of selection
To explore the modelling of software running on a MTMC embedded system, a suitable hardware
platform is required. The XMOS XS1-L processor meets this need for a number of reasons:
Each XS1-L core is hardware multi-threaded.
XS1-L processors can be interconnected to form a multi-core system.
Energy efficiency is a consideration of the architecture, with event-driven paradigms and
efficient multi-threading built into the ISA.
Software written for these processors is typically run bare-metal (without an OS), in the
relatively low level languages C and XC.
The software has a large amount of direct control over the hardware’s behaviour, due to the
processor’s target market.
In addition, the architecture has a number of characteristics that make it unique and worthy
of exploration with respect to energy modelling, complementing techniques applied to existing
architectures. In particular:
Time-deterministic instruction execution.
No cache hierarchy.
Single-cycle memory.
A focus on message passing rather than shared-memory, both in software abstraction and
hardware implementation.
I/O control and communication are directly implemented in the ISA.
Motivated by these characteristics, this chapter details these and other relevant parts of the
processor’s ISA, micro-architecture, and physical properties in 5.1. These details form essen-
tial background for the single-core, multi-threaded profiling and modelling contributions made in
Chapters 6and 7.
The Swallow project is then introduced in 5.2, a platform with which grids of XS1-L chips can be
connected and programmed. A significant amount of research effort was placed into making these
boards useful for MTMC research. These multi-core implementation details form the understanding
required for the contributions made in Chapters 8and 9.
5.1. The XS1-L processor family
The XMOS XS1-L processor [May+08] is an embedded processor implementing the XS1 ISA that
allows hardware interfaces to be written in parallel software. The execution of software is time-
deterministic and the instruction set places I/O hardware extremely close to the software. This
59
5. A multi-threaded, multi-core embedded system
Figure 5.1: Block diagram of XS1 architecture, showing pipeline, per-thread register banks, pe-
ripherals and network resources.
provides timing guarantees and advanced General Purpose Input/Output (GPIO) capabilities that
mean components such as MII Ethernet, Serial Peripheral Interface (SPI),I2Cand Universal Serial
Bus (USB) can be expressed as software rather than fixed hardware units. The application space
of this processor is therefore between a traditional programmable microprocessor and an FPGA.
Energy efficiency is sought both in the hardware and programming model by introducing event
driven paradigms, previously discussed in 2.1.4 and 4.4, whereby a thread may defer its own
execution until the hardware observes a particular event, such as a change in state of an I/O pin.
The hardware scheduler eliminates the need for polling loops and similarly, due to multi-threading,
interrupt service routines are not required.
The architecture also features a custom asynchronous interconnect that can be used to assemble
networks of XS1-L processors to distribute programs over many cores, with communication taking
place over two- or five-wire X-Links. This section focuses on multi-threaded operation of a single
core. The network is explained in more detail in 5.2.
5.1.1. The XS1-L multi-threaded pipeline
To bring the software as close to the physical interfaces as possible, the XS1 ISA manages threads
in hardware, with machine instructions and hardware resources dedicated to thread creation, syn-
chronisation and destruction [May09b]. In addition, each thread has its own bank of registers,
containing a program counter, general purpose and special purpose registers. This removes the
overhead of an OS and allows threads to be created in tens of clock cycles, but places a hard-limit
on the number of threads that can exist on the processor at any one time.
A block-level view of the XS1-L implementation is shown in Figure 5.1. In the XS1-L, up to
eight threads are supported. These eight threads are executed in a round-robin fashion through a
four-stage pipeline. The pipeline avoids data hazards and dependencies by allowing only a single
instruction per thread to be active within the pipeline at any one time. As such, the pipeline is
only full if four or more threads are active. If there are fewer than four threads able to run, then
there will be clock cycles in which pipeline stages are inactive. This makes the micro-architecture
relatively simple to reason about, but requires at least four active threads in order to use the full
computational power of the processor. When more than four threads are active, maximum pipeline
throughput is maintained, but compute time is divided between the active threads.
Calculating instruction throughput
With a 4-stage pipeline structure, the Instructions Per Second computed by the processor, IPSp,
in relation to the core frequency, F, is simply IPSp=F, in the case where the pipeline is full. It
can be expressed for any number of threads, Nt, as shown in Equation 5.1.
IPSp=Fmin(4, Nt)
4(5.1)
60
5.1. The XS1-L processor family
The instruction frequency of an individual thread, IPSt, can similarly be expressed as:
IPSt=F
max(4, Nt)(5.2)
An XS1-L processor typically operates at 400 MHz or 500 MHz, the latter providing a total
instruction throughput of 500 Million Instructions Per Second (MIPS) or at most 125 MIPS per
thread.
Context switching is free in time, but not in energy
By virtue of hardware threads and a four stage pipeline, the processor effectively performs a
“context switch” on every clock cycle, assuming there are sufficient active threads. In terms of
performance, this delivers significant benefits to multi-threaded programming in a single-core by
removing large overheads. However, from an energy perspective this creates interesting behaviours.
Firstly, with each clock cycle comes a completely new state into the pipeline — the proceeding
instruction and data will be from the next thread in the schedule. This introduces an energy cost
through switching in the control and data logic of the processor. Secondly, these context switches
on every cycle change the energy characteristics of the pipeline in a way that existing ISA level
energy models do not account for.
The fetch-noop and event-noop
The XS1-L pipeline structure avoids data hazards and other behaviours that would trigger pipeline
stalls and flushes in other architectures. However, there are limited scenarios in which a delay in
execution may still be incurred, in the form of a “fetch no-op” (FNOP).
The FNOP occurs when a thread’s instruction buffer does not contain a full instruction ready
to be issued. The conditions that may lead to starvation of the instruction buffer are explained
in [May09a] and summarised here:
Multiple sequential memory accesses in a thread prevent fetches, because fetches are per-
formed in the memory stage of execution.
Branch operations flush the instruction before, then fetch the branch target when word-
aligned.
Instructions are 16-bit aligned in memory, but are 16- or 32-bits in length.
If the above properties conspire to starve the instruction buffer, then a FNOP is issued. FNOPs
can be avoided by scheduling instructions to avoid runs of memory operations in a thread and
by aligning 32-bit instructions that are branch targets to 32-bit boundaries. If either of these is
impossible or undesirable due to the other overheads that may be incurred, then the conditions
that result in an FNOPsare sufficiently well defined that they can be determined statically, as is
the case in the XMOS Timing Analyzer [XMO10].
If an event or interrupt vector is followed by the processor, then the instruction buffer for the
affected thread must be flushed in a similar way, leading to a dedicated fetch, in this case termed
an event no-op.
These implicit no-ops, while simple to reason about, can have an important impact on the time
and also energy consumed by a program. Inclusion of these behaviours is therefore necessary in
simulation and energy modelling of the processor.
5.1.2. Instruction set
The XS1-L implements the XS1 ISA. This ISA is best described as a Reduced Instruction Set
Computer (RISC) construction, containing 203 instructions in total. Along with a set of typi-
cal arithmetic, logic, memory and branch operations, as described in the ISA manual [May09b],
additional groups of instructions provide simple DSP, peripheral component control, thread man-
agement, event handling and communication. This subsection elaborates on these ISA features
and their significance with respect to the unique requirements and opportunities they create when
modelling the energy consumption of software running on the device.
61
5. A multi-threaded, multi-core embedded system
Directly accessed peripheral blocks
In a conventional computer architecture, peripheral components such as timers, device interfaces
and Direct Memory Access (DMA) units are memory mapped. A memory mapped peripheral
occupies a section of the processor’s address space. Within that address space reside registers for
control, status and data relating to that peripheral. The processor interacts with a peripheral by
performing memory reads and writes to these locations [Rei99].
In the XS1 ISA, peripheral components are accessed directly with a set of resource instructions.
These allow peripherals to be allocated to a thread, controlled, and data read from/written to
them. This separates activities that can be considered I/O from memory access in the instruction
set as well as the memory hierarchy. Other ISAs, such as x86 [Int11, pp.115,176], distinguish
from memory and I/O in the instruction set through I/O specific instructions. However, I/O and
memory still share the address bus.
In XS1-L, the following resources are made available:
Thread synchronisers
Communication channels
Timers
I/O ports
Locks
ISA operations performed in relation to these instructions include data input, data output and
configuration. The exact behaviour is resource specific. For example, communication channels
need to be configured with a destination address before use, whereas locks are a simple resource
which use data in and out instructions to obtain and release the lock, with no other configuration
required.
All resources can be associated with interrupt or event vectors, causing a context switch or jump
upon the resource triggering some condition. For example, a channel resource would trigger an
event upon the availability of new data from a remote channel end. A timer could trigger an event
in a thread that has set a comparison condition against the timer, in order to then perform some
action at a specified time.
A thread can be de-scheduled when waiting for an event, meaning that it is no longer executing
instructions and thus not consuming any time within the execution pipeline. Alternatively, a
thread may continue running instructions until an event takes place. In the latter case, interrupts
are likely more useful then pure events, as the context of the thread is variable, thus some state
should be saved.
Hardware thread management
The scheduling of threads is handled in hardware. There is no requirement for a RTOS to be used
to manage threads. The ISA provides mechanisms for thread handling through a series of TINIT
instructions, which can be used to initialise the program counter, stack pointer and other pointers
of the target thread’s register file.
Threads can be initialised as unsynchronised, or associated with a synchroniser resource to allow
barrier synchronisation of groups of threads. The XS1-L supports up to eight threads, which share
time round-robin in the previously described four stage pipeline.
An allocated thread can be in either a running state, where instructions from that thread are
issued into the pipeline, or de-scheduled against a condition. The conditions against which a thread
waits are typically resource-driven, for example waiting for the expiration of a timer, or the arrival
of data on a port.
Given that scheduling is implemented in hardware, there is no need to check the status of a
thread. This combines well with the event and interrupt system of the processor. When an
event occurs that will allow the thread to be scheduled, the hardware scheduler will take action.
Busy-waiting loops can therefore be avoided.
62
5.1. The XS1-L processor family
Communication channels
The XS1 architecture uses channel communication for the exchange of data between threads,
modelled upon the CSP formalisation [Hoa78]. Channel communication is included in XC, a
custom C dialect created by XMOS. It is also present at the ISA and hardware level.
An XS1-L core has 32 channel end resources that can be allocated to threads. Each channel
end has an address, identified as a composition of its network node ID, local channel end ID and
resource type. The bit-wise construction is shown in Eq. (5.3). To send a message over a channel
end, a destination address must be specified with a SETD instruction against the local channel
end. All OUT instructions using that channel end will then be sent to the specified channel end. A
simplified sequence of these instructions is shown in Listing 5.1 and 5.2, where a single word (the
sender’s own channel end ID, in this case), is sent to a receiving channel end.
ID[31 : 16] = Node ID
ID[15 : 8] = Channel end ID
ID[7 : 0] = 0x2 (5.3)
1r0 ,2 # Get c hane nd
2r1 , cp [ 0] # Load d st
3res [ r 0 ], r1 # Set dst
4res [ r 0 ], r0 # TX w o r d
Listing 5.1: Sending on a channel.
1r0 ,2 # Get c hane nd
2r1 , cp [ 0] # Load d st
3res [ r 0 ], r1 # Set dst
4r0 , r es [ r0 ] # RX word
Listing 5.2: Receiving on a channel.
A channel end will receive data from anywhere that addresses it. Thus, at the architecture level,
many-to-one communication is possible, but one-to-many multicast or broadcast is not. At a higher
levels of abstraction in XC, channels are expressed purely as point-to-point communication, without
the ability to change the destination address dynamically. Further, although the architecture can
also permit core-local message passing through shared memory, the original version of XC does
not support this, due to strict parallel memory usage rules.
Thread 0
Processor switch
(PSwitch)
Node switch
(SSwitch)
ID: 0x0000
Thread 4 Thread 1
Channel end 0x07
ID: 0x00010702
Dst: 0x00000202
Processor switch
(PSwitch)
Node switch
(SSwitch)
ID: 0x0001
Channel end 0x00
ID: 0x00000002
Dst: 0x00000102
Channel end 0x01
ID: 0x00000102
Dst: 0x00000002
Channel end 0x02
ID: 0x00000202
Dst: 0x00010702
Core 0 Core 1
Figure 5.2: Channel communication in the XS1 ISA. Both core-local (green, dotted) and multi-core
(blue, dashed) communication is shown, between two pairs of channel ends allocated
across three threads in total.
With core-local channel communication, node IDs will be the same for source and destination. A
data rate of 2 Gigabits per second (Gbps) can be achieved locally with a 400 MHz core clock. Multi-
core communication uses the node ID to route messages to the correct core and is bandwidth limited
63
5. A multi-threaded, multi-core embedded system
by the interconnect. Figure 5.2 depicts channel how channel communication between threads and
cores takes place. The network implementation is explained in more detail in 5.2.
5.1.3. XS1 product and micro-architecture variants
A number of devices are based on the XS1-L micro-architecture. Not all variations are of interest
to this work, as they introduce features that are not central to the research. In addition, the
naming of devices and some architectural components has changed during the undertaking of this
research. This work has adopted the original conventions wherever possible, for consistency.
Product naming conventions
The devices and names that may referenced in this thesis and their key differences are explained
below.
XS1
The ISA, shared by all the XMOS processors modelled in this work.
XS1-G
A quad-core, 90 nm implementation of XS1. This processor is not actively used in this
research.
XS1-L
A single-core, 65-nm implementation of XS1, at the centre of this research.
XS1-L1 and XS1-L2
XS1-L based devices, assembled into either single- or dual-core products in a single package.
The XS1-L2 devices are used in Swallow.
XS1-SU1
An XS1-L based device packaged with a USB Physical layer (PHY),Analog-to-Digital Con-
verters (ADCs) and voltage controllers. The peripheral devices are accessed using the XS1’s
channel communication paradigms.
XS1-A8, A16, U8 and U16
Single- and dual-core variations of the XS1-SU1, using the more modern XMOS naming
scheme. Devices prefixed with “U” contain USB and analogue peripheral components,
whereas “A” devices omit USB.
Architectural naming conventions
Athread, as defined in 2.1.1 and the original XMOS terminology, is a logical core in the new
terminology. A core is then termed a tile. The distinction between old and new styles can be
made by observing that care is taken to refer to “logical cores”, not simply “cores” in the new
terminology. Further, the term threads is not used in the new style, and tiles is not used in the
old.
The changes made to the naming conventions create some potential for confusion when cross-
referencing material. However, in isolation, this thesis maintains consistency in its use of terms.
5.1.4. Summary
This section has given an outline of the features of the XS1-L, particularly those of interest to this
research, as listed at the beginning of the chapter. The very tightly coupled hardware scheduled
threads present a new challenge for ISA level energy modelling, whilst the channel communication,
event- and resource-driven parts present new opportunities for analysis of programs in a MTMC
context.
The examination and modelling of a single XS1-L core in Chapters 6and 7yields new insight into
hardware energy characteristics and a new approach to energy modelling of embedded software.
However, multiple such devices form a more complex and interesting subject of study. The next
section describes a system of this nature.
64
5.2. Swallow multi-core research platform
5.2. Swallow multi-core research platform
The Swallow multi-core research platform is a project established within the Microelectronics
Research Group at the University of Bristol during the course of the research presented in this
thesis. A significant amount of research and development effort was put into the tools and software
for Swallow to ensure that it can serve as a platform for supporting the multi-core component of
the multi-level energy model demonstrated in Chapter 9.
This chapter details the Swallow platform, its purpose in relation to this thesis and how it
and the tools developed for it are used to further the research conducted herein. A more general
description of the Swallow platform, in particular more detail on aspects not directly relevant to
this thesis, can be found in [HK15].
5.2.1. System design
Swallow, pictured in Figure 5.3 is designed to allow a multi-core embedded system in the order of
hundreds of cores to be assembled and used for a variety of experiments, in particular work focusing
on multi-core task allocation, network utilisation and energy efficient multi-core computing. It
achieves this with XMOS XS1 based hardware. As a result, it does not exploit emerging chip
technologies such as large on-chip networks of many cores and 3D stacking of components. However,
it does provide an experimental platform for exploring some of the considerations that must also
be taken into account in devices that use networks to communicate.
(a) A single 16-core Swallow board. (b) A 1x8 board stack.
Figure 5.3: Photos of the Swallow platform.
The key components that make up the Swallow platform are as follows:
XMOS XS1-L2 dual-core, 16-thread chip.
Eight L2 processors assembled onto a single board, giving 16 cores per board.
External link interfaces to allow multiple boards to be assembled both horizontally and
vertically.
Power measurement shunts designed into the board’s various power supplies, with pin-out to
allow measurement equipment to be coupled to the boards easily.
Some I/O exposed to allow external interaction when the X-link network cannot be used.
Support for peripheral boards that feature additional XS1 processors and provide peripherals
such as additional DRAM and Ethernet connectivity.
JTAG, flash and Ethernet based booting of cores, as well as JTAG debugging.
These will each be examined in more detail in the remainder of this section.
65
5. A multi-threaded, multi-core embedded system
XS1-L die
Core
Switch
XS1-L die
Core
Switch
XS1-L2 package
4 Gbps
switch links
500 Mbps
internal X-links
Un-bonded
X-links
125 Mbps
External
X-links
4 Gbps
switch links
Un-bonded
X-links
125 Mbps
External
X-links
Figure 5.4: The XS1-L2 package and its relationship to the two L-series cores and switches con-
tained within it. There are a total of 16 X-links, with four connected to pins on the
package, two from each switch.
XS1-L2 processor
At the time of design, the XS1-L2 processor had the largest core count of any XMOS L-series
processor. The G-series features 4 cores, but uses an older process technology, has a more restrictive
network topology requirement, and has no Dynamic Frequency Scaling (DFS) capabilities. As such,
the XS1-L2 was the best choice for achieving maximum core density, whilst enabling exploration
of network utilisation and energy efficiency in current and future work.
The XS1-L2 is two L-series die assembled in a single package, a representation of which is
depicted in Figure 5.4. Each die contains an XS1-L core and switch. The switch provides eight
X-links, four of which are bonded to the switch of the neighbouring XS1-L within the package.
Each switch also has two of the remaining X-Links bonded out onto external pins. A number of
I/O ports are also bonded out. A single clock source and set of power supplies is shared with both
cores via the package pin-out. The exact pin-out is described in the XS1-L2 datasheet [XMO12].
The pin-out presents some limitations with respect to efficiently laying out the chips and as-
sembling multiple chips into a network. Due to each package containing two switches and the
connectivity of those switches in relation to the package, assembling a mesh network using a
north-south/east-west connection method would result in a sub-optimal maximum number of hops
for any given assembly of such chips. It was therefore necessary to connect north-west/south-east.
This is described in more detail in the Swallow technical report [HK15]. The resultant network
and routing strategy is described in 5.2.2 of this thesis.
Eight chip Swallow board
Each Swallow board contains eight dual-core XS1-L2 processors. Vertical pairs of chips are powered
from separate voltage controllers for Vcore, with a global Vio for all chips and other I/O components
such as Light Emitting Diodes (LEDs).
All four external links of each XS1-L2 chip are used and are either connected to a neighbouring
chip, or an external connector for off-board communication. The network topology is described in
more detail in 5.2.2.
External interfaces
Ribbon connectors provide off-board transit for X-Link, I/O and JTAG signalling. The connectors
are not homogeneous in pin-out, restricting how boards can be connected. Swallow boards must
be aligned when connected, such that the top-left connector of one interfaces to the top-right
of another, and so on. Further, peripheral boards are only compatible with north and south
connectors, not any of the east or west connectors.
66
5.2. Swallow multi-core research platform
Power measurement
Shunt resistors are included on the 5 V, 3.30 V and each of the 1 V DC-DC power supplies, so that
the current supplied by them can be monitored when appropriate hardware is attached, such as
an INA219 measurement chip [Tex11]. The 5 V and 3.30 V supplies have a pin-out that allows a
measurement board to be attached above them.
I/O
Due to the prolific X-link usage and chip density per board, there are few spare I/O ports and
many that can be routed to a connector. However, there are still some restricted I/O capabilities:
Six 1-bit and two 4-bit ports from core 0 on the top-left chip, wired to a 2x8 header at the
corner of the board. This can be used for GPIO.
Three 1-bit ports on core 6, connected to a 64 Megabit (Mb) SPI flash chip on the Swallow
board. This can be used for persistent data storage.
Four 1-bit ports connected to core 10 This is intended to connect to an energy measurement
board, providing either sufficient I/O for the I2Cinterfaces to the measurement chips, or as
a simple interface to the device controlling the measurement board, in order to provide
triggering and synchronisation of measurements.
Additional I/O is possible if external X-Links are re-purposed; the package-accessible X-Links on
the XS1-L2 are multiplexed with various I/O ports [XMO12]. However, a more flexible approach is
to connect an additional XS1 device over an X-Link that is designed to serve as an I/O controller.
Peripheral boards
The Swallow grid can be supported by peripheral devices consisting of one or more additional XS1
chips and additional I/O components. Currently, one such peripheral board exists.
The peripheral board features a single-core XS1-L1, controlling 32 Megabytes (MBs) of DRAM
and has a connector for interfacing with Slicekit peripherals. XMOS Slicekits are a system of
modular board and peripherals that can be connected in various ways [XMO15]. In the case of
the Swallow peripheral board, the connector is intended to be used with an XMOS Ethernet slice,
which is a Slicekit compatible network adapter with PHY chip and RJ-45 connection.
The XS1-L1 on the board acts as an interface to the DRAM and Ethernet and can communicate
with the grid using channel communication. As such it can serve as a network bridge to allow data
to flow into and out of the grid over Ethernet and also as a volatile memory store. This allows the
grid to access significantly more memory than the 64 Kilobytes (KBs) SRAM of each of the XS1-L
cores. It is also possible to load program images onto the grid over Ethernet via TFTP, which is
significantly faster than JTAG, particularly for large numbers of cores, where the JTAG chain size
increases load times quadratically.
JTAG
JTAG provides a method of programming and debugging devices by forming a chain of Test Access
Ports (TAPs) that can be read and written serially [Rob94]. The performance of JTAG is limited
by the length of the chain and the maximum clock rate at which the slowest TAP can operate.
A single XS1-L device contains four TAPs, two for the core and two for the switch [May+08,
p. 31]. On a swallow board there are 16 XS1-L devices, forming a chain of 64 TAPs on a single
board. The chain is formed along the chips in a clock-wise fashion, entering at the leftmost chip on
the second row. Figure 5.5 gives a graphical representation of how the chain is formed, including
optional connectivity to other boards and the debug device. Control of external JTAG connections
is made via a 4-bit rotary switch, which drives the select inputs on a set of multiplexers.
When multiple boards are connected, the chain extends horizontally between boards, with ver-
tical chaining along the leftmost set of boards. A script was written as part of this work in order
to generate both the network configuration and correct JTAG chain ordering for arbitrary board
arrangements [Ker14].
67
5. A multi-threaded, multi-core embedded system
Core 0
XS1-L2 package
Core 1
XTAG2
Debugger
Core 0
XS1-L2 package
Core 1
Core 0
XS1-L2 package
Core 1 Core 0
XS1-L2 package
Core 1
Core 0
XS1-L2 package
Core 1 Core 0
XS1-L2 package
Core 1
Core 0
XS1-L2 package
Core 1 Core 0
XS1-L2 package
Core 1
Figure 5.5: JTAG chain of a single Swallow board, showing switching points to form chains with
additional boards.
Any JTAG read or write operation must be shifted through the chain by a clock, with sufficient
clock cycles provided to allow any response data from TAPs to also be shifted along the chain. The
approximate time, tmsg, to send an m-bit message along a JTAG chain is therefore constrained by
its clock frequency, F, and chain length, c, described in Eq. (5.4). The achievable throughput of
multiple messages is dependent on the response time of the TAPs and the size of the response that
they give, assuming that the response needs to be interpreted before sending another message.
tmsg m×c
F(5.4)
5.2.2. Network implementation
The Swallow network forms a two-layer unwoven lattice structure. There are three dimensions to
the network: horizontal and vertical, with respect to a board and its neighbours, and layer, with
respect to the cores within a single chip. Figure 5.6 visualises the connectivity formed by this
topology.
The connectivity of the chips is such that each core only has the freedom to communicate in two
of the three available dimensions, one of which is always the layer dimension. This forms two layers,
one in which horizontal communication is possible, and one in which vertical communication takes
place. The first core in each chip is connected to the vertical layer, whilst the second is connected
to the horizontal layer.
XS1 communication principles
When multi-core channel communication is performed in XS1, X-links are used to transmit and
receive data. The links use a credit-based flow control mechanism [May+08, pp. 12–13] to block
transmission if upstream buffers are full. Messages are transmitted on the wire as one-byte tokens,
although ISA allows transmission in single tokens via outt and int, or four-byte words with out
and in. A message begins with a three token destination address header, which is automatically
prepended to the first token emitted from a channel end by an out or outt instruction.
If the destination address is non-local, then the local switch begins to receive the message over an
internal link to the processor core. The most significant 16 bits of the destination address are then
68
5.2. Swallow multi-core research platform
XS1-L die
Core 0
XS1-L2 package
XS1-L die
Core 1
Core 2
Core 3
Core 4
Core 5
Core 6
Core 7
Core 8
Core 9
Core 10
Core 11
Core 12
Core 13
Core 14
Core 15
Vertical X-link
Horizontal X-Link
Vertical X-links to next board
Horizontal
X-links to
next
board
Internal
(layer)
X-links
Optional
XScope
link
Figure 5.6: Swallow network topology, with on-chip links providing layer transitions, and each core
having either vertical or horizontal external connectivity, similar to a unwoven lattice
on a pie. The second horizontal X-link on the left can connect to another board, or
optionally an XMOS XTAG2 for faster debug output than JTAG.
used to determine the X-link over which the message should be forwarded. A lookup is performed
against the position of the most significant destination bit that is different to the local node ID,
where the position number determines which of the device’s X-links will then be used. Switches
receiving a message over an X-link perform the same activity. If the top 16 bits match the local
node ID, then the remaining header bits (the channel ID) are forwarded to the local core, along
with the rest of the message.
Once a message has begun and the header sent, any X-links along the route are reserved in
the direction of communication. A closing control token must be sent along the route in order to
free these links for use by other messages. Either a PAUSE or an END token can be sent to achieve
this, where the former is discarded by the destination switch and the latter propagated to the
destination channel end.
This wormhole routing strategy allows both packeted messages and dedicated communication
channels to be used. In the latter case, no PAUSE or END tokens are ever issued between a pair of
channel ends, leaving the route between them permanently allocated to them. Contention can be
resolved by using multiple links between nodes (such as the four links between the two cores in the
XS1-L2 chip) and by sending information in packets, where the overhead of sending the three-byte
header should be considered, particularly for small packets.
69
5. A multi-threaded, multi-core embedded system
Routing table
Bit 15 14 · · · 0
Use link 0 0 · · · 1
Example
Node ID 0 1· · · 1
Destination ID 0 0· · · 1
Forwarding link 0
Table 5.1: XS1-L routing table example, where the second most significant bit of an incoming
header is different to the local node ID. A lookup against the routing table indicates
that link 0 will be used to forward the message.
Dimension-order routing over layers
Dimension-order, or e-cube routing allows deadlock-free routing over N-dimensional network struc-
tures [SB77]. The direction of travel for communication is done in a pre-determined order. For
example, in a 2D grid, a valid dimension-order routing strategy is to always move to the correct
horizontal position on the grid first, then to the correct vertical position, at which point the des-
tination has been reached. In this routing strategy, the request and response may take different
paths if their locations differ in more than one dimension. This avoids deadlock by preventing
cycles between groups of communicating nodes.
In the case of Swallow, nodes do not have connectivity to all dimensions. A best-effort is achieved
by applying 2D dimension-order routing, where vertical positions are resolved first, followed by
horizontal. When resolving the vertical position, if a node is on the horizontal plane, the message
will first pass to the vertical layer. For the horizontal stage, the message will reach the horizontally
routed node closest to the destination, then pass along to the node in the vertical layer if necessary.
The result is that at most two layer transitions may be required (the first hop and/or the last hop),
but vertical and horizontal traversals happen in dimension-order.
The routing strategy is governed by the switch configuration as defined earlier in 5.2.2. A
configuration for any m×nconfiguration of Swallow boards can be generated by a tool developed
during the course of this thesis [Ker14].
Network speed and width
On-chip, each core has four links to its neighbour, with a maximum data rate for a link of 500 Mbps,
giving an on-chip bisection bandwidth [HP06] of 2 Gbps. Between chips, there are single links ver-
tically and horizontally. This extends to other boards. The link speed is also 500 Mbps maximum,
but due to wire lengths it is typically a quarter of this in order to provide stability, although this
can be tuned. The bisection bandwidth of a single board, with 125 Mbps links, is 250 Mbps.
Bisecting a grid of boards horizontally (there are half as many vertical links has horizontal links
per board), the bandwidth, bis related to the number of boards horizontally, w, and link speed, l,
in Eq. (5.5).
2wl =bbps (5.5)
Link speeds are configurable either at compile time through an XN file, or dynamically through
configuration commands to the relevant switches. The five-wire links used in Swallow transmit two
bits per symbol and therefore four symbols per token. The transmit time of a token is 3Ts+Tt,
where Tsis the inter-symbol delay and Ttan inter-token delay. These delays are relative to the
switch clock, which is typically either 400 MHz or 500 MHz and is usually the same as the core
clock. The minimum delay parameters are Ts= 2, Tt= 1. Both delays are 11-bit values, allowing
for link speeds significantly lower than the switch clock frequency.
70
5.3. Research enabled by the XS1-L and Swallow
Name Load time Size Debug Notes
JTAG Slow Limited Reasonable
Partial network (straight line), core
numbering in debugger can be
unintuitive. No more than 128 cores.
JTAGv2 Slow Limited Good
Full network, logical core numbering.
Still limited to 128 core debug.
Requires v13.0 of XMOS tools.
Etherboot
[Ker12b]Fast Unlimited Poor
No debug symbols; assembly only. 128
core debug limit. Uses X-Link
network to boot 2 orders of magnitude
faster than JTAG.
Etherboot +
JTAG Medium Limited Reasonable
Full network, logical core numbering
and debug. Code loaded over
Ethernet then debugging via JTAG.
Video of boot process: https://www.youtube.com/watch?v=kUo11tTeYK0
Table 5.2: Swallow boot methods developed during the course of this research, each of which has
trade-offs to consider. The majority of work contributing to this thesis is done via the
JTAGv2 method.
5.2.3. Compiling & loading software for Swallow
Several techniques have been developed for loading software onto Swallow over the course of this
research, some of which have become possible due to improvements to the vendor’s toolchain,
whilst others have required more customized implementations.
The JTAGv2 boot method (Table 5.2) is used for the profiling and testing performed in Chap-
ter 9, as this provides an appropriate level of debug capabilities on a full network implementation
which can also be simulated.
5.2.4. Summary of Swallow
The statements of this thesis, made in 1.1, demand a multi-core system of embedded multi-
threaded processors in order to perform the desired research. This section has described the
Swallow platform, a system which serves this purpose.
The Swallow platform introduces hardware profiling and software energy modelling challenges
beyond those of a single multi-threaded core, for several reasons:
A significant amount of effort was required to construct, configure and program for this
system.
Multiple cores and power supplies must now be considered.
Communication of data over a credit-based, cut-through routed network can be observed.
A more general exploration of Swallow’s capabilities is presented in [HK15]. Energy profiling of
swallow is performed in this thesis in Chapter 8and a model proposed and evaluated in Chapter 9.
5.3. Research enabled by the XS1-L and Swallow
A collection of XS1-L processors assembled into a lattice of embedded compute nodes create a rich
set of features that enable the novel work of this thesis to take place. These features begin in the
core with the paradigms established in the ISA and reach as far as the network-level communication
that departs from the more conventional shared memory approaches. Most importantly, these
processors are embedded devices, not high-end application specific components, or large general
purpose CPUs.
The work presented in this chapter forms the working knowledge necessary to conduct research
along the themes defined in 1.1. Tools for booting, running and debugging Swallow were con-
tributed to the Swallow project during the course of this research. This hardware/software platform
71
5. A multi-threaded, multi-core embedded system
can be used both for this and future research activities. The key contributions from this chapter
will now be summarised, in relation to those research themes.
Use a multi-threaded embedded real time processor. Prior work, discussed in Chapter 3, fo-
cuses on single-threaded devices, and although recent research includes new parallel architectures,
the selection of the XS1-L allows a number of unique properties to be explored in this space.
In particular, the XS1-L ISA puts the software very close to the hardware, which may aid the
modelling of energy for software running on the device.
Extend the system into a multi-core network of processors. In response to the limits of Dennard
scaling, discussed in Chapter 4, parallelism is a necessary dimension into which both hardware and
software must expand. The Swallow platform allows this to be studied in the embedded real time
system space, where other platforms cater to different compute tasks.
Utilise novel or rarely used paradigms compared to previous work. The XS1-L and Swallow
have several characteristics worthy of exploring in relation to the goals sought by this thesis. In
particular, hardware-managed threads, time-deterministic execution, dedicated I/O instructions
with no memory mapping, and a channel based communication abstraction provide a compelling
list of features upon which to conduct novel research.
Provide a means of profiling hardware and evaluating modelling on multiple levels. The pre-
sented processor and Swallow system can be profiled as a single core or multiple cores. In the
multi-core scenario, the core power supplies and I/O supplies can be measured, in order to estab-
lish different facets of energy consumption, enriching the energy models proposed in the remainder
of this thesis.
Enable the cost of communication to be assessed in a message passing, rather than shared
memory architecture The XS1 architecture provides message passing at the lowest implementa-
tion levels. This is propagated up to the software level through the XC programming language.
Swallow allows many of these message passing cores to be utilised by parallel programs. Thus, these
characteristics can be profiled and modelled to seek useful predictions of the energy consumption
of such programs.
This chapter has reviewed the XS1-L processor and the multi-core XS1-L based Swallow system,
highlighting the architecture and system level properties that are important to implementing event-
driven and multi-threaded software. An adequate understanding of the principles underpinning
the XS1-L and Swallow allow the research questions posed in this thesis to be further studied.
This concludes Part Iof this thesis. It has provided the background research and knowledge
pertinent to the exploration of the new research questions posed in Chapter 1. Part II details the
efforts to answer those questions in earnest.
72
Part II.
Constructing a multi-threaded,
multi-core energy model
73
Introduction
In Part I, three relevant areas of prior research were discussed, in addition to details of the hardware
devices and platforms selected for use in this thesis. Part II presents the main contributions of this
thesis, forming answers to the research questions posed and thesis statements made in Chapter 1.
The background material from Part Iwill be referred to where relevant, and the relationship
between the state of the art and this thesis’s research contributions will be explored in more
technical detail. This part is structured to cover three main areas:
1. energy modelling in relation to one multi-threaded core;
2. modelling a network of such cores at a system level, and;
3. analysis of how these contributions relate to other contemporary architectures.
A concluding chapter evaluates these contributions.
Chapter 6proposes methods for exploring and capturing energy consumption data at the ISA
level for an XS1 processor. It includes analysis of the significant parameters that need to be con-
sidered when constructing a model of this processor. These discoveries are then used in Chapter 7
to form a model. This model is then extended through regression techniques and subsequently
tested in multiple contexts, including full tracing and statistics based simulation.
Energy profiling of the Swallow platform is presented in Chapter 8to obtain multi-core commu-
nication costs. This new data, combined with the core-level profiling and energy model, is used to
build a flexible graph oriented system level model, which is presented in Chapter 9.
Chapter 10 examines architectures other than the XS1, identifying the opportunities to apply
the contributions of this thesis to other platforms, as well as highlighting where further work is
required to achieve this.
This thesis is concluded with Chapter 11. It contains a summary evaluation of the complete
work and draws the final conclusions from the contributions made. Further work is proposed based
on the new possibilities created by this thesis, with a view to both improving upon this work and
using it for new research.
75
6. Model design and profiling of an XS1-L
multi-threaded core
This chapter details the first step in producing an energy model for a system of XS1-L processors:
profiling one multi-threaded core. The goal of the profiling is to collect sufficient empirical data to
provide a robust base for the model, expose processor characteristics in need of further investigation
and allow extrapolation of more complex model parameters through regression and other methods.
6.1. Strategy
The strategy for constructing the model is comprised of several parts, with a significant amount
of automation included to maximise data collection and opportunities for refinement:
Establish a modelling approach.
Create a test-bench to acquire data compatible with the selected modelling approach.
Run the test-bench.
Inspect the test-bench data, refining the tests and the framework as necessary.
Construct the model using collected data.
Verify the model against simple tests and more complex benchmarks, to determine its accu-
racy.
Continue to refine the model, both through changes to the model structure and through new
tests that provide additional data.
The process flow accommodating these parts is depicted in Figure 6.1. The remaining sections in
this chapter describe the profiling method with consideration towards the design of the model, dis-
coveries made during profiling and refinements made as a result. The subsequent chapter explores
the model itself.
6.2. Profiling device behaviour
Profiling at the ISA level brings certain benefits and also disadvantages. For example, it does
not require gate-level simulation of the device. However, without information on the exact im-
plementation of the micro-architecture, some behavioural details become a black box, where the
behaviours can potentially be exposed at the ISA level by suitable sequences of stimuli, but the
explanation or a full understanding of these behaviours may not be possible at this level.
The profiling performed in this work seeks to expose sufficient information about the processor’s
energy characteristics so that an energy model constructed against this data yields an acceptable
accuracy.
The relatively small instruction set and deterministic execution of the XS1-L processor bears
similarity to the parameters used in the ISA energy model proposed by Tiwari, as described in
3.2.2. Therefore, this approach is taken as a starting point and extended to account for the new
parameters necessary to capture multi-threading and other new behaviours in the XS1-L. More
model considerations are given in 6.3.
Such a model requires a base instruction cost for instructions as well as inter-instruction over-
heads, plus any other effects not directly expressed by the stream of instructions that are executed.
To construct a model in such a style, power measurements must be taken for individual instruc-
tions as well as measurements for pairs of instructions, so that the instruction base cost and
77
6. Model design and profiling of an XS1-L multi-threaded core
Test
bench:
profiling
Test
generation
Test data
Test
bench:
verification
Model
Test patterns
& constraints
Model
accuracy
Verification
tests &
benchmarks
Profiling
refinement
Model
refinement
New / modified tests
New test
bench features
New model
parameters
and features
Profiling Modelling and verification
Generate
model
Figure 6.1: The process used to profile the XS1-L then produce and verify an energy model in-
cluding refinement. Dashed lines denote manual effort and solid lines automatic. The
process starts with the definition of test patterns and constraints, becoming a cyclical
activity thereafter.
inter-instruction overhead can be considered. This data can then be used to estimate a program’s
energy, based on the sequence of instructions that it will run.
However, the XS1-L has features that cannot be accounted for in the Tiwari style model, therefore
it must be extended. In doing so, the terminology must also be carefully selected and explained,
so as to relate the prior models and the model that will be constructed in Chapter 7.
Base costs: processor vs. instruction
In the Tiwari model, the base cost is a base instruction cost. That is, each instruction has a contri-
bution to power dissipation, before considering inter-instruction effects, but there is no separation
between the instruction cost and any other always-present power dissipation in the processor. De-
coupling instruction cost from the underlying processor cost gives a base processor cost that is
unaffected by the instruction being executed.
In a sequential, single threaded microprocessor, a no-op instruction could represent the energy
consumed when the processor is idle and thus be profiled to determine the base processor cost. The
costs of executing meaningful instructions and the interactions between them can then be built on
top of this base processor cost.
Taking into account the XS1-L’s event driven architecture and therefore idle times in which no
instructions execute, the base instruction cost can be defined as the minimum energy consumed
when there are no active threads. Investigation into and establishment of a base processor cost for
the XS1-L is detailed in full in 6.4.
Instruction and inter-instruction costs
Once a base cost is established, the next challenge is how to handle instructions and inter-
instruction overheads in the context of the XS1-L. To determine these in a similar fashion to
the Tiwari model, the cost of executing each instruction and of transitioning between pairs of
instructions must be determined.
Hardware measurements are required in order to establish the magnitude and variability of inter-
instruction overheads, so that an appropriate granularity can be chosen for the model, delivering
an acceptable performance/accuracy trade-off. For example, if the contribution of inter-instruction
overhead is insignificant in comparison to the cost of each individual instruction, then it may not
78
6.3. Model design considerations
be necessary to consider it, or it may be appropriate to generalise it if there is little variation in
overhead between instructions.
To produce a model appropriate for the XS1-L’s multi-threaded architecture, the processor
must be seen as a pipeline that is executing a stream of unrelated instructions from neighbouring
threads. Although dependencies and synchronisation may exist between some threads at a higher
level of abstraction, on a per-instruction level, a pair of instructions travelling together through
the pipeline are effectively unrelated in any real-world embedded application.
This precludes using a sequence of instructions in a thread as a means of exercising the proces-
sor in order to determine instruction costs and inter-instruction overheads. Instead, instruction
overheads must be measured by controlling the instructions that a collection of threads are run-
ning, such that the exact sequence of instructions passing through the processor pipeline is known.
6.4.3 describes how the measurement framework achieves these guarantees in order to extract
inter-instruction overheads. Pairs of instructions remain sufficient for determining overheads, de-
spite the device’s four stage pipeline, due to the deterministic scheduling and in-order progression
of instructions through the pipeline.
Thread cost
In addition to instruction costs, the parallelism present in the XS1-L, in the form of its hardware
thread schedule, must be considered. It must be determined whether the number of active threads,
and therefore amount of parallelism present in the system at any given time, has a measurable
impact on power that should be accounted for in the model.
6.3. Model design considerations
In addition to the practicalities of collecting power data for the XS1-L, the design and use case for
the energy model is also considered. The primary goal of the energy model is to allow a simulation
of a piece of software to produce an estimate of the energy consumed by that software. It can
provide novelty through its exposure of previously un-modelled characteristics (for example, due
to the unique design of the target processor) and by providing simulation performance that is
better than lower level hardware models such as Register Transfer Logic (RTL) based approaches.
6.3.1. Simulation performance
For a software energy model to be useful, it must be more convenient to run it than to instrument
and measure a hardware system. The modelling approach used in this thesis requires the use
of an Instruction Set Simulator (ISS). On a 2.26 GHz Intel Core i3 CPU, a full instruction trace
simulation using the standard XMOS tool, xsim [JGL09] takes 51 minutes for a 0.4 second real-
time benchmark. A simulation producing only execution statistics, using the faster axe [Osb11;
Ker12a] tool, takes 40 seconds. The axe simulation accuracy is the same whether or not a trace
or only statistics are produced, so the reduced information present in statistics is the only risk to
model accuracy if axe is the chosen simulator. However, xsim is more accurate overall. Work on
improving the accuracy of axe to bring it in line with xsim is discussed later, in 9.2.
Thus, there is motivation to construct a model that can rely on instruction statistics rather than
complete trace data. However, statistics alone make it impossible to account for inter-instruction
overheads at a per-instruction level, because the exact sequences of executed instructions are not
recorded. The impact of forgoing this must be considered during data collection.
6.3.2. Architecture Comparison
Table 6.1 illustrates the key differences between the target processor for this research and a sam-
ple of other processors used in previous work as detailed in 3.2.2. The significant differences in
pipeline implementation, threading methods, communication model and memory hierarchy serve to
justify the goal of this work in creating the foundations of a model for the XS1-L. Chapter 10 com-
pares a wider range of processor types to the XS1-L in more detail and discusses the applicability
of the modelling approaches used in this work to those processors.
79
6. Model design and profiling of an XS1-L multi-threaded core
Feature XS1-L ARM7TDMI
[LEM01]C641T [IRF08]Xeon Phi [SB13]
Cores 1 1 1 60+
Threads 8 1 1 4 per core
Instr. sched. Round-robin
threads, in order In-order In-order 2x4
VLIW In-order
Forwarding No Yes No Yes
Com. model Channels Shared memory
Mem. / cache 1-cycle SRAM, no
cache Optional caches L2 L2 + tag-cache on
ring network
Table 6.1: Comparison of key differences between various architectures.
The XS1-L has a unique multi-threading method compared to other modelled processor. Further,
the single-cycle SRAM removes the need for a cache model. However, the channel communica-
tion implementation and underlying interconnect demand new profiling and modelling techniques.
These are examined in detail with respect to multi-core modelling in Chapters 8and 9.
6.4. : A framework for profiling the XS1-L
XMProfile is the hardware-measurement framework constructed for this work to gather energy
consumption data for the XS1-L. It is built with consideration to the following aims:
1. To execute code with a level of granularity that delivers certainty as to the trace of instructions
through the pipeline.
2. To provide a measurement interface in order to easily collect energy data and attribute it to
test cases.
3. To perform constrained generation of tests for automation of the profiling process.
4. To support the inclusion of benchmark code to enable comparisons between the resulting
model and the actual energy characteristics of the target hardware.
As such, XMProfile is both a test-generation framework and an energy measurement tool. They
can be used together or separately, although the generated test kernels are very tightly integrated
into the structure of the measurement framework.
6.4.1. Hardware
The hardware platform for the energy profiling effort consists of two XS1-L devices, an XK-1
development board containing the master processor and a bespoke XMOS board containing an
additional XS1-L — the slave processor or the Device Under Test (DUT). The bespoke board was
modified to provide easy access to the core power supply of its XS1-L. The XK-1 development
board controls a DC-DC power supply and an INA219 power measurement chip [Tex11], allowing
power dissipation of the power supply to be sampled at a rate of up to 8 K samples per second with
up to 11-bit resolution with a least-significant sample bit of 680 W for the expected maximum
current of the XS1-L.
In addition to controlling and monitoring the power supply of the DUT, the master processor
is also responsible for synchronising tests against energy measurements in order to automate the
collection of model data.
6.4.2. Software
The collection of software in XMProfile can be broken down into four key components:
1. Power measurement and data streaming to host PC.
80
6.4. XMProfile: A framework for profiling the XS1-L
Figure 6.2: XMProfile test harness hardware and software structure.
2. Test loading and synchronisation with power measurements.
3. Test case generation with constrained random data for all instruction permutations in a given
instruction subset.
4. Test control software, delivering fine-grained management of instruction flow during test
kernel execution.
The master processor runs software that samples power values from the INA219. These samples
are then streamed out over a USB interface to a host PC. At the end of each test run, an average
power figure is calculated. This combination of streamed data and test run averages provides
sufficient data to feed into an energy consumption model.
Tests are synchronised with power measurements by using the XMOS XS1’s communication
architecture. The master processor and DUT form a network over a 2-wire X-Link. The link is
used as a trigger to signal the start of the next test, and halt the test once the test period is over.
6.4.3. Controlling the pipeline
Establishing the instruction costs and overheads through the pipeline requires the ability to control
the order of instructions progressing through it. When subsequent instructions come from different
threads, this is difficult to guarantee at a high level, such as with compiled C code. However, the
XS1-L’s single-cycle thread synchronisation allows the test harness to have precise control over
which instructions in a thread are executing at any one time, provided the tests do not introduce
any non-determinism with respect to execution time, such as through I/O operations. A typical
test flow is depicted in Figure 6.3.
A test thread is a loop containing the body of instructions to be profiled, with minimal prologue
and epilogue, but sufficient to ensure synchronisation and allow correct termination. Four threads,
T0 to T3, are required to fill the pipeline and create instruction interactions on every clock cycle.
To observe inter-instruction effects, the body of odd-numbered threads are populated with one
instruction Iodd, whilst even-numbered threads are populated with another Ieven. As the threads
execute round-robin, the instruction executed at a given pipeline stage will alternate between Iodd
and Ieven, allowing specific inter-instruction effects to be measured.
All threads are synchronised against the thread that creates them, known as the master thread.
In this case, T0 is the master and T1–T3 are its slaves. As such, the loop prologue and epilogue
of the master thread is slightly different to that of the slave threads. During the test, at the start
of each loop the slaves perform a synchronisation (SSYNC instruction) against the master thread
(MSYNC instruction). If the master has received the end of test signal from the test harness, then it
performs an MJOIN instead of an MSYNC. This kills the slave threads when they next synchronise.
The slave threads execute no-op instructions when the master is performing the above checks.
To minimise the overhead of the execution of loop prologues and epilogues, the loop body must
be sufficiently long. The number of body instructions, Nb, required to achieve a body to total
instruction ratio, R, with Nooverhead instructions, is determined through Eq. (6.1).
81
6. Model design and profiling of an XS1-L multi-threaded core

















!
"
"
"

#







$
%


$





$
%


$
&&&





$
%


$
&&&






'()
*&&+
,'-
.
"
&&&
&&&
Figure 6.3: Test harness and DUT process flow.
Nb=NoR
1R(6.1)
Listing 6.1 and 6.2 show the minimal code padding used for a group of kernels used by threads
in an example test titled TestName. The event vector for the first thread is configured to point at
label TestNameEnd. When the test harness triggers this event, the thread will immediately jump
to this address, provided line 7 has been executed at least once. This code gives No= 4, a very
low overhead.
1TestNameT0Loop:
2# Ret r iev e sy n ch r on i se r
3r1 1 , s p [0 x 3 ]
4res [ r 11 ]
5# Unr o lle d in s tr u ct i on s
6# ...
70 x1
8TestNameT0Loop
9# ...
10 Tes tN am eE nd :
11 res [ r 3 ]
12 # Cle a nup
Listing 6.1: Example kernel of first
thread on the DUT.
1TestNameT1Loop: # T2Loop , etc
2
3
4# Break or p ro ce ed
5# Unr o lle d in s tr u ct i on s
6# ...
7
8TestNameT1Loop
9
10
11
12 # No c lean u p
Listing 6.2: Example kernel of further
slave threads.
The correctness of the thread synchronisation harness was validated in two ways. Firstly, against
the XMOS architectural simulator xsim, to provide a cycle-by-cycle trace of the harness’ execution.
Secondly, the behaviour was confirmed on the hardware by putting I/O operations in the test body
for each thread and observing the associated ports on an oscilloscope, to ensure that timing of the
signal edges was as expected.
Thread schedule. Early testing yielded an interesting discovery in relation to how threads are
scheduled into the pipeline. With a single active thread an instruction is issued once every four clock
82
6.5. Generating tests
Time-step 1 thread 2 threads 3 threads 4 threads 5 threads
1T0,0T0,0T0,0T0,0T0,0
2 T1,0T1,0T1,0
3T1,0T2,0T2,0T2,0
4 T3,0T3,0
5T0,1T0,1T0,1T0,1T4,0
6 T1,1T1,1T0,1
Table 6.2: Representation of instruction sequence for various active thread counts, with threads
represented as Tn,i, for thread number nand instruction number i.
cycles. When two threads are active, instructions are issued every other clock cycle. The alternative
would be to issue two instructions (one from each thread), and then have two cycles where no
instructions are issued. This is functionally equivalent, but it may have energy implications because
it affects the switching within the pipeline.
With three active threads, an instruction is issued for three in every four clock cycles. For four
or more active threads, an instruction is issued on every clock. Allocated, but inactive threads (i.e.
threads waiting on events) do not issue instructions, so have no influence on scheduling. Table 6.2
illustrates the XS1-L’s instruction and thread schedule in line with this observation.
6.5. Generating tests
Blocks of instructions are required to fill the loop bodies of test threads, the expectation being that
the majority of test time will be spent executing those body instructions, giving a power figure for
them. A number of ALU instructions hand-coded into test loops are used to gain understanding
of what to expect and also to determine a good approach for automation.
Following this initial setup, the process of creating tests is largely automated. For 36 arithmetic
operations, tests are generated for every possible pairing of them.
To account for data variation, constrained random data as well as constrained random source
and destination operands are generated. This ensures that for each instruction the supplied data is
valid (i.e. cannot cause an exception condition) and that results do not overwrite source registers,
avoiding value convergence over the course of the loop body.
Constrained random data generation is used to provide different data widths to the test loops,
with bit-widths of 32 (full width), 24, 16, 8, 4, 2, 1 and 0. Bit-masking is applied at code generation
time to constrain the data range, so the test loops themselves are identical between runs at various
data widths.
Exclusions
This approach is applied to 36 of the 203 instructions in the ISA, principally covering arithmetic
operations in the CPU. This excludes branches, I/O, memory, communication and other resource
related instructions. These other instructions can affect control flow, take multiple cycles or exhibit
non-deterministic timing and so are not suitable for profiling in this way.
The divide instruction was also excluded from automated tests. The divide unit in the XS1-L
is a serial divider with early-out capability. Thus, it will take up to 32 clock cycles to complete,
potentially spanning multiple thread cycles. If the divide unit is in contention, then threads will
remain scheduled and wait until they can claim it. This affects the thread execution timing and
for this reason was avoided in the automated data collection phase.
Although it is quite possible to build test loops that utilise many of them, they cannot necessarily
be generated or the result data be interpreted in the same automated way. It is necessary to either
construct specific tests for these cases, with significantly more constraints than auto-generated
tests, or produce more complex test loops comprising multiple instructions, extrapolating instruc-
tion costs using a suitable analysis method. Further tests are developed in 6.5.1 and modelling
of un-profiled instructions is explained in 7.4
83
6. Model design and profiling of an XS1-L multi-threaded core
Generation process
The process to generate tests for each ALU instruction automatically is as follows:
1. Describe constraints on all immediate encodings (value range or set of possible values).
2. Describe characteristics of each instruction in terms of length, encoding, operand count
(source & destination), immediate type and the number of source/destination registers to
allocate.
3. For each unique pairing of instructions, generate odd and even threads for a test kernel, with
the following generated contents:
For each instruction in a test, generate random values to populate the source registers
within that instruction’s constraints.
For each instruction in a test, generate random source and destination register addresses
within specified range.
If an instruction has an immediate value, generate a random immediate within con-
straints.
Generate Nbinstructions, satisfying Eq. (6.1).
4. Add test to list of tests to run.
5. Compile group of tests into framework, split into separate binaries if the processor’s program
memory limit is exceeded.
As an example of a set of constraints, take the instruction ashr,arithmetic shift right, with an im-
mediate shift value. In the C programming language, intY=X>>I, with constant shift amount
I, is functionally equivalent to the ashr instruction. This is encoded as a 32-bit instruction. It has
one input register and one output register, with an immediate value I ∈ {1,2,3,4,5,6,7,8,16,24,32},
encoded together with the register addresses into 11 bits. The operands of the instruction are
constrained by these parameters to ensure that only valid assembly instruction sequences are gen-
erated. Shifts by other amounts than those listed must be done using the three-register form of the
instruction. Due to encoding techniques in the XS1 ISA, this is encoded together with the source
and destination register addresses. As such it consumes fewer than four bits in the instruction,
but this is recorded here as a 4-bit wide immediate value for simplicity.
With this process in place, energy data for the majority of arithmetic instructions can be collected
in approximately 90 minutes. Discussion and analysis of these results is presented in the following
section.
6.5.1. Custom profiling and extended tests
A number of instructions cannot be profiled using the completely automated methods described
in the previous section. However, the XMProfile framework supports hand-written test patterns
and constraints, allowing for custom profiling of instructions with more complex dependencies or
behaviours.
For example, memory operations require a section of memory to access, and this must be popu-
lated with random data in order for a profiling run to yield realistic measurements. XMProfile is
able to support this by providing a heap containing constrained random data which is re-initialised
from a shadow heap between tests. The tests themselves require additional customisation, however,
because of the fetch behaviour of the XS1-L. Repeated memory operations starve the instruction
buffer for a thread, resulting in FNOPsoccurring during execution ( 5.1.1). Tests inducing varying
frequencies of FNOP allow the cost of the FNOP to be separated from the memory operation under
test.
Instructions covered by custom profiling runs include:
Unconditional branching.
Divide/remainder.
84
6.6. Profiling summary
Memory loading and storing of all available widths.
Core-local channel communication.
These custom profiling tests require additional scrutiny and parameterisation before inclusion
into the energy model. Moreover, the custom profiling tests do not cover the remainder of the
instruction set. Further work is done to fill in these gaps. These custom instructions and un-
profiled instructions are examined in 7.4, after 7.2 demonstrates and evaluates a model with
these absent.
6.6. Profiling summary
This chapter has detailed the process of profiling the XS1-L in order to collect data for an ISA
level energy model. The model is explored in the next chapter.
The profiling is largely automated thanks to the creation of the XMProfile framework for test
generation and measurement. The profiling process allows very tight control over the processor’s
pipeline, and the test generation can be fully automated or customised as required.
This framework serves not just to provide data to be processed into a model, but also to allow
the behaviour of the processor to be examined and reasoned about. For example, the precise thread
schedule detailed in 6.4.3 comes from the use of this framework and not processor documentation.
This allows energy characteristics of the processor to be explained as well as modelled, furthering
this thesis’ goal of providing better insight into the energy consumption of embedded processors.
85
7. Core level XS1-L model implementation
The model presented in this chapter draws upon the research discussed in Chapter 3, extending that
work to give consideration to the behaviours distinct to the hardware multi-threaded architecture
of the XS1-L. It uses the XMProfile framework, described in the previous chapter. A significant
portion of this work is published in [KE15b].
The outcome of the work presented in this chapter is a model and workflow that can be used to
estimate the energy consumed by embedded multi-threaded programs run on the XS1-L processor.
The error of the resultant models is as low as 2.67 %, as enhancements are implemented throughout
the chapter, based on both observations, improvements to the modelling technique and new features
in the modelling software.
The first stage of the modelling process focuses on the automatically obtained data via XMProfile,
creating what is termed the initial model. This is presented in 7.2. A model produced from more
extensive profiling, through customised XMProfile runs and regression techniques, termed the ex-
tended model, is presented in 7.4. In addition to the model construction and accuracy evaluation,
this chapter presents a discussion of model performance in terms of the levels at which it can be
applied, from trace-based simulation up to higher-level static analysis, in 7.6.
7.1. Workflow
The experimental modelling tools proposed in this thesis aim to fit within a software development
workflow. Throughout this and subsequent chapters, the tools are extended. However, they are
built upon the workflow shown in Figure 7.1.
The flow is considered in three stages: compilation, simulation and inspection. The compilation
stage is the standard compiler toolchain workflow and uses existing tools with no modification.
The simulation stage utilises an Instruction Set Simulation (ISS), in this case xsim [JGL09] or
axe [Osb11], bundled with the toolchain or available online, respectively. This is then fed into a
trace analysis tool, XMTraceM.
The XMTraceM tool is the novel contribution to the workflow, applying an energy model to the
simulated program in order to determine energy consumption and power dissipation in addition to
the execution time information that the simulator can already provide. A report is then produced,
which is considered within the final stage of the workflow, inspection. The inspection stage is an
opportunity for the developer to determine, from the energy report, whether they wish to make
further code changes and then repeat the workflow in an attempt to improve energy consumption.