Buses are shared communication media used by devices to “talk to” each other both on-chip and off-
chip. The communication actions which take place can carry both data and control structures.
In this lecture we will point at following distinctions:
on-chip vs. off-chip buses
serial vs. parallel buses
wired vs. wireless buses
Transaction is a complete piece of communication. All the transfers which take place across the bus
are split across transactions.
There are three distinct phases of every transaction:
1. Arbitration – it is decided which device will own the bus (drive the common medium) for the
time of transaction, thus becoming a master. It is only applicable to multi-master buses.
2. Addressing – the master activates another listening device by broadcasting (via bus) its
address (i.e. device address or control register address) for reception.
Can be unicast (message applies to single device) or multicast (applies to many). The
addressed devices are traditionally called slaves.
3. Actual data transfer.
The arbitration and addressing phases are considered as overhead, as they carry no useful
information, just serve to establish a link. Actually, the goal of communication is to transfer the data.
There is a trade-off: increasing size of the chunk of transferred data for each arbitration and
addressing reduces the need of frequent arbitration, also reducing the overhead and improving the
throughput. However, long transfers render the bus unusable for other devices, which are possibly
waiting for it, increasing latency.
Synchronization methods are needed for assurance that data presented in each of the
aforementioned phases are valid (not corrupt or changing) when being read. We can distinguish
three types of synchronization protocols across the bus:
synchronous protocol – there is a clock signal, which informs that all the data is stable, and
can be read safely
As vast majority of logic circuits is synchronous, the idea of extending it to the buses seem
All the data is read only on the CLK edge.
However, it is often that different devices are clocked by different clock signals, which can
run at different frequencies (e.g. CPU and UART peripheral), or be completely independent of
each other (e.g. two communicating UARTs).
The question is, how to implement such functionality, so that blocks (master and slaver)
running on different speeds could talk to each other.
semi-synchronous protocol – there is a clocking signal synchronously to which a request from
master occur. The response from slave is indicated on dedicated line (i.e. READY or WAIT),
whose value is sampled by the clock signal.
In the timing diagram above (for a read transaction), the data item at a clock edge
corresponds to the address at the previous clock edge. The slave selected in the first cycle
drives READY low to notify the master that it is not ready yet to provide the data in the
second cycle, but only at the third one, when the master can safely read the data,
asynchronous protocol – there is no implicit time constraints – the asynchronous events are
issued both by master and slave.
Slave put data. Slave has finished.
There is valid address on the bus.
Master confirms reading data.
Slave acklowledges its address.
The protocol uses two lines dedicated to asynchronous messaging: ACK and REQ.
Asynchronous buses in on-chip communication may gain popularity in future due to following
reason. In synchronous transfer, the clock frequency must be such that the slowest signal can
propagate from source to destination. This may even take 4-5 clock cycles to propagate the whole
chip! There are also many signals that can reach destination much faster, but are forced to wait for
the longest, worst case clock delay. The asynchronous protocol does not need such constraints.
Serial vs. parallel buses
When in need of large bandwidth, resorting to parallel bus (where data is transmitted simultaneously
across multiple wires) is a natural solution. They can increase throughput proportionally (roughly) to
the number of wires.
However, if we have many wires, the propagation conditions for each may be slightly different (the
distributed reactance can differ from one to another) thus resulting in different travel time across
whole line. It is difficult to balance many lines, so that they have equal propagation times. This effect
is called the skew.
The maximal clocking frequency for parallel buses is
The more lines the parallel bus have, the larger is the skew, but for short distances the effect is not
so large (proportional to distance). So, the parallel buses are option of choice for on-chip buses, while
in off-chip communication one often resorts to serial protocols.
AMBA is a feature-rich example of parallel on-chip bus standard. It is defined by ARM company and
used in µC’s with ARM cores, thus being the most popular 32 bit on-chip bus. The standard defines
two buses serving different roles:
AHB – which stands for ARM High-performance Bus – it links fast peripherals providing high
clocking frequency and large throughput. Has very complex hardware and is expensive to
APB – slow bus, very simple in comparison, cheap to implement in hardware
master master master
slave slave slave slave
slave slave slave slave
UART TIMER D/A GPIO
The bridge functions as slave to the AHB and is only master of APB bus. It adapts relatively slow APB
bus (can be 16x slower) to high speed AHB, dealing with timing, split transactions and packing and
unpacking the bytes of AHB word.
AHB is fast, parallel, multi-master, pipelined bus with support for burst and split transactions.
Pipelined bus can perform each of three aforementioned transaction phases simultaneously:
Arbitration T1 T2 T3 T4
Addressing T1 T2 T3
Data transfer T1 T2
Burst transaction – special type of transaction with specifying burst request to the slave, which may
be able to increment (or decrement) the given address, thus potentially reducing the need of sending
many sequential addresses through bus. This reduces power, since computations at the slave are less
power consuming, and may increase performance, since address decoding is done at the destination.
Burst transaction is often used by cache controllers when fetching a block of data from memory.
Its primary goal is to improve performance if following scenario:
A master initiates a transaction which potentially can take a long time (i.e. communicating with APB
peripherals). Without possibility of split transactions it would hold the bus for many cycles, which
could not utilized by other masters (e.g. DMA).
With split transaction, the masters issues a request, then releases the bus, and waits for notification,
which can come many cycles after.
APB stands for ARM Peripheral Bus. It is a slow, parallel bus with no pipeline, burst or split
transaction with single master. The peripherals connected to it won’t utilize high speed provided by
AHB, so it can be i.e. 12x slower than APB. Otherwise it would be a pure waste of power and chip
Typical buses used in 8-bit µC’s are very similar to APB – this is due to the fact, that they are mostly
legacy devices fabricated in older technologies, and where simplicity values higher.
Connectivity in AHB
Here is a simple diagram. More detailed information can be found in AMBA specification from ARM
MASTER BUS SLAVE
32 HADDR HSEL
lower part (register)
2 HTRANS – transfer type - not ready, burst mode request
HWRITE - direction
3 HSIZE – 8/16/32/…/1024 bits of data size
HPROT – levels of protection
HREADY (semi-synchronous feature)
HRESP – error condition
This represents data lines in AHB bus (H in the beginning stands for AHB). Note, that only single
master and single slave side is drawn. In fact each slave and each master would have it’s own
implementation of the hardware needed, and in the place where bus touches master and slave sides,
multiplexers are found if necessary.
HMASTLOCK – signals a locked transaction
HLOCK can be used to perform atomic instructions such as Test-And-Set, locking the bus from other
devices, that could change sensitive data in the middle of critical section.
Decoder presence is not necessary, but removes decoding logic redundancy from every peripheral.
There are no 3-state buses in AHB, nor open collector, as found in i.e. ISA or I2C, as it is chip
optimized. In every place, where multiple sources could drive same line, we find multiplexers.
Interesting diagrams can be found in AMBA specification from ARM.
Additional AMBA features
Implementor of AMBA specification may implement only part of functionality, as long as it support
basic transaction (i.e. off-chip). They may also be as complex as having three levels of buses, split
Used to fill buffers of peripherals with sequential
1st part of transaction is asynchronous, similarly to normal transaction
Master signals request for reading/writing sequential data from/to a slave. It keeps SEQ on HTRANS
during the transfer. Processor often requests lock on the bus – this enables to ease keeping track on
where you are. You can also have soft-buses, which can grant access to the other peripherals.
The data on HWDATA doesn’t change during clock trigger, if HREADY is in wait state.
This type of transaction is well suited for filling the cache.
When there is a request for data from cache which finishes with a miss, the cache fetches the desired
location first, then follows with fetching the following addresses, because it is how the cache
controllers work. This is the order:
Fetch order: 3 4 1 2
^ CPU Request
The wrapping burst implements such kind of transfer in hardware. Wrapping burst size can be i.e.: 4,
8, 16 bytes, matching the cache line. This should be designed by chip architects.
AMBA doesn’t specify order in which the arbiter grants access to the masters. It can be fixed priority,
round robin, priority queue, etc.
Stages of split transaction:
1. Master initiates transaction as usual.
2. If the slave is not ready, it asserts split and remembers the active master (provided by arbiter
to anyone interested).
3. The arbiter grants the bus to other masters.
4. Slave asserts HSPLIT line to the arbiter, telling which master can resume.
5. Arbiter restores bus grant to interrupted master.
There is a diagram in AMBA documentation, but it does not show the whole complexity.
The following picture applies to APB. The bus does not implement pipelining.
The PSEL duplicates some of PADDR part to simplify decoding logic in peripherals, thus reducing
PENABLE high indicates data phase.
If PCKL = 10MHz, the actual data throughput is 5MHz, cause data is held by 2 clock cycles.
The buses of 8-bit µC’s look very similar to APB. They may even share physical lines for address and
data – this is used i.e. when interfacing with off-chip devices.
The width of APB data mismatches data width of AHB. The bridge responsibility is to split and merge
Universal Serial Bus.
It was meant to replace huge variety of serial protocols which existed at the time of its design in PC’s
and embedded systems. It become quite a success with PC, in the world of embedded systems it still
competes with others.
I1C – one of competitor. Very simple, has only data and clock lines. Used i.e. with boot control, when
CPU talks to external ROM.
CAN, LIN, others – competitors in automotive world. Provide i.e. guarantee of service, multiple
Has one master (mostly a PC). Some smartphones can act both as master or as slave.
It is tree-structured. Master is called a host. Slaves are called functions. There are also nodes, that
only implement connectivity, called hubs.
FUNCTION HUB FUNCTION
Functions cannot request anything from Master. Every communication is initiated by Master. There
are no interrupts. Receiving data from functions is done by polling.
USB provides “hot plugging and unplugging” capability, which is fairly unique, but is dependant on
As much as 128 devices can be plugged-in at the same time.
USB has (like AMBA) several levels of performance. There are all implemented on same physical
medium (unlike AMBA). The levels differ i.e. by speed (dictated by frequency). From 100MB’s/s –
video, HDD, to kB’s/s with i.e. mouse, keyboard.
USB carries both information and limited power supply to power simple functions. They can also be
connected to external power sources.
Power supply Data lines
At transmission start the encoding used is binary, then is NRZ with bit stuffing (zeros stuffed into
sequences of ones). Zero is encoded as transition, one as no change. In USB specification they are
called J and K. The transmission is differential, and when both lines are in the same state it signals
special condition – synchronization.
The lines are terminated by pull on / pull off resistors which remedy noise problems on unterminated
Units of transfer.
There is sync field which synchronizes the receiver to transmitter. There are 8-bit alternating
There are 3 types of packets:
Token packets – control packets, i.e. requests
Data packets – pure data
Status packets – return various status information from functions
Sync PID (packet identifier) Payload
8 bits 8 bits
4 bits of packet type 4 bits complement
It has similar status as a status line. It is used to confirm that data is accepted, or that no data is
ready to be transferred.
The transfer can also be split across may functions. The hubs support split transaction requests from
After 3 unsuccessful transactions, the hosts decides, that the peripheral is no longer connected.
Cyclic Redundancy Check is used to protect the Payload.
USB standard documentation contains illustrative diagrams on various errors that may occur during
Most intelligence is concentrated in Host – PC. This allows function implementations to be fairly
Wireless Sensor Networks
The WSN in one of popular embedded system application.
What characterizes WSN:
Ad-hoc wireless network
Computation based on sensing
Bell’s law: there is new computer class each 10 years. I.e. it’s smaller, less people is required to
handle it, new modes of connectivity and interfacing etc.
Exemplary application areas:
Environment and agriculture
o Used mostly for monitoring (i.e. fires) to alert interested parties
o Can be static and mobile (animal handled)
o Has been prime target of academic efforts
o Example – Zebranet (Princeton) – tracking migration of zebras across Kenya
o Example – control of irrigation in response to temperature and humidity across large
o Example: gathering information about pressure, temperature across bridge
structures, so reduces human labour
o Predicting ground anomalies (land slide), investigation of causes – lots of wires would
o Most probably the first successful commercial application
o Example: lights turning on in response to human presence, door opening, etc.
o Have long lifetime requirements
Medical, health application
o Body sensor network replacing wired sensors – human-in-the-loop – i.e. chair with
o Environment that is aware of objects or phenomena, adaptive and responsive
o Does not require assistance
o Temperature across the plant
o Positioning large parts in car manufacturing in submilimeter precision
o Pattern recognition
o Control system
o Adaptive, robust, flexible
o Traffic assistance
o Driver help
cannot be restarted in field – has to be reliable
can be installed without effort of putting wires, rearranging buildings
Monitoring vs. Control
Statically placed or in move
Optimization of power – primary goal.
Application stops when battery runs out of charge and it can be hard or impossible to be
Network density and size
o Number of nodes
o Number of neighbouring nodes – how many nodes are communicating to each other
Central vs. distributed
Hierarchy or uniformity
Cost, size and power (interrelated features) – 1$ should be optimal for a piece
The algorithm used should be as simple as possible to minimize power requirements
Ease-of-development and management
100µW = 1cm3 of lithium battery volume for 1 year of operation on 100%.
Rechargeable batteries are half as efficient in term of volume.
Need to be replaced every 9 months or recharged every 3-4 hours.
Architecture of sensor network
I.e. Atmega on Mica2dot board
I.e. 512k external serial flash for Atmega.
I.e. RFM or Chipcon 1000
Depends on application.
Either battery or energy scavenge.
Exemplary pre-made boards
Mica2 – big, small memory, runs Atmega
Tmote Sky – has USB port, 5 sensors integrated, runs MSP430
Imote – powerful and enery-hungry
If you don’t have OS you have to write everything by yourself – deal with all hardware nuances and
task scheduling. This may be time consuming. A feasible solution would be a system that run on very
small system that can be simply programmed.
Usual services implemented by OS
Abstracting the system resources.
Separation from system mode and user mode for hardware.
Memory management unit.
Possible implementations of these goals
To imitate regular operating system (i.e. providing POSIX-compatibility)
Create familiar programming interface (namely: processes). Sacrifice process separation to match the
restrictions of the platform.
Small operating system targeting popular WSN platforms.
Takes about 400 bytes (which architecture?)
Created at University of UC Berkeley.
Requires programming in nesC – it’s own programming language, which is an extension of C,
influenced by VHDL
Portable across several platforms.
Component-based system (similar to VHDL components). Program is created by wiring them
together. They communicate by exchanging asynchronous events (event-driven).
Most WSN operate in sleep mode, so the system should also be able to be put to sleep. Therefore it
should react to environment conditions asynchronously, and send messages to components
Frame – state holder
Task – normal execution program (thread?)
Buffer which can be
accessed by both.
There is no context switch – tasks share the same stack. Each task runs to the completion (tasks
cannot sleep). Scheduler is FIFO based. It is similar to single-threaded event queue dispatchers in
higher level frameworks (win32 or java+awt event queue).
Interrupts that arrive don’t do full context switch, but simply enqueue another task to be done after
executing currently queued tasks.
Blocking resources is not an option, as there is actually no concurrency. This implies “Split-phase
Interfaces define functionality that should be implemented in component. Components provide
implementation for common interfaces. (Rather composition than inheritance).
How to develop WSN application?
And development environment for each
Platform Construction – Software
Usual system construction:
System Services Device
Platform system for WSN
Cross platform development support is a pro, as enables easy switching between different hardware
platforms. The code should compile well with compilers which produce binary images for different
platforms, and use abstracted API instead of dealing with chip resources directly.
To ease difficulties in porting to different platforms there is a possibility of debugging on-chip
Application is ONE linked executable composed of OS and components.
Event-driven (previous lecture).
High concurrency using little space.
Single shared stack across tasks.
No distinction between kernel and user space; the memory is shared.
Split phase of request and response - asynchronous command/event.
FIFO based scheduler which queues tasks which run one-by-one until completion.
Operating system also tailored to WSN.
Developed by University of Colorado.
Offers preemptive multithreading.
Often abbreviated to MOS.
Uses < 0.5kB in RAM (with network stack included).
Coded in standard C.
Kernel resembles UNIX one.
Implements POSIX subsystem (mutexes and semaphores for synchronization).
MANTIS comprises of following elements:
Network Stack Command Stack User level threads
MANTIS system API
Kernel/scheduler COMM DEV
When using preemptive multithreading, a programmer does not need to take into
consideration the possibility of one task blocking another, as CPU time is instrumented by
Preemption of running tasks consumes time, memory and energy, as each time the whole
context is stored on a stack.
As CPU is not blocked, other types of blocking may occur, when concurrently using resources
others than CPU.
Preemption of long-running tasks with short ones, dealing with I/O operations reduces the need of
large buffers and reduces the possibility of buffer overflows.
Non-preemptive multitasking Preemptive multitasking
Task 1 Task 2 Task 1 Task 2
Event Event Event Event interleaving
Producer Producer enables to have
Producer an overflow
Challenges of preemptive multitasking
limited memory (i.e. 4kB on MICA)
WSN node lifetime associated with energy – the scheduler should be able to save energy by
entering sleep mode of a processor
MANTIS – further details
The tasks can have one of the following priorities:
The context switch may take approximately 10µs.
The tasks may be suspended when waiting for resources using simple API calls (such as
mos_task_suspend, mos_task_resume). This puts them on a sleep queue – they are waken up, when
the resource is ready.
The thread table is the main kernel data structure. It is statically allocated – designer designates
running tasks at compile-time. It is implemented as a linked list, with additional pointer to current
The kernel is only triggered by the timer interrupt, which is approx. 10ms by default.
There is an idle thread of lowest priority, which implements power-aware scheduling.
COMM components in MANTIS
Communication API is accessed by MAC (Medium Access Control) protocol. It abstracts the hardware,
giving the programmer the following feasibilities:
unified interface for UART, USB and radio devices
management of packet buffers, and synchronization functions
operates mainly on four functions:
com_send is a blocking call, which means, that the calling thread is suspended until completion.
com_recv is also blocking – it waits until a buffer is filled with valid data by underlying hardware and
operating system routines.
Device drivers (DEV)
Device drivers are implemented POSIX-style. This outline different types of sensors found on target
boards (acceleration, temperature, light, humidity, etc.)