Embedded System Design
• A processor built from
dedicated silicon is
referred to as a .hard.
– Such is the case for
the ARM922T. inside
the Altera Excalibur
– The PowerPC. 405
inside the Xilinx Virtex-
II Pro and Virtex-4
Source: Xilinx Inc.
• A .soft. processor is built using the FPGA’s
• The soft processor is typically described in
a Hardware Description Language (HDL)
• Unlike the hard processor, a soft
processor must be synthesized and fit into
the FPGA fabric.
• Xilinx MicroBlaze
FPGA Cores – Hard vs Soft
• In both soft and hard processor systems,
the local memory, processor busses,
internal peripherals, peripheral controllers,
and memory controllers must be built from
the FPGA’s general-purpose logic.
Advantages of Embedded
Processors in FPGA
• An FPGA embedded processor system
offers many exceptional advantages
compared to typical microprocessors
2) obsolescence mitigation
3) component and cost reduction
4) hardware acceleration
• The designer of an FPGA embedded processor
system has complete flexibility to select any
combination of peripherals and controllers.
• In fact, the designer can invent new, unique
peripherals that can be connected directly to the
• If a designer has a non-standard requirement for
a peripheral set, this can be met easily with an
FPGA embedded processor system.
– For example, a designer would not easily find an off-
the-shelf processor with ten UARTs. However, in an
FPGA, this configuration is very easily accomplished.
#2 Obsolescence Mitigation
• Some companies, in particular those supporting
military contracts, have a design requirement to
ensure a product lifespan that is much longer
than the lifespan of a standard electronics
– Component obsolescence mitigation is a difficult
• FPGA soft-processors are an excellent solution
in this case since the source HDL for the soft-
processor can be purchased.
– Ownership of the processor’s HDL code may fulfill the
requirement for product lifespan guarantee.
#3 Component and cost
• With the versatility of the FPGA, previous
systems that required multiple components can
be replaced with a single FPGA.
• Certainly this is the case when an auxiliary I/O
chip or a co-processor is required next to an off-
• By reducing the component count in a design, a
company can reduce board size and inventory
management, both of which will save design
time and cost.
#4 Hardware acceleration
• Perhaps the most compelling
reason to choose an FPGA
embedded processor is the ability
to make tradeoffs between
hardware and software to
maximize efficiency and
• If an algorithm is identified as a
software bottleneck, a custom co-
processing engine can be
designed in the FPGA specifically
for that algorithm.
• This co-processor can be attached
to the FPGA embedded processor
through special, low-latency
channels, and custom instructions
can be defined to exercise the co-
• Unlike an off-the-shelf processor, the hardware platform
for the FPGA embedded processor must be designed.
• The embedded designer becomes the hardware
processor system designer when an FPGA solution is
• Because of the integration of the hardware and software
platform design, the design tools are more complex.
• The increased tool complexity and design methodology
requires more attention from the embedded designer.
• Since FPGA embedded processor software design is
relatively new compared to software design for standard
processors, the software design tools are likewise
relatively immature, although workable.
Peripherals and memory
• To facilitate FPGA embedded processor
design, both Xilinx and Altera offer
extensive libraries of intellectual property
(IP) in the form of peripherals and memory
• This IP is included in the embedded
processor toolsets provided by these
manufacturers. (e.g. UART, DMA, PCI-X)
Altera Embedded Processors
Xilinx Embedded Processors
Types of Buses
• The fastest possible memory option is to put everything
in local memory.
• Xilinx local memory is made up of large FPGA memory
blocks called BlockRAM (BRAM). Embedded processor
accesses to BRAM happen in a single bus cycle.
• Since the processor and bus run at the same frequency
in MicroBlaze, instructions stored in BRAM are executed
at the full MicroBlaze processor frequency.
– In a MicroBlaze system, BRAM is essentially equivalent in
performance to a Level 1 (L1) cache.
• The PowerPC can run at frequencies greater than the
bus and has true, built-in L1 cache.
– Therefore, BRAM in a PowerPC system is equivalent in
performance to a Level 2 (L2) cache.
• Xilinx FPGA BRAM quantities differ by device.
• For example, the 1.5 million gate Spartan-3 device
(XC3S1500) has a total capacity of 64KB, whereas the
400,000 gate Spartan-3 device (XC3S400) has half as
much at 32KB.
• An embedded designer using FPGAs should refer to the
device family datasheet to review a specific chip’s BRAM
• If the designer’s program fits entirely within local
memory, then the designer achieves optimal memory
• However, many embedded programs exceed this
External Memory Interface
• Xilinx provides several memory controllers that interface
with a variety of external memory devices.
– memory controllers are connected to the processor peripheral
– The three types of volatile memory supported by Xilinx are
• single-data-rate SDRAM
• double-data-rate (DDR) SDRAM.
• SRAM controller is the smallest and simplest inside the
FPGA, but SRAM is the most expensive of the three
• The DDR controller is the largest and most complex
inside the FPGA, but fewer FPGA pins are required, and
DDR is the least expensive per megabyte.
Should you Cache external
• A design in Spartan-3 enables 8 KB of data
cache and designates 32 MB of external
memory to be cached.
– This cache requires 12 address tag bits.
– This configuration consumes 124 logic cells and 6
• Only 4 BlockRAMs are required in Spartan-3 to
achieve 8 KB of local memory.
• In this case, cache is 50% more expensive in
terms of BRAM usage than local memory.
– The 2 extra BRAMs are used to store address tag
• Additionally, the achievable system frequency may be
reduced when the cache is enabled.
– without any cache - 75 MHz;
– with cache - 60 MHz.
• Cache controller
– adds logic and complexity to the design,
– decreasing the achieved system frequency during FPGA place
• Consumes FPGA BRAM resources that may have
otherwise been used to increase local memory
• Cache implementation may also cause the overall
system frequency to decrease.
Some example designs
• Considering these cautions, enabling the MicroBlaze cache,
especially the instruction cache, may improve performance, even
when the system must run at a lower frequency.
• A 60 MHz system with instruction cache enabled has a 150%
advantage over a 75 MHz system without instruction cache (both
systems store entire program in external memory).
• When both instruction and data caches are enabled, the 60 MHz
outperforms the 75 MHz system by 308%.
– This example is not the most practical since the entire DMIPs program
will fit in the cache.
– A more realistic experiment is to use an application that is larger than
• Another precaution is regarding applications that frequently jump
beyond the size of the cache.
– Multiple cache misses degrade the performance, sometimes making a
cached external memory worse than the external memory without
• For MicroBlaze, perhaps the optimal memory configuration is to
wisely partition the program code, maximizing the system frequency
and local memory size.
• Critical data, instructions, and stack are placed in local memory.
• Data cache is not used, allowing for a larger local memory bank.
• If the local memory is not large enough to contain all instructions,
the designer should consider enabling the instruction cache for the
address range in external memory used for instructions.
• By not consuming BRAM in data cache, the local memory can be
increased to contain more space.
• An instruction cache for the instructions assigned to external
memory can be very effective.
• Experimentation or profiling shows which code items are most
heavily accessed; assigning these items to local memory provides a
greater performance improvement than caching.
• In addition to the memory access time, the
peripheral bus also incurs some latency.
• In MicroBlaze, the memory controllers are
attached to the On-chip Peripheral Bus
• For example, the OPB SDRAM controller
requires a four to six cycle latency for a
write and eight to ten cycle latency for a
read (depending on bus clock frequency)
The PPC405x3 provides the following set of interfaces that support the
attachment of cores and user logic:
-- Processor local bus interface
The processor local bus (PLB) interface provides a 32-bit address and
three 64-bit data buses attached to the instruction-cache and data-
Two of the 64-bit buses are attached to the data-cache unit, one
supporting read operations and the other supporting write
The third 64-bit bus is attached to the instruction-cache unit to support
Device control register interface
The device control register (DCR) bus
interface supports the attachment of on-
chip registers for device control.
Software can access these registers using
the mfdcr and mtdcr instructions.
• The clock and power-management interface
– supports several methods of clock distribution and power management.
• JTAG port interface
– The JTAG port interface supports the attachment of external debug
– Using the JTAG test-access port, a debug tool can single-step the
processor and examine internal-processor state to facilitate software
• •On-chip interrupt controller
– combines asynchronous interrupt inputs from on-chip and off-chip
sources and presents them to the core using a pair of interrupt signals
(critical and noncritical).
– Asynchronous interrupt sources can include external signals, the JTAG
and debug units, and any other on-chip peripherals.
• • On-chip memory controller interface
– Supports attachment of additional memory to the instruction and data