An Analysis of Secure Processor Architectures by wuyunyi


									    An Analysis of Secure Processor Architectures

    Siddhartha Chhabra1 , Yan Solihin1 , Reshma Lal2 and Matthew Hoekstra2
        Dept. of Electrical and Computer Engineering, North Carolina State University
                                          Raleigh, USA
                                  {schhabr, solihin}
                                   Intel Labs, Intel Corporation
                                          Oregon, USA
                           {reshma.lal, matthew.hoekstra}

           Abstract. Security continues to be an increasingly important concern
           in the design of modern systems. Many systems may have security re-
           quirements such as protecting the integrity and confidentiality of data
           and code stored in the system, ensuring integrity of computations, or
           preventing the execution of unauthorized code. Making security guar-
           antees has become even harder with the emergence of hardware attacks
           where the attacker has physical access to the system and can bypass any
           software security mechanisms employed. To this end, researchers have
           proposed Secure Processor architectures that provide protection against
           hardware attacks using platform features. In this paper, we analyze three
           of the currently proposed secure uniprocessor designs in terms of their
           security, complexity of hardware required and performance overheads:
           eXecute Only Memory (XOM), Counter mode encryption and Merkle
           tree based authentication, and Address Independent Seed Encryption
           and Bonsai Merkle Tree based authentication. We then provide a dis-
           cussion on the issues in securing multiprocessor systems and survey one
           design each for Shared Memory Multiprocessors and Distributed Shared
           Memory Multiprocessors. Finally, we discuss future directions in Secure
           Processor research which have largely been ignored forming the weakest
           link in the security afforded by the proposed schemes, namely, Secure
           booting and Secure configuration. We identify potential issues which can
           serve to form the foundation of further research in secure processors.

           Keywords: Secure Processor Architectures, Memory Encryption, Mem-
           ory Authentication, Secure Booting, Secure Configuration

1        Introduction

With the growth of digital information stored on modern computer systems, and
the increased ability of attackers to target this wealth of information, security
has taken a front seat in the design of systems today. One of the design goals of
security systems is to protect the integrity and privacy of code executing on the
system and prevent attackers from injecting and executing arbitrary code. This
type of protection becomes particularly important for enforcing Digital Rights
Management (DRM) (for example, in gaming systems), preventing reverse engi-
neering and software piracy.
    Software attacks have formed the most widely exploited category of attacks
so far. However, recently hardware attacks have emerged to the horizon and
present a new set of security challenges for security architects. Hardware attacks
involve attacks wherein the attacker has physical access to the system. Most
modern systems offer ample opportunities to attackers to exploit the design of
the system to their advantage. As for example, in most modern systems, data is
stored and communicated in plaintext on the memory bus between the processor
and the system memory. The plaintext data stored in the main memory presents
the attackers with a situation where they can easily scan and dump the contents
of main memory, thereby, getting hold of potentially sensitive information such
as user passwords and cryptographic keys [1].
    Despite the increased complexity of hardware attacks, they are very power-
ful as they can bypass any software security protection employed in the system.
The recent proliferation of modchips [2–10] used to bypass the DRM protec-
tion mechanisms employed by gaming systems, has shown that given significant
financial payoffs, hardware attacks are very realistic threats.
    Since software security cannot prevent against hardware attacks, researchers
have proposed Secure processor architectures [11–23] to defend against hardware
attacks utilizing memory encryption and authentication as the protection mech-
anisms. These architectures assume that the on-chip components are secure and
the off-chip components are prone to attack. The data blocks evicted off the core
are encrypted and integrity protected before being sent off-chip and stored in
the main memory. When a data block is brought back on-chip, it is checked for
integrity and decrypted before being supplied to the processor. Memory encryp-
tion provides protection against passive attacks, wherein the attacker attempts
to silently observe the data with the possible intent to steal critical information
like passwords, by encrypting data that is moved off-chip and decrypting it back
when it is brought back on-chip. Memory authentication provides protection
against active where the attacker attempts to modify the data in off-chip stor-
age and communication channels, by computing Message Authentication Codes
(MACs) as code and data moves on and off the processor chip. The chip indus-
try has also recognized the need for secure processors, as is evident in the recent
efforts by IBM in the SecureBlue project [24], Dallas Semiconductor [25] and
Intel La Grande project [26].
    In this paper, we make the following contributions:

 – We analyze three of the proposed architectures for secure uniprocessors
   in terms of the software distribution model, security, system level issues,
   complexity of hardware required and performance overheads: eXecute-Only
   Memory(XOM) [14], Counter Mode Encryption [15–18, 20–23] and Merkle
   trees for authentication [11] and finally, Address Independent Seed Encryp-
   tion (AISE) and Bonsai Merkle Trees for authentication [27] (section 4)
 – We then provide a discussion on the issues in securing multiprocessor sys-
   tems and survey one design each for Shared Memory Multiprocessors and
   Distributed Shared Memory Multiprocessors.
 – We identify important problems that have largely been ignored by current
   secure processor design proposals. In particular, the current proposals suffer
   from two critical limitations in terms of security. One, the current archi-
   tectures consider that the system has booted up securely and only provide
   steady state security guarantees. A system can be compromised before boot-
   ing it, which in turn renders the steady state protection afforded by the se-
   cure processor worthless. Hence, to make security guarantees from boot-up to
   steady state execution, a secure booting solution must be in place. Secondly,
   the current proposals only work to protect the communication between the
   processor and main memory, communication with I/O devices such as Net-
   work card is largely ignored. We believe that the issues recognized in this
   paper can direct future research in secure processors.

    The rest of the paper is organized as follows. Section 2 discusses the assumed
attack model. Section 3 presents a brief description of the hardware attacks
on which the security of the surveyed architectures is discussed. Section 4 and
Section 5 present an analysis of the secure uniprocessor and multiprocessor ar-
chitectures, respectively, surveyed in this paper. Section 6 identifies challenges
in secure processor research and we finally conclude in Section 7.

2   Assumed Attack Model

The architectures discussed in this paper identify two regions in a system: a
trusted region and an untrusted region. The processor chip itself forms the trusted
region and it is assumed that code and data on-chip (in registers or caches)
cannot be tampered with. All off-chip components, including the memory, buses,
disk, BIOS and expansion cards are considered insecure and form the untrusted
region of the system. Any data or code stored in any of these structures can be
observed and modified by the attackers.
    We note that it is not impossible to scan the processor chip or even tamper
with the code and data on-chip, however it is much harder and more involved
than targeting off-chip components. The use of various manufacturing techniques
such as special coating [25] and the involved nature of the attack leading to po-
tentially damaging the processor chip makes it even harder to target the proces-
sor chip. Hence, we believe that this is a reasonable assumption for the attack

3   A Classification of Hardware Attacks

Broadly, hardware attacks can be classified into two main categories: passive
attacks and active attacks.
3.1   Passive attacks

Passive attacks are the ones where the attacker simply observes the data and/or
code going to and from the processor chip in a non-intrusive way. This could be
done by placing a snooper or a logic analyzer on the memory bus [28, 29] and
storing the information in a multi-gigabyte device. Passive attacks leave behind
no traces, as the program execution is not interrupted in any way, and could
lead to the attacker discovering the user’s private information such as credit
card numbers and passwords.

3.2   Active attacks

Active attacks are the ones where the attacker potentially changes the code
and/or data of the program under execution in order to alter the program flow
potentially causing the program to reveal sensitive information. The most com-
mon active attacks are discussed below and will be used for the discussion of
security properties of the architectures analyzed.

1. In spoofing attacks, the attacker tries to change the data value at a memory
   location and tries to pass it off as valid data.
2. Splicing attacks involve taking valid data values and duplicating them or
   replacing them with values at other locations.
3. In Replay attacks, the attacker records old values of data blocks and re-
   plays them at a later point of time.

4     Analysis of Secure Processor Architectures

In this section we present a detailed analysis of three of the proposed secure
processor architectures in terms of software distribution model, security, system
level issues, complexity of hardware required, and performance overheads.

4.1   eXecute-Only Memory

One of the first works on secure processor architectures was presented by Lie et
al. [14]. In XOM, the processor is equipped with a private key, known only to the
processor manufacturer, and the corresponding public key is published. Hence,
anyone can encrypt the software with the public key but only the processor
having the corresponding private key can decrypt and execute the software. In
XOM, data evicted off the processor chip is encrypted using direct encryption
and integrity protected with a MAC. Hence, the memory stores ciphertext and
when a block is brought back on-chip, it first needs to be decrypted before it can
be fed to the processor. The integrity check can happen lazily in the background,
and integrity failures trigger an exception and the program execution is halted.
Software Distribution Model The entire software can be encrypted with the
public key corresponding to the processor on which the software is supposed
to run. However, public key cryptography is prohibitively slower compared to
symmetric key cryptography. Hence, a different software distribution model is
chosen. The software is encrypted with key, Ks, and the key Ks is itself encrypted
using the processor’s public key and shipped with the software. The processor
begins execution by first extracting the key Ks, and using this key to decrypt
rest of the software. The key, Ks, is then used as the session key for encrypting
the code and data of the software when it is evicted off the processor chip.

Security and System Issues XOM encrypts all data that is sent off-chip
using the session key, Ks, extracted from the software. This provides effective
defense against passive attacks, as the attackers using snoopers or bus analyzers
can only see encrypted content. The key used for encryption never goes off-chip
and is stored securely in an on-chip hardware structure called the session key
    XOM can successfully defend against spoofing attacks as each data block has
a MAC associated with it. An attempt to tamper with a data block will result in
a MAC mismatch and subsequently an integrity violation will be raised. Splicing
attacks are prevented in a similar fashion. The MAC computed for a data block
includes a position dependent component, such as the address of the block,
and hence any attempt to duplicate one ciphertext block or replace one block
with another will result in a MAC mismatch. XOM, however, does not provide
protection against replay attacks. An attacker can record the ciphertext and the
corresponding MAC of a data block and replay it later without being detected,
defeating the security afforded by a XOM processor. Besides, there is also a
general security concern with direct encryption in that the statistical distribution
of ciphertexts matches that of the plaintexts and sophisticated frequency attacks
can provide the attacker with useful information.
    XOM uses a per process session key to encrypt the data blocks going off-chip.
This immediately makes XOM incompatible with Inter-Process Communication
(IPC) support. Two processes sharing a data block will use different keys to
encrypt/decrypt the data, hence the encrypted data written by one process to
memory will be decrypted by the other process using a key different from the
one that was used to encrypt the block in the first place. XOM does not address
this critical system level issue, making it incompatible with modern systems,
which make heavy use of IPC. A possible solution to make XOM support IPC
could be to have a single process use two keys, one for local data and another for
shared data. The shared key could be common for all processes, for example, an
additional key burnt on-chip for all the processes to use for encrypting shared
data. A more complex but plausible solution is to have the processes sharing
memory agree upon a shared key to be used for accesses to shared memory. The
process continues to use the session key for all its local data and code, however
for shared data the process uses the shared key either burnt on-chip or agreed
upon by the two processes as discussed above.
Additional hardware requirements XOM needs a cryptographic engine on-
chip which is designed to encrypt the data blocks leaving the processor boundary
and decrypt them when they are brought back on-chip as a result of a demand
miss or a prefetch request. The on-chip cryptographic engine should have im-
plementations of a symmetric key cryptography algorithm like AES [30] and an
integrity verification algorithm like SHA [31]

Performance Overheads The overheads of a XOM processor over a base pro-
cessor with no protection employed can be very high, slowing down applications
up to 35% even when using a very simple encryption algorithm that takes only
48 cycles to decrypt a cache line [22]. This is due to the fact that XOM decryp-
tion path lies in the critical path. Once the block is brought on-chip, it needs
to be decrypted before the processor can use it. This directly adds to the al-
ready long memory latency, making applications suffer high overheads. Thus in
addition to the security and system level issues, performance overheads limit the
applicability of XOM to modern systems.

4.2    Counter Mode encryption and Merkle tree based
Researchers have proposed counter mode encryption [15–18, 20–23] and Merkle
tree based authentication [11] to overcome some of the drawbacks of XOM both
in terms of security and performance. In counter mode encryption, the crypto-
graphic latency is decoupled from the data itself. A per-block seed is encrypted
to generate a cryptographic pad which is then XORed with the data block itself
to encrypt or decrypt it. Hence, the pad generation is overlapped with the la-
tency of fetching the block from memory. Counter-mode encryption/decryption
is summarized in the following equations:
                        Ctext = P text ⊕ AESk (seed)                          (1)
                        P text = Ctext ⊕ AESk (seed)                          (2)
    XOM integrity protection could successfully defend against splicing and spoof-
ing attacks but was vulnerable to replay attacks. Merkle trees were proposed to
provide protection against replay attacks as well. In a Merkle tree based authen-
tication mechanism, a tree of MACs is built over the memory. When a block
is brought on-chip, its integrity is verified against the tree of MACs up to the
root. The root of the tree is stored securely on-chip in a register and never goes

Software Distribution Model These architectures do not explicitly talk
about the software distribution model they use and assume a model similar
to that of XOM.
Security and System Issues In terms of security, encryption defends against
passive attacks as discussed earlier. It avoids other attacks that are possible on
direct encryption resulting from the statistical distribution of ciphertexts match-
ing that of plaintexts. Merkle tree based integrity verification defends against
splicing, spoofing and replay attacks successfully. Hence, in terms of security
this architecture provides defense against all the major categories of hardware
attacks considered in this paper.
    In order for counter mode encryption to be secure, the per-block seed needs
to be unique, both spatially and temporarily. Spatial uniqueness is obtained by
including the block address (physical or virtual) as a seed component. Temporal
uniqueness on the other hand is obtained by associating a per-block counter
which is incremented each time the block is written back to memory. To hide
the latencies, the counters are cached in an on-chip cache called the counter
cache. Hence, the seed has address (virtual or physical) and a per-block counter
as two of its important components for security. However, including the address
as a seed component results in critical system level and security issues.
    If the virtual address is used for spatial uniqueness IPC cannot be supported.
Processes sharing memory may be using different virtual address for the shared
region and encrypt/decrypt the same block differently. In addition, some of the
important OS optimizations based on page sharing can not be used. For exam-
ple, process forking cannot utilize the copy-on-write optimization as the shared
pages will be encrypted differently in the parent and child process. This lack
of IPC is particularly problematic in the era of CMPs. Finally, using virtual
addresses as part of the seed also results in a security hole as different processes
can have the same virtual addresses, resulting in potential reuse of pads across
processes. On the other hand using physical address for spatial uniqueness re-
sults in extra re-encryption work on page swapping. A page swapped out to the
disk can be reloaded to a different physical address. Using physical address for
seed composition will require the page swapped in to be decrypted using the old
physical addresses and re-encrypted using the new physical addresses where it is
now loaded to. This results in unnecessary complexity of the hardware. The op-
erating system can be used to do this, however then the hardware-based security
of the architecture is contingent on an uncompromised OS.
    Merkle tree verification results in significant memory overhead (nearly 33%)
for storing the MAC nodes. This high overhead is likely to be unacceptable for
most systems and in particular for embedded systems having limited memory,
this scheme is not practical. This overhead is essentially twice that of XOM
integrity protection which uses a per-block MAC instead of a tree of MACs.
Hence, despite solving some of the security and performance issues of XOM,
this architecture introduces new system level issues which make it limited in its

Additional hardware requirements This architecture requires an on-chip
cache (counter cache) for storing the per-block counters in addition to an on-
chip cryptographic engine with similar capabilities to the one required by XOM.
Performance Overheads We have used SESC [32], an open source execu-
tion driven simulator to model the architectures analyzed in this paper. The
architectural parameters used are summarized in Table 1

                       Table 1. Secure processor Parameters

                   Processor      2GHz, 3 issue, out-of-order
                  L1 D-cache      32KB, 2-way, Write-through, 2-cycle, 64-
                                  byte block, LRU
                  L1 I-cache      32KB, 2-way, 2-cycle, 64-byte block, LRU
                   L2 cache       1MB, unified, 8-way, Write-back, 10-cycle,
                                  64-byte block, LRU
                Main Memory       1GB, 200-cycle
             Cryptographic Engine Encryption/Decryption: Counter mode
                                  encryption 128-bit AES Engine, 16-stage
                                  pipeline, 80-cycle
                                  Authentication: HMAC [33] based on
                                  SHA-1 [31], 80-cycle [34], 128-bit authen-
                                  tication code
                                  Counter Cache: 32KB, 16-way, 10-cycle,
                                  64byte block, LRU

    We use 21 C/C++ SPEC 2000 benchmarks [35] for our evaluation. For each
benchmark we use the reference input set and simulate the benchmark for one
billion instructions after skipping 5 billion. We use timely but non-precise in-
tegrity verification, i.e. each block is immediately verified when it is brought
on chip, but we do not delay the retirement of the instruction that brings the
block on-chip if the verification is not completed yet. Figure 2 shows the perfor-
mance overheads of counter mode encryption with Merkle tree authentication.
The individual results are shown only for benchmarks that suffer an L2 miss
rate of more than 20% but the average is calculated across all 21 benchmarks.
As expected the overheads are much lower than XOM as now the cryptographic
latencies involved in pad generation are overlapped with the latency of fetching
the block from memory. However, the overheads are still substantial (mcf: 35.2%,
art: 74.4% and swim: 62.7%) and coupled with the system and security issues,
make this architecture impractical for implementation.

4.3   AISE and BMT
Address Independent Seed Encryption (AISE) and Bonsai Merkle Tree (BMT) [27,
36] based architecture aims to resolve the security, system and performance re-
lated issues with currently proposed secure processor architectures. [27, 36] pro-
poses a novel seed formation to use for encryption. The main reason for the
system and security issues with prior approaches is that address, a basic unit
of memory management, is used for security, something it was not intended to
be used for. AISE proposes a novel seed composition where the seed does not
use address for spatial uniqueness. Temporal uniqueness is still achieved using a
per-block counter which is incremented every time the block is written back to
memory. However, for spatial uniqueness, the concept of Logical Page Identifiers
(LPID) is introduced. Each physical page (frame) is assigned an LPID from an
on-chip non-volatile counter, the Global Page Counter (GPC), when it is first
loaded into the main memory from the disk. The seed is hence a concatenation
of the block counter (temporal uniqueness) and the LPID (spatial uniqueness).

    For integrity verification, Bonsai Merkle Trees are proposed which are a re-
duced version of Merkle trees. The authors make the observation that Merkle
trees were first introduced to defend against replay attacks. A MAC associated
with each data block is strong enough to protect against splicing and spoofing
attacks. It was proved that if it could be ensured that the LPID and the block
counter are fresh, that is, not being replayed, the data is fresh. Hence, having a
Merkle tree built over the counters (LPIDs and block counters) is sufficient to de-
fend against replay attacks. Along with this, a per block MAC was kept to defend
against splicing and spoofing attacks. Figure 1 compares the two organizations
of Merkle trees. This novel organization of the Merkle tree was termed Bonsai
Merkle Trees, due to the significantly smaller region of memory (the counter
space) over which the Merkle tree is built to provide the same protection as that
afforded by standard Merkle trees.

                             (a) Standard Merkle Tree

                              (b) Bonsai Merkle Tree

              Fig. 1. Standard Merkle Trees vs Bonsai Merkle Trees.
   The previously proposed schemes do not provide protection for data swapped
out to disk, providing the attackers with an easy target to bypass any security
offered by these architectures. An attacker can change the contents of a page
swapped out to the disk which will go undetected in these schemes. In [27],
the Merkle tree was extended to provide protection to the pages swapped out
to the disk. When a page is swapped out, the root of the mini Merkle tree
built over the page is stored securely in a reserved portion of memory called
the page root directory. This page root directory automatically comes under the
protection of the Merkle tree built over the memory. Any modifications to the
pages swapped out to the swap space will be detected as an integrity violation.
This was proposed as an extension to the standard Merkle tree that is built over
the entire memory.

Software Distribution Model The processor is equipped with a per-processor
key, burnt on-chip at manufacture time. The software vendors are required to
encrypt the software using the public key corresponding to the processor public

Security and Software Issues AISE and BMT affords the same security as
counter mode encryption and Merkle tree scheme along with avoiding the pad
reuse problems. The LPIDs are never re-used and the GPC is stored in a non-
volatile memory. Hence the pad uniqueness is maintained across system reboots.
A 64-bit counter is used for the GPC, which is enough to last a millenia. This
scheme also extends protection to the disk which was not provided by earlier
    Along with strengthening the security loopholes with earlier schemes, AISE
resolves critical system level issues making secure processor implementation prac-
tical. AISE supports IPC and does not introduce any complexity on page swap-
ping unlike previous counter mode schemes which did not support IPC and/or
introduced a lot of complexity on page swapping. It also reduces the memory
overheads to 25% from 33% in a scheme utilizing standard Merkle tree.

Additional hardware requirements The hardware required by AISE and
BMT scheme is the same as that required by counter mode encryption and stan-
dard Merkle tree architecture: A counter cache to store the per-block counters
and an on-chip cryptographic engine.

Performance Overheads AISE is a counter mode encryption scheme and the
overheads due to AISE alone are the same as that of a counter mode encryption
scheme. However, BMTs reduce the memory overhead due to the MACs as the
Merkle tree is now built over a much smaller portion of memory and more
importantly the cache pollution due to the Merkle tree nodes is reduced greatly,
allowing more cache space to be taken up by application blocks. Figure 2 shows
the benefits of this scheme compared to standard Merkle tree scheme. On an
average, BMTs reduce the performance overheads from 12.1% to 1.8%.
                      #$                                      %

                                                          !             " !

Fig. 2. Performance overhead comparison of AISE+BMT with Counter mode Encryp-

5     Analysis of Secure Multiprocessors

Shared Memory multiprocessors (SMPs) and Distributed Shared Memory multi-
processors are widely used as servers. SMP servers provide superior performance
for critical commercial and scientific applications (e.g. banking, air-ticketing
etc.). In many cases, small companies outsource their storage and processing
needs to a bigger company capable of providing the computation and storage
facilities needed. This model known as ”Utility computing” has great potential,
however, security concerns have slowed its adoption [37]. Often, these servers
are used to store critical customer information like credit card numbers etc.
and can bring significant financial loses if stolen. Hence, it is important to de-
sign tamper-resistant environment to safeguard the applications running in a
multiprocessor environment. Multiprocessors still need the memory to processor
protection (both confidentiality and integrity) as needed by uniprocessors. How-
ever, multiprocessors add another dimension to providing a secure runtime envi-
ronment, the processor to processor communication. The processor to memory
communication can be protected using the same mechanisms as those used for
uniprocessor systems, however new mechanisms need to be devised for protect-
ing the processor to processor communication. Both SMPs and DSMs present a
new set of challenges when protecting the processor to processor communication.
We discuss one design each for a secure SMP [23] and a secure DSM [15].

5.1   Secure Shared Memory Multiprocessors

A simple way to protect the communication over the bus is to use the same
ciphertext for cache-to-cache traffic as the one stored in memory. This scheme,
however, is not secure. Consider a processor A modifying a cache block, D, with
pad, P and corresponding ciphertext, C = D XOR P. Now, processor A can write
to this block as long as it is in its local cache, without updating the pad. Now,
lets assume that processor B requests a read-only copy of this block. It would
be sent as D XOR P. Later, A modifies D to D’ and B requests the block again.
This time it would be sent as D’ XOR P, as the block was never evicted out
of the processor caches, the pad did not change. An attacker observing traffic
over the shared bus can simply XOR the two ciphertexts and obtain D XOR
D’, which can potentially lead to revealing the plaintext data. Hence, a different
encryption is needed for cache-to-cache transactions.
    In addition to encrypting bus transactions, a scheme for authenticating them
should also be in place since they can be tampered with just as the memory con-
tents. The attacks on the bus may include message reordering, message dropping
or message spoofing. We now discuss the approach proposed by Zhang et al. [23]
to provide confidentiality and integrity guarantees on a SMP machine. The so-
lution is based on the fact that in a SMP, all processors can snoop the data that
is put on the bus.

Bus Encryption Scheme Before we describe the encryption scheme, we briefly
talk about the proposed software distribution model. Each processor has a unique
private key. Similar to a uniprocessor software distribution model used in XOM,
the software is encrypted with a key k, and the key k is then encrypted with
each of the public keys of the processors of the SMP machine. The software
vendors may chose to allow only a subset of the processors on the SMP to run
the software by not encrypting the symmetric key k with the corresponding
public key of the processors to be excluded.
    The proposed algorithm uses both counter-mode encryption and a symmetric
block cipher in Cipher Block Chaining (CBC) mode. Counter-mode encryption
is used to provide fast encryption/decryption capabilities and AES-CBC is used
for its security strength and capability in message authentication. The following
table summarizes the encryption scheme used.

                     Table 2. SMP Bus Encryption Scheme

                       AES-CBC                Bus Encryption
            Encryption c = Data ⊕ mlast       Data ⊕ mlast
                       m = AES(k, c)          send c
                       send m                 AES(k, c)
            Decryption receive m              receive c
                       c = AES−1 (k, m)       Data = c ⊕ mlast
                       Data = c ⊕ mlast       m = AES(k, c)

    In direct AES-CBC, the input at time t is the Data XORed with the cipher-
text at time t-1, mlast . The output of AES then updates mlast and is sent out
as the cipher of data. If m is used as the cipher to be sent on the bus, then we
need to wait for the AES encryption to finish even after the data is ready, which
will adversely affect performance. Instead for the bus encryption, AES-CBC is
adapted to send c, that is the XOR of the Data and the cipher at time t-1,
mlast . In the background, AES is used to update m. This adds only a one cycle
delay (for the XOR operation) before the data can be sent on the bus. Similar
reasoning applies for the decryption side.

Bus Authentication Scheme Using CBC mode of a block cipher has the
advantage that it can be used to generate a MAC to authenticate bus transfers.
The principle is that if a message has t blocks, which is divided according to the
underlying block cipher’s input width, then its MAC is the s-bit prefix of mac,
    mac = AES(...AES(block1 ) XOR block2 )....XOR blockt )
    If each bus transfer is treated as a block and a fixed number of bus transfers
are treated as a big message, the mac can authenticate the entire sequence of
bus transfers in one shot. In other words, mac represents the entire history of
messages upto time t.
    Counter mode encryption with Merkle tree based authentication was used to
protect the processor to memory traffic. The combined overhead of protecting the
processor to memory and processor to processor traffic was reported to be 2.03%
on an average across five benchmarks chosen form the SPLASH2 benchmark

5.2   Distributed Shared Memory Multiprocessors
DSMs present security architects with a new set of design challenges. Our dis-
cussion in this section is based on the first attempt to provide a secure DSM
system by Rogers et al. [15]. As discussed in the previous section, multiproces-
sor systems add a new dimension to protecting confidentiality and integrity of
data: processor to processor communication. A simple but naive solution would
be to extend the uniprocessor protection mechanisms to a DSM system, that
is the same set of counters used to protect the processor to memory communi-
cation is used to protect processor to processor communication. However, this
proves to be prohibitively slow. Assuming processor A has a line in dirty state.
Now processor B requests a read-only copy of the line. This will appear as an
intervention request to processor A. Now to send the data the block needs to be
encrypted for which a cryptographic pad needs to be generated. This is done by
incrementing the counter associated with the block and using it to generate the
pad. This in turn will result in invalidating the block counter at the receiving
processor. When the encrypted block is sent to processor B, it needs the block
counter for which it needs to send a read request for the counter to processor
A. Hence, the latency for processor to processor communication is effectively
doubled. In addition, the counter communication needs to be protected against
tampering as well, so it requires high latency authentication. Similar difficulties
exist with maintaining the coherency among nodes in the Merkle tree.
    Also the protection scheme used for a SMP system cannot be used for a DSM
system. This is due to the fact that in an SMP system, all the nodes can observe
each of the coherence transactions due to the shared bus. However, ensuring that
each node in a DSM system observes each of the coherence transactions would
be prohibitively expensive or even infeasible. Hence, new mechanisms need to be
designed for protecting processor to processor communication in a DSM system.

Processor to Processor Encryption Scheme As in other secure architec-
tures, counter mode encryption is used due to its capability to hide cryptographic
latencies without compromising security. However, for counter mode encryption
to work, the communicating processors should agree on a common stream of
counter values, so that the cryptographic pads can be pre-generated before the
data is ready to be sent or received. The cryptographic pads should be glob-
ally unique. Hence, the seed used to generate the pad is a concatenation of
the communication counter (Ctr), sending processor ID (IDs), receiving proces-
sor ID (IDr) and an arbitrary Encryption Initialization Vector (EIV). Hence,
the pads are unique for every communicating processor pair and even when the
sending and receiving processors are switched. The communication counter is in-
cremented for each message sent or received, hence the pad is also unique across
messages communicated by a processor pair. Also, the seed composition is inde-
pendent of the data itself and hence the pads can be pre-generated independent
of application data.
    The key issue for the encryption mechanisms to work well is the management
of communication counters and their pre-generated pads so that the counters are
synchronized at the sender and the receiver side. The work proposed three novel
counter management techniques: Private, Shared, and Cached Counter Stream.
We now briefly discuss each of the counter management schemes.
Private Counter Stream: This is the most intuitive scheme, where each pro-
cessor maintains a separate counter stream for each of the processors in the
system. Each processor maintains two separate tables, a send table and a receive
table. Each table has P-1 entries, where P denotes the total number of processors
in the system. Separate send and receive tables are needed because when two
communicating processors switch the direction of transfer, the order of concate-
nation of their processor IDs is different, and thus the pre-generated pads are
different. The counters between a processor pair are kept synchronized by having
each of the communicating processor increment the corresponding counter each
time that counter’s pad is used to encrypt or decrypt data. Each send or receive
table entry contains a 64-bit counter value, a 512-bit pre-generated encryption
pad (assuming 64-byte cache block size), a 128-bit pre-generated authentication
pad (assuming 128-bit MACs, discussed later), and a valid bit indicating that
the pad has been generated but not yet consumed. Since, the generation of the
cryptographic pad is independent of the data, the encryption pads are generated
whenever the counter is updated and there are free AES engine cycles available.
If a data block finds its encryption and authentication pads ready (valid bit
set to 1) when it needs to be sent over the network to another processor, the
cryptographic latencies can completely be hidden (barring the one cycle XOR
Shared Counter Stream: The private scheme is simple and has almost perfect
pad hit rates (the fraction of times the cryptographic pads are found ready in the
send/receive tables to encrypt/decrypt a block), however the storage overheads
for the private scheme are non-trivial for a large DSM system. The Shared orga-
nization tries to reduce the storage overheads by replacing the send table with a
single counter and pad for sending data messages to any processor. This shared
counter is incremented after each sent message, to guarantee uniqueness. To pre-
generate pads that the sender can use to send a message to any of the processors
in the system, the receiving processor’s ID is not included in the seed. The pads
are still unique as the processor updates its sending counter after sending each
    As in Private, upon receiving a data message, the receiving processor ac-
cesses the entry corresponding to the sending processor and checks whether a
pre-generated pad is available for that sender. However, since the sending pro-
cessor uses the same counter to send messages to all processors, it is less likely
for a receiving processor to see back-to-back messages with contiguous counter
values. Non-contiguous counters occur when a processor receives a message from
a particular sender, while the sender’s previous message was sent to different
processor. As a result, the receiver will suffer from a higher pad miss rate and
full decryption and MAC generation latencies.
Cached Counter Stream: The Private scheme achieves high performance due
to a low pad miss rate, but needs larger storage overhead, while Shared sacrifices
performance for lower storage overhead. However, both Private and Shared are
not scalable in the sense that the tables are designed to support only a fixed
number of processors. Unless the tables are very large, they prevent DSMs from
scaling to larger numbers of processors. Cached Counter scheme was proposed
to address the scalability issues of the other two schemes.
    The Cached scheme can scale to an arbitrary number of processors with fixed
send and receive table sizes, while still providing good performance. The intuition
behind its design is that processors in a DSM system often communicate with
a set of neighbors that is much smaller than the total number of processors in
the system. Therefore, the table size of each processor’s send and receive table
can be limited to some number of entries that is a fraction of the number of
processors in the system. This table can operate similarly to a cache, where
a send/receive to/from a processor that does not have an entry in the table
will create a new entry for this processor in the table, replacing the entry that
has been unused for the longest time. However, unlike a regular cache, displaced
entries are simply discarded, instead of written back to other storage. This avoids
the need to allocate off-chip storage for these entries, which would need to be
protected against attacks with additional security mechanisms. On receiving a
message from a processor for which there is no entry in the receive table, the
counter sent with the data is used to generate the decryption pad. For sending
a message to a processor for which there is no entry in the send table, each
processor uses the maximum of the counter values, maxCtr, that it has used
across all receivers, to generate the encryption pad.
Authentication Scheme To hide authentication latency, a combined counter
mode encryption/authentication scheme called Galois Counter Mode (GCM) is
used. GCM has three key benefits: One, it is cryptographically as secure as
AES. Two, it utilizes the same hardware as the underlying AES encryption
algorithm. Finally, since GCM uses counters, if the counter value is known,
the authentication pad can be pre-generated, hiding most of the authentication
latency. The only exposed part of latency is the GHASH computation, which is
a short chain of Galois Field multiplications and XOR operations, each of which
takes one cycle.
    To ensure security of data messages, it is necessary to protect all parts of the
message including the headers. Thus the MAC must cover the counter, processor
IDs of the sender and the receiver, address and type of coherence message. This
ensures tampering with any part of the message will result in an integrity verifi-
cation failure. The authentication seed is chosen to be the concatenation of the
counter, sender ID, receiver ID and an arbitrary Authentication Initialization
Vector (AIV). In GCM, additionally authenticated data (data that is authenti-
cated but not encrypted) can be supplied along with the data ciphertext into
the GHASH function. This is leveraged to protect the message header informa-
tion such as address and message type. This type of protection protects against
spoofing and splicing attacks, however replay attacks are still possible. One way
to detect replays is to note that a counter’s value is monotonically increasing.
A replayed message is detected when the received message’s counter is smaller
than the counter stored in the receive table. However, messages can be delivered
out of order and using this scheme may result in false positives. [15] discusses
one possible way to distinguish replayed messages from messages delivered out
of order.

6     Future Directions in Secure Processor Research

Despite the evolution of secure processor architectures from XOM to AISE, two
of the most important security challenges still remain: Secure Booting and Secure
Configuration. This is largely due to the fact that most proposed architectures
assumed that the system has booted up securely by including the booting process
in the Trusted Computing Base (TCB) and only tried to protect the communi-
cation between the processor and main memory. These assumptions do not hold
good, as a system can be compromised before it has booted up and the steady
state security of the secure processor employed comes into play. We now discuss
each of these security issues and identify the main challenges in solving these

6.1   Secure Booting

The secure processor architectures proposed so far assume that the system has
booted up securely and aim to protect the steady state operation of the system.
This is done by including the booting process in the TCB. However, an attack
conducted at boot time before the secure processor mechanisms can start pro-
tecting the system will go undetected and the security mechanisms will work to
protect an already compromised system. The feasibility and ease of boot time
attacks is shown by the proliferation of modchips for gaming systems. In gaming
systems, the BIOS is used to enforce DRM. The game manufacturers sign the
game DVDs and the BIOS is responsible for checking the signature off the DVD
before allowing the game to play. This prevents game users to play pirated DVDs
or any software not manufactured by the game manufacturers. A modchip is a
device that mimics the original BIOS during system initialization and boot time.
However, the BIOS code given to the processor off the modchip is the original
BIOS code minus the code to verify digital signatures of a game software. This
allows users to bypass the DRM measures taken by the game manufacturers
and play any software on the system including pirated games, music etc. In this
scenario, the user of the system is the attacker itself and the attack goes unde-
tected as the secure processor mechanisms are effective only when the system
has booted up.
   Besides modchip attacks in gaming systems, PandaLabs security report [38]
identifies rootkits in the Master Boot Record (MBR) as one of the top security
threats. This attack goes undetected as current systems do no check the MBR
before passing control to it. Besides MBR, other boot components such as the
PCI cards can also be targeted for sophisticated attacks. [39] demonstrates an
attack which uses the PCI expansion ROM for persisting rootkit code. Finally,
secure booting is an important issue for mobile and embedded devices as well.
    Despite its obvious and growing need, there has been very little work on
secure booting. Arbaugh et al. [40] proposed a chained integrity verification ap-
proach, wherein each boot component verifies the next component in the boot
chain before passing control to it. For example, the BIOS before passing control
to the expansion ROMs checks to see if the expansion ROMs have not been
compromised. This is done by verifying the signature off the expansion ROM.
In a similar way each boot component verifies the next layer and builds up
trust before passing control to the Operating system. The Trusted Computing
Group(TCG) [41] proposed a similar approach wherein the measurement of each
of the components is extended to what are called as the Platform Configuration
Registers (PCRs), which can later be provided to a remote verifier which main-
tains measurements of unmodified components. This process is known as Remote
    The fundamental flaw with the above two approaches is the fact that the
root of trust for the chained integrity verification lies in an off-chip component,
the BIOS. Arbaugh et al. use a portion of the BIOS as the root of trust. The
security of this scheme can easily be compromised by a hardware attack involving
the replacement of BIOS such as the modchip attack. A BIOS upgrade to a
BIOS that does not perform integrity checks can similarly subvert the security
of this approach. TCG utilizes the BIOS boot block as the root of trust. This is a
component of the BIOS that does not change across BIOS upgrades. However,
this scheme is still vulnerable to hardware attacks involving BIOS replacement.
Finally, TCG is optional for the user and in the scenarios such as modchip attacks
where the user of the system is the attacker itself, the user can simply opt not
to use TCG and conduct the attack successfully.
    There have also been proposed solutions which do assume that physical at-
tacks to off-chip components are possible. For example, in ARM TrustZone [42],
security extensions to the ARM architecture, the entire boot code is moved on
chip, eliminating the possibility of physical tampering to alter the boot code.
Additionally, any application that needs to be secured must be kept on-chip
in a secure RAM. While this solution is secure against the physical attacks we
consider, it is too heavyweight for use in many systems. Especially for more
general-purpose systems such as PCs and gaming systems, the cost in terms of
die area of keeping the entire BIOS on-chip (which can be as large as 1MB) as
well as the code and data of applications (possibly several MBs) is prohibitive.
For this wide range of systems, a more lightweight and robust solution is needed.
    It is clear that the root of trust for a secure bootstrapping solution should
be located on-chip to provide strong security. However, an on-chip root of trust
does not alone guarantee secure bootstrapping. We find that a class of attacks
which we classify as a type of Time-Of-Check To Time-Of-Use (TOCTTOU)
attacks are still possible. The main idea is that after a boot component has been
verified, an attacker with physical access to the system can tamper with the
component before it is actually fetched and executed by the processor. Hence,
the processor will now execute an image that is different from the image that
was originally verified. Therefore, as another requirement, a secure bootstrapping
mechanism should enforce that a boot component cannot be modified (without
detection) after it is initially verified during secure boot and before it is executed
by the processor. AEGIS and TCG do not meet this requirement and hence are
susceptible to such TOCTTOU attacks. ARM TrustZone, on the other hand,
meets this requirement since verified components never leave the processor chip,
however again this is an expensive solution that is infeasible for many systems.
As a result, the currently existing solutions to secure booting are inappropriate
for implementation on secure processors.
    Hence, for a secure booting mechanism to be truly secure and practical for
implementation in real systems, it must meet three requirements. Firstly, the root
of trust should be located on-chip. Secondly, the on-chip overhead added by the
root of trust should be reasonable enough for implementation in real systems and
finally, the solution should provide effective defense against TOCTTOU attacks
on the boot components. We do not present a complete solution to this problem
but believe that the discussion above could provide valuable direction to security

6.2   Secure Configuration
The current proposals on secure processor architectures have focused on protect-
ing the communication between the processor and the main memory. I/O with
other devices in the system has largely been ignored. For example, the processor
communicating with the network card sends the data packets unencrypted over
the bus. These packets are open to both passive and active attacks. However, the
problem of securing communication between the processor and I/O devices poses
a new set of challengers for designers. One, if the processor and the devices are
to communicate securely, they must have a shared key established somehow so
that the communication can be encrypted and prevented from passive and active
attacks. This in itself presents an important problem of key provisioning, that is,
how is this shared key provisioned. The possible options can include, provision-
ing at manufacture time or provisioning dynamically. Both options have their
own set of challenges. Provisioning at manufacture time requires the component
manufacturers to change their manufacturing process and coordinate with each
other to use the right key for communication. On the other hand provision-
ing dynamically, for example, through a CD shipped with the device leads to
other practicality questions, for example, if the user loses the CD and needs to
reconfigure the device.
    Just provisioning the keys is not sufficient, the next and more important
challenge is the requirement of cryptographic capabilities for the devices. A se-
cure processor itself can be provisioned with an on-chip cryptographic engine,
however having similar capabilities for all devices is not a practical solution.
Hence, a secure configuration scheme needs to provide some sort of central cryp-
tographic functionality which can be accessed securely by all devices to off-load
their cryptographic work. We are currently looking into secure configuration is-
sues and hope that this work will motivate other researchers to work in making
secure processors truly secure.

7   Conclusions

In this paper we have analyzed the current state of research in secure processor
architectures. We believe to make a system truly secure, three key components
are needed: A secure processor substrate, a secure booting mechanism and secure
configuration. Secure processor research so far has concentrated on providing a
secure processor substrate and making it practical for use in a real system. While
secure processor substrates have evolved from XOM to AISE which addresses
most of the system, security and performance issues with XOM, there has been
little or no work done for secure booting and configuration. In this paper, we
have presented the challenges involved in providing a secure boot and configu-
ration mechanism. We hope that this work will motivate security researchers to
concentrate on these issues to make secure processors truly secure and practical.

 1. Kumar, A.:       Discovering Passwords in Memory.    http://www.infosec- resources/ (2004)
 3. (2005)
11.   Gassend, B., Suh, G., Clarke, D., Dijk, M., Devadas, S.: Caches and Hash Trees
      for Efficient Memory Integrity Verification. In: Proc. of the 9th International
      Symposium on High Performance Computer Architecture. (2003)
12.   Gilmont, T., Legat, J.D., Quisquater, J.J.: Enhancing the Security in the Memory
      Management Unit. In: Proc. of the 25th EuroMicro Conference. (1999)
13.   Lie, D., Mitchell, J., Thekkath, C., Horowitz, M.: Specifying and Verifying Hard-
      ware for Tamper-Resistant Software. In: Proc. of the 2003 IEEE Symposium on
      Security and Privacy. (2003)
14.   Lie, D., Thekkath, C., Mitchell, M., Lincoln, P., Boneh, D., MItchell, J., Horowitz,
      M.: Architectural Support for Copy and Tamper Resistant Software. In: Proc.
      of the 9th International Conference on Architectural Support for Programming
      Languages and Operating Systems. (2000)
15.   Rogers, B., Solihin, Y., Prvulovic, M.: Efficient Data Protection for Distributed
      Shared Memory Multiprocessors. In: Proc. of the 15th International Conference
      on Parallel Architectures and Compilation Techniques. (2006)
16.   Shi, W., Lee, H.H., Ghosh, M., Lu, C.: Architectural Support for High Speed Pro-
      tection of Memory Integrity and Confidentiality in Multiprocessor Systems. In:
      Proc. of the 13th International Conference on Parallel Architectures and Compi-
      lation Techniques. (2004)
17.   Shi, W., Lee, H.H., Ghosh, M., Lu, C., Boldyreva, A.: High Efficiency Counter
      Mode Security Architecture via Prediction and Precomputation. In: Proc. of the
      32nd International Symposium on Computer Architecture. (2005)
18.   Shi, W., Lee, H.H., Lu, C., Ghosh, M.: Towards the Issues in Architectural Support
      for Protection of Software Execution. In: Proc. of the Workshop on Architectural
      Support for Security and Anti-virus. (2004)
19.   Suh, G., Clarke, D., Gassend, B., van Dijk, M., Devadas, S.: AEGIS: Architec-
      ture for Tamper-Evident and Tamper-Resistant Processing. In: Proc. of the 17th
      International Conference on Supercomputing. (2003)
20.   Suh, G., Clarke, D., Gassend, B., van Dijk, M., Devadas, S.: Efficient Memory
      Integrity Verification and Encryption for Secure Processor. In: Proc. of the 36th
      Annual International Symposium on Microarchitecture. (2003)
21.   Yan, C., Rogers, B., Englender, D., Solihin, Y., Prvulovic, M.: Improving Cost,
      Performance, and Security of Memory Encryption and Authentication. In: Proc.
      of the International Symposium on Computer Architecture. (2006)
22.   Yang, J., Zhang, Y., Gao, L.: Fast Secure Processor for Inhibiting Software Piracy
      and Tampering. In: Proc. of the 36th Annual International Symposium on Mi-
      croarchitecture. (2003)
23.   Zhang, Y., Gao, L., Yang, J., Zhang, X., Gupta, R.: SENSS: Security Enhancement
      to Symmetric Shared Memory Multiprocessors. In: Proc. of the 11th International
      Symposium on High-Performance Computer Architecture. (2005)
24.   IBM:        IBM Extends Enhanced Data Security to Consumer Elec-
      tronics Products.  
      news.20060410 security.html (April 2006)
25. Maxim/Dallas Semiconductor:           DS5002FP Secure Microprocessor Chip. view2.cfm/qv pk/2949 (2007 (last modifica-
26. Intel:                    Intel       Trusted        Execution       Technology. (May 2006)
27. Rogers, B., Chhabra, S., Solihin, Y., Prvulovic, M.: Using Address Independent
    Seed Encryption and Bonsai Merkle Trees to Make Secure Processors OS- and
    Performance-Friendly. In: Proc. of the 36th Annual International Symposium on
    Microarchitecture. (2007)
28. Huang, A.: Hacking the Xbox: An Introduction to Reverse Engineering. No Starch
    Press, San Francisco, CA (2003)
29. Huang, A.B.: The Trusted PC: Skin-Deep Security. IEEE Computer 35(10) (2002)
30. FIPS Publication 197: Specification for the Advanced Encryption Standard (AES).
    National Institute of Standards and Technology, Federal Information Processing
    Standards (2001)
31. FIPS Publication 180-1: Secure Hash Standard. National Institute of Standards
    and Technology, Federal Information Processing Standards (1995)
32. J. Renau et al: SESC. (2004)
33. H. Krawczyk and M. Bellare and R. Caneti: HMAC: Keyed-hashing for message
    authentication. (1997)
34. Kgil, T., Falk, L., Mudge, T.: ChipLock: Support for Secure Microarchitectures.
    In: Proc. of the Workshop on Architectural Support for Security and Anti-Virus.
    (October 2004)
35. Standard Performance Evaluation Corporation: http://www. (2004)
36. Chhabra, S., Rogers, B., Solihin, Y., Prvulovic, M.: Making secure processors os-
    and performance-friendly. ACM Transactions on Architecture and Code Optimiza-
    tion 5(4) (2009) 1–35
37. Bartholomew, D.:             On Demand Computing – IT On Tap?               &Sec-
    tionID=4 (June 2005)
38. PandaLabs: Quarterly Report PandaLabs.
39. John Heasman:             Implementing and Detecting a PCI Rootkit. And Detect
    ing A PCI Rootkit.pdf (2006)
40. Arbaugh, W., Farber, D.J., Smith, J.M.: A Secure and Reliable Bootstrap Archi-
    tecture. In: Proc. 1997 IEEE Symposium on Security and Privacy. (1997)
41. TCG:           TCG     PC     Client   Specific    Implementation    Specification
    For    Conventional     BIOS. 
    sspecs/PCClient/TCG PCClientImplementationforBIOS 1-20 100.pdf            (April
42. ARM: ARM TrustZone. home.html

To top