Advanced Techniques for Automated Linux Kernel
Memory Footprint Reduction
Supervisor(s): Koen De Bosschere
Abstract— The Linux kernel is used more and more in embedded sys- II. D ESIGN I SSUES
tems. However, Linux is not designed for the kind of modularity and com-
pactness that is necessary for these systems with very constrained memo- The two main demands for any approach to dynamic loading
ries. Previous work investigated the use of link-time binary rewriting as of kernel code are reliability and performance.
a means of specializing Linux for a speciﬁc combination of hardware and
software. This paper proposes strategies for removing much of the over-
head that cannot be removed by the previous techniques. Code coverage
or proﬁling information is used to guide the selective removal of parts of The Linux kernel is preemptively multi-threaded. As a re-
the kernel code. These parts are then stored in a compressed form that
takes considerably less memory and will be reinstated whenever they are sult, the dynamic code loader should be reentrant. Further-
necessary. more, buffer management strategies should take the multi-
Keywords— Link-time Optimization, Operating System Specialization, threadedness into account. Whenever a piece of code has to be
Code Compression unloaded to make room for some new code, the buffer manager
has to prove it is not being executed by any of the kernel’s other
I. I NTRODUCTION threads. Any locking introduced by the dynamic loader code
should also avoid deadlock situations and priority inversions.
More and more, the Linux kernel is used as the operating en-
vironment for embedded systems. In these systems, memory is B. Performance
typically very constrained. As opposed to more traditional em- As an OS kernel is very performance-critical, it is imperative
bedded operating systems, the Linux kernel is not engineered that the overhead of on-demand code loading is limited. This
from the ground up for modularity and compactness. Whereas has repercussions for the possible off-line storage methods of
a typical embedded OS kernel has a memory footprint of about the unloaded code. For example, hard drives have too big a
25 to 60kiB, a minimal Linux kernel has a footprint of about latency to be viable as a storage medium.
125kiB. Previous work  investigated the use of link-time bi-
nary rewriting techniques to alleviate this overhead. Existing III. F ROZEN C ODE C OMPRESSION
user-space code compaction techniques were adapted for use
on system-level code. Furthermore, new transformations were Our ﬁrst approach uses code coverage information to identify
proposed that allow to specialize a kernel for a speciﬁc hard- the so-called frozen code  in the kernel. This is code that was
ware/software combination. never executed during the coverage analysis. The frozen code
is partitioned into single-entry regions and compressed. Each
However, extensive code coverage analyses show that even
compressed region is replaced by a stub that invokes the decom-
after applying all the aforementioned techniques a signiﬁcant
pressor with a pointer to the compressed code. The decompres-
portion (more than 50%) of the kernel code is still unexe-
sor allocates a buffer (using the kernel’s standard memory al-
cuted . This unexecuted code comes from two major causes.
locator kmalloc), performs decompression and overwrites the
Firstly, the kernel contains a lot of code for handling exceptional
stub with an unconditional jump to the decompressed code.
situations (e.g. hardware failures). Code coverage analysis will
show this code as being unexecuted, even though it is neces- Once decompressed, a region is never again removed from
sary for the correct functioning of the kernel. Secondly, the memory. This removes the problem of having to prove the code
techniques used for compacting and specializing the kernel suf- is not executed while it is being unloaded. A second advan-
fer from the inherent drawbacks of static program analysis: the tage of this approach is that the performance impact of the dy-
limited accuracy and the need for conservativeness mean some namic code loading is very limited. For each invocation of a
useless code will go undetected. compressed region, only one decompression is needed. Further
invocations suffer no delays at all.
In this paper, we propose two approaches for alleviating as
Of course, the decision never to remove any decompressed
much of the remaining overhead as possible. Guided by code
code from memory has its drawbacks as well. For one, the upper
coverage or proﬁling information, parts of the kernel code are
limit of kernel memory usage has not decreased compared to the
selectively removed from memory. If it turns out that some of
original kernel: in the worst case all code will be decompressed.
this code is needed after all, it is retrieved from some off-line
However, we do not think this is a realistic situation. After all,
storage and loaded into the kernel.
this would mean all useless code has already been removed from
the kernel and that all exceptional situations that can possibly
D. Chanet is with the Department of Electronics and Information Sys-
tems (ELIS), Ghent University (UGent), Gent, Belgium. E-mail: Do- occur do in effect occur during one run time of the system. A
minique.Chanet@elis.ugent.be . second drawback of this decision is that it only makes sense
i 3 8 6 A R M
• This approach makes sense not only for frozen code but for
cold code in general. This allows us to remove a much larger
i n i t i a l c o d e s i z e 6 2 2 . 1 4 k i B 1 0 6 2 . 3 8 k i B
a r t i t i o n i n g o v e r h e a d 4 . 0 5 k i B 1 8 . 8 5 k i B
o m r e s s i o n
body of code from memory.
• No decompressor stubs are necessary. This amounts to a sav-
d e c o m r e s s o r o v e r h e a d 1 . 4 9 k i B 1 . 4 1 k i B
u b s i z e
B ( 1 2 . 3 0 % ) 1 7
B ( 1 6 . 9 0 % )
ing of 9 (for the i386) to 12 (for the ARM) bytes per entry into
n e t g a i n w i t h o
f f I
l i n e s t o r a g e 3 3 4 . 4 9 k i B ( 5 3 . 7 6 % ) 6 0 7 . 3 5 k i B ( 5 7 . 1 7 % )
the infrequently executed code.
The big disadvantage of this approach is that it can poten-
tially cause big slowdowns in the execution of the kernel if the
R ESULTS OF THE F ROZEN C ODE C OMPRESSION APPROACH FOR AN
pre-allocated buffer is not large enough and code frequently has
EMBEDDED WEB SERVER CASE STUDY FOR TWO DIFFERENT PLATFORMS :
to be swapped in and out of memory. While no guarantees on
maximum slowdown can be given, we can make some assump-
ONE I 386- BASED AND ONE ARM- BASED .
tions that will help keep the overhead low. The ﬁrst assumption
is that it is very unlikely that two independent pieces of infre-
quently executed code will have to be executed concurrently.
to compress frozen code, instead of all cold (i.e. infrequently The second assumption is that two regions of infrequently exe-
executed) code. cuted code are independent if all code paths between those two
A case study involving a simple embedded web server on regions pass through frequently executed code.
two different platforms (one i386 and one ARM) gave the re- With these assumptions group the infrequently executed re-
sults shown in Table I. In both cases, link-time compaction gions together in clusters. Two regions belong to the same clus-
and specialization techniques were already applied to the ker- ter if there is a control ﬂow path from one to the other that does
nel. Even so, it turns out that 56% to 61% of the remaining not pass through frequently executed code. If a cluster is bigger
code is frozen. Accounting for all overhead (decompressor code than one page, the regions in it are spread over multiple pages in
and data, partitioning the frozen code and making it easily re- such a way that the maximum number of pages needed for any
locatable, stubs for all frozen regions), a net gain of 12.3% for path through the cluster is minimized. This maximum number
i386 and 16.9% for ARM can be achieved. If the compressed is then used as the size of the preallocated buffer.
code can be moved to some fast off-line storage (e.g. Flash Unfortunately, there are not yet any results available about
memory), the in-memory gain rises to respectively 53.8% and this approach, but initial prototyping (with frozen code) shows
57.17%. Compression ratios for i386 and ARM are 0.73 and that the maximum number of pages needed for any particular
0.65 respectively. Instruction compression for i386 performs path through frozen code is 7. As such, it would be possible to
worse than for ARM because the i386’s CISC instruction set swap out all frozen code and use a buffer of just 7 pages (=28
is designed for compact code size, leaving less opportunities for kiB) to swap necessary code back in with minimal performance
extra compression. impact.
IV. C OLD C ODE S WAPPING V. C ONCLUSIONS
The biggest problem with the frozen code compression tech- Link-time compaction and specialization of the Linux kernel
nique is the necessity of keeping all decompressed code in mem- cannot remove all overhead introduced by the fact that Linux is
ory until the next reboot because it is very hard to determine conceived as a general purpose operating system as opposed to
whether code can safely be removed from memory. However, a modular, application speciﬁc embedded OS. Using dynamic
on systems with a MMU we can avoid this problem by using information (coverage and proﬁle information) it is possible to
the processor’s paging mechanism to swap code in and out of remove some of the remaining overhead by storing the superﬂu-
memory. ous code off-line and reinstating it whenever it is needed.
The infrequently executed code is separated from the fre-
quently executed code and put on separate virtual memory
pages. These pages are then left unmapped so that they do not The research of Dominique Chanet is sponsored by the Flan-
use any physical memory. Whenever execution of the kernel en- ders Fund for Scientiﬁc Research (FWO). This work is also sup-
ters one of these pages a page fault will occur. The page fault ported in part by the HiPEAC European Network of Excellence.
handler is modiﬁed so that it will retrieve the correct page from
off-line storage and map it into a pre-allocated buffer of physi- R EFERENCES
cal memory pages. Removing loaded pages from memory is no  D. Chanet, B. De Sutter, B. De Bus, L. Van Put, and K. De Bosschere,
“System-wide compaction and specialization of the Linux kernel,” in Proc.
problem: if the code on the page was being executed at the time of the 2005 ACM SIGPLAN/SIGBED Conference on Languages, Compilers,
of removal, a page fault will immediately occur and the code and Tools for Embedded Systems (LCTES), 2005, pp. 95–104.
will be swapped in again. While this is costly in terms of CPU  Jeff Cours, Nacho Navarro, and W. Hwu, “Using coverage-based analysis to
automate the customization of the Linux kernel for embedded applications,”
time, this is not a situation we expect will occur often, as we are M.S. thesis, University of Illinois at Urbana-Champaign, 2004.
only swapping out infrequently executed code.  Daniel Citron, Gadi Haber, and Roy Levin, “Reducing program image size
The major advantages of this approach are: by extracting frozen code and data,” in EMSOFT ’04: Proceedings of the
4th ACM international conference on Embedded software, New York, NY,
• Guaranteed upper bound on memory usage. As we can now USA, 2004, pp. 297–305, ACM Press.
easily remove code from memory, we can use a ﬁxed-size buffer
for all loaded code.