Docstoc

24342

Document Sample
24342 Powered By Docstoc
					                                                                               1




>> Kathryn McKinley: I'm Kathryn McKinley. I've been here, I'm going to have
my one-year anniversary this month, but this is my first candidate to host and
I'm really thrilled to have Jason Mars here. He's been working on data center
energy efficiency. Some of his work that we're going to hear about today has
been selected for IEEE top picks and they've influenced the way that Google is
building their data centers, and I'm hoping to have him come here and influence
how we build our data centers.

>> Jason Mars: Thank you. I appreciate it. Hi. So today, I'm going       to be
talking about a piece of work that's captured my interest for the past   few
years. And it deals with the architecture of warehouse scale computer    and how
we build a highly efficient design. I've been looking forward to this    talk and
I'm happy to be here. So let's begin.

So the landscape of computing has been changing traditionally, when users
thought of computing, they think of a desktop they go to do some type of
activity, work or play, and then move on with their daily lives. However, now,
we're always connected with highly mobile portable devices and much of our
computation cycle live in these massive scale warehouse scale computers.

And as noted by Forester research, the cloud was a $40 million market in 2011,
and this will grow to a $241 billion market by 2020. So this is really the
space that I work has been in recently. And a very interesting space for
computer science in general. So here, I show two pictures of, two of Google's
large scale warehouse scale computers, each of these buildings ar football
field in size, and each building houses thousands of servers and machines.

On these machines, we run large-scale internet applications, like search and
mail, social networking and so forth, maps and so forth. These warehouse scale
computers are expensive, costing hundreds of millions of dollars to construct
and operate, and this is growing. And my claim is they're inefficient. As a
system and software architecture of these warehouse scale computers remains in
its infancy.

Now, when thinking about improving efficiency and warehouse scale computers,
there is a number of optimization operatives and metrics you can consider. My
work has focused on performance of software running in these warehouse scale
computers and utilization. Was noted by Luiz Barrosa and Urs Hoelzle, software
performance and server utilization is critical for efficiency in warehouse
                                                                               2


scale computers.

So we are all familiar with performance. It's how well our software is running
on these machines. However, utilization has a particularly interesting metric
in warehouse scale computers. This graph to the right shows the utilization of
Google warehouse scale computers over fraction of time of one of their 2007
machines. So on the Y axis, we have the fraction of time and on the X axis, we
have the amount of utilization for that fraction of time.

And as you can notice, the hump of the curve indicates that we're usually
around 30 to 40 percent utilization on average, and what we'd like to do is
move this hump to the right to have higher utilization for a larger fraction of
time.

And to put these two metrics in perspective, a 1% improvement in either
performance or utilization results in millions of dollars saved at the scale of
Microsofts and Yahoos and warehouse scale computer.

>>: So is that in that you don't have to buy the service or is that in you
turn them off so you're not paying for the energy? Where is that?

>> Jason Mars: This millions of dollars? So it's you won't have to use the
service. So basically, there is a cost model used. So when a product group
wants to use X amount of machines, Google -- well, I've done a lot of my work
at Google, but Google will actually put a price tag on how much it costs to use
that many machines. And so if you can have a one percent improvement using
Google's cost model across the entire infrastructure, you save millions of
dollars worth of computing resources.

So that's kind of where that millions of dollars really comes from. It's an
internal model. But you can consider building a smaller data center for some
fixed amount of work. Question?

>>: So I'm curious. It seems like if you have a data center and you're
running at that utilization, if you had fewer machines that you were running at
a higher utilization. Can't you just trade off the number of machines you have
turned on to change your utilization curve?

>> Jason Mars: Yeah, so that's a good point. This curve doesn't include load,
right. So this is the utilization when all of the machines are active in some
                                                                               3


way. So basically, there's a number of contributing factors to low
utilization. And one of those factors has to do with the lack of co-locating
things together on the same machine.

So does that -- right. So I know what you're talking about. Like so
basically, you can have across over like a month, you can have times when
basically you're not getting as many queries on the data center, but I believe
this curve, I could be wrong, but I believe this curve factors that aspect out.
I could be wrong, though.

>>: Google is not going to show us if they're turning on or off their machines
anyway, right?

>> Jason Mars:   Pretty much.

>>: I think you can assume the server doesn't just do computations and a lot
of wasted time is on things like stacks and storage and other external
resources.

>> Jason Mars:   Right.

>>: And the more to spare you have, the more time you're going to have sitting
around waiting for a network.

>> Jason Mars:   Absolutely, absolutely.

>>:   You can't use a CPU.

>> Jason Mars: And all of those factors that you just mentioned factored into
this utilization challenge.

>>:   So is this CPU utilization or just any part of the system some.

>> Jason Mars: It's compute utilization, yeah. So we'll see shortly how we
can start addressing that utilization problem. However, before we look into
improving efficiency in warehouse scale computers, it's important to reflect on
the design of these systems. So traditionally, companies have used a
functionality first, efficiency second approach. Where initially, they use
commodity components like, you know, off the shelf processor and open source
software. They stitch these components together for functionality, and then
                                                                                4


they tweak these performance for -- these components for better efficiency as
time goes on.

The problem is commodity components were not initially conceived and designed
for the space of warehouse scale computers, and when you start with these
components, you may lose sight of unique characteristics in warehouse scale
computers that are critical for a highly efficient design.

And so system architects may have missed a key characteristic that steers that
design so I argue that we must rethink the system architecture for any
characteristics we would want to design our systems to exploit to have a highly
efficient system. And so the insight of my work and one of the underlying
threads of the work I'm presenting today is one such characteristic has been
missed, and that is the diversity in execution environments.

So let me first define what I mean by an execution environment. So here, I
show, so given the task, its execution environment is the underlying machine
configuration for which that task is running on coupled with the co-running
tasks on that machine at that given time. So the execution environment of this
task is the generation to key on with the co-runner that's running.

And so inside of a warehouse scale computer at any given time, we actually have
a number of various execution environments. We have different machines because
as new machines, as machines are retired or fail, new machines are brought in
that are of newer generations and we also have at any particular time in a
warehouse scale computer, each machine is loaded with a different number of
tasks already running.

So as we say take this identical web search job and we run it on three
different execution environments, currently in warehouse scale computers, the
entire system does not place tasks where they would like to run or where they
run best. Tasks can adapt at the machine level to events within the execution
environment of the machine. And certain kinds of events we can't address or
measure or manage explicitly, such as the interference between tasks, ask.
This is critical for utilization.

So these are all hard problems and precisely what my work looks at. But before
we talk about those problems, you might ask, well, is it important to
acknowledge this diversity in execution environment? And my claim is that this
diversity in execution environment is key for a highly efficient design.
                                                                                5



And a simple experiment can lead us to this conclusion. So if we take three
different machine configurations, when just looking at machine configuration,
if we take three different machine configuration, these are actually production
configurations you would find in a Google cluster. When we run these nine
large-scale Google web services across these different machine configurations,
we observe a significant impact on the performance of these various jobs.

Some applications like big table here can observe a 3x difference in
performance depending on the machine configuration. Other jobs, like proto
buff, it matters little which machine it runs on and we have jobs like search
scoring and search on maps detect face where it will prefer one architecture
while other machines prefer, tend to prefer another.

So we observe that task performance is heavily impacted by diverse machine
configurations.

>>: Do you have a reason why any intuition from a high level why you see
those? Is it [indiscernible].

>> Jason Mars: I do have that intuition and by and large it has to do with the
diversity and the memory subsystem of the various architectures. So across the
three architectures I've shown, we observe a very big variation in the cache
sizes used, in the prefetchers that are used on those various architectures,
and the topology of the memory subsystem across these architectures. So we
have a generation 1 Xeon, which is a Clovertown core 2 type architecture. We
have an Opteron, Istanbul Opteron and we have a Westmere generation 2
[indiscernible] type core I-7, but an earlier core I-7 architecture.

So if you notice across these architectures, if you look at it, the cache sizes
are very different and the hierarchy is different and the types of prefetchers
and the effectiveness of the prefetchers is different. That's our observation
as the biggest contributor to the variation.

>>: Maybe I'm [indiscernible] to your argument. Would it simplify to say the
CPU runs at infinite speed and all the delays are due to the memory subsystem?

>> Jason Mars: Infinite speed is very fast, but yeah. I mean, if we can
actually keep all of our work in the, like, first level cache, you would have a
significantly fast machine. We can't really realize that. I agree with you
                                                                             6


completely. We don't in practice, it's always the latency that leaves the
first level cache that's causing --

>>: A local story, it's like complaining about the [indiscernible] stop at
520.

>> Jason Mars: Yeah, exactly. The potential is huge is, I think, exactly the
point that you're pointing out.

>>:   You're normalizing to one on the clover here?

>> Jason Mars: Right, exactly. So basically, the minimum performing of the
three architectures, we normalize the whole cluster to, yeah. So yeah.

>>:   So are these all, these are all multithread the benchmarks?

>> Jason Mars:   These are multithreaded benchmarks, yeah.

>>: So when you run on an Xeon, which I guess the squares are showing it has
two CPUs and the Istan has four and the West has four? So do you fork a number
of different threads?

>> Jason Mars: So we bin -- in this experiment, I bin the work to one core.
So these workloads are essentially a part of a suite of pre-made benchmarks
inside of Google which are composed of the commercial binary coupled with a
giant log of hundreds of thousands of queries that it just turns on a pre-made
package log of queries. It's called perf lab.

>>:   So these are all running in one thread?

>> Jason Mars:   Right in this particular --

>>:   So they get the whole memory systems to themselves?

>> Jason Mars:   Exactly.   So we don't see --

>>:   [indiscernible] memory system itself.

>> Jason Mars:   Yeah, exactly. So when we look at -- so that's machine
configuration.   We observe a lot of variation in performance. What about
                                                                                7


co-runners. We also observe when we take the same type of experiments but only
look at the degradations due to a co-runner running alongside this machine, we
have a significant degradation in performance in some cases and not so much in
other cases.

So across the various clusters we can see -- and the key thing to take out of
this graph is as you change the co-runner, you can have a different amount of
degradation so we observe that the task performance is highly impacted by the
co runners on that machine as well.

>>:    So the co-runner, you have three co-runners there?

>> Jason Mars: Yeah, so here we fill up the whole -- because we're studying
contention here. So we just fill up the whole chip with threads so it's binned
based on half the -- the policy is half the cores are one application. The
other half of the cores are the other application. Because we want it to kind
of, if we ran the experiment where we only limited it to two threads, we won't
really see the true pressure of contention, you know, given the memory system.
Because you won't have as many axises, memory axises coming from the various
applications if they're limited to one thread. It would be a slower rate. So
that was the rationale to fill up the whole chip.

>>:    Okay.   So again, each of these apps that you have, they get one core?

>> Jason Mars:     So in this experiment, they get half the cores on the machine.

>>:    They get half the cores, okay.

>> Jason Mars:     It's a half policy.

>>:    So I imagine there's a lot going on on these systems at any point in time,
even   in the performance lab, and I imagine that there's a lot of variability
that   you get just from running these things multiple times. So are these means
over   multiple runs?

>> Jason Mars: Right. So these are means over three runs. However, this perf
lab, it's called perf lab, this benchmark suite that Google's developed
internally, they've worked really hard to get the performance variation across
runs down within the one percent range. So that's really what's used here.
                                                                               8


>>: When you start adding interference, then you have timing issues of when
you actually start the applications, right?

>> Jason Mars:   Yes.

>>:   Is the harness handling that.

>> Jason Mars: So the harness here is handling -- it spawns both applications
at the same time. The applications run for a while so they run for 20 minutes
so we --

>>:   So you minimize that.

>> Jason Mars: We minimize it, but there might be a little, yeah. So the
observation is that the execution environment has a significant impact on
performance, and for this, this is I claim that exploiting and not just
exploiting but I adapting to diversity in execution environments is critical
for improving performance and utilization in warehouse scale computers.

So I've done a number of work dealing with these types of issues of
optimization and the interaction between workloads and the underlying system
across disciplines and computer science such as characterization, workload
studies, compiler, static and dynamic analyses, runtime systems and software
system and computer architecture.

But today, I'll be talking about a number of these works that deal with this
issue of diversity in execution environments and how we can exploit that. So
there's three aspects to the work. There's three design points that we need to
acknowledge the diversity in execution environments. At the cluster level, we
want to map jobs in execution environments they prefer. At the machine level,
we want to able jobs to adapt to the execution environment on the machine and
changes in that execution environment. And then there's a very important event
that happens within execution environment, important for utilization, the
interference between tasks and we want to be able to acknowledge and explicitly
manage this interference to improve utilization. And that will be clear in
just later in the talk.

So to summarize the contribution, I've developed an intelligent mapping
approach that will dynamically learn how jobs perform in various execution
environments using continuous profiling. And then exploit that knowledge to do
                                                                             9


this mapping and that results in a 15% performance overall across a cluster and
this has been validated using real, the same kind of application I'm showing.
Real production machines and real production workloads.

And at the machine level, I've developed a new mechanism that allows for
applications to adapt and respond to events in the execution environment and
solved two pressing problems in warehouse scale computers. The first is
selecting your aggressive optimizations based on if they're effective or not
dynamically so you can dynamically detect if your aggressive optimizations are
giving you a benefit and then apply them only when they're giving you a
benefit.

And then also, we really need a mechanism in software, and this is actually,
you know, at the time was a challenging problem is to think of a mechanism in
software to detect that contention is occurring. To be able to positively say
that we observe contention. And so I've used this runtime system to detect and
respond to contention dynamically, which is important for utilization.

And then finally, when dealing with the interference within an execution
environment, when certain applications have a precise amount of performance,
they must guarantee. And certain co-locations between applications can violate
or not violate that performance requirement and I've developed a technique that
can precisely predict which applications are okay to co-locate on a given
machine, to allow more co-locations to occur over a policy of just disallowing
co-locations. It will be clear when I get into it. And that presents a 50%
improvement in utilizations over 500 machine cluster in the scenario that I
will present.

>>:   So in each of the scenarios is a five-machine cluster?

>> Jason Mars: Oh, no, no. So it differs from experiment to experiment. So
at the cluster level and these two works, this work, the cluster level and the
interference in the EE, it's 500 machines. And we run an experimental scenario
on those 500 machines using, you know, canned executions. It will be clear in
just a few.

So let's look at the cluster level. I'll be going through each of these
cluster level, machine level and interference. I'll be dealing with each of
these in the talk. So we start with cluster level. At the cluster level, what
we're really trying to attack is the assumption of homogeneity in the way
                                                                            10


systems are designed currently.

So many systems, like systems using Google, they view the entire warehouse
scale computer as a collection of thousand of cores and petabytes of memory.
And basically, when a job comes in to be scheduled on this giant computer, it
finds the first available machine with the prescribed number of cores and
amount of memory needed by that job. Each job has a configuration file with a
number of cores and memory needed.

However, what actually happens is we do have this machine heterogeneity, and as
I've shown you before, it matters a lot in the graphs I've shown, where the
jobs end up landing.

So how do we exploit this? How do we take advantage of this off the cuff? The
first thing to do when taking advantage of this diversity? So we want to be
able to map jobs where they run best and we want to take advantage of the
unique properties of warehouse scale computers to do so. We know the
application services we're going to be running continuously on these machines
and we also know that we have continuous profiling as a service that lives in
production where we continuously know can profile all of the jobs and to get
performance information using hardware performance counts.

In the case here, I use GWP. In this work, I use GWP, which is the Google wide
profiler. And there's a paper in IEEE micro from 2010 that discusses that in
more detail.

So what I did here was develop this approach, I call SmartyMap. And what
SmartyMap does is it exploits this continuous profiling to build a knowledge
bank of how jobs perform, scores of how jobs perform in different execution
environments. Defined by both the machine unranked machine configuration and
the co-running jobs.

And based on this knowledge bank, SmartyMap will train the knowledge bank and
use it to map jobs in places where it predicts it will have the highest score
using a Google wide profiling. Question?

>>: Space from production profiling or on the test cluster where you run these
benchmarks?

>> Jason Mars:   The evaluation will show task cluster.
                                                                              11



>>:   The maps and the models.

>> Jason Mars: Oh, this is live. SmartyMap runs live and with the continuous
profiling information, it refines its knowledge of the different types of jobs
running and the types of jobs it likes to run.

>>: So how do you incorporate the variance in workloads coming in and other
noise that you see.

>> Jason Mars: So statistical methods is what we use. We would use the
average, basically, that we observe across different execution environments to
score.

>>: And    one more question. How do you see applications are instrumented to
tell you   this is the start of a request, this is when it's done so that you
know how   long it's taking? Because if you just see arbitrary, you know, binary
running,   you don't know the latencies, right? The user would --

>> Jason Mars: Right, right. So that's not factored in. So what happens is
it uses hardware performance counters. So the profiling service continuously
samples all day long all night long hardware counters across all machines and
it also collects what was running on that machine.

So we have all of these logs of just hardware performance counter information
for all the jobs running as a service. So it does it at the hardware
performance counter level and it doesn't give you higher level information such
as, you know, quality of service information.

>>:   So how do you know from the counters if this is slow or if this is fast.

>> Jason Mars: So here in this experiment, I use a metric I call instructions
per second. So I'll talk about it in terms of improving the performance we're
getting from the whole cluster in terms of how many instructions per second am
I getting from the cluster. Kind of thinking of the whole cluster as a chip,
kind of. But it's kind of -- I like it.




>>:   So why is IPC the right metric?
                                                                               12



>> Jason Mars: So IPC particularly wouldn't be the right metric because we
have different types of architectures. But IPS, which is essentially IPC but
generalized to seconds, because seconds is generalizable across architectures.
But, I mean, I think you're hinting toward a really good point, which is you
may not be optimizing -- you may not want to optimize strictly for how fast
something is running. There might be other things view to optimize for. And
for that, you can use a similar methodology. You would just change the feature
you're optimizing for.

So if you could collect from GWP, if you could get a motion of, let's even make
it high level user -- how satisfied a year is. If you can use a continuous
profiling service to see how satisfied a user is from these types of services
and different execution environments, you could apply the same methodology that
I'll show or the same approach that I'll show in the next couple slides.

>>: Even though you're using instructions per second, your approach is general
enough to handle other --

>> Jason Mars: Right. And you'll see on the next slide. It's because we
formulate -- it's a great question. It's because we formulate the mapping
problem as an optimization problem. So we can leverage standard numerical
optimization techniques to arrive at a good solution so we have an objective
function and we're just trying to optimize to that objective function.

So that's how we formulate the problem. Given that we have this knowledge bank
that we can refine these scores using the statistics I just mentioned, given
this knowledge bank, what SmartyMap does is whenever it has a set of jobs it
needs to map into a cluster, it will internally SmartyMap will simulate the
different mappings using the information it has from the knowledge bank and
use, in this -- use a numerical optimization to optimize toward the highest
performance it predicts that you can get from the cluster.

So we use a stochastic hill climbing approach to perform this optimization.    It
works well for this kind of optimization.

>>: Is this sort of a one-shot optimization that when I have a bunch of jobs
to place, I just optimize for those, ignoring everybody else and keeping
everybody else fixed, or will you get additional benefit by sort of globally
optimizing?
                                                                                 13



>> Jason Mars: So how this works is imagine you want to turn a couple of
services on inside of a cluster. When you turn those services on, there's a
set of many types of jobs that need to run. So when you have the set of jobs
you need to run, you run it through an optimizer that will simulate to use the
knowledge bank to optimize to get the best aggregate performance it predicts.

So it kind of works like that. I think you're hinting towards two issues,
which is what if there are things already running. And that would be where you
start your optimization. You'd have things running and then you would be
mapping, simulating things being added on to those machines and you optimize
accordingly.

Another thing I think you're hinting to, which is a great point, is what about
changing the optimization dynamically, can you move things around.

>>:   Can you turn off that service that's already running for a while.

>> Jason Mars: Yeah, I think that's an absolutely fine idea and probably would
produce good results. But I didn't do that particularly specifically.

So I'm not going to talk too much about -- go ahead.

>>: How long does it -- so it could take a long time to compute that and you
have some requirements on getting these jobs scheduled quickly. So what kind
of deadline do you give yourself for the scheduling algorithm.

>> Jason Mars: So the scheduling algorithm we have here, and with the size of
problems that, you know, in the thousands of machines and a particular mapping
scenario, so it takes, in a matter of seconds, you'll have a map that is
predicted to be good by SmartyMap. So in this situation, we're not doing -- so
these mappings won't be done often, because it happens when you turn on the
services in the warehouse scale computer. You can kind of do that once. Let's
assume you want to have search and maps and mail running on cluster 5. You can
do that once and then let that run until you make a different decision. Like
all right, I don't want to run maps in that cluster. I want to move it to this
other cluster and then you can do another remapping.

>>:   So it's not as jobs arrive.   It's just when the service gets turned on?
                                                                              14


>> Jason Mars: Right, right, exactly right. So as time passes, the quality of
knowledge bank improves, because we keep refining it with new updated
information. And I'm not going to get into too much detail about the specific
classes of information you can use for mapping, just leave it to say smarty-M,
which is just if you were to only take information at the machine level, like
jobs, if you were to only consider machines in the heterogeneous environment,
what is the complexity of the amount of information you would need to collect
and then what if you were to take machine level plus all the permutations of
possible co-runners, which the complexity gets much higher.

So I would just point to this, because there's an interesting observation as to
how SmartyMap across these two things work. And I think I present that in the
slide after this. So just to describe the experimental setup, to see how well
does the SmartyMap work.

So we 500 machines, three production types of machines. These are actually,
you know, real machine types configurations used at Google, which might be
interesting for you guys to see. I got approved to have that. I think it
slipped past the approval, because normally, they wouldn't approve that.

Anyway, so we have a thousand jobs that we like to map. And I've used the nine
Google applications presented earlier, same -- there's two sets of experimental
test beds. There's the Google one that I did at Google, and then there's an
experimental test bed that I used spec and off the shelf machines when I was
not at Google to do the same experimentation.

So what we observed. So this is normalized IPS, and it's normalized as if all
jobs had the best machine it could run on and it was running alone. So that's
the normal, that's one. And as you observe, across the different metrics, the
light blue bar shows what we'd expect currently, right. Where it doesn't take
into account any of the heterogeneity across machines or co-runners. And then
we show different classes of information that we would collect.

But observe that machine -- so only these two, the -- only the blue and the red
show where we consider machines and not just co-runners. And notice that when
you consider machines, you get most of the benefit. So the diversity across
co-runners in the execution environment matters a lot less than the diversity
across machines. Which is a nice, interesting observation.

But interesting thing in the paper that this is attached to that you guys
                                                                                15


haven't seen yet, the interesting thing is, right, it depends on how diverse
your machines are. So across these three machines that actually exist in a
real cluster, the diversity is relatively high. But it's kind of surprising,
because we were just talking about going from core 2 to core I-7, basically and
having the Opteron competitor to the core I-7 in there. So the generations
aren't that huge.

>>:   You are doubling the size of the memory system from the previous slide.

>> Jason Mars: Exactly. And the prefetchers got so much better, especially
from core 2 to I-7, they fixed a lot of things in the prefetcher. You're
absolutely right. Good point.

So that's the result. But there's one more interesting observation. So we
have a 15% improvement overall across those two examples. Now, one interesting
observation in this, and this might be really useful to some of the data center
people here is that it tells you something about how do we build our cluster.
If you were to acknowledge this diversity, does it change the way we would want
to go buy machines to fill a cluster. And it does, right.

So here we show applying the SmartyMap to clusters that are fully composed of
one of the three types of machines. And then we show clusters that are half
filled with Clovertown and Istanbul and West mere. And Westmere and
Clovertown, the three different types of machines. And what we can observe is
so -- oh, no.

All right. Oh, I see I was clicking. So what we observe is if you have all
Istanbul versus having half Istanbul and older architecture to Clovertown, when
you're doing the optimization of mapping jobs where they like to run, you can
match the performance of the all, the newer machine cluster, which this one
would be a cheaper cluster to build, and you actually beat the performance.
Why? It's because some of those workloads prefer the Clovertown other
Istanbul. If you can do the mapping right, you can actually have a cheaper
mixed cluster and get matching or better performance, which we see in this
case.

Which I think is really interesting and might tell us that we shouldn't just
buy all one machine.

>>:   Although your West number says we should just buy memory machines, right?
                                                                               16



>> Jason Mars: Yeah, that's because West is just such a strong processor. But
if you look at the, at least if you look at the difference between the
expected, the random, the expected performance and the difference between the
when optimizing performance, you do a little better than if you were to
randomly map. So that's why I didn't -- I don't want to -- because this one is
a little bit more subtle, but you have to look at maybe there's another graph I
would need to built to really show this point, but you get better efficiency.
You can get better efficiency than the random when you're doing the mapping is
what I mean, going across the two.

But you're right. In this case, you would just, for best performance, you just
buy all Westmere, done.

>>: But if you look to performance for energy, the powering up that memory
system is efficient so you might get a better energy efficiency point on one of
the other two graphs, depending upon how much it costs to run it.

>> Jason Mars: That would be a very interesting data point to show, and I kind
of wish we did that now. Because if I can sell you the energy benefit that
would be actually interesting. But yeah. So that's that result. So let's
move to the machine level.

So at the machine level, we want to be able to allow jobs to adapt to their
execution environment, because there are a number of problems that require this
and I'll show why. So what we're dealing with here is the rigidness.
Rigidness at the application level. So traditionally, when a compiler
generates your binary, your binary is fixed regardless of where it runs in the
execution environment.

Now, there's a class of problems that would require this application to adapt
to its execution environment. Two I'll highlight and address in this work.
The first problem is the problem of aggressive compiler optimizations. These
are optimizations known to either improve or degrade performance based on
person dynamic effects or certain situations across different applications. So
we would want -- oh, so in one execution environment, you may see a performance
improvement. While in the other, you might experience a slowdown. So that's
the first problem that we need this kind of flexibility.

The second problem is this issue of detecting and responding to interference
                                                                            17


online. So your web search job will have performance requirements and may be
co-located to some priority batch applications. Currently, there's no way for
us to know programatically or to ask, well, are these jobs contending.

And so because if we could ask this question, and if we could get a response
from the software system, we might be able to do something to reduce the
interference that is generated. So the ability to detect and respond requires
a runtime system to be there to ask that question to.

But before we can address these two problems, we need a mechanism for cracking
this rigidness, for allowing this online adaptation. So when realizing this
kind of mechanism, we have binary translators. This is for arbitrary native
applications. So for arbitrary binary applications, we have dynamic binary
translators. However, it hasn't been adopted in many production environments,
including warehouse scale computers, because they're huge in complexity and
sacrifice a significant performance. And the argument of using these binary
translators for optimization hasn't really been realized by our community in a
convincing way for production.

So to achieve deploy ability in warehouse scale computers, we need an approach
that is lightweight and low in complexity. So we need a new technology for
online adaptation, and this new technology is one of the major challenges and
contributions of this work, right. So let's take a quick look at conventional
approaches to binary adaptation. So your application runs on top of a runtime
and this application is responsible for keeping fine grain control at the
instruction and basic block levels. So it uses something we call a code cache
and the runtime will dynamically translate the application at the instruction
basic block level into this code cache and then execution is only allowed to
run -- execution is only allowed to occur out of this code cache.

If this code cache were to transition back to the original binary, things will
break. So that's a big no-no. However, there's expensive transitions between
the code cache and the runtime that occurs. So when we have indirect branches
that are not easy to predict and on the critical path of your application, you
may take many expensive transitions between the code cache and the runtime, and
that's very expensive.

And then to monitor the execution environment, you need to add instrumentation.
And then this instrumentation to employ some kind of adaptation might need to
also make transitions to the runtime and back. So we have a lot of complexity,
                                                                            18


a lot of overhead.

So we need a new approach for binary adaptation. So I asked the question, why
not let the application run directly on the processor, right? And we can use a
very tiny runtime that uses hardware performance counters to keep us informed
about what's going on dynamically. So we have no fine grained control and we
execute directly.

But I may ask, well, how do you adapt the application code then? So I advocate
and propose that we use software, what I call scenario-based multi-versioning,
where we can specialize instances and regions of code for dynamic situations we
anticipate and we can just select which regions and code to execute when we
experience that dynamic situation.

So this is very lightweight, very low overhead to actually employ when you want
to change your code dynamically and then you get the compiler so there's some
added benefit over something that only uses binary. So it's an interesting
argument, because there are things that you might not be able to do since
you're not compiling dynamically, but then there's things you can do because
you have a compiler, but you have to statically, the caveat is you have to
statically acknowledge what you want to specialize for dynamically. That's the
caveat of this approach. And we also need, especially for when we want to
coordinate multiple co-running applications, we need to be enable coordination
between runtime. So we need to let this runtime system talk to each other so
that we can have coordinated adaptation policies across applications.

So this is the mechanism. So this is low in complexity and low overhead. So
I'm not going to talk to too much detail about the runtime system, but I'll
show I call it LOAF, the lightweight online adaptation framework. But I'll
show how this diagram can kind of summarize technically what's really going on,
right for scenario based multi-versioning.

So basically, we have an application that each call to the region of code that
you specialize statically for dynamic situations are indirect branches that use
a table, a global dispatch table to know which one is active, which version is
active, and you can reroute execution dynamically using this dynamic
introspection engine. So the beautiful thing about this approach is as your
program is running along as fast as it can doing its thing with one particular
routing, your dynamic engine can simply write new values into this table and
then restructure, reconfigure how your binary is running.
                                                                            19



So you can think of it as a reconfigurable binary, and this is nice because as
opposed to traditional multi-versioning where you have some kind of conditional
that's on your critical path, if that conditional had to look something up in
your environment, you'd have to execute it every time. But here, the program
never has to itself conceptually itself pick what version to use.

>>:   Are you trading an indirect branch for --

>> Jason Mars:   A direct branch.

>>: Well, yeah, exactly, for a test that's going to -- a branch
[indiscernible]. So isn't this actually more overhead than what you just said?

>> Jason Mars: Oh, so okay. There is a tiny bit of overhead. I've actually
run the experiments. The overhead is less than one percent. Just if you were
to -- so the experiment to test that overhead was just keep these values the
same and just let the application run, right. And that overhead is less than
one percent.

>>:   The branch predictor works?

>> Jason Mars:   Yeah, and you're not changing --

>>:   The direct branch or indirect branch?

>> Jason Mars: Yeah. And when you're re-routing, you know, a lot of it
depends on the granularity on which you're chaining this table. If you're
chaining this table at milliseconds and you have tens or hundreds of
milliseconds is when the next change comes, then you don't suffer a major
penalty from switching the table too much.

It's much lower than -- it's much lower than traditional approaches.

>>:   So the one percent was redirecting every column through this table?

>> Jason Mars: Right, every call of the -- it's not all the functions. It's
the hottest functions. Every call to the hottest functions, so kind of like
every call since they're the hottest functions.
                                                                                20


>>: Arnold Ryder did an instrumentation framework like this for Matt's Ph.D.
thesis that's very similar. They had the same kind of results even when the
branch predictors weren't that good. Or as good as they are now.

>>: There seems to be a testing problem. You have 20 such options. You have
over a million versions of the program at any one time and even a bigger
infinity if it's a long acting bug. What would the testing [indiscernible] say
about that?

>> Jason Mars:   I haven't heard that comment before.   That's an interesting
point.

>>:   Your compiler --.

>> Jason Mars: You're right. You allude to a really good point. Statically,
if you want to do too much online, there's a state explosion, right? You can
have all kinds of permutations of versions for different dynamic scenarios. So
it's up to the optimization designer to design -- to not go too crazy. It or
it will start to not be as performant so that's an interesting point.




>>: Did you do anything so now you have filter version so you have code
explosion. So do you do anything to control that so your cache doesn't blow
up?

>> Jason Mars: Yeah. So I don't do anything explicitly. Really, everything I
do is in the designing what -- so you identify your dynamic scenarios and you
specialize accordingly. And it's in controlling the number of specializations
you want to do. But I don't do anything clever dynamically. But it's not hard
to imagine coming up with some ideas to keep that state explosion. Actually,
that's a very, very interesting point. Yeah, we should chat more about that.

So time address those problems now that we have this runtime system. So I
mentioned the two problems, SBO and CAER and aggressive optimizations and
detecting contention. So I'm going to talk about SBO first and then contention
aware execution. So how do we use this kind of approach to detect when our
aggressive optimizations are effective and then use them only when they're
effective.
                                                                             21


So I use an approach I call a competition heuristic, where initially,
throughout execution, you learn to identify whether or not aggressive -- you
try aggressive and compare that to the non-aggressive version and identify
which version produces the better performance using the dynamic information
that gives you performance information online and once you make that decision,
you let that run for a while before coming back around to do another test and
this continues, this functions continuously throughout execution. And so
that's the competition heuristic.

It shown to work quite well. So on this graph, these are spec applications. I
show the execution time normalized to O2. Basically it's just O2 no aggressive
optimizations. And then each of the three bars, the blue, red and -- the
green, red and light green bars show the performance when statically applying
these different permutations. And then the third bar shows the performance
when you dynamically select the optimizations.

And we consistently do better than our baseline, which is O2, the conservative
approach, which is just to not use aggressive optimization.

So we get a clear win over the conservative approach, and we, by and large, we
even beat statically applying any one dynamically. So it's pretty effective.
And overall, it is about a 4 percent to 10 percent performance gain over the --

>>:   Which one is doing it dynamically versus the fixed?

>> Jason Mars: So just this one. And all the others are some fixed aggressive
optimizations. No aggressive optimizations, cache prefetching, loop unrolling
and both cache prefetching and loop unrolling using the GCC 4.3. So we get
about a 4 to 10 percent performance improvement there.

And ask for my dissertation if you find any of this interesting.   All of this
stuff is in detail in my dissertation and there's papers too.

So now let's look at contention aware execution where we need to detect and
respond to contention. So let's make it clear what is this contention problem?
So if we have multiple applications --

>>: Before you get into this, can you give me a better sense, when you're
compiling for different versions of these [indiscernible], what's the
difference between the [indiscernible] choices.
                                                                               22



>> Jason Mars: Right. So the choices I made in this work were to use the --
so GCC is configured for some canned heuristic knob settings, right. So I use
the canned M-arc, the micro [indiscernible], maybe I'm getting too technical,
but I used the designers of GCC have selected values for loop unrolling, which
is aggressive, and then they've selected values for how to do the heuristics
for how to do the cache prefetching.

So I've just used those. I didn't have versions for each knob setting and this
gets into the state explosion, right. Because if you have different versions,
if you want each knob setting, you can start to really blow up stop.

>>:   Like the unroll is the unroll by four?

>> Jason Mars:   Yeah, exactly.

>>:   Which I happen to have in my head.

>> Jason Mars:   Actually, that's precisely right.   The unroll is unroll by
four.

>>:   That's the magic number.

>> Jason Mars: Four. Just go four. Four and you can't go wrong. That's a
great point. You have to really think carefully about how you specialize.

>>: One of the issues in doing trials is that you, when you measure something
at a fine grain, you could measure one instance of the loop that only has ten
[indiscernible] and you can measure one that has a million. So, of course, the
one that takes a million takes longer even though it is the faster version,
because it's optimized more correctly for the million version, right?

>> Jason Mars:   Yeah.

>>: So how do you take that into account, or are your workloads so homogenous
that you didn't see that very much?

>> Jason Mars: Yeah, so I use -- that's a good point. I use a kind of a
coarse granularity. I use in the milliseconds. So it's ten milliseconds for
comparing the different versioning schemes. So the space of versioning
                                                                              23


schemes. And I refresh the information continuously. After some fixed amount
of time, I think it was a couple seconds. So these are spec applications,
right. So after a couple seconds, it would just do the evaluation again,
aggressive, nonaggressive, and update. So it's really dynamic, truly dynamic
in this sense. Does that answer?

>>:   So you just are making local choice is based on the current averages?

>> Jason Mars:   Precisely.   Absolutely correct, yeah.

>>: And how do you alternate between the versions? Are you running two
separate versions and measuring both of them, or are you switching between
them?

>> Jason Mars: Switching between them using that rerouting capability that I
-- just updating that table.

>>:   You could get unlucky because you --

>> Jason Mars:   Switch at the wrong time, make the wrong decision.

>>:   But you're not seeing that in --

>> Jason Mars: Yeah, not across these applications. So that's good. All
right. So what's contention. So when you have multiple applications, if the
working set of the application fits neatly into the early levels of cache, we
don't have interference. We get to scale with the number of cores. However,
as soon as we used shared resources like the shared cache or the bandwidth to
memory, then these applications can slow each other down by contending for
these shared resources. So that's what interference, that's how we get
interference. Now, in warehouse scale computers, this is particularly
problematic. If we have application A and B run on its own machines, they'll
run full speed. However, as soon as we co-locate them on the same machine, we
can have a performance degradation, and a high priority application may have a
quality of service requirement that it must meet.

And so for this reason, in warehouse scale computers, the co-location between
high priority applications and other applications are simply disallowed. It
gives the whole machine to the high priority application. So that actually
wastes potential utilization. And this is because not all potential co-runners
                                                                              24


will degrade the performance of our high priority application.

So if we can do a check dynamically to see are these programs contending, we
can identify those that don't have any contention at all, right?
So what we need to do is have a runtime approach that can detect contention.
So you know, at the time I did this, a lot of folks told me from software,
there's no counter for contention. Like if you have a dip in performance, how
do you know that it's actually contention? So is detection possible?

Well, I came one this approach. I call it the shutter approach. So on your
runtime, you have your -- this is the high priority latency sensitive. And we
have a batch application. The runtime can shutter the execution of the low
priority application and if it observes a corresponding spike in last level
cache misses, corresponding with the shutter of the low priority application,
then we can assert they're contending, because we see it with the little test.
It's a little online test. And we can see that they're actually contending.

So we do this little test, and we can know they're contention. And there are a
whole number of ways you can respond. You can respond by killing it. You can
respond by slowing it down, which is what's in the results I show. You can
respond in a number of ways.

But the cool thing is you can rule out those that don't contend at all, you can
let those run with your little shutter approach. This also works quite well,
uses the same runtime infrastructure that I presented before. This also works
quite well. So basically, what this graph shows is it will show the
degradation across these spec applications when co-located with LBM. And when
we just allow co-location, the degradation can be very high. But when we
co-locate with CAER, then the degradation is significantly reduced. So this is
across two different heuristics of applying CAER. For brevity, I'll just tell
you that we get an average of 60% utilization and we reduce the interference
from 17 percent to 4 percent on average and get 60 percent of utilization from
the neighboring core.

So this   experiment was done with two cores on the machine, and so with the 60
percent   of the neighboring core, we get an overall improvement of utilization
of that   machine by 30 percent. Where we're allowing the co-location and
getting   the reduction here.

The problem with this is that it's not precise.    It doesn't deal with
                                                                              25


performance requirements. And the next part of my talk deals with just that.
And that's the Bubble-Up work, which I'm going to get to right after this
question. And so, okay, there's some insights that come from this that I
really believe is how we should think about doing software runtimes as we move
on.

I think it's good to allow some vertical integration. So what I demonstrated
was integration between the compiler, the runtime and the architecture. In
developing dynamic solutions, we need to, one option is to allow the compiler
to stitch in reconfigurability in our binaries. I think that's an option.
That's something we might want to think about doing.

Performance monitors are the key enabler of online techniques moving forward
and we really need to think about how to build monitors not just for debugging,
but for also for online dynamic techniques.

The approaches I presented are realizable today and I believe will gain more
traction as performance counters are built into the application binary
interface, right, where currently, you know, there's a lot of systems where if
VM Ware is using your counters, you can't use them. As soon as we have this
built in to the application binary interface, we'll solve that. And I believe
ultimately, the day will come when all code and systems will be continually
restructuring like the warming of a cache.

Okay. So finally, I'd like to show how we manage and measure this interaction,
this interference so we can allow for precise predictions and precise
co-locations.

So what does that mean? So some applications have strict performance
requirements, as I mentioned below. And some applications that interfere, we
can allow, if it doesn't hurt our performance too much to violate that
requirement. So here's a simple example.

If we take search render, and each one of these bars shows the performance of
search renderer when co-located with each of these applications. Now we're
back with Google large scale real large applications. And what we observe is
when we co-locate search renderer with each one of these applications, search
renderer's performance is impacted variably.

If we have a 90 percent performance requirement, I'm going to call this the
                                                                            26


quality of service threshold, then some of our co-runners violate that
performance requirement, while others do not violate that performance
requirement. So what we want to do is be able to identify all of the
co-locations that don't violate that performance requirement for precision,
because of this over provisioning of resources we have where latency sensitive
applications get run alone and we waste those cores.

So we want to eliminate the uncertainty of that interference penalty. And we
want to precisely predict the impact of quality of service to allow safe
co-locations. So the goal is to have a general methodology that's platform
agnostic. It can run on all those machines I presented before, and is
deployable at a scale of warehouse scale computers.

Now, prior work, there's a lot of work that predicts whether an application is
contentious, but we don't have a way to predict how much it will hurt each
other when running together. If you file two arbitrary applications, we don't
have a way to know how much they'll penalize each other from a performance
standpoint when running together on a machine.

And here, we're trying to capture an interaction with resources that are not
explicitly manageable or visible to software. A quadratic brute force
profiling methodology is straightforward. You take all of your possible jobs
and co-locate it, you know, with all the other possible jobs, but that's not
suitable at the scale of warehouse scale computers as jobs are updated
frequently and there's a large number of different types of jobs. So it
doesn't work at the scale.

The question is, is a linear approach possible? Can you look at each type of
job once and then tell arbitrarily how much they will interfere with each other
from a performance standpoint? Especially considering that that impact is
based on the interaction with the co-runner.

So insight that let me to this approach is given a white box approach, where we
try to analytically be able to analytically model all of the different aspects
of the processor, the prefetchers, the caches, the interconnect, bandwidth,
memory controller, replacement, queues, buffers, and even the secret sauce,
which is private, like how the prefetchers on real architectures work, it may
not be the right approach. It's high in complexity. It's not portable.
You'll have to do this for every type of machine. Do work for every type of
machine and it's may not be feasible because of this secret sauce stuff.
                                                                            27



If this really matters a lot with the performance. So the question is, is a
black box approach possible, where we can just treat the whole memory subsystem
as a black box. So it's lower in complexity. It's portable. You can move it
to any black box and run the same approach, and it's deployable on real systems
with secret sauce stuff.

So when thinking of this approach, I'm going to use an analogy I use the man or
woman in a dark room where you can think of the memory subsystem as a dark room
where you try to feel -- in a dark room, you can feel out the furniture. You
can kind of get an idea of the layout of the room. Can we do that as a
methodology. And that will come in the slide after next. It will be clearer
how that actually plays in.

But first, let me tell you essentially how Bubble-Up works. So we capture a
representation of sensitivity and a representation of aggressiveness of each of
the applications. When deciding a co lotion, we take that representation of
sensitivity and aggressiveness and combine them to produce the performance
degradation. And it will be clear from this animation.

So we have a representation from our profiling, we have a representation of
sensitivity, which is a sensitivity curve, which you'll see, and we have a
representation of the co-runner's aggressiveness, which is the aggressiveness
score. So we want to predict how much they will hurt each other when running
on this real system, we can take the sensitivity curve and we can also take the
aggressiveness score and the sensitivity curve shows the Y axis, it shows
quality of service as you increase this notion of pressure in the memory
subsystem, and you'll see how that works in the next slide.

But you can take this aggressiveness score and use the sensitivity curve to
predict an actual performance level when co-located. So then the question
becomes, well, how do you get this magical, awesome sensitivity curve and these
aggressiveness scores. Well, we use a bubble to produce the sensitivity curve
and a reporter to produce the aggressiveness score. And this is where the
analogy comes in.

So the bubble is essentially a stress test that provides a performance dial
where you can turn up the pressure on the memory subsystem holistically. Just
using loads and stores from the application, you can turn up the pressure dial
that holistically turns up pressure on the memory subsystem. As you turn up
                                                                            28


this pressure dial as shown in the animation, you generate a curve as to how
the performance degrades as you increase the pressure on the memory subsystem.

And that's how you get the sensitivity curve. Now, to get a score, use a
similar kind of intuition. You have a reporter that sits on the memory
subsystem that has a presence on the memory subsystem. And you let your
application run alongside that reporter, right. And what the reporter, it's
trained. It's trained to know -- it's complicated how it's trained. It
actually has trained a sensitivity curve in itself that's used in reverse. But
I'm not going to get into too much detail. But it's trained and it basically
reports how much it thinks you're hurting it. It's basically reports how much
it thinks the application, what its aggressiveness score is based on how its
own performance is being affected. So that's how the reporter works.

So these are the two steps of the profiling approach. You take each
application and you run it through this profiling approach.

So what is this bubble? Well, how can we conceptualize the bubble. Well if
you take the degradation on some application by a co runner, A and C, you can
conceptualize the degradation as the summation of all the sensitivities and
pressures of sensitivity application A and the pressure on some resource, Ri,
and the pressure of the co-runners on that resource Ri. And essentially, the
bubble is this approximation of this actual degradation by replacing the
co-runner with a bubble of some k, where k is closest to the co-runner.

If this isn't landing, it's all in the paper and dissertation. So when view
this approximation, we do have a source of error and that source of error comes
from the difference when you replace that C in the actual degradation from the
double K. So there's a hypothesis. The hypothesis is if you design the right
kind of bubble, you can minimize this potential for error. And so good bubble
design is key, and in the paper, I actually outline the systematic principles
for designing a bubble. If your bubble stress test software application has
all of these properties, it should be a very good bubble, very small error.
And so it's monotonic curves. The details of this is in the paper and I show
how I achieve each of these three things in the paper.

But let's look at how well Bubble-Up works. And I have some backup slides that
even show even more how well it works. So let's take, this is actually very
important to understand the next graphs coming. The experimental scenario. We
have 500 machines. And each machine is running -- each machine has six cores,
                                                                              29


right. Search renderer is configured to use three cores and each machine is
running search renderer. So we have 500 machines half loaded with search
renderer, right. And we have three cores available for the co-runner.

So we have 500 jobs ready to run and it's an even mixture of the 17 Google
workloads. Randomly selected mixture of the 17 Google runners, and basically
we allow co-location that is steered by Bubble-Up. So as opposed to having
each machine only run search renderer, which is kind of the current approach,
we'll allow co-location, but only to co-locations that Bubble-Up says are okay.

So our baseline is basically those 500 machines half loaded. So we're at 50
percent utilization. Now, if we say, okay, Bubble-Up, please allow all of the
co-locations that don't violate our program, that doesn't degrade our program
by more than one percent. A lot of the ones that don't cause, essentially
don't cause contention. I'll allow a one percent leeway. This is how much
utilization improves as you change the QoS policy, that line that tells you how
much you have to guarantee. So at 99 percent, we're already experiencing
significant gains. And at 90 percent, we're beyond 80 percent utilization. So
we have a significant improvement in utilization.

So overall, at the 98 percent, we have about a 50 percent improvement in
utilization by letting Bubble-Up steer 98 percent of the quality of service.
But I mentioned that there was a potential for error. And so Bubble-Up can
potentially make mistakes. It can predict that a co-location is safe, but then
it does violate the quality of service. That could happen in some degree. But
we observe that when that does happen, violations, when that does happen, it's
severely small violation, right.

So in the worst case, we have a violation of 3 percent. So all of this blue
are correct decisions made by Bubble-Up, right. And then the other colors show
slight violations. So a one percent violation, the worst we get is a 3 percent
violation. It's actually a 2.2 percent violation. But it's between 2 and 3
percent. And this is across these 17 large real world applications. And
actually, in the paper, I show how if you change how you define your QoS
policies, you can bring this down to -- you can significantly reduce violations
and bring it down to zero if you change how you define your QoS policies.

So this is a huge step forward.

>>:   What if I were to just have a policy of just adding in any attempt of the
                                                                            30


-- or randomly allocating. Do you have anything you compared against? You
provided the policy to describe when and where you could go. But I could
imagine just a random --

>> Jason Mars: Right, just randomly co-locate anything. Well, this top graph,
that's an interesting point. This top graph might be helpful for that. It
shows the number of violations that you would have, the absolute number of
violations you would have if you allowed all co-locations. So if you just
allowed everything to co-run, this is how much violation. If you let Bubble-Up
co-run, that's how many violations you have.

And on top of that, the violations are very, very small. So in the worst case,
if you say I want 95%, in the worst case, which is very small, you get 92%
across these workloads. And you can change the way. You can incorporate a
tolerance in your QoS policy to reduce the number of violations if your
contract can find.

In fact, companies like Google and [indiscernible] companies are not contract,
like, it's not like they lose money. It's just their latency will be 3 percent
more than they want. So this is actually a very big result for, you know, when
you have web services that you're providing, software as a service. So that's
pretty cool.

And then it's general. It applies across architectures. Here are two totally
different kinds of architectures, right. And we let Bubble-Up allow the
co-location across these very two very different memory subsystems and
hierarchies. The same bubble, the same reporter, the same design, everything's
the same. You just blindly move Bubble-Up on a different architecture and it
works just as well. And to be clear, just as well doesn't mean it has as many
co-locations. Just as well means it predicts the co-locations because,
remember, these processors have smaller caches, smaller subsystem, so you won't
be able to have as many co-locations at a given quality of service level.

So Bubble-Up works quite well here.

>>:   How many cores do those machines have?

>> Jason Mars: Oh, yeah. And it's the same half-loaded policy where we do
half the cores and half the cores. So the core 2 has 4 and then the AMD one
has six and the Westmere has six. So it's six, six, and then four for the core
                                                                               31


2, but these only show the core 2 and the AMD. The other one, the other
results showed the Westmere on the Westmere architecture.

So basically, all the machines are half loaded or full loaded policy.

>>: Does your reporter function need to be tuned to the architecture of the
specific machine?

>> Jason Mars: No, that's the beautiful thing. It's the same reporter, the
same bubble. And you just increase that pressure and the reporter
programatically --

>>:   Measure it on the new machine?

>> Jason Mars: Yeah, yeah. You have to run the profiling on the new machine.
On every machine, you have to run the profiling. So you profile your
applications on the given machine and then you can predict arbitrary
co-locations. So you know, that might not be necessarily necessary. It's
tough. That's a good, interesting point. But I think there's a lot more
interesting kind of cool things you can do with this kind of approach and I'm
really exploring future work.

So I've presented these, you know, how we integrate an awareness of execution
environments, these three levels. And now I'll just talk about what's next.
What are the kind of things, what are the areas of research that I'm planning
to jump right in to.

So the first little vision that has many components associated to it is to have
the warehouse scale computer be an autonomous sentient entity, where it's
continuously self-aware of all the execution environments within the warehouse
scale computer and dynamically responding and optimizing. You can imagine
having automated learning and reaction to historical data, where you can derive
policies from observing anomalies. So if you learn, for instance, oh, when
this job has to make an RPC this far away, every single, like most of the time,
that job over there is going to suffer something.

So you can find these little patterns and then derive automatically policies
from them. And you can imagine using machine learning to assist with that.

So here, you think of the data center as a robot that's continuously
                                                                              32


self-optimizing. And you can also think you might want to express to that
robot policies, like you might want to tell the robot how to deal with certain
situations and how do you express, how do you program this robot. How do you
design a language that you can express things like responses and sensing and
execution environment. It's really a language for performance anomalies and
warehouse scale computers.

So I wrote a grant on that, and it was funded by Google.   There's a lot of
excitement about this particular idea.

Another idea I want to go into is the notion of accelerating warehouse scale
computers, right. In our phones, we have IP blocks. We have accelerators in
our phones for different common workloads that we run all the time. At the
scale of warehouse scale computers, we have the same environment. If Samsung
can build an SOC, why shouldn't Google or a Microsoft or a, you know, any of
these companies? So what you'd have to do is find the common algorithms, what
are the common things. There's a lot of machine learning that happens
realtime, right.

That should be something that we can build an IP block for we can accelerate.
And you've got to get a taxonomy of these different types of workloads that are
amenable to acceleration and then we can start designing custom hardware for
these different approaches and continue to argue to Google or Microsoft or any
of these companies that really, this is the right way to go, because you can
save significantly.

So just two more. I really think there's a lot of opportunity and it's an
emerging momentum in integrating software with architecture. Vertical
integration. And hoisting complexity. Where we let the architecture, let the
micro architecture expose things to the software platform and take advantage of
that. I think one interesting, you know, philosophy of doing that is in the
hybrid architecture design. It's transmittal like designs where you have a
software component to the chip and there's binary translation going on. It's
actually really kind of -- people are starting to look at that commercially and
I think that we don't really know what they can do yet. And I think they can
do a lot. And I want to realize a lot of this potential and just demonstrate
the greatness of it.

Software, okay. So and then finally, I think that we don't have enough
research. We have a lot of research in the hardware community on accelerators
                                                                            33


and SOCs and so forth, but we don't have enough software systems. We have
some, but we don't have enough attention being paid to how do you build the
stack on top of -- how do you extract specialized types of specialized cores?
How do you abstract it so you make them specialized, but then you generalize
them as much as possible, right. What are the right kinds of systems that we
need for that. So that's another promising area that piques my interest.

And with that, I'll take questions.

>>:   Thank you.

>> Jason Mars:     Thanks.

>>: So you have sort of the commoditization of these data center components
and I guess the assumption there is that there's a market for such components
so the question is, if there's only three companies or whatever, four maybe in
the world that build these, how many are there [indiscernible] and where is the
market for this?

>> Jason Mars: So that's a good point. So those few companies have scale,
right. They have volume. So a lot of, like chip architecture companies like
Intel pay a lot of attention to them. However, I would look at the trend more
broadly.

I would look at the fact that kind of applications that we're using computers
for, a lot of that computation needs to live in that kind of environment, and
you know, we can imagine maybe we would have a lot of smaller companies that
have their own little data centers or we have a few big companies that have
their infrastructures.

I think regardless of how that looks, I think we'll still continue to have more
and more of our workloads run in the cloud. But it's an interesting point.
That's always a question with how much complexity and cost do you want to
investment in something. You have to always think about does it make sense
economically, right. Like what are the economics of what you're proposing? Is
having an accelerator, does that make sense if there's only, like, three of
them that you'll need. So it's a great point.

>>: So a little curious, you really focused a lot on the CPU and its systems.
What about other components. If you take the exact same system and I change
                                                                               34


from standard spinning drives to SOCs, how much different dynamics does that
cause these applications?

>> Jason Mars: So the scope of this work really deals from memory up through
the software, through the cores, through the software stack. That's really how
I focused this, and there are a lot of applications that never really leave
memory. Like, you know, web search is one where you shard the index over some
number of machines and the query gets sharded out, sent out and then the search
never leaves memory and never goes to disk and then it comes back with the
result.

But there are applications that you go to disk all the time, like mail, right.
Like where you click on some random email, it can hit a disk, right, over there
somewhere in the cluster.

And it may not even be the same machine with the query. So then the
bottlenecks change completely, right. The bottle neck becomes I/O and it
becomes network, how congested the network is. So my work doesn't go to that
level, but I think a lot of the principles that I kind of expose here applies,
right. Like you could imagine having techniques for sensing how contended the
network is, right.

They might be also -- so I think in the network space, there might be more
opportunities in hardware, like you can have something that asks specifically
questions. But yeah, yeah, you can have a lot of the same principle. You may
be able to have something like Bubble-Up for, you know, network I/O to disk.
You can contend for bandwidth to disk and so forth.

You can imagine the same kind of techniques, many of my trends translate over.
But it's very interesting.

>>: I think that the [indiscernible] are less predictable than you memory
controller, because you're running, say, two applications on the same machine
that's different than the running, you know, there's a thousand applications on
the networks and you're getting random traffic, random bursts of traffic. The
disks might be also, just because of how it physically behaves, a little less
predictable. So building these kinds of models might not be as easy.

>> Jason Mars:   Yeah, yeah.
                                                                            35


>>: The next would be to bring it up a level and do it across the whole data
center.

>> Jason Mars: Yeah, yeah. I think that's a great point. I know it's
actually, there's a lot of researchers that are doing a lot of interesting in
the network for this exact same space. So yeah, I hope to collaborate with a
few. Thank you.

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:0
posted:4/21/2013
language:English
pages:35
About