Performance of Cache Fair Thread Scheduling for multi core processors using by ajithkumarjak47


									 National Conference on Role of Cloud Computing Environment in Green Communication 2012

         Performance of Cache Fair Thread Scheduling for multi core processors using
                                  Wait free data structure

        A.S.Radhamani, Research Scholar / Department of Computer Science and Engineering,
                          Manonmanium Sundaranar University,Tirunelveli.
        E.Baburaj,Professor, Department of Computer Science and Engineering, Sun College of
                  Engineering and Technology,

Abstract- As multi-core processors with
tens or hundreds of cores begin to grow,             Key Words: Cache Fair Thread Scheduling,
system optimization issues once faced only           multi-core, cloud computing
by the High-Performance Computing
(HPC). To satisfy the requirement, one can               I. INTRODUCTION
leverage     multi-core    architectures   to
parallelize traffic monitoring so as to                         Explicit parallel architectures
progress information processing capabilities         require specification of parallel task along
over traditional uni-processor architectures.        with their interactions. For network traffic
In this paper an effective scheduling                analysis it is a challenge for several reasons.
framework for multi-core processors that             First, packet capture applications are
strike a balance between control over the            memory bound, but memory bandwidth does
system and an effective network traffic              not seem to increase as fast as the number of
control mechanism for high-performance               core available [1]. Second, balancing the
computing is proposed. In the proposed               traffic among different processing units is
Cache Fair Thread Scheduling (CFTS),                 challenging, as it is not possible to predict
information supplied by the user to guide            the nature of the incoming traffic. Exploiting
threads scheduling and also, where                   the parallelism with general-purpose
necessary, gives the programmer fine                 operating systems is even more difficult as
control over thread placement. Cloud                 they have not been designed for accelerating
computing       has     recently     received        packet capture. During the last three
considerable attention, as a promising               decades, memory access has always been
approach for delivering network traffic              one of the worst cases of scalability and thus
services by improving the utilization of data        several solutions to this problem have been
centre resources. The primary goal of                proposed in [2]. With the advent of
scheduling framework is to improve                   Symmetric Multiprocessor Systems (SMP),
application throughput and overall system            multiple processors are connected to the
utilization in cloud applications. The               same memory bus, hereby causing
resultant aim of the framework is to improve         processors to compete for the same memory
fairness so that each thread continues to            bandwidth.       Integrating    the    memory
make good forward progress. The                      controller inside the main processor is
experimental results show that the parallel          another approach for increasing the memory
CFTS could not only increase the processing          bandwidth. The main advantage of this
rate, but also keep a well performance on            architecture is fairly obvious: multiple
stability .                                          memory modules can be attached to each

Department of CSE, Sun College of Engineering and Technology
 National Conference on Role of Cloud Computing Environment in Green Communication 2012

processor, thus increasing bandwidth. In             popular and important research field in
shared memory multiprocessors, such as               cloud computing. Some researches focus on
SMP and NUMA (Non Uniform Memory                     the data confidentiality and integrity in
Access), a cache coherence protocol [2]              cloud computing, and some analyze security
must be used in order to guarantee                   problems in cloud computing from different
synchronization among processors. A multi-           points of view, such as network, servers,
core processor is a processing system                storage, system management and application
composed of two or more individual                   layer [7]. Besides security, another hot topic
processors, called cores, integrated onto a          in cloud computing is virtualization
single chip package. As a result, the inter-         technologies. A security Private Virtual
core bandwidth of multi-core processors can          Infrastructure [PVI] and the architecture that
be many times greater than the one of SMP            cryptographically secures each virtual
systems.                                             machine are proposed in [8]. A virtual
               In this paper, a performance          machine image management system is
analysis for multi core processors based on          introduced in [9], and a real-time protection
real time schedulers are proposed. Also it           solution for virtual machine is given in [10].
monitors the application behavior at run-            A Run-Time Monitor (RTM) which is a
time, analyze the collected information, and         system      software     to    monitor     the
optimize multi-core architectures for                characteristics of applications at run-time,
different types of schedulers. Then wait-free        analyze the collected information, and
data structures are applied selectively to           optimize resources on a cloud node which
make multi-core parallelization easier and           consists of multi-core processors are
manageable. Based on this, the parallel              described in [18]. To characterize scientific
network traffic controller can run in a 1-           and transactional applications in Cloud
way 2-stage pipelined fashion on a multi-            infrastructures - IaaS, identifying the best
core processor, which not only increases the         virtual machine configuration in terms of the
processing speed significantly, but also             optimal processor allocation for executing
performs well in stability. The remainder of         parallel and distributed applications are
this paper is organized as follows. After            proposed in [19]. The effect of
related work in Section 2, Section 3 gives           heterogeneous data on the scheduling
description of the problem for multi core            mechanisms of the cloud technologies and a
systems using          Cache Fair Thread             comparison of performance of the cloud
Scheduling algorithm and Section 4 gives a           technologies under virtual and nonvirtual
result followed by conclusion.                       hardware platforms are given in [20]. The
                                                     number of cores which fit on a single chip is
                                                     growing at an exponential rate while off-
           II. RELATED WORK                          chip main memory bandwidth is growing at
                                                     a linear rate at best. This core count to off-
       As cloud computing is a relatively            chip bandwidth disparity causes per-core
new concept, it is still at the early stage of       memory bandwidth to decrease as process
research. Most of the published works focus          technology advances. An analytic model to
on general description of cloud, such as its         study the tradeoffs of utilizing increased
definition, advantages, challenges, and              chip area for larger caches versus more
future [4]-[6]. In detail, security is a very        cores is introduced in [21]. In this study on

Department of CSE, Sun College of Engineering and Technology
 National Conference on Role of Cloud Computing Environment in Green Communication 2012

constructing many core architectures well            execution of all threads in the processor
suited for the emerging application space of         without excessive overhead for scheduling
cloud computing where many independent               computation.      The     proposed     thread
applications are consolidated onto a single          scheduling performs in parallel with CPU
chip is described.                                   operation by utilizing resources and thereby
         To make CFTS capable for cloud              minimizing the amount of time spent in the
computing, parallelization technology on             OS. In order to schedule the CFTS threads
multi-core processor is required. A research         effectively, it must have information
trend on multi-core parallelization is wait-         regarding the current state of all threads
free data structures. A general introduction         currently executing in the processor. To do
to the wait free data structures is given in         gain this knowledge wait free data structures
[11].                                                are implemented to store threads and
        III. PROBLEM STATEMENT                       maintain information about their current
                                                     status. Therefore on parallelizing CFTS by
            This section first describes CFTS        applying wait-free design principles can be
in detail by comparing it with static                used for the allocation and management of
scheduling algorithm (deadline monotonic)            shared network resources among different
for multi core systems when running cloud            classes of traffic streams with a wide range
applications. The scheduling primitives must         of performance requirements and traffic
support a wide variety of parallelization            characteristics in high-speed packet-
requirements. Moreover, some applications            switched network architectures. Since the
need different scheduling strategies for             proposed CFTS maximizes the through put,
different program phases. In deadline                these metrics can then be used for efficient
monotonic, the time that a program takes to          bandwidth management and traffic control
run will not depend only on its computation          in order to achieve high utilization of
requirements but also on the ones of its co-         network resources while maintaining the
runners. Therefore, the scheduler not only           desired level of service for cloud computing
must select the task to be launched (e.g., a         by CFTS scheduling in multi core systems.
critical task) but also the appropriate core.
The core must be selected according to the           A. Description of Cache Fair Thread
computational requirements of the tasks              scheduling algorithm
already running in each core. Therefore it is                       CFTS is suitable for cloud
measured as time consuming for cloud                 computing for its idea of bandwidth
applications. Also, as the number of tasks           borrowing. It can not only control the
(traffic) increases, the static scheduling           bandwidth of all users, guaranteeing that all
system employed by a traditional OS will no          users could be given different levels of basic
longer guarantee optimal execution of tasks.         service by their payment, but also make
Therefore proposed CFTS reduces the                  more effective usage of free resource and
effects of unequal CPU cache sharing that            make a better user experience. In the cloud,
occur on these processors and cause unfair           there could be different kinds of leaf users,
CPU sharing, priority inversion, and                 e.g. 2Mbps, 5Mbps, regarding different
inadequate CPU accounting. CFTS attempts             service levels. Because CFTS is a dynamic
to minimize the effect of thread level data          traffic control mechanism, classes can be
dependencies and maintain efficient

Department of CSE, Sun College of Engineering and Technology
 National Conference on Role of Cloud Computing Environment in Green Communication 2012

added or removed dynamically. This makes             have completed under the existing
CFTS scalable enough for cloud computing.            scheduling policy if it ran at its fair CPI rate
                 On real hardware, it is             (divide the number of cycles by the fair
possible to run only a single task at once, so       CPI). Then measure the actual number of
while that     one task runs, the other tasks        instructions completed.
that are waiting for the CPU are at a                4. Estimate how many CPU cycles to give or
disadvantage - the current task gets an unfair       take away to compensate for the difference
amount of CPU time. In CFTS this fairness            between the actual and the fair number of
imbalance is expressed and tracked via the           instructions. Adjust the thread‟s CPU
per-task p->wait_runtime (nanosec-unit)              quantum accordingly.
value. "wait_runtime" is the amount of time          The algorithm works in two phases:
the task should now run on the CPU for it to         1. Searching phase: The scheduler computes
become completely fair and balanced.                 the fair L2 cache miss rate for each thread.
CFTS's task picking logic is based on this p-        2. Calibration phase: A single calibration
>wait_runtime value and it is thus very              consists of computing the adjustment to the
simple: it always tries to run the task with         thread‟s CPU quantum and then selecting a
the largest p->wait_runtime value. So CFTS           thread from the best effort class whose CPU
always tries to split up CPU time between            quantum is adjusted to offset the adjustment
runnable tasks as close to „ideal multitasking       to the cache-fair thread‟s quantum.
hardware' as possible. This algorithm                Calibrations are repeated periodically.
redistributes CPU time to threads to account                          The        challenge         in
for unequal cache sharing: if a thread‟s             implementing this algorithm is that in order
performance decreases due to unequal cache           to correctly compute adjustments to the CPU
sharing it gets more time, and vice versa.           quanta and need to determine a thread‟s fair
The challenge in implementing this                   CPI ratio using only limited information
algorithm is determining how a thread‟s              from hardware counters [14]. This algorithm
performance is affected by unequal cache             reduces L2 contention by avoiding the
sharing using limited information from the           simultaneous scheduling of problematic
hardware.        The cache-fair scheduling           threads while still ensuring real-time
algorithm does not establish a new CPU               constraints.
sharing policy but helps enforce existing
policies. The key part of our algorithm is           B. Description of Static Algorithm (Deadline
correctly computing the adjustment to the            Monotonic)
thread's CPU quantum [13]. The given four-                    To meet hard deadlines implies
steps are used to compute the cache-fair             constraints upon the way in which system
scheduling algorithm adjustment.                     resources are allocated at runtime. This
1. Determine a thread‟s fair L2 cache miss           includes both physical and logical resources.
rate – a miss rate that the thread would             Conventionally, resource allocation is
experience under equal cache sharing.                performed by scheduling algorithms whose
2. Compute the thread‟s fair CPI rate – the          purpose is to interleave the executions of
cycles per instruction under the fair cache          processes in the system to achieve a pre-
miss rate.                                           determined goal. For hard real-time systems
3. Estimate the fair number of instructions –        the obvious goal is that no deadline is
the number of instructions the thread would          missed. One scheduling method that has

Department of CSE, Sun College of Engineering and Technology
 National Conference on Role of Cloud Computing Environment in Green Communication 2012

been proposed for hard real-time systems is          execute on corej is a finite positive number,
a type of deadline monotonic algorithm [15]          denoted by cij. The execution time of ti
.This is a static priority based algorithm for       under a constant speed Sij, given in cycles
periodic processes in which the priority of a        per second is ,
process is related to its period. With this                            tij = cij/Sij.
algorithm, several useful properties,
including a schedulability test that is                  C. Description of Wait free data
sufficient and necessary the constraints that            structures
it imposes on the process system are severe:
processes must be periodic, independent and                            A wait-free data structure is
have deadline equal to period. The processes         a lock-free data structure with the additional
to be scheduled are characterized by the             property that every thread accessing the data
following relationship:                              structure can make complete its operation
 Computation time < deadline < period                within a bounded number of steps,
 Based on this each core is characterized by:        regardless of the behaviour of other threads.
1) The frequency of each core, fj, given in          Each thread is guaranteed to be progressing
cycles per unit time. With DVS, fj can vary          itself or a cooperative thread [12]. This
from fj min to fj max, where 0 < fj min < fj         property means that high-priority threads
max. From frequency it is easy to obtain the         accessing the data structure never have to
speed of the core, Sj, which is simply the           wait for low-priority threads to complete
inverse of the frequency.                            their operations on the data structure, and
2) The specific architecture of a core,              every thread will always be able to make
A(corej), includes the type the core, its            progress when it is scheduled to run by the
speed in GHz, I/O,                                   OS. For real-time or semi-real-time systems
local cache and/or memory in Bytes.                  this can be an essential property, as the
Tasks: Consider a parallel application, T =          indefinite wait-periods of blocking or non-
{t1, t2, …, tn}, where ti is a task. Each task       wait-free lock-free data structures do not
is characterized by:                                 allow their use within time-limited
1) The computational cycles, ci, that it needs       operations. A wait-free data structure has the
to complete. (The assumption here is that the        maximum potential for true concurrent
ci is                                                access, without the possibility of busy waits.
known a priori.)                                           IV.RESULTS AND DISCUSSION
2) The specific core architecture type, A(ti),                    In the proposed work the time
that it needs to complete its execution.             that it takes to complete its work segments
3) The deadline, di, before each task has to         in the Cache Fair Thread Scheduler with
complete its execution.                              Wait Free data structures (CFTS-WF) and
              The application, T, also has a         Cache Fair Thread Scheduler (CFTS)
deadline, D, which is met if and only if the         schedules are compared. This quantity is
deadlines of all its tasks are met. Here, the        referred to as completion time. When
deadline can be larger than the minimum              running with a static scheduler, the
execution time and represents the time that          difference between completion times is
the user is willing to tolerate because of the       larger, but when running with the cache fair
performance-energy trade-offs. The number            scheduler, it is significantly smaller. Figure
of computational cycles required by ti to            1 demonstrates normalized completion times

Department of CSE, Sun College of Engineering and Technology
 National Conference on Role of Cloud Computing Environment in Green Communication 2012

with the static scheduler and Figure 2                                  should seek to minimize interaction among
demonstrates normalized completion times                                processes by mapping tasks with a high
with the Cache Fair Thread scheduler.                                   degree of mutual interaction onto the same
                                                                        process.         It is again designed to test
Static Scheduler                                                        whether the wait –free based CFTS could be
                                                                        competent for cloud applications also.
      1.4                                                                      V.      CONCLUSION
       1                                                                                 A significant body of the
      0.8                                                       DM WF   performance modeling research literature
      0.6                                                       DM      has focused on various aspects of the
      0.4                                                               parallel computer scheduling problem and
      0.2                                                               the allocation of computing resources
                                                                        among the parallel jobs submitted for
            coop imrec   ood   plros game   zip   wpro   vlsi
                                                                        execution. Several classes of scheduling
                                                                        strategies have been proposed for such
                                                                        computing environments, each differing in
Figure: 1 Performance Variability with                                  the way the parallel resources are shared
Static scheduler                                                        among the jobs. This includes the class of
                                                                        space-sharing strategies that share the
                                                                        processors in space by partitioning them
Cache Fair Thread scheduler                                             among different parallel jobs, the class of
                                                                        time-sharing strategies that share the
                                                                        processors by rotating them among a set of
                                                                        jobs in time, and the class of scheduling
                                                                        strategies that combine both space-sharing
                                                                        and time-sharing. In this paper, a wait free
                                                                        based parallel CFTS is implemented for
                                                                        effective and stable for multi core systems.
                                                                        Based on the algorithms on accessing data
                                                                        structures and the usage of wait free FIFO,
                                                                        the parallel CFTS can run a pipelined
                                                                        fashion. The experimental analysis and
                                                                        evaluation results both indicate that parallel
Figure: 2 Performance variability with                                  CFTS is more suitable multi core
Cache fair Thread Scheduler                                             applications due to its excellent performance
                                                                        on both line rate and stability. In future the
Hence the CFTS seek to maximize the use                                 improvement of CFTS WF for higher line
of concurrency by mapping independent                                   rate may be implemented. Moreover,
tasks on different threads, so that to                                  parallel network application based on multi-
minimize the total completion time by                                   core processor is cheaper and more scalable
ensuring that processes are available to                                than special hardware and explore its more
execute the tasks on the critical path as soon                          effective usage in cloud computing.
as such tasks become executable, and it

Department of CSE, Sun College of Engineering and Technology
 National Conference on Role of Cloud Computing Environment in Green Communication 2012

                                                             Networks, 2009, pp.763-767, doi:
              REFERENCES                                     10.1109/I-SPAN.2009.157.
                                                         8. F.J. Krautheim, “Private Virtual
   1. K. Asanovic and others, The                            Infrastructure        for       Cloud
      landscape of parallel computing                        Computing”, in HotCloud‟09, San
      research: A view from Berkley.                         Diego, CA, USA, June, 2009.
      Technical Report UCB/EECS-2006-                    9. J. Wei, X. Zhang, G. Ammons, V.
      183, EECS Department, University                       Bala, and P. Ning, “Managing
      of California, Berkley, December                       security of virtual machine images in
      (2006).                                                a cloud environment”, Proc. Of the
   2. T. Tartalja and V. Milutinovich, The                   ACM workshop on Cloud computing
      Cache Coherence Problem in                             security 2009, Chicago, IL,US, 2009,
      Shared-Memory        Multiprocessors:                  pp.91-96,                         doi:
      Software Solutions, ISBN: 978-0-                       10.1145/1655008.1655021.
      8186-7096-1, (1996).                               10. M. Christodorescu, R. Sailer. D. L.
   3. C. Leiserson and I. Mirman, How to                     Schales, D. Sgandurra and D.
      Survive the Multi-core Software                        Zamboni, “Cloud Security Is Not
      Revolution, Cilk Arts, (2009).                         (Just) Virtualization Security”, Proc.
   4. R. Buyya, “Market-Oriented Cloud                       of the ACM workshop on Cloud
      Computing: Vision, Hype, and                           computing security 2009, Chicago,
      Reality of Delivering Computing as                     IL,US, 2009, pp.97-102, doi:
      the 5th Utility”, Proc. of 9th                         10.1145/1655008.1655022.
      IEEE/ACM                 International             11. K. Fraser and T. Harris, “Concurrent
      Symposium on Cluster Computing                         programming without locks”, ACM
      and the Grid          (CCGRID‟09),                     Transactions on Computer Systems,
      Shanghai, China, May, 2009, pp. 1,                     vol. 25 (2), May 2007.
      doi: 10.1109/CCGRID.2009.97.                       12. J. Giacomoni, T. Moseley, and M.
   5. M. Armbrust et al., “Above the                         Vachharajani, “Fastforward for
      Clouds: A Berkeley View of Cloud                       efficient pipeline parallelism: A
      Computing”,       Department        of                 cache-optimized concurrent lockfree
      Electrical Engineering and Computer                    queue”, Proc. of PPoPP‟08, New
      Sciences, University of California at                  York, NY, USA, February 2008,
      Berkeley, Report No. UCB/EECS-                         pp.43-52.
      2009-28, CA, USA, 2009.                            13. Alexandra Fedorova, Margo Seltzer
   6. J. Heiser and M. Nicolett,                             and Michael D. Smith,Harvard
      “Accessing the Security Risks of                       University,     Sun     Microsystems
      Cloud Computing”, Gartner Inc.,                        “Cache-Fair Thread Scheduling
      Stanford,                                              forMulticore Processors”.
      CT,2008,                   14. S. Kim, D. Chandra and Y. Solihin.
   7. M. Yildiz, J. Abawajy, T. Ercan, and                   Fair Cache Sharing and Partitioning
      A. Bernoth, “A Layered security                        in     a     Chip      Multiprocessor
      Approach for Cloud Computing                           Architecture, In Intl. Conference on
      Infrastructure”, Proc. of the 10th                     Parallell      Architectures      and
      International     Symposium        on
      Pervasive Systems, Algorithms, and

Department of CSE, Sun College of Engineering and Technology
 National Conference on Role of Cloud Computing Environment in Green Communication 2012

       Compilation Techniques (PACT),                          David,” Core Count vs Cache Size
       2004.                                                   for Manycore Architectures in the
   15. Ishfaq Ahmad, Sanjay Ranka, Sanjay                      Cloud”, CSAIL Technical Reports,
       Ranka, “Using Game Theory for                           Volume 6568/2011, pp.39-50
       Scheduling Tasks on Multi- Core
       Processors        for     Simultaneous
       Optimization of Performance and
       Energy”, 2008
   16. Zheng Li, Nenghai Yu, Zhuo Hao,”
       A Novel Parallel Traffic Control
       Mechanism for Cloud Computing,
       2nd IEEE International Conference
       on Cloud Computing Technology
       and Science,2010
   17. Lizhe Wang, Jie Tao, Gregor von
       Laszewski,         Holger     Marten,”
       Multicores in Cloud Computing:
       Research           Challenges       for
       Applications, Journal of computers,
       vol. 5, no. 6, june 2010
   18. Mikyung Kang , Dong-In Kang,
       Stephen P. Crago, Gyung-Leen Park
       and Junghoon Lee , Design and
       Development of a Run-Time
       Monitor           for       Multi-Core
       Architectures in Cloud Computing ,
       Sensors 2011, 11, 3595-3610
   19. Denis R. Ogura, Edson T.
       Midorikawa, "Characterization of
       Scientific       and      Transactional
       Applications       under     Multi-core
       Architectures on Cloud Computing
       Environment," IEEE International
       Conference on            Computational
       Science and Engineering, pp. 314-
       320, 2010
   20. Ekanayake, J.; Gunarathne, T.;
       Qiu, J,” Cloud Technologies for
       Bioinformatics Applications” , IEEE
       Transactions on Parallel and
       Distributed Systems, Volume: 22,pp
       998 – 1011.
   21. Agarwal, Anant; Miller, Jason;
       Beckmann,        Nathan;     Wentzlaff,

Department of CSE, Sun College of Engineering and Technology

To top