VIEWS: 24 PAGES: 6 CATEGORY: Hardware POSTED ON: 4/9/2012
This paper take about these topics cpu, power Consumption, processor , multicore , multiprocessors, multithread .
Author manuscript, published in "48th IEEE Conference on Decision and Control (CDC'09), Shanghai : China (2009)" Energy Consumption Reduction with Low Computational Needs in Multicore Systems with Energy-Performance Tradeoff Sylvain Durand and Nicolas Marchand Abstract— A two voltage level electronic device is interesting A good energy-performance tradeoff could be achieved because the clock frequency and the supply voltage level could using a commonly used approach in embedded systems: be reduced (respecting certain rules) in order to decrease the Dynamic Voltage and Frequency Scaling (DVFS). This the energy consumption. We proposed in a previous paper a robust control architecture to deal with this power-performance method consists in adapting the voltage and the frequency to tradeoff and we are now interested in extending this principle the computational load and leads up to an important energy for several devices which works together since they are all consumption reduction (regarding the application) [10]. Fur- supplied with the same voltage and clock frequency. Thus, an thermore, it seems that most of the applications could run intuitive multicore control strategy which duplicates the whole with a reduced voltage [2], [3]. Thus, several behaviors are monocore architecture as much as devices is compared with a second strategy where the duplication is reduced as much known to minimize the energy consumption. Firstly, each as possible. It appears that the proposal clearly gives a low task has to be considered independently and its execution control computational needs with the same reduction of the time has to ﬁt with the deadline. Moreover, selecting some energy consumption. hal-00404053, version 1 - 6 Apr 2011 suitable voltage levels leads to a drastic energy reduction even if the number of levels is very small [7]. The supply I. INTRODUCTION voltage has to be reduced as much as possible and the fre- An energy-performance tradeoff is required in many em- quency clock adapted to the computational load to minimize bedded electronic systems. Actually, three power consump- the energy consumption [11]. tion sources exist in CMOS circuits [4], which could be Based on these different rules, we proposed in [5] a sorted into a dynamic consumption from switching of elec- robust strategy to control the clock frequency and the supply trical gates and a static consumption from short circuit and voltage level of an electronic device. The proposal leads leakage currents: to minimize the energy consumption while guaranteeing a P = Pswitching + Pshort circuit + Pleakage good computational performance. We are now interested (1) P = Kdyn fclk Vdd + Ksc fclk Vdd + Kleak Vdd 2 in extending this principle to several devices which works together (with the same voltage and frequency domain) but It appears that the consumption could be reduced by where each device has to deal with a different load. In decreasing Vdd , i.e. the supply voltage, or fclk , i.e. the the following section, we ﬁrst propose to bring back the clock frequency. However, decreasing only the frequency monocore system architecture and summarize its control will decrease the power consumption and results in a slower strategy. In section III, the multicore architecture is then running task but the total energy consumption will remain presented and two control strategies are detailed: a ﬁrst unchanged [12]. The voltage has hence to be reduce in order intuitive one which duplicates the monocore principle as to decrease the energy consumption. Furthermore, the supply much as devices and a second strategy which reduces consid- voltage is the dominant term especially because the dynamic erably the computational needs. Finally the two controllers power is the most important part in (1). In other words, are compared in section IV in term of energy consumption decreasing the voltage will almost quadratically decrease the and control computational needs. energy consumption. Unfortunately, this drop will decrease the computational speed (because of the propagation delay II. MONOCORE SYSTEM PRINCIPLE of transistors) and controlling the supply voltage is hence The system architecture with only one device to control a power-delay tradeoff: the power consumption decreases is shown on Figure 1. while the delay increases. That is why the supply voltage and the clock frequency have to be controlled together to f fclk guarantee the critical path (the longest electrical path a signal ref Oscillator Device ω can travel to go from a point to another of the circuit). Controller Vdd Vdd ω Vlevel Vdd Clearly, it is required to decrease the clock frequency before hopping decreasing the supply voltage and, respectively, increase the Monocore system supply voltage before increasing the clock frequency. Fig. 1. Monocore system architecture S. Durand is with NeCS Project-Team, INRIA - GIPSA-lab - CNRS, Grenoble, France, sylvain.durand@inrialpes.fr N. Marchand is with NeCS Project-Team, INRIA - GIPSA-lab - CNRS, The Device is the system to control. It usually runs at Grenoble, France, nicolas.marchand@gipsa-lab.inpg.fr nominal supply voltage and constant clock frequency but these quantities will now dynamically vary in order to there are now N devices to control, which means as many reduce the energy consumption. That is possible introducing references ref N given by the operating system (the number a closed-loop with a controller to monitor the activity of the of instructions and the deadline for each task) and as many device (its computational speed ω) and to adapt the supply measured computational speeds ω N as devices. Therefore the voltage and the clock frequency regarding the computational controller has to control the whole system but devices do load ref provided by the operating system for each task. not work independently since they are all supplied with the same voltage Vdd and the same clock frequency fclk . The The Oscillator and the Vdd-hopping are the two actuators only allowed dimension of freedom is to trigger a device used in some DVFS systems. They respectively provide the with a ratio of the clock fclk because in fact in practice it is clock frequency and the supply voltage to the device: possible to add one or two NOPs (i.e. No OPeration) between • The oscillator could be a ring oscillator [6]. each instruction in order that the device runs twice or three • The Vdd-hopping principle is described in [1]. Two times slower. For this reason, now the energy controller has voltage levels are available (Vlow and Vhigh ) and the one to provide the frequency ratios ρN anymore. or the other could be achieved (with a certain transition time and dynamics that depends upon the internal controller of f the Vdd-hopping) regarding the Vlevel input signal: Vlevel = ref N Vlevel Actuators fclk &Vdd levellow to require the low voltage and respectively levelhigh Controller ω1 for the high voltage. ρ1 Device 1 ωN ρN ωN ω2 The Controller has to provide the control signals to the ρ2 Device 2 actuators. Actually, the controller can be divided into two ω3 ρ3 Device 3 parts, as depicted on Figure 2: hal-00404053, version 1 - 6 Apr 2011 • The computational speed controller (CSC) provides the ρn Device n ωn computational speed set point ωsp . Thus, from some task Multicore system informations - for each task Ti the operating system provides the computational load (i.e. the number of instructions Ci ) Fig. 3. Multicore system architecture and the time before the task has to be executed (i.e. the dead- line Ni ) - a fast predictive control law permits to calculate Notations: ρj (lower case indice) denotes the signal ρ of the best speed set point in order to minimize the penalizing the device j, whereas ρN (upper case indice) means that high voltage running time (and so the energy consumption) there are N signals ρ, one for each device. while guaranteeing the computational performance. • Then the frequency and voltage level controller (FVC) In the two following subsections we will detail two control ﬁts the measured speed ω with the desired one ωsp , by strategies: a ﬁrst intuitive one which duplicates the monocore adapting the frequency f and the voltage level Vlevel . principle as much as devices and a second one which tries to minimize the computational needs of the controller. Ci Computational ωsp Monocore speed Frequency and f Ni voltage level Vlevel system ω A. Multicore control based on full duplication of the mono- ω controller ω controller Σ core control strategy A ﬁrst way to control several devices is to duplicate the monocore control strategy (detailed in section II) as much Fig. 2. Monocore controller architecture: a computational speed controller as devices. The resulting multicore architecture is presented (CSC) plus a frequency and voltage level controller (FVC) on Figure 4 and could be divided in three steps: The whole monocore controller (CSC + FVC) leads to 1) First, the computational speed controller (CSC) calcu- N a robust control (see [5] for further details): for a given lates the speed set points ωsp for the whole devices. Thus the j test bench, the device runs at the penalizing high supply set point ωsp is independently calculated for each device j, voltage only during 30% of time and an energy consumption j using the task information Ci and Lj (given by the operating i reduction of about 20% is achieved. We propose next to adapt system) and the measured speed ω j . this principle to a system with several devices. 2) Then the frequency and voltage level controller (FVC) independently calculates the frequencies f N and the voltage III. MULTICORE SYSTEM PRINCIPLE N levels Vlevel usually required to control a single device. The system architecture with several devices to control 3) Finally a frequency ratio controller compares the is shown on Figure 3. In fact this system is not so different calculated frequencies f N to deduce the critical device c, from the monocore one (presented in section II and shown on i.e. the device which needs the maximal frequency to ﬁt with Figure 1). Indeed the bases remain the same, with a controller its load. Thus the frequency f and the voltage level Vlevel which sends the frequency f and the voltage level Vlevel to sent to the actuators are those from the critical device, i.e. the actuators, i.e. a ring oscillator and a Vdd-hopping which c f c and Vlevel , and the frequency ratios ρN are obtained by respectively provide the clock frequency and the supply doing the ratio between the frequency of the current device voltage to the electronic devices. The main difference is that f j and the one of the critical device f c . N Ci fcor NiN N ωsp N fcor Frequency Multicore of instructions and the deadline because the computational CSCN Vlevel system FVC N N Vlevel ratio ωN load which was already executed is necessary too. Therefore N N ω ω controller ρN ΣN we propose to duplicate the computational speed controller (which seems to have to be repeated anyway). Thus the multicore architecture on Figure 5 is proposed: Fig. 4. Multicore control architecture based on full duplication of the monocore control strategy: the computational speed controller (CSC) and 1) First, the computational speed controller (CSC) pro- N the frequency and voltage level controller (FVC) are duplicated as much vides the speed set points ωsp , from which the frequency N as devices and a frequency ratio controller calculates the critical frequency ratios ρ could be obtained since they provide information and voltage level to deduce the frequency ratios ρN on the remaining computational load. 2) Then the frequency ratio controller compares the whole N This intuitive strategy guarantees that the tasks are cor- speed set points ωsp to deduce the critical task c, i.e. the task rectly performed for all devices because each device is in- which needs the maximal speed to ﬁt with its deadline. Thus dependently controlled using the monocore strategy. Indeed, the speed set point ωsp and the measured speed ω sent to the c the monocore strategy works for one device and we focus the FVC are those calculated for the critical task, i.e. ωsp and c N frequency and the voltage level decision on the critical one, ω , and the frequency ratios ρ are obtained by doing the j i.e. the device which has to treat the task with the highest ratio between the speed set point of the current device ωsp c computational needs. Thus, all the non-critical tasks will be and the one of the critical task ωsp . executed with the critical voltage level and a frequency lower 3) Finally the frequency and voltage level controller or equal to the critical frequency. Moreover, an non-critical (FVC) calculates the frequency f and the voltage level Vlevel device could become the critical one whereas its task requires to send to the actuators only for the critical device, i.e. the hal-00404053, version 1 - 6 Apr 2011 more and more computational needs. device which has to compute the critical task. With this proposal, only the CSC is repeated and not An improvement could be done for the non-critical de- the FVC anymore. We so hope a reduction of the control vices. Actually, if a device runs at high level then it is computational needs without impacting the gain on the forced to the maximal frequency in order to run the shortest energy consumption. possible time at the penalizing high supply voltage (see [5] for further details). A non-critical device - which a priori N ωsp fcor Ci N ωsp Frequency ω FVC Vlevel Multicore could run at Vlow - will hence have its frequency forced NiN CSCN ratio system ωN N N anyway when the critical device needs to run at Vhigh . For ω ω controller ρN ΣN this reason, we propose to force only the frequency of the critical device. However in practice the critical device is not known yet when the frequencies are calculated, i.e. in step 2, Fig. 5. Multicore control architecture based on partial duplication of the because the frequency ratio controller determined it in step 3. monocore control strategy: only the computational speed controller (CSC) Fortunately, a solution consists in using the device which is duplicated as much as devices and then a frequency ratio controller calculates the critical speed set point which will be used by the frequency was critical during the previous sampling period, by using and voltage level controller (FVC) and deduces the frequency ratios ρN the assumption that the critical device does not often change. This intuitive duplication of the whole monocore principle Though all the devices are not independently controlled leads to reduce the energy consumption of several devices using the monocore strategy, the computational performances working together while guaranteeing their computational are yet guaranteed for each device. Indeed, with this second performance. Nevertheless, a consequence is that the control architecture the monocore control strategy only guarantees computational needs are multiplied as much as devices and that the critical task will ﬁt with its deadline, since the the number of variables seriously increases too. That is monocore control strategy is only applied to the critical why we propose next to duplicate only some parts of the device. The frequency ratios for the non-critical devices monocore control strategy. are then calculated from the computational load of the task of each device which is ﬁnally adjusted thanks to the B. Multicore control based on partial duplication of the CSC. Thus all the non-critical tasks are executed until their monocore control strategy deadline anyway, or a task becomes the critical one when its This second strategy tries to minimize the control compu- computational needs become the more important one. tational needs by not intuitively duplicating all the monocore control strategy. In fact, the frequency ratios ρN require to be C. Discret values of the frequency ratios calculated and so some parts have necessary to be duplicated One could note that the control algorithms proposed in order to obtain the N signals. The aim is to repeat the in both previous subsections were developed with ideal least code as possible. The best solution would be to use the continuous frequency ratios ρN . However, as explained in references ref N (given by the operating system) to deduce introduction of the multicore principle, some devices could the ratios without duplicating any part of the monocore be triggered with a ratio of the clock frequency fclk by strategy, but these signals are not relevant enough. Indeed, adding NOPs between instructions in order that the device the critical task could not be known only from the number runs slower. This is why the frequency ratios could only be a discrete value which correspond to the number of NOPs, i.e. = {1; 1 ; 1 } for 0, 1 or 2 NOPs respectively added between 2 3 each instruction (note that the discrete frequency ratios will be called and the continuous ones ρ). In order to implement this behavior, we ﬁrst have to calculate the continuous ratios ρN (i.e. ρj = fcor /fcor for j c the multicore control strategy based on full duplication of the monocore control strategy and ρj = wsp /wsp for the j c one based on partial duplication). Then, iterations have to be done for each device j in order to deduce the discrete frequency ratio j just upper than the value of the continuous ratios ρj , as depicted by the below algorithm: 1 if 1 < ρj ≤ 1 2 if 1 < ρj ≤ 1 1 2 3 2 j = (2) 1 if 0 < ρj ≤ 1 3 3 0 otherwise This discrete ratio behavior would lead to a less energy efﬁcient system because the frequencies of the non-critical hal-00404053, version 1 - 6 Apr 2011 devices will be higher than required - thanks to (2) - contrary to the continuous case where these frequencies correspond exactly to the desired ones. Moreover, the control computational needs would increase a little bit thanks to the added code required to calculate the discrete ratios N . Fig. 6. References used for the simulations: the number of instructions, the deadline and the laxity (the remaining available time to execute the task) IV. PERFORMANCE EVALUATION for each device This section presents some simulation results. The bench- mark test is the same for all the simulations, where four devices with a different reference (number of instructions The results are quantiﬁed in term of energy consumption and deadline shown on Figure 6) have to be controlled: and computational needs: device 1 → three tasks to execute: the ﬁrst task starts with Energy consumption of the system: The energy consump- 5 instructions to do in 0.5µs, then a 75 instruction task tion is calculated in order to have an idea of the has to be executed in 2.5µs and the last one has to reduction achieved thanks to our proposal. Thus, the re- compute 10 instructions in 1µs. lation (1) is used and a ratio of this power consumption device 2 → three tasks also: a 15 instruction task to execute is added due to the Vdd-hopping principle: 20% more in 1.25µs, a task with 50 instructions to do in 2.25µs during the voltage transition time and 3% more during and then 5 instructions to execute in 0.5µs. the steady state [8]. Finally, an integration during the device 3 → a single task of 40 instructions to do in 4µs. whole running time gives the total energy consumption. device 4 → three tasks again: 10 instructions to compute in Computational needs of the controller: The control laws 0.75µs, a task with 20 instructions to do in 0.75µs and are compared in term of computational needs, i.e. the a last 40 instruction task to execute in 2.5µs. number of instructions required to calculated the com- putational speed set points, the frequencies, the voltage First, the simulation results for both control strategies levels and the frequency ratios. To do that, we use the (with ideal continuous frequency ratios) are shown on Fig- Lightspeed Matlab toolbox proposed by T. Minka [9], ures 7 and 8. The top plots show the average speed set point which provides a number of ﬂops for each instruction. (for guideline), the speed set point ωsp , the measured speed ω Moreover, the strategies are compared with a system and the critical speed ω c (for guideline) for each device. One using the intuitive control strategy (by duplicating the whole could verify that ω = ω c when the device is the critical one monocore control strategy) but without Dynamic Voltage (highlighted by the gray areas on plots). Moreover, the supply Scaling (DVS): in this case the measured speed tracks the voltage Vdd (which is the same for the whole devices because average speed set point and the supply voltage is ﬁxed to the of the multicore architecture) is shown on the bottom plot. penalizing high voltage, i.e. Vlevel = levelhigh . Note that the calculated frequency f or the clock frequency fclk and the voltage level Vlevel are not plotted because In both cases, the system runs during more than 50% of they do not provide relevant information: the frequencies are the simulation time at low voltage and a reduction of the proportional to the speed and the level can be deduced from energy consumption of about 20% is achieved in comparison the voltage. with a system without DVS. The differences between the two hal-00404053, version 1 - 6 Apr 2011 Fig. 7. Simulation results of the multicore controller based on full dupli- Fig. 8. Simulation results of the multicore controller based on partial cation of the monocore control strategy (with ideal continuous frequency duplication of the monocore control strategy (with ideal continuous fre- ratios): energy consumption of 3.976 · 10−5 J and computational needs quency ratios): energy consumption of 3.98 · 10−5 J and computational of 5.8 · 105 f lops, that is 82.2% of energy consumption and 94% of needs of 3.8 · 105 f lops, that is 82.4% of energy consumption and 62% of computational needs compared to a controller without DVS computational needs compared to a controller without DVS control strategies are during the voltage transitions and come and so the critical speed - remains continuous. from the choice of the critical device: While the energy consumption is very similar for both A) For the multicore control strategy based on full dupli- strategies, the computational needs is considerably reduced cation of the monocore control strategy, one could see on for the second one with a drop of 35% of the number of Figure 7 that the measured speed ω is continuous for all the ﬂops. For this reason, it would be the strategy to use. devices. This is because the ratios ρN are obtained from the Finally we propose to compare the simulation results of the frequencies f N independently calculated for each device. control strategy with low computational cost, on a ﬁrst hand B) For the strategy based on partial duplication, one could when the frequency ratios are the ideal continuous variables see on Figure 8 a discontinuity of the measured speed ω as ρN and on an other hand when they are the discrete variables soon as the critical device changes, such as at time 2.35µs N described by the algorithm (2). One could immediately on device 2. Indeed, the frequency ratios ρN are obtained remarks than the results, respectively shown on Figures 8 N from the speed set points ωsp which are switching variables and 9, are quite similar. The main difference is that the due to their construction (see [5] for further details). Thus measured speed ω does not track the speed set point ωsp in the speed set point value of a device could suddenly change the discrete case as well as in the continuous case. However, and so are the ratios. Nevertheless, the critical frequency - the algorithm assures that the speed will be at least upper while guaranteeing the computational performance. While the ﬁrst multicore control strategy intuitively dupli- cates the whole monocore architecture as much as devices, the second strategy - the contribution of this paper - tries to minimize as much as possible the duplication in order to decrease the control computational needs. Both architectures lead to a similar gain of energy consumption (compared to a system without DVS mechanism) but an important reduction of the number of ﬂops is achieved with the second one. We ﬁnally propose to use discrete frequency ratios which are the only way to implement our controller in practice. Next steps in this research is to test these control strategies in practice. VI. ACKNOWLEDGMENTS This research has been supported by the NeCS Project- Team (INRIA, GIPSA-lab, CNRS) in the ARAVIS project context. ARAVIS project is a Minalogic project gathering ST Microelectonics with academic partners of different ﬁelds, namely TIMA and CEA-LETI for micro-electronics and hal-00404053, version 1 - 6 Apr 2011 INRIA for operating system and control. The aim of the project is to overcome the barrier of subscale technologies (45nm and smaller). R EFERENCES [1] C. Albea, C. Canudas de Wit, and F. Gordillo. Control and stability analysis for the vdd-hopping mechanism. In Proceedings of the IEEE Conference on Control and Applications, 2009. [2] T. Burd and R. Brodersen. Processor design for portable systems. In The Journal of VLSI Signal Processing, volume 13, pages 203–221, 1996. [3] T. Burd, T. Pering, A. Stratakos, and R. Brodersen. A dynamic voltage scaled microprocessor system. In IEEE International Solid- State Circuits Conference Digest of Technical Papers, volume 35, pages 1571–1580, 2000. [4] A. Chandrakasan and R. Brodersen. Minimizing power consumption in digital cmos circuits. In Proceedings of the IEEE, volume 83, pages 498–523, 1995. [5] S. Durand and N. Marchand. Fast predictive control of micro con- troller’s energy-performance tradeoff. In Proceedings of the 3rd IEEE Fig. 9. Simulation results of the multicore controller based on partial Multi-conference on Systems and Control - 18th IEEE International duplication of the monocore control strategy (with discrete frequency ratios): Conference on Control Applications, 2009. energy consumption of 4·10−5 J and computational needs of 4.3·105 f lops, [6] S. Fairbanks and S. Moore. Analog micropipeline rings for high that is an increase of 1% of energy consumption and 11% of computational precision timing. In Proceeding of the International Symposium on needs compared to the controller with ideal continuous frequency ratios Advanced Research in Asynchronous Circuits and Systems, pages 41– 50, 2004. [7] T. Ishihara and H. Yasuura. Voltage scheduling problem for dynami- cally variable voltage processors. In Proceedings of the International than the desired one and so the computational load will be Sympsonium on Low Power Electronics and Design, pages 197–202, correctly computed. This is why this principle is interesting 1998. [8] S. Miermont, P. Vivet, and M. Renaudin. A power supply selector since it only leads to an increase of less than 1% of the for energy- and area -efﬁcient local dynamic voltage scaling. In energy consumption and 11% of the control computational PATMOS’07: 17th International Workshop on Power and Timing needs in comparison with the continuous frequency ratio case Modeling, Optimization and Simulation, pages 556–565, 2007. [9] T. Minka. The lightspeed matlab toolbox v2.2. (which could not be implemented in practice anyway). http://research.microsoft.com/˜minka/software/lightspeed/. [10] T. Pering, T. Burd, and R. Brodersen. Voltage scheduling in the V. CONCLUSIONS AND FUTURE WORKS lparm microprocessor system. In Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED), pages This paper proposes architectures to control several de- 96–101, 2000. vices which work together since they are all supplied with [11] J. Pouwelse, K. Langendoen, and H. Sips. Dynamic voltage scaling on a low-power microprocessor. In Proceedings of the 7th Annual In- the same voltage Vdd and the same clock frequency fclk (or a ternational Conference on Mobile Computing and Networking, pages ratio of this clock). The multicore control strategies are based 251–259, 2001. on the monocore control strategy depicted in [5], where a fast [12] A. Varma, B. Ganesh, M. Sen, S. Choudhury, L. Srinivasan, and J. Bruce. A control-theoretic approach to dynamic voltage scheduling. predictive control technique gives a computational speed set In Proceedings of the International Conference on Compilers, Archi- point to track in order to minimize the energy consumption tecture and Synthesis for Embedded Systems, pages 255–266, 2003.