; scheduling
Learning Center
Plans & pricing Sign in
Sign Out
Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>



  • pg 1
									Chapter 7

Introduction to Asynchronous
Distributed Processing

The asynchronous distributed processing poses different requirements on the infrastruc-
ture compared to synchronous processing. There is a need for fast distributed storage of
huge capacity accessible from acquisition tools, processing tools, and client tools (either
indirectly using server that can access the storage or by enabling clients to access it di-
rectly). Even if the latency of the processing is not an issue, the users are concerned very
much about the overall speed of processing. To create an efficient processing infrastruc-
ture, not only efficient job distribution model and scheduling must be applied, but also the
location of data in the distributed storage infrastructure and location of sources, process-
ing capacity, and clients must be taken into account and optimized. For example, if the
data is available on one site, it makes sense to utilize processing capacity available on that
site. On the other hand if we know that some significant processing capacity will become
available in defined time, it might be reasonable to migrate data as close as possible to that
processing infrastructure.

7.1 Objectives of Asynchronous Distributed Processing
The goal of our effort is to build a distributed asynchronous processing system which mini-
mizes overall time of the processing regardless of latency that will utilize powerful networks of
computing and storage resources called Grids. This aim comprises several subtasks:

   • designing scheme for efficient distributed processing that scales close to linear with
     respect to number of nodes involved in processing,
   • propose suitable job scheduling, which incorporates distributed processing capacity
     and storage capacity,
   • optimization of data location with respect to processing capacity and vice versa.

    The target system must be designed to provide processing at least at real-time speed
even when applying advanced transformations of the multimedia material. Looking at
real-world example, we may need to convert a video in DV format into raw format or into
to common MPEG-4 based formats used for streaming or downloading (e. g. RealMedia,
DivX) while applying high-quality de-interlacing and de-noising filters, current high-end
single processor computers are able to process the data only three times slower than real-
time speed.
7.2. STATE OF THE ART                                                                        55

7.2 State of the Art
A number of both open source and closed source tools is available to process multimedia
content in a centralized way and some of them also in a distributed way. To author’s
knowledge with the only exception discussed below, none of them has been integrated
into the Grid environment and none of them has support for distributed storage (unless
it is emulated by operating system as a local filesystem). The most important tools and
approaches are listed in this section.

7.2.1 Grid and Distributed Storage Infrastructure
There are numerous projects aimed at building computational Grid infrastructure for var-
ious purposes in U. S. A., Europe and Japan. These projects are ranging from specific-
purpose built Grids for selected applications to general infrastructure projects. As a part
of the Grid infrastructure these projects provide large computational power in the form of
Linux PC clusters that are being rapidly expanded each year because of the cost-efficiency
of this solution which we target our attention on. The Grid activities are covered by
META Center project [78] in the Czech Republic.

7.2.2 Video Processing Tools
Currently a lot of tools is available for multimedia transcoding (list of freely available tools
can be found e. g. in [69] and [70], description of most important commercial tools can
be found in [20])—some of them being open source and some of them closed-source—and
vast majority of them doesn’t allow distributed encoding, while some allow for distributed
processing in homogenous environments with some centralized shared storage capacity.
    Due to its highly modular architecture, the transcode [55] tool supports transcoding
between almost all common video distribution formats except for those for which no open-
source freely available implementation of encoding library exist. It also allows advanced
processing that is needed like high quality de-interlacing and down-sampling. There is a
simple computation distribution available based on PVM [89] and shared filesystem sup-
ported in underlying operating system (typically NFS). The transcode tool can also be
used as video and sound acquisition tool on Linux using Video4Linux interface.
    Other tool available for general multimedia transcoding is MEncoder which is part of
MPlayer software suite [81].
    However none of above mentioned tools support transcoding to RealMedia format,
which is currently one of the most popular formats for video streaming delivery. The
company producing RealMedia encoder decided to provide source code of all applications
for research and development purposes under Helix Community Project [72]. Thus it is
possible to explore integration of this format into asynchronous encoding environment.

7.2.3 Automated Distributed Video Processing
Shortly after our Distributed Encoding Environment, there has been another system pub-
lished based on similar ideas called “A Fully Automated Fault-tolerant System for Dis-
tributed Video Processing and Offsite Replication” [44]. This system uses similar overall
architecture to ours based on parallelizing the encoding in a distributed computing envi-
ronments by splitting the encoding into smaller chunks that are encoded in parallel.
    As the system doesn’t use distributed file system with replica support, it handles the
data replication using Condor-related tools Stork [46] and DiskRouter [45]. Furthermore,
the claimed fault-tolerancy is understood and handled only on job scheduling level and
the system actually demonstrates fault-tolerancy of the Condor-G [51] scheduling system
which it is based on.
Chapter 8

Distributed Encoding

In recent years there has been a growing demand for creating video archives available
on the Internet ranging from archives of university lectures [90, 67], archives of medical
videos and scientific experiments recordings to business and entertainment applications.
Building these media libraries requires huge processing and storage capacities.
    In this chapter, we describe a system called Distributed Encoding Environment (DEE)
[39] that is designed to utilize Grid computing and storage infrastructure. This chapter is
organized as follows: in Section 8.1 we propose architecture of the DEE, in Section 8.2 suit-
able data and processor scheduling algorithms are analyzed, Section 8.3 briefs prototype
implementation and evaluates its performance.

8.1 Model of Distributed Processing
There are two possible approaches to building a distributed video processing system dif-
fering in granularity of the parallelization:

   1. parallelization on level of compression algorithm, i. e. fine-grained parallelization of
      the compression algorithm itself,
   2. parallelization on data level, i. e. coarse-grained parallelizing the whole encoding
      process by splitting the material and encoding the resulting parts in parallel.

While the former option is suitable for semi-asynchronous processing like live streaming, it
adds significant overhead and almost prevents reasonable linear scalability on distributed
processing infrastructure without single shared memory because of several reasons. It usu-
ally involves substantial synchronization among the distributed processes—e. g. I-frames
need to be handled before processing of P- and B-frames can occur. It also requires move-
ment of source data to the processing node just before the calculation, as the data is not
available well in advance, and transfer of resulting data back. As the source data is in this
case usually not available in advance, it is hard to schedule data movements. Furthermore,
the data movements in this model require low-latency transfer for efficient processing and
thus it is impossible to utilize distributed storage infrastructure1 Third problem is that
fine-grained parallelization requires modifications to all source and target codecs in use,
which is very hard as it might comprise tens of different algorithms to parallelize.
   1 Theoretically,
                  it would be possible to utilize some highly experimental and not very affordable storage sys-
tems, such as data circulating in optical networks.
8.1. MODEL OF DISTRIBUTED PROCESSING                                                       57

    We have opted for the latter approach for the following reasons: since the asynchronous
processing relaxes latency constraint, we may assume that the source data is completely
acquired before the processing, and also because our target is to build a system that works
faster than real-time and we can suppose whole material is available in advance anyway.
Compared to parallelization on compression algorithm level, the parallelization on data
level is codec-independent and thus the same architecture and implementation can be used
for many input/output formats. Furthermore, it is possible to use it with target formats for
which there is no open-source codec implementation and the only condition is that there
is an efficient way for merging resulting chunks together.
    The proposed workflow for the distribution of the processing (Figure 8.1) looks as fol-
lows: the source data is split into chunks which are then encoded in parallel and the re-
sulting data is merged back into the target data. The goal is then to minimize completion
time of the last finishing job of the parallel phase. Although we have relaxed latency require-
ments posed on the asynchronous processing and initial and final phases count together to
overall latency of the processing, we require that these two phases should be much faster
than the parallel phase, thus making the processing effective from the real-user point of
view. The source chunks for the parallel phase are stored in distributed storage (possibly
in multiple copies for performance and reliability purposes) to be effectively accessible by
distributed processing nodes.

                                       Source Data

                                      Data Chunking

         Chunk            Chunk            Chunk                             Chunk
       Processing       Processing       Processing                        Processing

                                     Chunk Merging

      F IGURE 8.1: Workflow in the Distributed Encoding Environment model of
            processing distribution.

   As the source data is complete before the processing, we may split the parallel pro-
cessing into uniform chunks, which makes it possible to create a scheduling algorithm
belonging to PO class as shown in Section 8.2.

8.1.1 Conventions Used
The overview of the infrastructure model used throughout this chapter is given in Fig-
ure 8.2, comprising data sources, storage depots, processing nodes and the network infras-
8.1. MODEL OF DISTRIBUTED PROCESSING                                                                            58

                                   data source                   storage depot

                                   router/switch                 processing node

                             F IGURE 8.2: Model of target infrastructure.

tructure with links and active elements (routers/switches). In order to maintain consistent
notation throughout this chapter, we also define a number of symbols below.
Definition 8.1 (Data transcoding) Transformation of (multimedia) data from a source for-
mat to a target format is called transcoding.                                        2

Definition 8.2 (Data prefetch) Data prefetch2 is an act of moving data closer to the pro-
cessing infrastructure during the time period between the job is scheduled and the job is
run.                                                                                   2

   There is also a number of symbols and variables used, some of which are also provided
with deeper explanation where appropriate:

t time                                                     D set of depots that store data to be pro-
                                                                cessed (all depots unless indicated
t0 now                                                          otherwise)

p processing node
                                                           u (type of) processing task
P set of processing nodes
                                                           U set of processing tasks (all the tasks have
d the depot where the data to be processed                      the same length)
     are stored

Du set of depots scheduled/used for actu-                  U set of tasks scheduled to processor p
    ally accessing the data to be processed
    in task u                                              lu length of processing task u; units [Mb]
  2 In  Grid community, ”data stage-in” term is often used as an equivalent to data prefetch.
  3 This  information can be theoretically obtained from most of current advanced schedulers. However there
are a few issues that make it partially theoretical functionality only:
   • existence of priority jobs in scheduling systems (the priority job can delay availability of the processor),
8.2. SCHEDULING ALGORITHMS                                                                                     59

tp _free information from job scheduler
                                                          bD,p (t) download capacity (bandwidth)
       in what time the processor p will be                      from depot set D to processor p in
       available3; units [s]                                     time t as discussed in Section 8.2.2;
                                                                 units Mb . s−1
sp,u processing performance4 of processor p
      on (type of) task u; units Mb . s−1
                                                          bp,D (t) upload capacity (bandwidth) from
sp,u resulting material production perfor-                       processor p to depot set D in time
      mance of processor p on (type of) task                     t as discussed in Section 8.2.2; units
      u; units Mb . s−1                                           Mb . s−1

8.2 Scheduling algorithms
The work described in this section is primarily motivated by the need for efficient job
scheduling across geographically distributed computing cluster infrastructure and dis-
tributed storage systems for distributed processing of large data sets. Such scheduling
system must take into account not only the processing power of each computing node
(which is not uniform as understood by most of scheduling algorithms), but also esti-
mated end-to-end network throughput between the location of the data in the distributed
storage system and the processing nodes.

8.2.1 Use Cases and Scenarios
There is a number of scenarios that can be covered by our approach and the following list
includes the ones we consider the most important:

    • Scheduling the processing on the best hosts to perform the processing. The “best” host
      doesn’t need to be the fastest one in terms of available processing power. Actually, it
      is the one on which the calculation finishes in shortest time. To select the best node
      for the processing, we need to sort hosts according to estimated completion times of
      the processing and then use processor scheduling algorithm to schedule the tasks.
    • Selection of best depots containing the source data to process with respect to processing capac-
      ity. Taking into account available bandwidth between the data depots containing the
      source data and processors that are about to process the data, we need to schedule
      which processor will use which depot.
    • Prefetch decision support. Some evaluation criteria are needed to decide whether data
      prefetch is appropriate or not. The minimum condition states that the prefetch must
      accelerate the processing, i. e. it must decrease overall processing time.
    • Upload distribution support. If the data processing is to happen in short enough future
      so that we can predict which computing resources will be used for processing, it
      may be useful to to upload the resulting data back into the distributed storage with
      respect to the location of these computing resources.
   • existence of preemption jobs, which can preempt already running jobs,
   • users are not required (and thus the most users don’t bother) to specify expected run-time of their jobs thus
     defaulting to maximum run-time available in specifying queue,
   • non-trivial known problems with this functionality in well-known scheduling systems (e. g. PBSPro [88]).
Nevertheless there is considerable effort in current Grid computing to make estimates of job run-times [63] and
thus we assume this functionality available very soon.
   4 We assume that the processing performance of the processor is constant in time and the processor is either

available (free) or unavailable (busy) for the purpose of job scheduling. When some algorithms assume uniform
tasks, we denote sp processing performance of processor p.
8.2. SCHEDULING ALGORITHMS                                                                 60

8.2.2 Components of the Model
There are two basic components of the model: Completion Time Estimate (CTE) used for
finding the “best” host for data processing, and Network Traffic Prediction Service (NTPS) for
prediction of the available end-to-end bandwidth between data storage depot and proces-
sor. Furthermore, some auxiliary functionality like proximity functions, prefetch decision
support, and upload optimizations are provided in this section.
Definition 8.3 (Completion Time Estimate) Completion Time Estimate (CTE) is an esti-
mate of the time when the processing finishes if it uses specified computing resources
while data are processed directly from/to specified storage resources. Networking re-
sources defined by location of computing and storage and network topology are used as
well.                                                                              2

   Because prediction of network traffic can be very complex depending on requirements
on the prediction as discussed below, we define NTPS as a general interface for traffic
Definition 8.4 (Network Traffic Prediction Service) The Network Traffic Prediction Ser-
vice (NTPS) is a service capable of estimating network bandwidth available for data trans-
mission between two hosts in the network in end-to-end way using specified stack of net-
work protocols.                                                                         2

CTE – Completion Time Estimation
In general, CTE can be obtained by solving the following equation for the job u given
location of the data in depots Du and using processor p and resulting data of length lu

are uploaded into depot set Du
                           CTEd (p,Du ,u)

                                                 min{sp,u , bDu ,p (t)} dt = lu          (8.1)

                                      CTEu (p,Du ,u)
                                                         r(t) dt = lu                    (8.2)

The CTEd (p, Du , u) is estimated completion time of download and processing phase, the
CTEu (p, Du , u) is estimated completion time of upload phase, z(t) = sp,u − bp,Du (t),
amount of locally stored data—if production is faster than available network bandwidth
for transport—is Z(t) = max{ tsched_free z(t)dt, 0}, and the resulting upload rate is

                                        min{sp,u , bp,Du (t)} Z(t) = 0
                         r(t) =
                                        bp,Du (t)             Z(t) > 0
    Because CTEd (p, Du , u) < CTEu (p, Du , u),

                             CTE(p, Du , u) = CTEu (p, Du , u)                           (8.3)

This model also presumes that the uploading into the storage infrastructure takes place in
parallel with downloading otherwise the lower bound of the integral in (8.2) needs to be
modified accordingly (e. g. when uploading happens just after the processing finishes, the
lower bound in (8.2) would be CTEd (p, Du , u)).
   If we assume that bp,Du (t) and r(t) is constant in interval tp _free , CTE (p, Du , u)

(which can be justified e. g. since job duration is less than time resolution of network traffic
8.2. SCHEDULING ALGORITHMS                                                                       61

prediction service) and if we assume uploading phase just after processing finishes, we can
use simplified model
                                                     lu                           lu
         CTE (p, Du , u) = tp _free +
                                                                         +                     (8.4)
                                        min{sp,u , bDu ,p (tp _free )}
                                                                             r(tsched_free )

    To simplify the model even further, we can assume that the uploading into the in-
frastructure is not the bottleneck since lu << lu (which is typical for video processing

applications from raw video format to compressed formats) while bDu ,p ≈ bp,Du , or that
the uploading phase takes negligible time only even if the uploading occurs after the pro-
cessing. Thus we obtain formula that will be used further on for sake of simplicity
                CTE (p, Du , u) = tp _free +
                                                min{sp,u , bDu ,p (tp _free )}

In case that the presumption with neglecting uploading phase is not valid, the model and
the resulting algorithm can be easily extended to support it.
    Such function allows us to find the most suitable processors for processing. To avoid
synchronous overloading of processing infrastructure, we suggest to use one of the two
well-known approaches: either to randomize set of processing nodes and pre-select some
subset, or to pre-select the subset manually. For the given subset we calculate CTE esti-
mates and launch a greedy scheduling algorithm starting with processor with lowest CTE.

Available bandwidth estimate
Let’s assume we have some kind of Network Traffic Prediction Service (NTPS) that pro-
vides us with estimate of available network throughput between node A and node B in
time t: NTPS (A, B, t). For receiving realistic estimate of available TCP bandwidth, at least
following parameters need to be evaluated: minimum line capacity on the path, round-trip
time (RTT), and packet loss rate as all of these are important for performance of TCP that
is underlying our applications. There are several possible models of NTPS with different
interactions with our job scheduling model as shown below.
    The main difficulty arises when the traffic generated by “our” application has regular
patterns in its nature and thus it is included as a part of NTPS prediction itself. In such
scenario, we need to differentiate between predicted traffic generated by “our” application
and predicted background traffic. Moving from most complex to simpler models, we will
show interactions with our scheduling system for each of them.

NTPS Model #1 Let’s assume the most complete NTPS model with following properties:
  1. the NTPS can predict cumulative available TCP bandwidth in N:1 fashion when N
     host are sending data to single host in parallel,
  2. “our” application tells the NTPS which traffic in NTPS measurements has been gen-
     erated by it for NTPS to identify it,
  3. the NTPS service can provide prediction of “background” traffic for “our” appli-
     cation by subtracting predicted traffic of “our” application from overall predicted
  4. the NTPS performs in-advance bandwidth allocations in time for scheduled jobs and
     project these allocations into available bandwidth predictions,
  5. the NTPS can compare reserved vs. actual traffic by “our” application and it can
     utilize it to keep statistic information (which can be e. g. automatically used for ad-
     justing the reservation if application regularly overestimates bandwidth needed in
     its allocation requests),
8.2. SCHEDULING ALGORITHMS                                                                       62

   6. the NTPS can estimate the available bandwidth in end-to-end way; that means it can
      decrease reported available bandwidth correspondingly if the bottleneck is either in
      storage depot itself or processing node itself.
Under such conditions what we need from the prediction service is total bandwidth avail-
able between processor p and depot set D,

                                  bD,p (t) = NTPS (D, p, t)                                    (8.6)
   The application then allocates the bandwidth per processor         bsched (t)
                                                                       D ,p        with NTPS service
by adding depots from depot set D to depot set D .

           ∃d ∈ D ∧ bD ∪{d},p (t) > bD ,p (t) ∧ bD ,p (t) < sp,u .
                                                   D := D ∪ {d}; D := D − {d};

The process is repeated until the total reserved bandwidth is larger than sp,u or until
no other depot in D is available that increases bsched(t) and then the reservation is done
                                                 D ,p
bsched (t) := bD ,p (t).
 D ,p

NTPS Model #2 If the NTPS prediction is unable to perform prediction in N:1 fashion
(relaxing condition 1), it is near to impossible to use multiple depots to feed one processor
as it requires detailed knowledge of network topology. Thus the formulas above become

                         bD,p (t) = NTPS (d, p, t) where D = {d},                              (8.8)

                                bsched (t) = min {sp,u , bd,p (t)}
                                 d,P                                                           (8.9)

NTPS Model #3 If we assume the same behavior as above (Model #2) with exception
of unavailable bandwidth allocations (relaxing conditions 1 and 4), the model gets more
complicated again even if we use single depot per processor only.

                  bD,p (t) = NTPS (d, p, t) −       bsched (t)
                                                     d,p         where D = {d},               (8.10)

where p goes over all processes that share some link with previously scheduled processing
(thus omitting this term when creating the first estimate)—(d, p) and (d, p ) share at least
one link. Again, this requires detailed knowledge of network topology and thus it can
hardly be used.

NTPS Model #4 If the network traffic forecasting service is capable of including “our
jobs” into its estimate but it is unable to isolate “our” traffic from its predictions, the cri-
terium becomes
                      NTPS (D, p, t) > 0       or     NTPS (d, p, t) > 0                 (8.11)
as we are watching whether there is still some spare bandwidth available to say whether
the congestion (including “our traffic”) is imminent or not.

NTPS Model #5 If NTPS certainly doesn’t include “our” traffic in its prediction (e. g.
since traffic generated by “our” application is neither regular nor predictable), the cri-
terium becomes

      bD,p (t) = NTPS (D, p, t) or     bD,p (t) = NTPS (d, p, t)     where D = {d},           (8.12)
8.2. SCHEDULING ALGORITHMS                                                                                  63

Proximity Function
The auxiliary proximity function is an approximative static replacement of NTPS, that
allows scheduling system to have assessment of “closeness” of processors and storage
depots in a static time average fashion when no dynamic NTPS is available. In similar way
to NTPS function, the proximity functions must take into account maximum achievable
end-to-end throughput depending in on transport protocol used between the data depots
and computing nodes. For TCP transport protocol, it is based on estimated5 or measured
average TCP throughput, while for UDP it might be just limiting capacity and possibly also
loss of the network between storage and processor. Proximity functions can be prototyped
as follows:

     PX (p)         . . . returns vector of depots close to p (in non-increasing order)
   PXinv (d)        . . . returns vector of processors close to single depot d
                          (in non-increasing order)
   PXinv (D)        . . . returns vector of processors close to depot set D
                          (in non-increasing order)

Prefetch Evaluation
First, it makes no sense to perform the prefetch if the processing power is the bottleneck,
so the prefetch makes sense only if

                                                sp,u > bD,p (t)                                         (8.13)

where D is the depot set on which the data to process is located. Thus it is meaningful to
perform prefetch from depot set D to depot set D if the following condition is met:

                                               lu                                lu
               tp _free +
                                                                  >t +                                  (8.14)
                              min{sp,u , bD,p (tp _free )}
                                                sched                    min{sp,u , bD ,p (t )}

                                 t = max{tp _free , t0 + ∆tprefetch}
                                                           D→D                                          (8.15)
It is also necessary to find out whether there is some available depot which is “closer” that
current ones. Minimal condition to attempt to do prefetch is that for any p ∈ P

                           ∃d ∈              {PXi (p)}    −D      . bd ,p (t) > 0.                      (8.16)

where PXi (p) is the i-th element of vector PX (p). If we want to maintain number of copies
of the data and just “flow” the data in the storage infrastructure, the condition looks as
                    ∃d ∈             {PXi (p)}       −D    ∧ ∃d ∈ D. bd ,p (t) > bd,p .                 (8.17)

Simplified condition for case, when only single data copy is used, is

                                                    PX0 (p) = d                                         (8.18)
   5 Maximum TCP throughput is limited on only by the network capacity, but also by round-trip time, packet

loss, and maximum segment size (which is network maximum transmission unit minus TCP and IP headers,
counting together 40 bytes). Based on analytical models of standard TCP congestion control [53, 15], the TCP
throughput is proportional to   M√SS
                                        . Such estimates are implemented e. g. in Network Diagnostic Tool [85].
                              RT T loss
More elaborate models are also available [56].
8.2. SCHEDULING ALGORITHMS                                                                64

Upload Optimizations
If we know at the time of uploading data from source nodes into data storage depots
that there is some pool of processors P we want to use and assuming that the storage
infrastructure can perform auto-replication and prefetching, we can upload data to the
following set of depots
                                          PX0 (p)                                (8.19)

   Our model consists of two stages. During the first stage, the processor scheduling
algorithm assigns tasks to the processors and in the second stage, the storage scheduling
algorithm assigns tasks to depots. We are assuming two models of storage scheduling:
first one is 1–to–1 where one task data is stored in a single depot only. Second model is
1–to–n where one task data is replicated to n depots that can be accessed simultaneously.
Last but not least, it is important to keep in mind that these models handle uniform tasks
only, as discussed in Section 8.1.

8.2.3 Processor scheduling
We are not using online algorithm and we use an abstraction that all the tasks are known
at time of scheduling. For real online algorithm, this might not hold. Furthermore, for pur-
pose of this algorithm we don’t care about the available network capacity between storage
depots and processing nodes and the only measure of speed is sp,u , which is denoted as sp
because of uniform job size.

Input: set of processors P , set of tasks U , task length l, speed of each processor s p .
Output: sets Up that contain tasks assigned to processor p, and for ∀u ∈ U scheduled time
     to start the task tu is computed.
Goal: minimize.
Measure: maximum processor running time.

    Common processor scheduling problem, which takes uniform processors and tasks of
different sizes, belongs to NPO class. In this case we have uniform task size and processors
with different speeds, denoted as Qm |pj = 1|Cmax class problem [47].
    Let tp _free be the time when processor p is free (meaning there is no task scheduled

to processor p at time tp _free ). We are using greedy algorithm shown in Figure 8.3 for

assigning tasks to processors. It is easy to see that complexity of algorithm is O(|P ||U |).

    1   foreach p ∈ P do
    2     Up := ∅;
    3     tp _free := 0;

   4    od
   5    foreach u ∈ U do
   6      p : ∀p1 ∈ P tp1 _free + slp ≥ tp _free +
                        sched       u    sched       lu
                                                     sp ;
    7     Up := Up ∪ {u};
    8     tu := tp _free ;
   9      tp          := tp _free + su ;
                          sched      l
   10   od

            F IGURE 8.3: PS Algorithm: Greedy algorithm for processor scheduling
8.2. SCHEDULING ALGORITHMS                                                                                          65

Theorem 8.1 Processor scheduling algorithm “PS” belonging to PO class provides optimum so-
lution for tasks of uniform size.
P ROOF We need to prove that greedy processor scheduling algorithm PS belonging to PO
class (Figure 8.3) returns the optimum solution. Since all the tasks are uniform and no task
precedence is allowed, we can see that no permutation of tasks inside Up results in better
or worse solution. Moreover, let u1 ∈ Up1 and u2 ∈ Up2 for two processors p1 = p2 , we
can see that {Up1 − u1 } ∪ {u2 }, {Up2 − u2 } ∪ {u1 } does not give better solution, because the
tasks are of uniform size. Let u1 ∈ Up1 and Up2 where p1 = p2 , we show that {Up1 − u1 },
{Up2 ∪ u1 } does not give better solution. Since we can do any permutation of tasks in any
Up , let u1 be the task whose tu1 is highest in Up1 , i. e. it is the last scheduled task in Up1 . Let
u2 ∈ Up2 whose tu2 is highest in Up2 .
   • In the case of tu2 +             lu
                                     sp 2   +    lu
                                                sp 2    > t u1 +      lu
                                                                     sp 1   worse solution is found.

   • In the case of tu2 +             lu
                                     sp 2   +    lu
                                                sp 2    < t u1 +    sp1 a better solution is found, but we show that

      this is impossible: holding for ∀p                        ∈ P. tp _free + slu ≥ tu1 + slp (line 6 in Figure 8.3),
                                                                        sched                  u
                                                                                    p           1

      we substitute variables tu2 +                       lu
                                                         sp 2   + sp ≥ tu1 + sp . But this is a contradiction with
                                                                   lu            lu
                                                                     2             1
      t u2 +    lu
               sp 2   +    lu
                          sp 2   < t u1 +    lu
                                            sp 1   .

8.2.4 Storage scheduling problem, 1–to–1 model
Input: set of all depots D, set of processors P , set of tasks U , speed of each processor s p ,
     transfer speed between processor and depot bd,p (t), for ∀p ∈ P scheduled Up , and for
     ∀u ∈ U scheduled tu .
Output: sets Pd that contain tasks assigned to depot d.
Goal: maximize.
Measure:       f (bd,p (tu ) − sp )
      where d ∈ D is such that u ∈ Pd , p ∈ P is such that u ∈ Up , and
               1 x≥0
      f (x) =
               0 x<0
Theorem 8.2 The 1–to–1 storage scheduling problem is NPO-complete.
P ROOF We use Karp’s reduction to Bin-Packing problem. Let I = {a1 , a2 , . . . , an } be the
finite set of rational numbers with ai ∈ (0, 1] for i = 1, . . . , n, we search minimal partition
{B1 , B2 , . . . , Bk } of I such that
                                              ai ≤ 1
                                                                ai ∈Bj

for j = 1, . . . , k. Let sp1 = a1 , sp2 = a2 , . . . , spn = an for 1–to–1 storage scheduling prob-
lem. Let processors and depots are interconnected via the complete graph. Let

                                                       ∀d ∈ D :             bd,p (t) ≤ 1

for any time t. We search minimum set of depots D so that

                                                                f (bd,p (tu ) − sp )

where d ∈ D is such that u ∈ Pd , p ∈ P is such that u ∈ Up , and
                                                                         1 x≥0
                                                        f (x) =
                                                                         0 x<0
is maximal. That means we run algorithm for 1–to–1 storage scheduling up to k + 1-times.
Depots corresponds to partition {B1 , B2 , . . . , Bk }.
8.2. SCHEDULING ALGORITHMS                                                                  66

   We use an approximative algorithm First Fit Decreasing that is used for Bin-packing
problem. The First Fit Decreasing is 2 −approximative scheme [4].

    1   foreach d ∈ D do
    2     Pd := ∅;
    3   od
    4   foreach p ∈ P do
    5     tp _free := P BS(p);
    6     P rocT ime(p) :=                        sched_free    ;
                                   min{sp ,bD,p (tp          )}
    7   od
    8   foreach u ∈ U do
    9     p : ∀p1 ∈ P. tp1 _free + P rocT ime(p1) ≥ tp _free + P rocT ime(p);
                        sched                                sched
                                sched_free            sched_free
   10     d : ∀d1 ∈ D. bd1 ,p (tp          ) ≤ bd,p (tp          );
   11     sched_depot(u, d); /* Pd := Pd ∪ {u} */
   12     sched_job(p, u); /* tp _free := tp _free + P rocT ime(p) */
                                   sched          sched

   13   od

                          F IGURE 8.4: 1-DS Algorithm: 1–to–1 task scheduling

   The algorithm for processor and task scheduling is shown in Figure 8.4. The function
sched_job(p, u) tells the cluster’s resources scheduling system (e. g. PBSPro) to allocate
processor time starting at tp _free and to mark particular processor busy. The function

sched_depot(p, d) changes network and depot conditions bd,p (tp _free ) in the way that

data transfer from depot d to processor p will utilize network and depot capacity.

8.2.5 Storage scheduling problem, 1–to–n model
Input: set of depots D, set of processors P , set of tasks U , speed of each processor s p ,
     transfer speed between processor and depot bd,p (t), for ∀p ∈ P scheduled Up , and for
     ∀u ∈ U scheduled tu .

Output: sets Pd that contain tasks assigned to depot d.
Goal: maximize.
Measure:            f(          bd,p (tu ) − sp )
             u∈Up        d∈Du
                                                                    1 x≥0
        where Du = {d | u ∈ Pd }, p ∈ P , and f (x) =
                                                                    0 x<0

   n-DS algorithm in Figure 8.5 is a modified version of algorithm for 1–to–1 task schedul-
ing that uses replicas of the data on different data storage depots to find optimum solution.
Transfers to processors can be done from multiple sources. It is important to keep in mind
that we work in the complete graph to achieve PO-class complexity and thus line 14 in
Figure 8.5 can work with bd separately instead of bD , since there is no shared link and thus
the data flows are independent and thus the bd is additive.

Theorem 8.3 The 1–to–n storage scheduling belongs to PO class if and only if depots and proces-
sors are interconnected via the complete graph.
P ROOF We need to show that greedy algorithm n−DS returns optimum solution to prove
that 1–to–n storage scheduling belongs to PO class. Indeed, using replicas allows us to
8.3. PROTOTYPE IMPLEMENTATION                                                                    67

    1   foreach u ∈ U do
    2     Du := ∅;
    3   od
    4   foreach d ∈ D do
    5     Pd := ∅;
    6   od
    7   foreach p ∈ P do
    8     tp _free := P BS(p);
    9     P rocT ime(p) :=                    sched_free    ;
                               min{sp ,bD,p (tp          )}
   10   od
   11   foreach u ∈ U do
   12     p : ∀p1 ∈ P. tp1 _free + P rocT ime(p1 ) ≥ tp _free + P rocT ime(p);
                        sched                           sched

   13     while (sp > bDu ,p ∧ D − Du = ∅) do
   14            d : ∀d1 ∈ D. bd1 ,p (tp _free) ≤ bd,p (tp _free );
                                       sched             sched

   15            sched_depot(u, d); /* Pd := Pd ∪ {u}, Du := Du ∪ {d} */
   16     od
   17     sched_job(p, u); /* tp _free := tp _free +
                                sched          sched                      lu
                                                                              sched_free    */
                                                             min{sp ,bDu ,p (tp          )}
   18   od

                       F IGURE 8.5: n-TS Algorithm: 1–to–n task scheduling

utilize all the depots to their maximum, this means that no better solution can be found.
In the case of non complete graphs some network conditions can prevent utilizing some
depots to the maximum extent when First Fit Decreasing algorithm is used (i.e. the second
fastest processor is connected only to the fastest depot while the first fastest processor is
connected to all depots and is fast enough to utilize the fastest depot to its maximum then
the second fastest processor has no access to free depot.)
    Current largely over-provisioned high-speed networks can be seen from the schedul-
ing point of view as providing logical (virtual) complete graph. As the network capac-
ity grows extremely when reaching the network core and the limitations usually lie in so
called ”last mile” of the network (last or very few network links before the end-node is
reached), the limiting network capacity for each depot and processor can usually be seen
in the ”last mile” or more specifically last link, which is available entirely for utilization by
traffic from/to the depot. Therefore the depots and processors can not block each other as
discussed below and thus can be utilized to maximum extent.
Theorem 8.4 The 1–to–n storage scheduling is NPO-complete if depots and processors are inter-
connected via common graph.
P ROOF The proof uses the same reduction as proof for the Theorem 8.2 while network
conditions restrict using of replicas. So there may exist a depot that is not utilized to
maximum while there may be a processor that is neither utilized to the maximum extent
and using more replicas does not improve performance any more.

8.3 Prototype Implementation
8.3.1 Technical Background
Instead of building our own Grid infrastructure for testing, development and pilot appli-
cations, we have decided to opt for using the powerful Grid infrastructure made available
8.3. PROTOTYPE IMPLEMENTATION                                                              68

in Czech Republic by the META Center project [78]. The META Center project was en-
hanced during year 2003 by a new project called Distributed Data Storage (DiDaS) [30]
incorporating new distributed storage based on an Internet Backplane Protocol (IBP) [5].
Such storage infrastructure can be efficiently used for implementation and deployment of
DEE system, however as the scheduling system in use is not capable of scheduling jobs
with respect to location of the data and there is also neither data location optimization nor
prefetch functionality and our data-intensive application requires at least some of them for
optimal performance, we had to enhance the underlying infrastructure.
    As a part of the Grid infrastructure, the META Center provides large computational
power in the form of IA32 Linux PC clusters that are being rapidly expanded each year
because of the cost-efficiency of this solution.
    As follows from the discussion in Section 8.2, we need some globally accessible dis-
tributed data storage for transient storage of source, intermediate, and target data, that
provides high enough performance to supply data to processing and that supports data
replicas. The filesystem shared across these PC clusters is based either on a rather slow
globally accessible AFS filesystem supporting read-only administrator controlled replicas
(several terabytes of storage are available) or on a somewhat faster site-local NFS which
doesn’t support data replicas, has its own problems such as broken support for sharing
files larger than 2 GB and a capacity on order of only a few tens of gigabytes available.
Therefore we need a different means of storage for processing large volumes of data.
    The IBP uses soft-consistency model with a time-limited allocation thus operating in
best effort mode. A basic atomic unit of the IBP is a byte array providing an abstraction
independent of the physical device the data is stored on. An IBP depot (server), which
provides set of byte arrays for storage, is the basic building block of the IBP infrastructure
offering disk capacity. By mid-2004 the IBP data depots were present in all cluster locations
as well as distributed across other locations in the Czech academic network providing total
capacity of over 14 TB.
    The PBSPro [88] scheduling system is used for job scheduling across the whole META -
Center cluster infrastructure. The PBSPro supports queue based scheduling as well as
properties that can be used for constraining where a job may be run based on user re-
quirements. These properties are static and defined on per-node basis. Under ideal cir-
cumstances the PBSPro is capable of advance reservations and estimate of time when a
specified node is available for scheduling new jobs unless a priority job is submitted. The
latter feature requires cooperation with users that submit their jobs as the PBSPro needs an
estimate of processing time provided by the job owner for each job—otherwise the maxi-
mum time for the specified queue is used and this results in non-realistic estimate of when
a specified processing node will be available.
    For video processing we use the transcode tool [55]. As discussed in Section 7.2.2,
this tool is unable to directly produce RealMedia format, which is one of few streaming
formats with strong multi-platform support and which is also the format of choice for our
pilot applications, the CESNET video archive and the Masaryk University lecture archive.
Therefore we also need to use Helix Producer [72] to create the required target format. The
Helix Producer needs raw video with PCM sound as input file and as this format is rarely
the format of the input video, we use the transcode for pre-processing data for the Helix

8.3.2 Architecture
As shown in Figure 8.6, the architecture of the Distributed Encoding Environment com-
prises several components and component groups:

User interface The basic functionality of the user interface is for user to provide input
     information on the job. The basic information is usually the input file (media), job
8.3. PROTOTYPE IMPLEMENTATION                                                               69

                                                   Media Processing Tools

  User Interface           Job Preparation
                                                          Spli er
                          Local Scheduler
                          Job Submission                 Processor

   Job Logging            Job Monitoring                  Merger

      F IGURE 8.6: Distributed Encoding Environment architecture and components.

      chunk size, target file/media, and target format and its parameters. If multiple pro-
      cessing infrastructures are available, the user may also select which one will be used
      for processing. When job monitoring module is available, the user interface may pro-
      vide visualization of the job progress and informs user about problems encountered.
Job preparation This module steers the preparation of the parallel job. It invokes media
     analyzer to find source format and its parameters, performs splitting job into chunks
     automatically or based on user specification if provided, prepares job files and in-
     vokes the job submission procedure to pass the jobs to local job scheduling system
     on the computing facilities.
Local scheduler interface The local scheduler interface group provides one mandatory
     and one optional module: mandatory job submission interface to send the jobs for
     computation on computing facilities and optionally also job monitoring interface so
     that user can monitor overall job status using user interface. Shall the system support
     the proposed scheduling model (Section 8.3.4), the job submission interface should
     provide not only the “write only” access for job submission, but it should be also ca-
     pable of reporting when specified resource will be available for scheduling according
     to the local scheduler tp _free (with all the limitations discussed above).

Job logging facility The job logging facility takes care of keeping permanent track of the
      job status, results and especially error conditions. This component can be used only
      if job monitoring interface is used.
Media processing tools The media processing group contains four modules: media ana-
    lyzer for source media/format analysis, media splitter for splitting the source media
    into the job chunks, media processor, which is the actual processing, tool and media
    merger for merging resulting media chunks.
Distributed storage Any distributed storage system that the media processing tools can
      interact with. It is desirable to have a system that can use data replicas for optimizing
8.3. PROTOTYPE IMPLEMENTATION                                                              70

     performance and that supports data location hinting for user to be capable of spec-
     ifying data location, so that advanced functionality of the scheduling model can be

    The workflow in this architecture works as follows: a user specifies his jobs using the
user interface. The job preparation module analyzes the source data using media analyzer
and if an unsupported format is found, the user is notified and the processing is termi-
nated. When supported format is found, the source media is split into smaller chunks
either based on chunk size specified or degree of parallelism specified by the user or even
using predefined split points provided by the user. The splitting can be done though job
submission interface or locally—the local splitting is sometimes desired since it avoids the
job enqueuing latency, which can be quite long on heavily loaded systems.
    The job preparation module then creates job specification for each individual media
chunk and sends it via job submission interface to the local job submission system on
the computing resource using the scheduling mechanism described in Section 8.3.4. It
prepares the last job that merges resulting chunks into the target file or media. This job is
executed only if all the chunk-processing jobs finish successfully.
    If the job monitoring interface is present, all the jobs are monitored throughout their
lifetime and the results are gathered by the user interface. Also, if any error situation is
found, the user is notified both using the user interface and using the job logging facility.

Security Considerations. The Grid environment provides strong security via Grid Secu-
rity Infrastructure (GSI) [22] based on X.509 certificates. Other projects have developed
enhanced security architectures based on GSI and each computing Grid usually has some
security infrastructure for authentication, authorization, and accounting (AAA) readily
available. Because DEE is designed for Grid environment, it relies on Grid infrastructure
for the AAA functionality.

8.3.3 Access to IBP Infrastructure
A general purpose abstraction library called libxio [30] has been developed that pro-
vides an interface closely resembling standard UN*X I/O interface allowing developers to
easily add IBP capabilities into their applications providing access to both local files and
files stored in IBP infrastructure represented by IBP URI.
    The IBP URI has the following format:
lors://host:port/local_path/file?bs=number&duration=number \
   &copies=number&threads=number&timeout=number&servers=number \
where the host parameter is a specification of an L-Bone server (IBP directory server) to be
used, the port is a specification of an L-Bone server port (default is 6767), the bs is a spec-
ification of block-size for transfer in megabytes (default value is 10 MB), the duration
specifies allocation duration in seconds (default is 3600 s), requested number of replicas is
specified by the copies (defaulting to 1), the threads specifies number of threads (con-
current TCP streams) to be used (default is 1), the timeout parameter is specification of
timeout in seconds (defaulting to 100 s), the servers parameter specifies number of dif-
ferent IBP depots to be used (default is 1), and the size specifies projected size of file to
ensure that the IBP depot has enough free storage. It is possible to override default values
using environment variables, too. If the given filename doesn’t start with the lors://
prefix, the local_path/file is accessed as local file instead.
    When writing a file into the IBP infrastructure the local_path/file specifies the
local file where a serialized XML representation of the file in IBP will be stored. At least an
L-Bone server must be specified when writing a file into IBP. In our experience the file with
serialized representation (meta-data) occupies approximately 1/10th of the actual data size
in IBP on average, but it varies depending on block size to large extent.
8.3. PROTOTYPE IMPLEMENTATION                                                             71

    When a file stored in IBP is read, the local_path/file specifies the local file contain-
ing a serialized XML representation of the IBP file. The user can also use a short form URI
lors:///local_path/file as the serving depots are already specified in local XML
    The transcode program has been modified so that it can load and store files to/from
IBP depots. As the transcode has very modular internals and some file operations are
implemented inside libraries for certain file formats, it is necessary to patch such libraries
as well. Currently we have patched the libquicktime [77] library, as the QuickTime
MOV files are common products of editing software (e. g., AVID Xpress provides fast
codecs for producing DV video wrapped in a QuickTime envelope which can be processed
by transcode when libquicktime is modified to recognize its FourCC identification
as common DV format to be processed using libdv library [76]).

8.3.4 Scheduling Model
For DEE prototype implementation, we have implemented a version of N–to–1 schedul-
ing algorithm as shown in Figure 8.7 based on processor and storage scheduling analyzed
in Section 8.2. Prototype implementation neglects uploading overhead as prototype pilot
applications encode from large volumes of data to significantly (typically at least one mag-
nitude) smaller data. In case that it is impossible to saturate processor by available data
replicas, it also supports prefetch functionality.

   1    foreach u ∈ U do
   2      Du := ∅;
   3    od
   4    foreach d ∈ D do
   5      Pd := ∅;
   6    od
   7    foreach p ∈ P do
   8      tp _free := sched_free(p);

    9     / ∗ sched_free(p) returns estimate when processor p becomes free. ∗ /
   10     ProcTime(p) :=                       sched_free    ;
                              min{sp,u ,bD,p (tp          )}
   11      /* assuming uniform size tasks u ∈ U */
   12   od
   13   foreach u ∈ U do
   14     p : ∀p1 ∈ P. tp1 _free + ProcTime(p1 ) ≥ tp _free + ProcTime(p);
                         sched                                 sched

   15     while (sp,u > bDu ,p ∧ D − Du = ∅) do
   16             d : ∀d1 ∈ D. bd1 ,p (tp _free) ≤ bd,p (tp _free );
                                         sched                  sched

   17             sched_depot(u, d);            /* Pd := Pd ∪ {u}; Du := Du ∪ {d}; */
   18     od
   19     if sp,u > bDu ,p ∧ D − Du = ∅
   20        then prefetch(p, u); fi
   21     sched_job(p, u);
   22     tp _free := sched_free(p);

   23   od

        F IGURE 8.7: Simplified job scheduling algorithm with multiple storage depots per
              processor used for downloading (i. e. N–to–1 data transfer) and neglecting
              the uploading overhead.
8.3. PROTOTYPE IMPLEMENTATION                                                          72

8.3.5 Distributed Encoding Environment
An input video (typically in DV format with an AVI or QuickTime envelope produced
by AVID or Adobe video editing software) gets uploaded into the IBP infrastructure from
the source (editor’s) workstation first. The video is then taken from IBP by IBP-enabled
transcode, re-multiplexed to ensure proper interleaving of audio and video streams
what is necessary pre-requisite for correct splitting, split into smaller chunks which are
uploaded directly back to IBP and encoded on many cluster worker nodes in parallel (see
Figure 8.8). At this stage the DEE system uses the PBSPro system to submit the parallel
jobs on worker nodes.

       editing       DV uploa            IBP                          PC cluster
      computer                  d
                                                    DV down
                                                                      single node
                                                                      DV remux
                                                                      DV split
                                                      DV chun
                                                       u pload
                                                     DV chun
                                                     downloa          many nodes
                                                                      transcoding to
                                                                      raw AVI
                                                                      encoding to
                                                       RM chun

                                                      RM chun
                                                      downloa         single node
                                                                      joining RM
                                                      RM uplo
                                                      RM chun
                              ad                       re moval
                      RM uplo

     streaming         RM uplo

      F IGURE 8.8: Example DEE workflow for transcoding from video in DV to
            RealMedia format.

    The processing phase is somewhat more complicated when target format is RealMedia,
the primary format for our pilot applications. During the parallel processing phase, each
video chunk is first transcoded to raw video with PCM sound since this format is required
by the Linux version of Helix Producer. All required transformations are performed at this
step, typically including high quality deinterlacing and resizing, audio resampling and
optionally audio normalization, and possibly noise reduction (if the original file is very
8.3. PROTOTYPE IMPLEMENTATION                                                               73

noisy due to a low light level or due to low-quality camera used). This parallel phase uses
storage capacity local to each worker node to create and process the raw video file. The
raw file is then fed into the Helix Producer and the resulting chunk of RealMedia video is
stored back into the IBP. As a final step the RealMedia chunks are joined, the complete file
is stored in the IBP, and the individual RealMedia chunks are removed.
    When the Helix Producer is replaced by various transcode output modules or when
the raw video is piped into another encoding program the DEE system can be used also
for producing other video formats: DivX version 4 and 5, MPEG-1, MPEG-2, etc. Also
when no intermediate (raw) file is needed in the encoding process, the system can directly
transcode the data from the IBP while simultaneously uploading results back into the IBP.
    Since there is no direct support for dynamic properties like location of files in the IBP
infrastructure available in the PBSPro scheduling system and there is currently also no
network traffic prediction service available in the META Center infrastructure, we have
defined some static properties for the computing nodes that allow us to assess the proxim-
ity of the computing nodes to the IBP storage depots. Computing nodes which have the
same processing characteristics and share the same network connection are given some
static attributes. We measure static average estimate of bandwidth available between each
such set of nodes and each storage depot. The DEE system then uses these properties and
locations of files to be transcoded and gives hints to PBSPro where the jobs should be run
according to the algorithm shown in Figure 8.7. It can also initiate replication of the data in
the IBP infrastructure when processing power of the node is higher than network capacity
from the IBP depots that keep the data.

8.3.6 Performance Evaluation
It follows from the workflow, that scalability of the distributed processing is limited by the
following factors:

   • processing overhead comprising re-multiplexing, splitting, and final merging,

   • job startup latency due to job scheduling and submission system used,
   • minimum job chunk size for the parallel processing.

    While the first two factors will be discussed in the details of evaluation below, it is
worth noting several points about the job minimum chunk size for the parallel process-
ing phase, which depends mainly on the input media format. The main problem is that
for many common formats used for multimedia distribution over the network, not all the
video frames are usually independent. The independent frames called I-frames (or key-
frames) are often followed by several P-frames which describe differences to I-frames, and
B-frames describing differences to two P-frames. As the P- and B-frames are meaningless
without the I-frames, the chunk size is limited by the maximum I-frame distance in the
source format, which can comprise even several hundred frames. Fortunately, the com-
mon source formats used for editing and post-production like DV use at most inter-field
compression and no inter-frame compression (thus all the frames are I-frames), so that all
the frames are independent and can be used as split-points. This however doesn’t hold for
media in MPEG-4 based formats like DivX, XviD, Windows Media Files etc.
    The evaluation used the following transformations:

DV to RealMedia with remultiplexing For processing, we used DV data with AVI enve-
     lope approximately 1 GB in size with 6911 DV frames which corresponds to 4:36 of
     PAL video time-line with 25 frames per second. The DV audio and video data is not
     properly interleaved and thus re-multiplexing is required before splitting the video
     into chunks. The data was stored in a single copy in the DiDaS IBP infrastructure re-
     lying on automatic distribution of the data across the IBP. We also deployed storage
8.3. PROTOTYPE IMPLEMENTATION                                                                                 74

                                                          Working nodes configuration
                    Processor                              2 × Pentium IV @ 3.0 GHz
                    RAM                                        2 GB + 2 GB swap
                    Maximum observed cache size                       1.8 GB
                    OS kernel                                   Linux 2.4.27 SMP
                    Shared FS                                         NFSv3
                    Local scratch disk                    Software RAID 0 on 2 disks
                    Local scratch disk FS                               XFS
                    NIC                                          Intel PRO/1000
                    NIS status                                 1 Gbps full-duplex

                            TABLE 8.1: Dedicated testbed configuration.

       optimization and it turned out that storage location in IBP and the networking is not
       the bottleneck of the processing as the performance evaluation results are very close.
       During the processing, the video was down-scaled6 from 720 × 576 to 384 × 288 using
       algorithm with Lanczos convolution7 , de-interlaced using high-quality cubic blend
       de-interlace filter, audio was re-sampled from 48 kHz to 44 kHz and finally converted
       through the raw format to RealMedia as shown in Figure 8.8.
DV to RealMedia without remultiplexing The same source video, filters, target format,
     and IBP storage have been used as above, but this time, the source video was prop-
     erly multiplexed and thus no remultiplexing was needed before splitting.

    The evaluation was performed in two environment: dedicated infrastructure to allow
isolation of the experiment and shared infrastructure to verify the behavior on real-world
Grid computing infrastructure.

Evaluation Using Dedicated Infrastructure
For evaluation of DEE performance, we used 4 dual-processor nodes that are part of
META Center infrastructure and that were dedicated for testing. Configuration of nodes is
summarized in Table 8.1. For job scheduling, the META Center PBSPro system was used
with dedicated job queue which was sending jobs on the dedicated nodes only, so the jobs
of other users didn’t interfere with the evaluation.
    The overall results of performance acceleration while changing a degree of parallelism
are shown in Figure 8.9. Detailed execution profiles for different parallelization degrees
without remultiplexing are shown in Figure A.2 on page 95 in Appendix A, while the cor-
responding profiles with remultiplexing are shown in Figures A.3 on page 96. It follows
from the plot, that there is some unavoidable processing overhead which is discussed in
detailed analysis below. It also turns out, that process with degree of parallelism equal
one performs better than expected. This is due to the fact, that while dual processor nodes
were always fully occupied for parallelism higher than 1, for degree of 1, the processing
phase occupies the whole node and the internal threads of the transcode process can uti-
lize actually more than one processor8 . The fitted curve also shows an average latency of
   6 Conversion from 720 × 576 to 384 × 288 also changes aspect ratio as 4:3 ratio is needed for proper viewing

on PC monitor with square pixels, while the original video with 5:4 aspect ratio is designed for displaying on TV
sets with rectangular pixels.
   7 More details on sinc-based Lanczos filter functions can be found e. g. in http://www.worldserver.com/

   8 The transcode process comprises of at least two computationally intensive threads, first of which called

tcdecode reads the multimedia stream from the medium and decodes it to the raw format for further pro-
8.3. PROTOTYPE IMPLEMENTATION                                                                                  75

                                           DEE Parallelization Performance

                                                                           with remux
                                                                           without remux
                                                                           y = 1/(0.0498*x) + 3.05
                        20                                                 (RMS = 0.030)
                                                                           y = 1/(0.0520*x) + 2.52
                                                                           (RMS = 0.040)
           time [min]




                             0   1    2      3     4        5       6       7        8        9      10   11
                                                       degree of parallelism

        F IGURE 8.9: Acceleration of DEE performance with respect to degree of
                         Results for the without remultiplexing is shown in green, while the set with
                         remultiplexing is shown in red. 95% confidence intervals are shown for both sets.

the processing to be approximately 3 minutes with remultiplexing and 2.5 minutes with-
out it. The detailed profiles reveal all the phases described in Figure 8.8. The file needs to
be re-multiplexed first as the input file doesn’t have proper video and audio interleaving
as needed for splitting phase, immediately afterwards it is split into the chunks, uploaded
back into IBP, and individual processing jobs are spawned. Then, there is a latency induced
by the PBSPro scheduler when no jobs are run until they are started by the PBSPro. The
distributed parallel phase follows when the number of processes reaches the desired de-
gree of parallelism. The obvious step at the startup of this phase for 8 processes is because
the PBSPro started the first 4 jobs at once and then the other 4 jobs after some delay, de-
spite that all the computing resources were available and idle. In the final phase separated
again by PBSPro scheduling latency, the merging job is run.
    The overhead of the first step can be reduced by using properly interleaved source
media file, when re-multiplexing can be avoided, which approximately halves the initial

Evaluation Using Shared Production Infrastructure
The same evaluation process using the same input data has also been performed with
META Center infrastructure shared by all users. Performance profile for 8 parallel pro-
cessing by the second threads, and the second thread which processes the resulting raw video with the filters
specified. Usually, there is also a third thread which encodes video into the target format, but in our case it is
just pass though since the output is raw video. The second thread can be even split into multiple threads which
share the input using frame-buffer when filters which support concurrency are applied (so called _M_ thread
filters). Thus on SMP machine with one transcode process, it is common to see that one of transcode threads
consumes 99% CPU and the tcdecode consumes approximately another 30% CPU in the parallel phase. With
estimate of 3 minutes of the splitting and merging phases, the measured value with single process only should
be approximately 23 minutes, which agrees with fitted curve in Figure 8.9.
8.3. PROTOTYPE IMPLEMENTATION                                                          76

cesses is shown in Figure A.1 in Appendix A. It turns out to be similar to the dedicated
infrastructure except for two things. First, the processing takes longer since less power-
ful processing nodes were chosen by the scheduling algorithm as the more powerful ones
were busy. Second, the parallel phase of the computation ends more step-wise as there
were pronounced differences among the individual processing nodes and thus the calcu-
lation took different times on different nodes.

To top