Q111 - GridPP

Document Sample
Q111 - GridPP Powered By Docstoc
					GridPP Quarterly Report
Area         Tier-1
Quarter      Q1 11
Reported by Andrew Sansum

GridPP no.   Tier-1 no.            Description                         Source            Target
3.1.1        3.1.1                 Availability of LFC service
3.1.2        3.1.2                 Availability of LHCb LFC service
3.1.3        3.2.1                 Availability of WMS service
                                   Availability of LHCb, ALICE and                       98%?
3.1.5        3.3.1, 3.3.2, 3.3.3   CMS VO boxes
                                   Availability of R-GMA Registry                        99%
3.1.6        3.4.1                 service
3.1.7        3.5.1                 Availability of RGMA service
3.1.8        3.6.1                 Availability of CE service                            99%
3.1.9        3.10.3                SAM availability of FTS service                       99%
                                   SAM availability of MyProxy                           99%
3.1.10       3.11.1                service
3.1.11       3.12.1                Availability of UI service                            95%
                                   SAM availability of site BDII                         99%
3.1.12       3.13.1                service
                                   SAM availability of toplevel BDII                     99%
3.1.13       3.13.2                service
                                   3D service for ATLAS and LHCb
3.1.15       ?                     availability
                                                                    Extractable from
                                   WLCG Service Availability
                                   Target (set lower by WLCG than h/LCG/MB/availabilit
3.2.1        1.2.2                 MoU)                             y/site_reliability.pdf
                                   Meet WLCG MoU target             callout.xls. Total     95%
                                   response time for operational    number of pagouts,
                                   problems (2 hours in prime shift number missed
                                   and 12-48 hours outside prime    pageouts(2h day, 2h
3.2.2        1.2.3                 shift)                           night)
                                   Fraction of WLCG MoU                                    100%
3.2.3        1.2.7                 commitment for CPU
                                   Fraction of available T1 KSI2K   Schedule.
3.2.4        ?                     used in quarter                  Used/Available         20-95%
3.2.5        1.4.1                 Number of Security Incidents                            2/year
                                   % Time [weighted by resource                            1%
3.2.6        1.4.2                 share] on VO blacklists
                       Number of Incidents reaching                             <2 pa
                       level 3 in the disaster
3.2.7    1.4.3         management system
                       Number of Incidents reaching                             0
                       level 4 in the disaster
3.2.8    1.4.4         management system
                       % met of normalised UB
3.2.9    1.6.3         allocation for CPU
3.2.10   1.6.4         Job Efficiency (CPU/Wall)                                85%
                                                          Metric 0.106 or
                                                          0.107 and GOC
                                                          accounting for LCG.
3.2.11   1.6.5         Farm Occupancy                     Wall clock time
                       Percentage of GRIDPP3 Staff in                           93%
3.2.12   2.1.3         Post
                       Quarterly milestones/metrics                             100%
                       report to GRIDPP available
3.2.13   2.1.4         within 1 month of quarter end
                       Number of GGUS tickets
                       handled (assigned only from Q3
3.2.14   4.1.1         2009.
                                                                                3 or less
                       Number of GGUS tickets not                               missed per
                       responded to within 2 working                            month
3.2.15   4.1.2         hours (any time for alarm tickets)
                       Fraction of WLCG MoU               Tier1 quarterly       100%
3.4.1    1.2.9         commitment for Tape                report
                       Fraction of WLCG MoU                                     100%
3.4.2    1.2.8         commitment for Disk
                       % met of UB Allocation for Tape                          100%
3.4.3    1.6.1
3.4.4    1.6.2         % met of UB Allocation for Disk
                       Fraction of available T1 Disk
3.4.5    ?             used in quarter                                          20-95%
3.4.6    3.10.1        Data imported to Tier-1 via FTS.                         nth
                       Data exported from Tier-1 via                            <1500TB/mo
3.4.7    3.10.2        FTS.                                                     nth
                                                                                99% each

3.4.8    5.1.1-5.1.4   CASTOR SAM tests: LHC Vos
                                                                                <= 6
                                                                                h level
                                                                                severe or
3.4.9    5.1.3         CASTOR Incidents reported                                higher;
                 Number of File-system         1 per month
3.4.13   7.1.2   corruptions per month
                                               <1 in 500
                 Number of Damaged GRIDPP
3.4.14   7.3.1   Tapes, leading to data loss
3.4.15   7.3.4   Reliability (Robot up-time)   >99%
                 Actual data volume for LHC
3.4.16   7.3.5   (tape-based view)
3.4.17   ?       Data rates to tape (MB/s)     >200
Q111   Q410   Q310   Q2 10   Q1 10   Q4 09   Q3 09
99%    100%   N/A    100%    99%     100%    100%

99%    100%   N/A    100%    100%    92%     100%

100%   100%   99%    100%    100%    100%    100%

100%   100%   100%   100%    100%    100%    100%

       N/A    N/A    N/A     N/A     N/A     N/A

       100%   100%   100%    100%    100%    100%

99%    100%   99%    100%    90%     94.3%   84%
99%    99%    99%    100%    99%     98.5%   99%
100%   100%   99%    100%    100%    100%    100%

100%   100%   100%   100%    100%    100%    100%
100%   100%   100%   99%     100%    100%    100%

100%   100%   100%   100%    100%    100%    100%

100%   100%   100%   100%    100%    100%    N/A

97%    99%    99%    97%     90%     93.7%   86%

99%    99%    100%   95%     100%    95%     98%

113%   100%   100%   100%    100%    100%    100%

63%    68%    70%    33%     60%     38%     38%

0      0      0      0       0       0       0
0        0        0       0        0        1        1

0        0        0       0        1        2        0

101%     98%      98%     93%      91%      101%     84%

89%      85%      90%     82%      85%      80%      75%
77%      81%      80%     43%      69%      60%      63%

98%      104%     104%    104%     104%     106%     104%

10-May   19-Jan   5-Nov   22-Jul   27-May   20-Nov   20-Nov

16       19       15      12       13       13

1        3        2       0.5      2        5

100%     100%     100%    89%      89%      50%      100%

121%     101%     100%    100%     93%      78%      100%

99%      102%     95%     90%      78%      53%      106%

114%     105%     129%    154%     95%      98%      79%

53%      55%      46%     49%      49%      41%      51%

399      385      240     351      172      228      123

370      506      478     662      275      216      213

99%      99%      99%     100%     92%      93%      83%

0.00     0.33     0.33    0.33     0.33     1.17     2.00

0.7            3.0            0.3            1.0            0.3           3             0

100%           99%            100%           100%           100%          100%          97%
2792.3         2613.0         2240.0         2007.0         1704.7        1400.3        1213.5

         125            180            140            155            90            90            90
Q2 09    Q208      Q1 08
99%      100%


99%      100%

96.9%    Not met

N/A      N/A

100%     99%

91%      99%       97%
98.28%   99.83%    99.20%
99.25%   100.00%

99%      100%      100%

98.38%   100%      100%


91%      99%       96%?


100%     100%      152%

52%      47%       54%

0        0         0
0            0              0

0            0              0

92%          94%            49%

88%          80%            90%
67%          50%            85%

102%         67.60%         84%

20-Aug       N/A for June   N/A for March

52           9 assigned, 14 17 assigned, 10
assigned     updated        updated

100%         109%           100%

100%         100%           103%

105.0%       119%           131%

87.3%        77%            91%

48%          54%            60%

207          147TB          367.66 TB

355          169 TB         382.19 TB

92.97%       96.00%

      1.00         1.00     6

0            1 CMS tape   0

100%         100%         100%
1064         769TB        654TB

Close to target
Not OK
Not yet able to be measured


[2010Q4]Now decomissioned

[2010Q4]We do not monitor 12 hour response - we only record how we
meet our own target response time (2 hours day or night).
[2010Q4]We are now gathering this data and will have a metric next quarter
GridPP Quarterly Report
Area          Tier-1
Quarter       Q1 11
Reported by   Andrew Sansum
Based on Milestones 1.3d-2

Milestone no.   Description                                     Owner           Due date
5.1.1           Castor Gen Instance in production               Bonny Strong      1-Feb-08
                Trial Tier-1 Blog to improve information flow   Derek Ross       12-Feb-08
1.5.1           to Tier-2s
                Have successfully taken part in CCRC08(1)       Derek Ross        1-Mar-08
1.3.1           (with 3 experiments)
7.4.1           Upgrade RAL firewall to 10G operation           Robin Tasker     18-Mar-08
                Experiment requirements for second CCRC08       Matt Hodges      21-Apr-08
4.1.1           run understood.
1.2.2           On-Call Service Tested (WLCG-02)                Andrew Sansum    31-Jan-08
1.2.3           On-Call System Available (WLCG-07-03)           Andrew Sansum    31-Mar-08
                Tier-1 able to meet 2008 WLCG MoU               John Gordon       1-Apr-08
1.2.4           resource commitment
                Disc and CPU (only) 2009 Capacity               Andrew Sansum    26-May-08
1.2.5           Procurements Started
6.3.2           2008 Disk Tender Started                        Martin Bly       31-May-08
7.2.1           2008 CPU Tender Started                         Martin Bly       31-May-08
                Have successfully taken part in CCRC08 with     Derek Ross        1-Jun-08
                4 experiments
                Set up CASTOR Gen instance for small            Bonny Strong      1-Jun-08
5.1.4           experiments
                Assign experiment coordinator (depends on       Andrew Sansum    30-Jun-08
                PPD recruitments, or existing staff if to be
4.1.3           done prior to second phase of CCRC08).
                Experiment requirements for first data-taking   Derek Ross       30-Jun-08

                2008 purchasing plan available                  Andrew Sansum     31-Jul-08

                Ensure well-defined experiment contacts in      Matt Hodges      30-Jun-08
                place (at Tier-1 and experiment ends).

5.1.9           CASTOR certification testbed ready to use       Chris Kruk        1-Aug-08
                FY08/09 Tape media capacity procurement         Tim Folkes        1-Aug-08
7.3.5           started
                FY08/09 Tape robot capacity procurement         David Corney      1-Aug-08
7.3.6           tender out
        Ready for 2008 running                         Andrew Sansum   31-Aug-08

        Disaster and Business Continuity Plan          Andrew Sansum   30-Apr-08

        Recruitment of additional GRIDPP3 posts        Andrew Sansum   30-Mar-08

        R89 Available for Installation                 Martin Bly       1-Dec-08

        LHC Monitoring infrastructure operational at   Robin Tasker     1-Sep-08

        DCache Service Ends                            Derek Ross      31-Dec-08

         Review of overall effectiveness of experiment Matt Hodges        31-Dec-08
         support in conjunction with the experiments.
         Expect to do this via the User Board.
         2008 Disk Hardware Received                 Martin Bly            1-Sep-08

         2008 CPU Hardware Received                  Martin Bly            1-Sep-08

         Provide site dashboard for experiments.     Gareth Smith          3-Sep-09

         2009 Disk Tender Started                    David Corney         1-May-09

         2009 CPU Tender Started                     David Corney         1-May-09

         Experiment requirements for 2009 running.   Catalin Condurache   1-May-09

         2008 Disk hardware Accepted and bill paid   Martin Bly           31-Aug-09

         2008 CPU hardware Accepted and bill paid    Martin Bly           31-Aug-09

5.4.1    General ADS Service Ends                    Andrew Sansum           40633
         Migration to 64bit                          Martin Bly           31-Jul-09

         2009 Disk hardware Accepted and bill paid   Martin Bly           31/6/2010

         2009 CPU hardware Accepted and bill paid    Martin Bly           31/02/2010

         2010 Disk Tender Started                    Martin Bly             2-Jan-10
         2010 CPU Tender Started                     Martin Bly             2-Jan-10
         Atlas centre phased out other than for      Martin Bly            31-Mar-10
         emergency backup servers
1.2.9    2011 capacity procurement Started           Andrew Sansum          1-Jun-10
7.1.3    2010 Disk hardware Accepted and bill paid   Martin Bly            12/1/2010
7.2.5    2010 CPU hardware Accepted and bill paid    Martin Bly            12/1/2010
         Tier1 fully operational in R89              Martin Bly            31-Mar-09

1.2.1    On Call Service Specified (WLCG-01)         Andrew Sansum         31-Dec-07
3.2.3    Resilient national WMS Service in place     Catalin Condurache     31-Jul-08
         Disaster Plan fully implemented             Andrew Sansum         30-Jan-09

         Ready for 2009 running                      Andrew Sansum          1-Aug-09

         Tier-1 able to meet 2009 WLCG MoU           Andrew Sansum         31-Aug-09
         resource commitment

         2010 capacity procurements Started           Andrew Sansum     1-Jun-09

1.2.8                                              commitment
         Tier-1 able to meet 2010 WLCG MoU resourceAndrew Sansum        6/1/2010
1.2.10                                             commitment
         Tier-1 able to meet 2011 WLCG MoU resourceAndrew Sansum        4/1/2011

         Migration plan Agreed by GridPP              Martin Bly

         Complete strategic review of Tier-1 objectives David Corney
         Order Placed for 10,000 slot tape robot      Tim Folkes       10/1/2009
         Tape robot received                          Tim Folkes        1-Dec-08

         GRIDPP Migrated to new tape robot            Tim Folkes       31-Mar-09

         R89 Migration document                       Andrew Sansum    30-Oct-09
Not yet due

Date complete











 1-Nov-08 Tier-1 Report: Monday 17th November


14-Dec-09 General Incident Response Document version 1.5System excercised to level 3 (of 4 levels) with PMB involvement.










 30-Jul-09 We have been unable to
           capture the experiments'
           requirements in a formal
           document as they are
           only partially defined and
12-Feb-10 continue to evolve.


    40633 Shutdown announced and access disabled.




 1/31/2011 PMB Minuits
 3/11/2011 PMB Minuits

 14-Dec-10 DM system is fully
           operational as presented
           at Cambridge GRIDPP


  1-Feb-10 Closed partially
           completed. GRIDPP did
           not intend to fully meet
           the Tape MoU
           commitment owing to the
           reduced tape
           requirements caused by
           the changed LHC




Proposed to be Rescheduled
Likely to be late
Proposed Milestone change



Tape media purchase and robot purchase defined seperately in 7.3

Failed, as we did not get Alice running during CCRC08 phase 2

Waiting diskservers from Tier1

5-Feb-09 Matt Hodges assigned to this role.22-jul-08 Still pending on PPD
recruitment of experiment support staff. I’d like to see who the team are
before nominating a coordinator
22-jul-08 Have some information from Atlas, but which is not supposed to be
used outside Atlas as it may change. Am talking to CMS UK to get
information out of central CMS, Have received information from LHCb
04-jul JG to raise with GDB and agree new date
22-apr-08 GRIDPP will not have a formal budget until end of July. AS will
keep GRIDPP closely informed and involved in the developing purchasing
2-jul-08 Expected date should be 31 July 2008 GRIDPP financial plan not yet
agreed with STFC but getting closer.
22-jul-08 GRIDPP have made a proposal to STFC. We therefore have a draft
financial plan and could start to construct a spend plan. I may be able to hit
the expected date for this as all relevant info is now available. Main constraint
is a little effort to collate input

22-jul-08 Will try and do Tier-1 side this week; suggest changing expected
data to August 1
2-jul-08 On agenda for quarterly UB meeting on 2008-06-24.

New milestone at jul-08
[2008Q3]Waiting for GRIDPP approval of spend plan

Added 4-jul-08
[2008Q3]The service was ready for data taking. 22-jul-08. CASTOR upgrade
to 2.1.7 is looking doubtful. Otherwise assumes that Grid team upgrades get
done and disk capacity deployed is as per plan. (Ok but are they going to
[2009Q3]Major incident contingency plans presented to the GRIDPP review
[2009Q2]The work is almost complete but a few contingency plans are
[2009Q1]A disaster management system has been designed and some
contingency plans produced. The system has been tested. Further work is
required on operational proceedures and contingency plans. This will be
completed in June 2009
[2008Q4]This has still not been completed but is planned to be in
Q12009.][2008Q3] Work on disaster planning and business continuity is now
underway as a high priority item. Expect to complete by the end of
November.22-jul-08 Not a priority with me but needs progressing. Not likely to
hit August target as I am flat out on procurement/recruitment until start
August, then leave. Then two weeks left with more recruitments.Need a plan
here to progress
[2009Q2]Just 0.5 FTE of effort remains to be recruited. interviews have been
completed and an offer is being prepared
[2009Q1]Further recuitment has been caried out and we are now just 1 FTE
short of the original GRIDPP plan. This missing effort is the two experiment
support staff. However STFC restrictions on recruitment in PPD make it
unlikely that this milestone will be completed before September if ever.
[2008Q4]A further 1 FTE has been recruited and 2 FTE are outstanding.
Recruitment of the last member of the production team has been difficult -
advertised twice without success and also attempts via agency. Further
interviews expected in February. Experiment support posts have just
interviewed and an offer is expected to be made this week. It is possible we
will reach target effort in Q1 or early Q2. 30/10/08 Situation improving with
3FTE recruited, but unlikely to have remainder in place until Q1 2009. 22-jul-
08 Unlikely to hit 31 October deadline – 30 November may be possible
[2008Q4]R89 is not available. When 22/12/08 became unachievable a new
schedule was constructed based on an expected delivery date of 9th
February. This date may be met but is now considered unlikely to be
achievable.[2008Q3]Target date for completion is 22/12/08 and our
assesment is becoming more optimistic that this is likely to be achieved.
30/7/08 Highly likely to be december 2008
[2009Q3]RAL networking had problems making the system work through the
site firewall. Progress is slowly being made but not all functionality is in place
yet.[2009Q2]Waiting on work by Dante
[2009Q1]Installation by Dante engineers is complete by the end of April. Final
configuration to grant external access to Dante still to be done. Dante still to
commission service.
[2008Q4]Installation is underway and is likely to be completed in the next 4
weeks.[2008Q3]The MoU has been signed with Dante and the installation
schedule is in their hands [2008Q2]The technical solution for LHCOPN
monitoring is agreed; there is now a discussion between LHC partners, the
NRENs and DANTE on the formalities, for example the need and scope for a
MoU etc. Once these arrangements are in place, DANTE (who are
[2008Q3]Tape service expected to end in December but disk service will
probably have to continue until March 2009 22-jul-08 Ongoing, LHCb and
CMS migrated, Atlas have tape files remaining only, Minos soon to begin
migration - this should allow us to shutdown and
the dCache ADS interface
[2009Q1]Revieve underway - waiting for late experiment response. Expect
writeup to be completed by May.

[2009Q1]Delayed pending R89 availability.
[2008Q4]Ready for delivery but held pending machine room[2008Q3] Disk
delivery is on schedule to arrive just after R89 becomes available.
[2008Q2]needs to slip in line with Andrew's schedule: delivery to be complete
[2009Q1]Delayed pending R89 availability.
[2008Q4]Expected to be ready for delivery in late February. Held until
machine room becomes available.[2008Q3]needs to slip in line with Andrew's
schedule: delivery to be complete ~31-Dec-08
[2009Q1] rescheduled
[2008Q4]Needs to be re-scheulded. This is now on the delivery plan for the
production team a date is expected to be fixed by the end of 2009Q1
[2008Q3]Not likely to happen by target date. A project for the production team
- needs to be rescheduled [2008Q2] Very remote. Project unassigned?

[2009Q1]Rescheduled to commence once we have cleared the previous
tender from the system. Leads to an early January delivery.
[2008Q4]Needs to be rescheduled in-line with LHC planning
[2009Q1]Rescheduled to commence once we have cleared the previous
tender from the system. Leads to an early January delivery.
[2008Q4]Needs to be rescheduled in-line with LHC planning
[2009Q1]rescheduled to 1st May to reflect last possible date we can accept
changes to requirements prior to STEP09.
[2008Q4]Needs to be rescheduled in-line with LHC planning[2-jul-08] Very
remote, and Derek's (subject to clarification as above)? (previously Matt
[2009Q3]Problems with acceptance have delayed this. Not likely to be
complete before december 2009.
[2009Q1]Rescheduled to reflect WLCG High level milestones
[2008Q4]Delay in the R89 schedule has caused this to be late.
[2008Q3]Expected to be March 2009 [Q2]On track but very tight.
[2009Q1]Rescheduled to reflect WLCG High Level Milestones
[2008Q4]Delay in the R89 schedule has caused this to be
late.[2008Q3]Expected to be March 2009 [Q2]On track but very tight.
[2010Q1]Situation reviewed. Service remains available (and free) over GRIDPP3 but propose moving PP data off (or deleting it mor
[2009Q3]90% of capacity is now in SL5. Although the SL4 service has not yet
been closed - this milestone is effectivly complete.
[2009Q2]WLCG has not been ready to migrate (and still has some problems
to resolve), the new proposed date is 30-Sep-09, updated from 31-Jul-09. A
test service is already available and an upgrade plan exists which will allow us
to meet the September deadline.
[2009Q1]Software release was not available until recently. Task rescheduled
to match WLCG requirements.[2008Q4]This remains a low priority item.
[2008Q3]This is not in our stretegy for this year - needs a change to the
project [Q2]On track but very tight.
[2010Q1]Ongoing problems but expected to pass acceptance 5th November
[2009Q4]Expect to complete this in April
[2009Q2]In light of plan for phased delivery this cannot be completed until
June 2010
[2009Q1]Propose we change this to be 31-Feb10

[2009Q2]Not likely to be complete until April/May following a February
delivery. Delay was caused by late agreement of financial plan with STFC.
[2009Q1]Propose we change this to be 31-Feb10

[2009Q4]Needs to be started by 1 May 2010 for December delivery
[2009Q2]We need new dates here according to GRIDPP plan
[2009Q4]Needs to be started by 1 May 2010 for December delivery
[2009Q2]We need new dates here according to GRIDPP plan
We have not wished to migrate the remaining critical servers to the UPS room
until the UPS issue was resolved. Only a few services remain in ATLAS

[2010Q4]Hardware delivered - acceptance tests running
[2010Q4]Hardware delivered - acceptance tests running
[2009Q1]Expected by July 2009
[2008Q4]Unlikely to be achieved owing to late delivery of
R89.[2008Q3]Continues to be possible - but increasingly tight [2008Q2]This is
very tight and may still be possible. Depends critically on avcailability of R89.

[2009Q1]Work plan indicates that this will be completed in June.
[2008Q4]This is unlikely to be completed for some months after the plan is

[2009Q3]This milestone WAS successfully completed at the end of
September following certification of the Tier-1 during STEP and the end of our
development cycle at the end of September. Subsequently after the end of
the quarter new operational problems emerged, however all the work required
in the project plan for this milestone was completed. Propose closing it as
[2009Q1]Rescheduled to meet new LHC schedule
[2008Q4]Needs to be rescheduled owing to LHC slippage
[2009Q4]Able to meet disk commitments - however GRIDPP never planned
to meet tape MoU commitment and we will be unable to do so until T10KB
service is operational. Should be able to meet 2010 commitment.
[2009Q3]We are unable to meet our disk commitment owing to hardware
problems with the disk. GRIDPP have choosen not to meet our tape
commitment as current use does not justify the expendature on tape that was
originally planned.
[2009Q1]Rescheduled to match WLCG High Level Milestone
[2008Q4]The disk and CPU capacity will not be available in time, however
with the slippage in the LHC schedule and substantial under utilisation of
existing resources this is unlikely to be a problem in practice.
WLCG changed this deadline to be June. We are on track
[2011Q1]CPU was a few days late following deployment problem. GRIDPP
plan to meet tape MoU using buy on demand
[2010Q4]Owing to late availability of T10KC hardware we are unlikely to be
able to reach tape MoU commitments until August. However actually usage is
well below current available capacity and this is unlikely to be a problem.

[2008Q4]The migration plan was discussed at the PMB[2008Q3]Plan to agree
this in first half of November [2008Q2]Ongoing
02-jul-08 Ongoing
22-apr-08 Possible inconsistencies with PO overall migration plan. Need to
be investigate and clarified. Keep under close observation.

[2009Q1]Delayed by R89 delays. Robot expected to be received in May 2009

[2009Q1]Delayed by R89 delays. Expected in July 2009

Document is available on request and has been distributed to the PMB
3 but propose moving PP data off (or deleting it more likely) by the end of GRIDPP3.[2009Q4]Major users migrated. It is not obvious there is significant be
ers migrated. It is not obvious there is significant benefit from persuing the minor users[2009Q3]Tape reclaim from major users has commenced.[2009Q2]
claim from major users has commenced.[2009Q2]This has simply not been a priority in a period of major activity.[2009Q1]A plan has been provided to the
r activity.[2009Q1]A plan has been provided to the UB. Closure process has started. Service is read only. Expected to terminate read access by Septemb
ly. Expected to terminate read access by September 2009.22-jul-08. Agreed with AS. Also to terminate ADS service (effectively) allowing occasional futur
ADS service (effectively) allowing occasional future acess on case by case. Otherwise experimnets migrate into CASTOR.
grate into CASTOR.
GridPP Quarterly Report
Area         Tier-1
Quarter      Q1 11
Reported by Andrew Sansum

Effort (FTE)
                                                                                  GridPP Funded                         Unfunded
                                              GRIDPP Funded
Site           Work area                      Name(s)                     Month 1 Month 2 Month 3 Month 1 Month 2 Month 3
Tier-1         CPU                            Adams Bly Hafeez                0.85    0.85    0.85    0.00    0.00    0.00
                                              Adams, Bly, Thorne
Tier-1         DISK                           Hafeez                           1.70        1.70       1.70       0.00       0.00       0.00
                                              Viljoen, Kruk, De Witt,
Tier-1         CASTOR/Tape                    Ketley                           2.30        2.30       2.30       0.50       0.50       0.50
                                              Condurache, Hodges,
Tier-1         Experiment Support             Dewhurst, Lahiff                 1.50        1.50       1.50       0.00       0.00       0.00
                                              Ross, Thorne, Bly, Pani,
Tier-1         Core                           Ketley, Collier                  4.25        4.25       4.25       0.80       0.80       0.80
                                              Robinson,Norris, Patel,
Tier-1         Operations                     Sheppard                         1.00        1.00       1.00       0.80       0.80       0.80
Tier-1         Network                        Metcalf, Jesset                  0.00        0.00       0.00       0.50       0.50       0.50
                                              Ross, Hodges,
Tier-1         Deployment                     Condurache                       2.50       2.50       2.50        0.40       0.40       0.40
Tier-1         Management                     Sansum                           0.90       0.90       0.90        0.80       0.80       0.80
Tier-1         Production                     Smith, Kelly, Idicula            2.80       2.80       2.80        0.00       0.00       0.00
Total                                                                         17.80      17.80      17.80        3.80       3.80       3.80


The above is an estimate of what we expect to be booked. However we still have work to do to reconcile the plan against actual bookings
I hope to provide actual booking data by the end of the month.
Core bookings reduced by 0.6 FTE in Q4 owing to staff illness (under SSC sickness is not charged to the project unlike in previous years)
Summary Sheet

                                    Non Capacity Tier-1 Misc Tier-1 CPU Tier-1 Disk   Tape InfrastructureOPN     Tape Capacity Total
Spent on FY09 Commit
Spent on FY10 Commit
Total Spent to Date
Total Expected Future Spend
Outturn Forecast For FY10
Total GRIDPP Budgeted Spend in FY
GRIDPP Budget - Outturn
STFC Allocated FY10 Spend
STFC FY10 Allocated - Outturn

FY10 Commitments expected in FY11

Finance Period

                                                                                                       STFC Allocation


                                                                                                       Allocation - Plan               £77,000
GridPP Quarterly Report
Area          Tier-1
Quarter       Q1 11
Reported by Andrew Sansum

Progress over last Quarter
Work area                                      Successes                                                    Problems/Issues

               This quarter has been focused on preparatory work for the coming
               period of data taking.

               Tier-1 metrics remain generally very good (although lower level
               problems continue at a higher level than we would like).

               The issues with transfermor TX2 were finally resolved. This has
                                                                                          Main area of concern is the recent los
               removed the major risk that a significant single point of failure in the
                                                                                          (Pani, Helier, Thorne, Hodges). Dere
General        machine room electrical supply would lead to an extended outage.
                                                                                          transfer out of the group in Q2. Risk
                                                                                          remains high.
               Problems with the UPS power supply leading to instability in our Oracle
               RAID arrays was also resolved by the addition of isolating transformers.
               Work on a gap analysis searching for other key areas of weakness is
             Fixed term staff contracts were approved for renewal until March 2015.
                                                                                        Staff effort booking will fall below pla
             However two fixed term staff resigned before this issue was resolved
                                                                                        2011Q1 owing to staff redeployment
             (its probable that the initial 1 year renewal was a contributary factor in
             their decision to leave). Nevertheless staff effort remained close to
             GRIDPP4 planned level of 18 FTE.
                                                                                        The planned tape drive purchase (£2
                                                                                        days) to be delivered in FY10. When
Management   Work on recruitments has commenced and is expected to be a major
                                                                                        order was raised (4th February) it wa
             activity in the next quarter.
                                                                                        some risk, however the indicated lead
                                                                                        and previous track record of the supp
             All planned purchases were completed (except the tape drives - see
                                                                                        that risk of late delivery was small. Un
             problems). Work took place to track and meet outturn forecast. New
                                                                                        purchasing systems within SSC, Orac
             working systems are in place now to extract and manage information
                                                                                        subsequent logistics delays (by Orac
             coming out of the SSC finance system for staff effort and hardware
                                                                                        substantial unexpected delays and a

                                                                                        We were 7 days late meeting our Mo
                                                                                        for CPU when problems were encoun
                                                                                        deployment process. Difficulty was th
             2010 CPU hardware has been commissioned.
CPU                                                                                     run out of addresses in its existing su
                                                                                        encountered problems using the non
                                                                                        address space we had available to al
                The 2010 disk has been commissioned and was deployed into
                                                                                           Following several disk server filesyste
                Disk servers were upgraded to 64bit O/S. This has several benefits but
                                                                                           became necessary to remove the SL
Disk            was required in order to resolve issues with GridFTP checksum
                                                                                           disk servers from production last qua
                Performance tuning changes were made to the disk servers' tcp tuning
                in order to improve wide area network transfer rates.

                A minor upgrade to CASTOR 2.1.10-0 was completed successfully.

                An upgrade to SRM 2.10 was completed (on 3 instances) but remains
                posponed on the 4th instance while we assess SRM database load
                HEP access to the ADS service was terminated.

                Whole node scheduling was deployed for testing.
Grid                                                                                       Both the CEs and BDIIs continue to b
Deployment                                                                                 problematic during the period (no sin
                The RGMA registry was decomissioned.

                Quatorisation work continued. Testing of virtualisation (for the service
                nodes) continues.

                Work on CVMFS continues with further tests carried out. RAL has been
Core Services
                a leading player in this area internationally.The Tier-1 is providing a
                failover CVMFS mirror for WLCG. Both ATLAS and LHCB are now
                using CVMFS.
                We continue to closely monitor callout rates and causes. A substantial
                improvement on the last quarter. Callout rates fell in January (2.7 days
                per week) and February (1.25) against a target max of 3.0. March was       Generally we have the impression of
                operationally worse (3.6) but still better than December.                  problems with the service, although th
                                                                                           supported by the measured metrics.
                A number of Fabric management metrics are now being accumalated
                and reviewed monthly.

                                                                                           There have been a number of reliabil
                                                                                           main site network. Intermittant period
                                                                                           loss (5-10%) was a problem for the w
                                                                                           This was eventually (April) traced to a
                                                                                           configuration problem.
Network         Deployed commodity 10Gb networking for newest disk servers.
                                                                                           There have been a few operational p
                                                                                           Tier-1's commodity network that have
                                                                                           number of operational breaks. There
                                                                                           no single cause. Most of these proble
                                                                                           been resolved.

                Work on capacity planning system now essentially complete.
Support         Successfully merged two ATLAS disk pools with negligable disruption to

Note:To get multiple lines per box use Alt-Return

General Risks
                                    Risk                                                          Mitigating Action

Insitute or area specific risks
                                    Risk                                                          Mitigating Action
Reduction in available staff effort below planned staffing levels leads to
delays in delivery of upgrades. A number of recent (or planned) departures      Recruitments have started. Further recruitments
have already made the situation difficult. Further unexpected loss of staff     commence shortly after approval from resourcin
remains likely. Grid team is particularly vulnerable and likely to fall below   forum. Redistribute tasks within team.
critical staffing level by end of June.

                                                                              Understand new policy on compliance, highlight
Risk that Tier-1 will have to commit significant effort to meet new goverment
                                                                              organisation what impact the current project poli
rules on web compliance or withdraw services. Possibility of some increased
                                                                              scope will have on the service. Reduce number
annual costs to meet auditing requirements.
                                                                              "web units" and limit range of access.

T10KC Tape drives may not be deployable in Q3 2011 as planned. We will          Proceed to initial testing ASAP. Usage is most
be unable to meet our 2011 MoU commitment if the migration cannot take          unlikely to exceed existing capacity until at least
place in time.                                                                  even Q4 2011.

                                                                                Increase effort committed to investigating alterna
Experiment data access models are evolving. It is not clear if CASTOR will      to CASTOR, particularly for disk based storage.
provide all required functionality to meet future experiment requirements.      Currently hampered by reducing staff effort to wo
                                                                                this area in the short term.

                                                                           Review long range tape drive capacity plan. Rev
Increased use of tape subsystem may lead to contention for tape drives. We
                                                                           migration policy and tape streaming optimisation
have seen the first indications of drive contention on the T10KA series of
                                                                           Likely to be hampered by reduced staff effort to
drives when LHCB archived a large data set.
                                                                           progress quickly.

Objectives and Deliverables for Last Quarter
                         Objective/Deliverable                                        Due Date                               Metric

2010 Disk hardware accepted and bill paid                                             12/1/2010        Accepted 31/1/2011

2010 CPU hardware accepted and bill paid                                              12/1/2010        Accepted 11/3/2011

                                                                                                       Disk and CPU filly deploye
Tier-1 able to meet 2011 WLCG MoU resource commitment                                 12/1/2010
                                                                                                       requirement using buy on
General ADS Service Ends                            3/30/2011     Service terminated for HE

Objectives and Deliverables for Next Quarter

                           Objective/Deliverable    Due Date                    Summary

Objectives and Deliverables needing Rescheduling

                           Objective/Deliverable   Old Due Date      New Due Date

New Objectives and Deliverables
                        Objective/Deliverable       Due Date                         Metric

 a of concern is the recent loss of 4 staff
elier, Thorne, Hodges). Derek Ross will
out of the group in Q2. Risk of further losses
rt booking will fall below planned levels in
owing to staff redeployment and departures.

ned tape drive purchase (£207K) failed (by 4
be delivered in FY10. When this purchase
s raised (4th February) it was known to carry
k, however the indicated lead times (20 days)
 ous track record of the supplier suggested
of late delivery was small. Unfortunatly new
ng systems within SSC, Oracle and
ent logistics delays (by Oracle) led to
al unexpected delays and a missed financial

  7 days late meeting our MoU commitment
when problems were encountered late in the
ent process. Difficulty was that the Tier-1 had
 f addresses in its existing subnets and had
  red problems using the non-contiguous
space we had available to allocate to the new
g several disk server filesystem failures it
necessary to remove the SL08 generation of
 ers from production last quarter. Testing is

 CEs and BDIIs continue to be operationally
atic during the period (no single cause).
y we have the impression of more operational
  with the service, although this isn't clearly
d by the measured metrics.

ve been a number of reliability issues with the
  network. Intermittant periods of high packet
0%) was a problem for the whole quarter.
 eventually (April) traced to a firewall
 tion problem.

 ve been a few operational problems with the
 ommodity network that have caused a
of operational breaks. There appears to be
  cause. Most of these problems have now

itigating Action

itigating Action
started. Further recruitments will
 fter approval from resourcing
 asks within team.

 icy on compliance, highlight to
mpact the current project policy and
he service. Reduce number of
  range of access.

 sting ASAP. Usage is most
 xisting capacity until at least Q3 or

mitted to investigating alternatives
larly for disk based storage.
 by reducing staff effort to work in

ape drive capacity plan. Review
 tape streaming optimisation.
ed by reduced staff effort to


   Accepted 31/1/2011

   Accepted 11/3/2011

   Disk and CPU filly deployed. Intention to meet tape
   requirement using buy on demand.
Service terminated for HEP users on 31 March 2011

             Summary of Comments

  New Due Date                            Reason

Row Labels                  Items              Area Total
CASTOR-Infrastructure                             £88,528
 CASTOR Database Servers             £51,739
 Other CASTOR Servers                £36,789
CPU                                              £389,575
 Clustervision CPU                  £187,115
 Viglen CPU                         £202,460
Disk                                             £391,947
 Streamline Disk                    £199,471
 Viglen Disk                        £192,476
FY09                                             -£13,152
 FY09                               -£13,152
Misc                                             £115,662
 Cables                                 £242
 Consultancy                             £51
 Electrical                          £12,865
 Error                                  £680
 LAN - Capacity                      £24,703
 Maintenance                          £2,664
 Miscellaneous                        £6,108
 On-Call                              £3,145
 Redhat Licences                        £952
 SAN Switch                          £12,954
 LSF Licence Maintainance            £19,740
 Oracle Maintainance                 £10,000
 LAN - 10GigE                        £15,101
 Rack                                 £2,174
 Virtualisation Licences              £4,283
Non -Capacity                                     £37,907
 Non -Capacity                       £37,907
OPN                                              £107,930
 OPN                                £107,930
Tape-Bandwidth                                    £37,791
 Maintainance                         £1,667
 Tape Servers                        £34,331
 Cleaning Cartridges                  £1,793
Tape Media                                        £68,678
 T10KC Tape Media                    £68,678
Grand Total                                    £1,224,866

Shared By:
xiaohuicaicai xiaohuicaicai