Docstoc

Presentation13

Document Sample
Presentation13 Powered By Docstoc
					Atlas Canada Lightpath Data Transfer Trial

Corrie Kost, Steve McDonald (TRIUMF) Bryan Caron (UofAlberta), Wade Hong (Carleton)

ATLAS CANADA TRIUMF-CERN LIGHTPATH DATA TRANSFER TRIAL FOR IGRID2002

Two 1Gigabit optical fibre circuits (colours) What was accomplished?
•Established relationship with “grid” of people for future networking projects •Demonstrated a manually provisioned 12,000Km lightpath •Transferred 1TB of ATLAS Monte-Carlo data to CERN (equiv. to 1500 CD’s) •Established record rates ( 1 CD in 8 seconds or 1 DVD in <60 seconds) •Demonstrated innovative use of existing technology •Largely used low-cost commodity software & hardware.

Participants
•TRIUMF •University of Alberta •Carleton •CERN •Canarie •BCNET •SURFnet

&

Acknowledgements
•Netera •Atlas Canada •WestGrid •HEPnet Canada •Indiana University •Caltech

•Extreme Networks •Intel Corporation

Brownie 2.5 TeraByte RAID array
• 16 x 160 GB IDE disks (5400 rpm 2MB cache)
• hot swap capable • Dual ultra160 SCSI interface to host • Maximum transfer ~65 MB/sec • Triple hot swap power supplies • CAN ~$15k

• Arrives July 8th 2002

What to Do while waiting for server to arrive
•IBM PRO6850 Intellistation (Loan) •Dual 2.2 GHz Xeons •2 PCI 64bit/66MHz •4 PCI 33bit/33MHz

•1.5 GB RAMBUS
•Add 2 Promise Ultra100 IDE controllers and 5 Disks

• Each disk on its own IDE controller for maximum IO
• Begin Linux Software RAID performance tests ~170/130 MB/sec Read/Write

The Long Road to High Disk IO
• IBM cluster x330‟s RH7.2 disk io ~ 15 MB/sec (slow??) • expect 45 MB/sec for any modern single drive • Need 2.4.18 Linux kernel to support >1TB filesystems • IBM cluster x330‟s RH7.3 disk io ~ 3 MB/sec

What is going on • Red Hat modified serverworks driver broke DMA on x330‟s
• x330‟s ATA 100 drive, BUT controller is only UDMA 33
• Promise controllers capable of UDMA 100 but need latest kernel patches for 2.4.18 before drives recognise UDMA100 • Finally drives/controller both working at UDMA100 = 45MB/sec • Linux software raid0 2 drives 90MB/sec, 3 drives 125 MB/sec 4 drives 155MB/sec, 5 drives 175 MB/sec Now we are ready to start network transfers

did we to do? ---------------------------------So what are we going
• Demonstrate a manually provisioned “e2e” lightpath
• Transfer 1TB of ATLAS MC data generated in Canada from TRIUMF to CERN • Test out 10GbE technology and channel bonding • Establish a new benchmark for high performance disk to disk throughput over a large distance

Comparative Results
(TRIUMF to CERN)
Transfer Program wuftp 100 MbE wuftp 10 GbE iperf pftp bbftp (10 streams) Tsunami - disk to disk Tusnami - disk to memory Transferred 600 MB 6442 MB 275 MB 600MB 1.4 TB 0.5 TB 12 GB Average 3.4 Mbps 71 Mbps 940 Mbps 532 Mbps 666 Mbps 700 Mbps > 1GBps 710 Mbps 825 Mbps 1136 Mbps Max Avg

What is an e2e Lightpath
• Core design principle of CA*net 4

• Ultimately to give control of lightpath creation, teardown and routing to the end user
• Hence, “Customer Empowered Networks” • Provides a flexible infrastructure for emerging grid applications

• Alas, can only do things manually today

CA*net 4 Layer 1 Topology

The Chicago Loopback
• Need to test TCP/IP and Tsunami protocols over long distances, arrange optical loop via StarLight
– ( TRIUMF-BCNET-Chicago-BCNET-TRIUMF ) – ~91ms RTT

• TRIUMF - CERN RRT ~200ms Told Damir, we really needed to have a double loopback
– “No problem” – The loopback2 was setup a few days later (RTT=193ms) – (TRIUMF-BCNET-Chicago-BCNET-Chicago-BCNET-TRIUMF)

TRIUMF Server
SuperMicro P4DL6 (Dual Xeon 2GHz) 400 MHz front side bus 1 GB DDR2100 RAM Dual Channel Ultra 160 onboard SCSI SysKonnect 9843 SX GbE 2 independent PCI buses 6 PCI-X 64 bit/133 Mhz capable 3ware 7850 RAID controller 2 Promise Ultra 100 Tx2 controllers

CERN Server
SuperMicro P4DL6 (Dual Xeon 2GHz) 400 MHz front side bus 1 GB DDR2100 RAM Dual Channel Ultra 160 onboard SCSI SysKonnect 9843 SX GbE 2 independent PCI buses 6 PCI-X 64 bit/133 Mhz capable 2 3ware 7850 RAID controller

RMC4D from HARDDATA

6 IDE drives on each 3-ware controllers RH7.3 on 13th drive connected to on-board IDE WD Caviar 120GB drives with 8Mbyte cache

TRIUMF Backup Server
SuperMicro P4DL6 (Dual Xeon 1.8GHz) Supermicro 742I-420 17” 4U Chassis 420W Power Supply 400 MHz front side bus 1 GB DDR2100 RAM Dual Channel Ultra 160 onboard SCSI SysKonnect 9843 SX GbE 2 independent PCI buses 6 PCI-X 64bit/133 MHz capable 2 Promise Ultra 133 TX2 controllers & 1 Promise Ultra 100 TX2 controller

Back-to-back tests over 12,000km loopback using designated servers

Operating System
• Redhat 7.3 based Linux kernel 2.4.18-3
– Needed to support filesystems > 1TB

• Upgrades and patches
– – – – Patched to 2.4.18-10 Intel Pro 10GbE Linux driver (early stable) SysKonnect 9843 SX Linux driver (latest) Ported Sylvain Ravot‟s tcp tune patches

Intel 10GbE Cards
• Intel kindly loaned us 2 of their Pro/10GbE LR server adapters cards despite the end of their Alpha program
– based on Intel® 82597EX 10 Gigabit Ethernet Controller

Note length of card!

Extreme Networks
TRIUMF CERN

EXTREME NETWORK HARDWARE

IDE Disk Arrays
TRIUMF Send Host CERN Receive Host

Disk Read/Write Performance
• TRIUMF send host: – 1 3ware 7850 and 2 Promise Ultra 100TX2 PCI controllers – 12 WD 7200 rpm UDMA 100 120 GB hard drives (1.4 TB) – Tuned for optimal read performance (227/174 MB/s) • CERN receive host: – 2 3ware 7850 64-bit/33 MHz PCI IDE controllers – 12 WD 7200 rpm UDMA 100 120 GB hard drives (1.4 TB) – Tuned for optimal write performance (295/210 MB/s)

THUNDER RAID DETAILS
/root/raidtab raiddev /dev/md0 raid-level 0 nr-raid-disks 12 persistent-superblock 1 chunk-size 512 kbytes device /dev/sdc } raid-disk 0 device /dev/sdd } raid-disk 1 device /dev/sde } raid-disk 2 device /dev/sdf } 8 drives on raid-disk 3 device /dev/sdg } raid-disk 4 device /dev/sdh } raid-disk 5 device /dev/sdi } raid-disk 6 device /dev/sdj } raid-disk 7 device /dev/hde } raid-disk 8 device /dev/hdg } raid-disk 9 device /dev/hdi } 4 drives raid-disk 10 device /dev/hdk } raid-disk 11
raidstop /dev/mdo mkraid –R /dev/md0 mkfs -t ext3 /dev/md0 mount -t ext2 /dev/mdo /raid0

3-ware

on 2 Promise

Black Magic
• We are novices in the art of optimizing system performance
• It is also time consuming • We followed most conventional wisdom, much of which we don‟t yet fully understand

Testing Methodologies
• Began testing with a variety of bandwidth characterization tools
– pipechar, pchar, ttcp, iperf, netpipe, pathcar, etc

• Evaluated high performance file transfer applications
– bbftp, bbcp, tsunami, pftp

• Developed scripts to automate and to scan parameter space for a number of the tools

Disk I/O Black Magic
• min max read ahead on both systems
sysctl -w vm.min-readahead=127 sysctl -w vm.max-readahead=256

• bdflush on receive host
sysctl -w vm.bdflush=“2 500 0 0 500 1000 60 20 0” or echo 2 500 0 0 500 1000 60 20 0 >/proc/sys/vm/bdflush

• bdflush on send host
sysctl -w vm.bdflush=“30 500 0 0 500 3000 60 20 0” or echo 30 500 0 0 500 3000 60 20 0 >/proc/sys/vm/bdflush

Misc. Tuning and other tips
/sbin/elvtune –r 512 /dev/sdc /sbin/elvtune –w 1024 /dev/sdc (same for other 11 disks) (same for other 11 disks)

-r sets the max latency that the I/O scheduler will provide on each read
-w sets the max latency that the I/O scheduler will provide on each write When the /raid disk refuses to dismount! Works for kernels 2.4.11 or later.
umount -l /raid (then mount & umount)

^ lazy

Disk I/O Black Magic
• Disk I/O elevators (minimal impact noticed)
– /sbin/elvtune – Allows some control of latency vs throughput Read_latency set to 512 (default 8192) Write_latency set to 1024 (default 16384)

• atime
– Disables updating the last time a file has been accessed (typically for file servers) mount –t ext2 –o noatime /dev/md0 /raid
Typically, ext3 writes90Mbytes/sec while for ext2 writes 190Mbytes/sec Reads minimally affected. We always used ext2

Disk I/O Black Magic
• IRQ Affinity
[root@thunder root]# more /proc/interrupts CPU0 CPU1 0: 15723114 0 IO-APIC-edge 1: 12 0 IO-APIC-edge 2: 0 0 XT-PIC 8: 1 0 IO-APIC-edge 10: 0 0 IO-APIC-level 14: 22 0 IO-APIC-edge 15: 227234 2 IO-APIC-edge 16: 126 0 IO-APIC-level 17: 16 0 IO-APIC-level 18: 91 0 IO-APIC-level 20: 14 0 IO-APIC-level 22: 2296662 0 IO-APIC-level 24: 2 0 IO-APIC-level 26: 2296673 0 IO-APIC-level 30: 26640812 0 IO-APIC-level NMI: 0 0 LOC: 15724196 15724154 ERR: 0 MIS: 0 timer keyboard cascade rtc usb-ohci ide0 ide1 aic7xxx aic7xxx ide4, ide5, 3ware Storage Controller ide2, ide3 SysKonnect SK-98xx eth3 SysKonnect SK-98xx eth0

Need to have PROCESS Affinity - but this requires 2.5 kernel

echo 1 >/proc/irq/18/smp_affinity echo 2 >/proc/irq/18/smp_affinity echo 3 >/proc/irq/18/smp_affinity cat /proc/irq/prof_cpu_mask >/proc/irq/18/smp_affinity

   

use CPU0 use CPU1 use either reset to default

TCP Black Magic
• Typically suggested TCP and net buffer tuning
sysctl -w net.ipv4.tcp_rmem="4096 4194304 4194304" sysctl -w net.ipv4.tcp_wmem="4096 4194304 4194304" sysctl -w net.ipv4.tcp_mem="4194304 4194304 4194304" sysctl -w net.core.rmem_default=65535 sysctl -w net.core.rmem_max=8388608 sysctl -w net.core.wmem_default=65535 sysctl -w net.core.wmem_max=8388608

TCP Black Magic
• Sylvain Ravot‟s tcp tune patch parameters
sysctl -w net.ipv4.tcp_tune=“115 115 0”

• Linux 2.4 retentive TCP
– Caches TCP control information for a destination for 10 mins – To avoid caching
sysctl -w net.ipv4.route.flush=1

We are live continent to continent!
• e2e lightpath up and running Friday Sept 20 21:45 CET

traceroute to cern-10g (192.168.2.2), 30 hops max, 38 byte packets 1 cern-10g (192.168.2.2) 161.780 ms 161.760 ms 161.754 ms

BBFTP Transfer
Vancouver ONS
ons-van01(enet_15/1)

Vancouver ONS
ons-van01(enet_15/2)

BBFTP Transfer
Chicago ONS GigE Port 1

Chicago ONS GigE Port 2

Tsunami Transfer
Vancouver ONS
ons-van01(enet_15/1)

Vancouver ONS
ons-van01(enet_15/2)

Tsunami Transfer
Chicago ONS GigE Port 1

Chicago ONS GigE Port 2

Sunday Nite Summaries

Exceeding 1Gbit/sec …
( using tsunami)

What does it mean for TRIUMF in the long TERM
• • Established a relationship with a „grid‟ of people for future networking projects
Upgraded WAN connection from 100Mbit to • 4 x 1GB Ethernet connections directly to BCNET
1. 2. 3. 4. Canarie – educational/research network Westgrid GRID computing Commercial Internet Spare (research & development)

•

Recognition that TRIUMF has the expertise and the Network connectivity for large scale and high speed data transfers necessary for upcoming scientific programs, ATLAS, WESTGRID, etc

Lessons Learned –1
• Linux software RAID faster than most conventional SCSI and IDE hardware RAID based systems.

• One controller for each drive, more disk spindles the better
• More than 2 Promises / machine possible (100/133Mhz) • Unless programs are multi-threaded or kernel permits process locking, Dual CPU will not give best performance.
– Single 2.8 GHz is likely to outperform Dual 2.0 GHz, for a single purpose machine like our fileservers.

• More memory the better

Misc. comments
•No hardware failure – even for the 50 disks!
•Largest file transferred: 114 Gbytes (Sep 24)

•Tar, compressing, etc take longer than transfer
•Deleting files can take a lot of time •Low cost of project - $20,000 with most of that recycled

220Mbytes/sec 175Mbytes/sec

Acknowledgements
• Canarie
– Bill St. Arnaud, Rene Hatem, Damir Pobric, Thomas Tam, Jun Jian

• Atlas Canada
– Mike Vetterli, Randall Sobie, Jim Pinfold, Pekka Sinervo, Gerald Oakham, Bob Orr, Michel Lefebrve, Richard Keeler

• HEPnet Canada
– Dean Karlen

• TRIUMF
– Renee Poutissou, Konstantin Olchanski,

Mike Vetterli (SFU / Westgrid),

• BCNET
– Mike Hrybyk, Marilyn Hay, Dennis O‟Reilly, Don McWilliams

Acknowledgements
• Extreme Networks
– Amyn Pirmohamed, Steven Flowers, John Casselman, Darrell Clarke, Rob Bazinet, Damaris Soellner

• Intel Corporation
– Hugues Morin, Caroline Larson, Peter Molnar, Harrison Li, Layne Flake, Jesse Brandeburg

Acknowledgements
• Indiana University
– Mark Meiss, Stephen Wallace

• Caltech
– Sylvain Ravot, Harvey Neuman

• CERN
– Olivier Martin, Paolo Moroni, Martin Fluckiger, Stanley Cannon, J.P Martin-Flatin

• SURFnet/Universiteit van Amsterdam
– Pieter de Boer, Dennis Paus, Erik.Radius, Erik-Jan.Bos, Leon Gommans, Bert Andree, Cees de Laat

Acknowledgements
• Yotta Yotta
– Geoff Hayward, Reg Joseph, Ying Xie, E. Siu

• BCIT
– Bill Rutherford

• Jalaam
– Loki Jorgensen

• Netera
– Gary Finley

ATLAS Canada

Alberta SFU Victoria UBC TRIUMF Montreal Carleton

Toronto

York

LHC Data Grid Hierarchy
~PByte/sec
Online System
Experiment

CERN/Outside Resource Ratio ~1:2 Tier0/( Tier1)/( Tier2) ~1:1:1 ~100-400 MBytes/sec
CERN 700k SI95 ~1 PB Disk; Tape Robot
FNAL: 200k SI95; 600 TB

Tier 0 +1 Tier 1
IN2P3 Center

~2.5 Gbps
RAL Center

INFN Center

Tier 2
Tier 3
Physics data cache

2.5 Gbps
Tier2 Center Center Center Center Tier2 Tier2 Tier2 Tier2 Center

~2.5 Gbps
Institute Institute Institute ~0.25TIPS Institute

0.1–1 Gbps

Physicists work on analysis “channels” Each institute has ~10 physicists working on one or more channels

Workstations

Tier 4

Slide courtesy H. Newman (Caltech)

The ATLAS Experiment
Canada

Canada