PHENIX Computing Center
in Japan (CC-J)
Takashi Ichihara
(RIKEN and RIKEN BNL Research Center)
Presented on 08/02/2000 at CHEP2000 conference, Padova, Italy
Contents
1. Overview
2. Concept of the system
3. System Requirement
4. Other requirement as a Regional Computing Center
5. Plan and current status
6. WG for constructing the CC-J (CC-J WG)
7. Current configuration of the CC-J
8. Photographs of the CC-J
9. Linux CPU farm
10. Linux NFS performance v.s. kernel
11. HPSS current configuration
12. HPSS performance test
13. WAN performance test
14. Summary
Takashi Ichihara (RIKEN / RIKEN BNL
Research Center)
PHENIX CC-J : Overview
PHENIX Regional Computing Center in Japan (CC-J) at RIKEN
Scope
Principal site of computing for PHENIX simulation
PHENIX CC-J is aiming at covering most of the simulation tasks of the whole
PHENIX experiments
Regional Asian computing center
Center for the analysis of RHIC spin physics
Architecture
Essentially follow the architecture of RHIC Computing Facility
(RCF) at BNL
Construction
R&D for the CC-J started in April ‘98 at RBRC
Construction began in April ‘99 over a three years period
1/3 scale of of the CC-J will be operational in April 2000
Takashi Ichihara (RIKEN / RIKEN BNL
Research Center)
Concept of the CC-J System
import HPSS SMP
PHENIX CC -J
Duplicating Facility
DST Servers Servers
DST
HPSS DST DST
DST PC farms
STK 15TB
Tapes
Big Phys. for ana. &
(50GB/ Tape
volume)
Robot Disk simulation
sim. 10k Spectnt95
Tape drive units DST
to duplicate data Export
sim.
Sim.
APAN/ESNET
WAN
Duplicating Facility HPSS
SMP
Servers RCF PHENIX
Servers
DST DST HPSS
Tapes 40TB STK
(50GB/ Big Tape
volume) Disk Robot
20MB/s
Tape drive units Physics DST DST Raw
to duplicate data
Track
CAS CRS reconstruction
Takashi Ichihara (RIKEN / RIKEN BNL
Research Center)
System Requirement for the CC-J
Annual Data amount CPU ( SPECint95)
DST 150 TB Simulation 8200
micro-DST 45 TB
Sim. Reconst 1300
Simulated Data 30 TB
Sim. ana. 170
Total 225 TB
Theor. Mode 800
Data Analysis 1000
Hierarchical Storage System
Total 11470
Handle data amount of 225TB/year
Total I/O bandwidth: 112 MB/s
Data Duplication Facility
HPSS system
Export/import DST, simulated data.
Disk storage system
15 TB capacity
All RAID system
I/O bandwidth: 520 MB/s
Takashi Ichihara (RIKEN / RIKEN BNL
Research Center)
Other Requirements as a Regional Computing Center
Software Environment
• Software environment of the CC-J should be compatible to the PHENIX Offline
Software environment at the RHIC Computing Facility (RCF) at BNL
• AFS accessibility (/afs/rhic)
• Objectivity/DB accessibility (replication to be tested soon)
Data Accessibility
• Need exchange data of 225 TB/year to RCF
• Most part of the data exchange will be done by SD3 tape cartridges (50GB/volume)
• Some part of the data exchange will be done over the WAN
• CC-J will use Asia-Pacific Advanced Network (APAN) for US-Japan connection
• http://www.apan.net/
• APAN has currently 70 Mbps bandwidth for Japan-US connection
• Expecting 10-30% of the APAN bandwidth (7-21 M bps) can be used for this project:
• 75-230 GB/day ( 27 - 82 TB/year) will be transferred over the WAN
Takashi Ichihara (RIKEN / RIKEN BNL
Research Center)
Plan and current status of the CC-J
1998 1999 2000 2001 2002
April April April April April
Jan. 2000 Mar. 2001 Mar. 2002
RBRC R&D for CC-J CPU farm (number) 64 200 300
CC-J frontend at BNL
(BNL) Pro to type of CPU farms CPU farm (SPECint95) 1500 5900 10700
Data Du plica tio n fa cility Tape Storage size(TB) 100 100 100
Phase 1 Disk Storage size(TB) 2 10 15
RIKEN CC-J Phase 2
Tape Drive (number) 4 7 10
construction 1/3 scale Phase 3
W ako 2/3 scale Tape I/O (MB/s) 45 78 112
Full scale Disk I/O (MB/s) 100 400 600
CC-J Working Group SUN SMP Server unit 2 4 6
formed (Oc t. 1998)
CC-J s tarts operation HPSS Server unit 5 5 5
CC-J review at BNL at 1/3 scale Full sc ale CC-J
(Dec. 1998) (April. 2000) (Mar. 2002)
HPSS Softwar e/Hard ware
Installation (March 19 99)
(Su pple mentar y Bud get) PHENIX Exp. at RHIC
Takashi Ichihara (RIKEN / RIKEN BNL
Research Center)
Working Group for the CC-J construction (CC-J WG)
CC-J WG is a main body to construct the CC-J
Working Group for the CC-J construction (CC-J WG)
manager Servers, Network, HPSS T. Ichihara (RIKEN and RBRC)
technical manager HPSS Y. Watanabe (RIKEN and RBRC)
computer scientists CPU farms, HPSS N. Hayashi (RIKEN)
Bach queue system S. Sawada (KEK)
System monitor S. Yokkaichi (Kyoto Univ.)
scientific programming coordinator
coordination H. En'yo (Kyoto and RBRC)
AFS mirroring H. Hamagaki (CNS, U-Tokyo)
front-end BNL Data duplication Y. Watanabe (RIKEN and RBRC)
Software environment Y. Goto (RBRC)
Prototype CPU farms A. Taketani (RIKEN)
Hold bi-weekly regular meeting at RIKEN Wako, to discuss technical items
and project plans etc.
Mailing list of the CC-J WG created (mail traffic: 1600 mails /year)
Takashi Ichihara (RIKEN / RIKEN BNL
Research Center)
Current configuration of the CC-J
PHENIX Computing Center In Japan
32 Pentium II (450 MHz )+
current config. updated on 14 Jan. 2000
32 Pentium III (600 MHz )
256 MB Memory /CPU
(Alta cluster) * 4 box
288 GB
U
S N E450
Pentium
Pentium IIIII RAID Disk
Pentium II III
PentiumII NFS Server 1 00GB
Pentium
Pentium II 4CPU, 1GB memory
PentiumII
Pentium II
Redhat 5.2
1.6 TB
Linux
RAID Disk
Pentium III U
S N E450
Pentium II .C . erver 1 00GB
G .E S
Pentium II
Pentium
Pentium II II III
Pentium
Pentium II 2CPU, 1GB memory
Pentium II 1000Bas eSX
(9kB MTU)
Serial
100BaseT x 32
Gigabit HIPPI Jumbo Frame
Gigabit Switch Gigabit Ethernet
Catalyst 2948G 1000 Switch #2 (L3)
BaseSX Alteon 180 (9KB MTU) 1000Bas eSX
Private Altacluste r
comp ac
(9kB MTU) Main
contro l WS AFS se rver
address DS20 (e xper imen tal) Bldg.
HIPPI
Comp.
SWITCH Bldg.
EPS-1 000
28 8 GB 1000Bas eSX
HIPPI x 5 (9 kB M TU)
4 RedWood Raid (Work)
SP Router
100 TB SD3 drives Asc end GRF
Tape Mover Gigabit
HPSS Tape Mover Switch
Dis k Mover #1 (L3)
STK Dis k Mover Alteon 180
Tape HPSS Server (9kB MTU)
HPSS
Robot 28 8 GB
15 0GB Raid
HPSS Ca che
Contro ll WS
10/100Bas eT
RIKEN Raid X 2 IBM RS/6000-SP
SUN HPSS Ca che 1000Bas eSX
super ACSLS Silver node x 5
(AIX 4.3.2)
computer RIKEN LAN
Switch
ACSLS
HPSS Switch
Photographs of the PHENIX CC-J at RIKEN
StorageTek Tape Robot (100TB [250 TB]) HPSS Server (IBM RS-6000/SP)
STK Tape Robot (100 TB [240 TB] )
CPU Farm of
1. 6 TB
RAID5 Disk 64 CPU
TWO SUN E450
Data Servers
Uninterruptable
Power Supply (UPS) Uninterruptable
Power Supply (UPS)
Linux CPU farms
Memory Requirement : 200-300 MB/CPU for a simulation chain
Node specification
• Motherboard: ASUS p2b
• Dual CPU /node (currently total 64 CPU)
• PentiumII (450MHz) 32 CPU + Pentium III (600 MHz) 32 CPU
• 512 MB memory / node (1GB SWAP/node)
• 14 GB HD /node (system 4GB, work 10 GB)
• 100 BaseT Ethernet interface (DECchip Tulip)
• Linux Redhat 5.2 (kernel 2.2.11 + nfsv3 patch)
• Portable Batch System (PBS V2.1) for batch queuing
• AFS is accessed through the NFS (No AFS client is installed on Linux pc)
• Daily mirroring of the /afs/rhic contents to a local disk file system is carrying out
PC Assemble (Alta cluster)
• Remote hardware-reset/power control, Remote CPU temp. monitor
• Serial port login from the next node (minicom) for maintenance (fsck etc.)
Takashi Ichihara (RIKEN / RIKEN BNL
Research Center)
Linux NFS performance v.s. kernel
NFS Performance test using bonnie benchmark for 2 GB file
• NFS Server : SUN Enterprise 450 (Solaris 2.6) 4 CPU (400MHz) 1GB
memory
• NFS client : Linux RH5.2, Dual Pentium II 600 MB, 512 MB memory
NFS Write NFS Write NFS Read NFS Read
(per Char) (Block) (per Char) (Block)
2.2.11 0.6 MB/s 0.5 MB/s 4.7 MB/s 5.4 MB/s
2.2.11+nfsv3 7.1 MB/s 6.5 MB/s 6.4 MB/s 9.8 MB/s
2.2.14 1.1 MB/s 1.9 MB/S 4.7 MB/ 5.8 MB/s
2.2.14+nfsv3 5.5 MB/s 5.6 MB/s 6.2 MB/s 10.2 MB/s
NFS performance of the recent Linux kernel seems to be improved
nfsv3 patch is still useful for the recent kernel (2.2.14)
– currently we are using the kernel 2.2.11 + nfsv3 patch
–nfsv3 patch is available from http://www.fys.uio.no/~trondmy/src/
Takashi Ichihara (RIKEN / RIKEN BNL
Research Center)
Current HPSS hardware configuration
• IBM RS6000-SP
• 5-node (silver node: Quadruple PowerPC604e 332 MHz CPU/node)
• Core server : 1, Disk mover : 2, Tape mover : 2
• SP switch (300 MB/s) and 1000BaseSX NIC (OEM of Alteon)
• A StorageTek Powderhorn Tape Robot
• 4 Redwood drives and 2000 SD3 cartridges (100 TB) dedicated for HPSS
• Sharing the robot with other HSM systems
• 6 drives and 3000 cartridges for other HSM systems
• Gigabit Ethernet
• Alteon ACE180 switch for Jumbo Frame ( 9 kB MTU)
• Use of the Jumbo Frame reduces the CPU utilization for transfer
• CISCO Catalyst 2948G for distribution to 100BaseT
• Cache Disk : 700 GB (total), 5 components
• 3 SSA loops (50 GB each)
• 2 FW-SCSI RAID (270 GB each)
Takashi Ichihara (RIKEN / RIKEN BNL
Research Center)
Performance test of parallel ftp (pftp) of HPSS
pput from SUN-E450 : 12 MB/s for one pftp connection
• Gigabit Ethernet, Jumbo Frame (9 kB MTU)
pput from LINUX : 6 MB/s for one pftp connection
• 100BaseT - G.Ether - Jumbo (defragment on a switch)
Totally 〜50 MB/s pftp performance was obtained for pput
Takashi Ichihara (RIKEN / RIKEN BNL
Research Center)
WAN performance test
RIKEN (12 Mbps) - IMnet - APAN (70 Mbps) -startap- ESnet - BNL
• Round Trip Time for RIKEN-BNL :170 ms
• File transfer rate is 47 kB/s for 8 kB TCP widowsize (Solaris default)
• Large TCP-window size is necessary to obtain high-transfer rate
• RFC1323 (TCP Extensions for high performance, May 1992) describes the
method of using large TCP window-size (> 64 KB)
TCP windowsize FTP transfer rate Theoretical lim it
(observed) For 170 ms RTT
8 kB 41 kB/s 47 kB/s
16 kB 87 kB/s 94 kB/s
32 kB 163 kB/s 188 kB/s
64 kB 288 kB/s 376 kB/s
128 kB 453 kB/s 752 kB/s
256 kB 585 kB/s 1500 kB/s
512 kB 641 kB/s 3010 kB/s
Large ftp performance (641 kB/s = 5 Mbps) was obtained for a single ftp
connection using a large TCP window-size (512 kB) over the pacific ocean
(RTT = 170 ms)
Takashi Ichihara (RIKEN / RIKEN BNL
Research Center)
Summary
The construction of the PHENIX Computing Center in Japan (CC-J) at RIKEN
Wako campus, which will extend over a three years period, began in April 1999.
The CC-J is intended as the principal site of computing for PHENIX
simulation, a regional PHENIX Asian computing center, and a center for the
analysis of RHIC spin Physics.
The CC-J will handle the data of about 220 TB/year and the total CPU
performance is planned to be 10,000 SPECint95 in 2002.
CPU farm of 64 processors (RH5.2, kernel 2.2.11 with nfsv3 patch) is stable.
About 50 MB/s pftp performance was obtained for HPSS access.
Large ftp performance (641 KB/s = 5 Mbps) was obtained for a single ftp
connection using a large TCP window-size (512 kB) over the Pacific Ocean
(RTT = 170 ms)
Stress tests for the entire system were carried out successfully.
Replication of the Objectivity/DB over the WAN will be tested soon.
The CC-J operation will be started in April 2000.
Takashi Ichihara (RIKEN / RIKEN BNL
Research Center)