FTB-Enabled
g
InfiniBand Monitoring Software
Karthik Gopalakrishnan
Ohio State University
InfiniBand and FTB:
Current State and Future Plans
System Components, Libraries, Applications and Autonomics
(MPI, Parallel Fil Systems, Chkpt/Rstrt, etc.)
(MPI P ll l File S t Chk t/R t t t )
( )
Fault Tolerance Backplane (FTB)
User-Transparent Network Fault
Recovery Prevention
(dynamic and adaptive (alternate paths using
Reconfiguration) LMC, APM, etc.)
Network Fault Network Fault
Monitoring Prediction
(link, switch, SM, (port counter, congestion,
topology change) history)
InfiniBand and FTB:
Current State and Future Plans
System Components, Libraries, Applications and Autonomics
(MPI, Parallel Fil Systems, Chkpt/Rstrt, etc.)
(MPI P ll l File S t Chk t/R t t t )
( )
Fault Tolerance Backplane (FTB)
User-Transparent Network Fault
Recovery Prevention
(dynamic and adaptive (alternate paths using
Reconfiguration) LMC, APM, etc.)
Network Fault Network Fault
Monitoring Prediction
(link, switch, SM, (port counter, congestion,
FTB-IB 1.0 topology change) history)
Release done
on 11/10/08
Fault Tolerant InfiniBand Component
Monitored Events
FTB_IB_ADAPTER_AVAILABLE
– FTB IB ADAPTER AVAILABLE
– FTB_IB_ADAPTER_UNAVAILABLE
_ _ _
– FTB_IB_ADAPTER_INFO
– FTB_IB_PORT_INFO
– FTB_IB_EVENT_PORT_ACTIVE
– FTB_IB_EVENT_PORT_ERR
– FTB_IB_EVENT_LID_CHANGE
FTB_IB_EVENT_CLIENT_REREGISTER
– FTB IB EVENT CLIENT REREGISTER
Fault Tolerant InfiniBand Component
FTB Agent
FTB
Enabled
Component
FTB-IB
IB HCA
Fault Tolerant InfiniBand Component
FTB Agent
FTB
Enabled
Component
FTB-IB
IB HCA
Fault Tolerant InfiniBand Component
FTB Agent
Port Down
FTB
Enabled
Component
FTB-IB
IB HCA
Fault Tolerant InfiniBand Component
FTB Agent
FTB
Enabled
Component
FTB-IB
IB HCA
Fault Tolerant InfiniBand Component
FTB Agent
FTB
Enabled
Component
FTB-IB
IB HCA
Fault Tolerant InfiniBand Component
FTB Agent
Adapter
Unavailable
FTB
Enabled
Component
FTB-IB
IB HCA
Fault Tolerant InfiniBand Component
FTB Agent
FTB
Enabled
Component
FTB-IB
IB HCA
Fault Tolerant InfiniBand Component
• Future Plans
– Library based design to support affiliated events related to QP, SRQ and
CQ errors
– Network-Fault Prediction
Network-Fault Prevention
– N t k F lt P ti
– User-Transparent Recovery
• Availability
– FTB-IB 1.0 release can be downloaded from
http://nowlab.cse.ohio-state.edu/projects/ftb-ib/
– Also available from the CIFTS Software page at
p g p p
http://www.mcs.anl.gov/research/cifts/software/index.php