2
3
Background
• Project: Spam’n’Beans
• What is it?
– An E-Mail content analysis system
– Benefits E-Mail servers by offloading
expensive E-Mail analysis to our system
– Exhibits Fault-Tolerant, Real-Time and High-
Performance qualities
4
Background
• What makes it interesting?
– E-Mail analysis is a real-world issue
• Unsolicited Commercial E-Mail (UCE) / SPAM
accounts for 60-80% of all E-Mail traffic
• E-Mail viruses pose a security risk
– Such analysis is often too expensive for high-
volume mail servers
– Few products yet exist to address this need:
• Amavisd-new http://www.ijs.si/software/amavisd/
• Postini http://www.postini.com/
5
www.Postini.com
6
Development Environment
• Language: Java
• Middleware: EJB (JBoss)
• Database: PostgreSQL
• Content Filtering: SpamAssassin
• Operating System: GNU/Linux
• Test Data: Public Corpus
– Database of real-world E-Mail
– Made available by SpamAssassin
7
High-Level Overview
• A client’s MTA (Mail Transport Agent) uses the
Spam’n’Beans client to send incoming E-Mail
messages to a cluster of replicas for filtering
• Each replica runs all necessary content filters as
daemon processes
• replica-side middleware accepts incoming E-Mail
from clients and feeds it to the appropriate
daemon processes over a local socket
connection
8
High-Level Overview
9
Baseline Architecture
10
Fault-Tolerance Goals
• E-Mail processing servers are replicated to guarantee
availability of service despite faults on any one replica
– System will continue to be available despite up to N-1 faults (N is
the number of replicas)
– Clients will continue to retry when no replicas are active
• State, stored in a remote database, consists of:
– Replica state and statistics
– Client authentication information
– Message state and statistics
– Client requests are idempotent
• Non-Replicated Components
– Replication Manager / Fault Detector
– Database backend
– EJB Nameserver 12
Fault-Tolerance Elements
• Replication Manager
– A process that starts/stops replicas and manage list of
available replicas
• Fault Detector
– A dedicated thread monitors each replica
• Fault Recovery
– Monitor thread will re-start replica automatically as
needed
• Fault Injector
– A separate script used during testing
– Forcefully kills a random replica every S seconds
13
FT-Baseline Architecture
14
Fault Detection
• Client application receives exception and
reports it to the Replication Manager
– From EJB (Remote Exception)
– From server application (Fatal Exception,
Non-Fatal Exception)
• Periodic ping by Fault Detector
– K failures initiates replica re-start
15
Client-Side Fail-Over
• Notify Replication Manager of replica
failure
• Request another replica
– Retry if none are available
• Connect to new replica and re-issue
original request
16
Fail-Over Measurements
ms
17
Message #
Real-Time Baseline
• Bounded fail-over achieved by:
– Removing replicas from the pool when
• Client disables replica use after receiving
exception
• Fault Detector identifies unresponsive replica
– Only choosing live replicas on fail-over
19
Bounded Fail-Over
Measurements
ms
• Fail-over now
bounded by
600 ms
• Fail-over time
reduced by 1
order of
magnitude
20
Message #
Performance Strategy
• Clustering
– Any middle-tier replica can handle any
request
– All replicas handle requests in parallel
• Load Balancing
– Minimize response latency
– Adjusts to
• Static system resources
• Dynamic system utilization
21
Load Balancer Implementation
• Load Balancer on golden machine
– Maintains list of all live replicas and their associated
load
• Replica load is updated by Fault Detector ping
• Clients request replicas from Load Balancer
– Every M messages
• Load balancing strategies:
– Round-Robin
– Priority (inversely proportional to relative CPU load)
22
Round Robin Performance
ms
23
Message #
Priority Based Performance
ms
24
Message #
Other Features
• Multi-threaded administrative console
• Run-time replica management
– Individual replicas can be added/removed as
needed
• Run-time selection of load balancing
strategy
• Optimization for transient failures
– Don’t restart a replica until it has been
unreachable for K pings
– Verify client-reported errors
26
Insights from Measurements
• System bottleneck is CPU-intensive E-Mail
analysis
• Message processing time is highly
correlated with message size
• Increases in system load cause
temporary increases in jitter and delay
27
Fixed Big Message (~90KB)
ms
28
Message #
Variable Sized Messages
ms
29
Message #
Fixed Small Messages (~0.4KB)
ms
30
Message #
Open Issues
• Multiple simultaneous replica connections
• Increase throughput
– Experiment with other load-balancing strategies
– Add automatic capacity scaling
– Enqueue client requests
• Add virus checking (via ClamAV)
• Remove single points of failure
• Enhance administrative consoles
– Add graphical/web interface
31
Conclusions
• What did we learn?
– Tradeoffs between fault-tolerance, real-time,
and performance can be difficult to manage
• What did we accomplish?
– We built a working system with fault-
tolerance, real-time and high-performance
attributes to solve a real-world problem
• What would we do differently now?
– Start with better architecture definition
– Adhere to “KISS” principle 32
Q&A
Any questions?
33