Handling Faults in Sensor Networks

Document Sample
Handling Faults in Sensor Networks Powered By Docstoc
					      Confidence

Integrity Group Presentations
           3/1/2007
                         Problem Description
Faults are Common in WSS deployments!
  Faults are anything decreasing the quality (i.e. usability) or quantity of data.
   This means we have to address both Environmental sensor and
   “Network” sensor faults e.g.
    – Faulty sensors
    – Faulty radios
    – Questionable environmental context

Fault tolerance is a common technique to address faults in distributed and
   Internet Systems.
    – Replication & Redundancy – Yeah right
    – Diversity i.e. Multi-modal sensing => Requires detailed understanding of
        the environment. Often the very reason we are deploying the system
    – Furthermore, especially for rapid deployments, we want all of the data

Problem: a system for real-time fault detection and remediation
• Users should validate questionable data or fix sensor/hardware faults as
   they occur
NOT AFTER THE FACT

•   Need an on-line system to monitor data quality and suggest actions a user
    can take to fix faulty data or validate questionable data
                                Confidence
Confidence
1)    classifies faulty (network and environmental) sensors
2)    “labels” sensors with actions a user can take to remediate these faults

Actions: Anything increasing the quality or quantity of data; e.g.
     –    Replace or recalibrate sensor
     –    Replace node
     –    Extract physical samples


We cannot completely remove the human from the loop
Most faults impacting sensor networks require some kind of human
      intervention, be it calibrating a sensor, or moving a tree that has fallen on
      a node.
However, solely relying on the human to manually monitor and administer a
      large number of nodes and sensors is not feasible as well.

GOAL: Enable users to support larger scale systems by automating key
    administration tasks instead of designing completely autonomous
    networks
                         System Constraints
•   Continually learn and adapt to its environment.
                    => NO DISTINCT TRAINING and OPERATIONAL PHASE
                    => NO STATIC THRESHOLDS (Like Sympathy)
                    => MUST OPERATE ONLINE
    Knowledge-based systems, static thresholds, and decision trees are
    relatively simple, transparent, and scale to large amounts of data.

    Thresholds are difficult to assign, and must usually be set for each
    deployment

    Even if we are able to set the threshold accurately, threshold-based
    systems are not designed to adapt to dynamic environments or to
    incorporate user interaction and knowledge gained during the course of the
    deployment.

•   Operate in the absence of ground truth
                   => NO SUPERVISED LEARNING
•   Up to about 40% (formalize this how?) data can be faulty
•   Have a transparent decision making process
                            Confidence

Confidence: A system to detect faulty data, and suggest a small
  number of actions a user can take to fix faults or validate
  questionable data. Applies to both network and sensor data.

Each data point from a sensor is translated into a feature vector and
  mapped into the pre-defined feature space; data quantity features
  are similar to system metrics collected by Sympathy and data quality
  features are selected based on domain experience with sensors
  (e.g. gradient of sensor data).

Using a simple outlier detection algorithm, points that are far away from
   the origin are considered faulty.

Similar vectors are grouped together using a simple on-line clustering
   algorithm

Confidence learns effective diagnoses by recording those actions that
  result in points moving from a faulty location in the feature space to
  a good location
  Motivation
System Model
Feature Space
  Diagnoses
  Evaluation
                            System Model
• Sensor is an entity that periodically returns
  data to the sink
     – Network sensors return system metrics
     – Environmental sensors return sensor data
Sink calculates                       Standard Deviation
features
                                      Gradient
                                 t2
                            t1

 Network
                     Sink                        F = <F1, F2, F3, F4>
  Cloud                                          Feature Vector F
                                                 describes state of sensor
   F inserted into N-
   dimensional feature
   space defined by N
   features
                           System Intuition

We reduce the problem of identifying and diagnosing system faults to identifying
  the correct feature space in which faults appear anomalous.

    Points anomalously far from the origin in the feature space are faulty




Actions that fix one point in a cluster are assumed to fix all the points in
   that cluster
  Motivation
System Model
Feature Space
  Diagnoses
  Evaluation
                        Feature Selection

• Features should be chosen such that as the feature value increases,
  the sensor quality should generally decrease.
   – Direct mapping not necessary
• Using this simple constraint, points that are farther from the origin
  are more likely to be faulty.
• We have chosen features, based on deployment experience and
  sensor domain knowledge, that are useful in describing the general
  quality of the sensor.

• Environmental Features
e.g. standard deviation: std dev of data within a short window of
   samples. No definite threshold is known, but as the standard
   deviation increases, the hardware is more likely to require action.

• Network Features largely taken from Sympathy
                       Outlier Detection

Points that are anomalously far away from the origin are considered
   faulty
Assume distance of good points to the origin represented with a normal
   distribution (reasonable)
Update distribution parameters using a EWMA and only if point is good
      System State: Updated upon Point Arrival
    System State: Updated upon Point Arrival

Cluster: Fc, Pc, Ac
 Fc = <F1, F2..FN>: Vector of N features representing cluster center
 Pc: Array of most recent p points
 Ac = Vector of actions associated with cluster

Point: Fp, Fd
 Fp = <F1, F2..FN>: Vector of N features representing the point
 Fd = Distance of feature from origin


Feature Space: m, s
  m: Mean Fd over all points
  s: Standard deviation of Fd over all points
  Motivation
System Model
Feature Space
  Diagnoses
  Evaluation
                        Making K-means On-line
  • Only cluster center closest to point is updated
      – Clusters and Points represented by feature vector in cluster space
      – Closest cluster => Minimum Euclidean distance from the point in the
        feature space
  • Points will not necessarily be grouped with their optimal clusters.
  => Not necessary, because Confidence’s priority is to accurately classify
     the most recent points.

Example: 2-D feature
space, represented by
Features F1, and F2
                                       EWMA used to update
                        F2               cluster center



                             <F1,F2>
                                             F1
                                  Bootstrapping Clusters with Actions
                     Assumption: Actions that fix one point in a cluster are assumed to fix all
                         the points in that cluster
                     User bootstraps clusters with actions based on domain knowledge
                     Incoming points are classified using their distance from the origin
                     If faulty: Confidence identifies the closest cluster to that point, and
                         notifies user of actions associated with that cluster
                         Cluster center is updated
Distance from NLDR




                            N/A                       Re-calibrate   NLDR
                                                      Sensor

                                                                       LDR


                                                     Physical
                                                     Sample




                                     Distance from LDR
                      Learning New Actions

Incoming points are classified using their distance from the origin
If faulty: Confidence identifies the closest cluster to that point, and
    notifies user of actions associated with that cluster
    Cluster center is updated
    ….
User notifies Confidence when they take an action
If next point from sensor is no longer in a faulty region, action is
    associated with cluster and positively validated
                 Non-Linearity in Features

• As feature value increases, significance decreases
   – E.g. It is much more significant to not hear from a node between
     30 and 60 seconds, then it is to not hear from the node between
     1030 and 1060 seconds.
   – So, the distance between a cluster at 30 seconds and another at
     60 seconds should not be equivalent to the distance between a
     cluster at 1030 seconds and 1060 seconds.
• Take log2 of most features when mapping to cluster space
   – log2 30 = 5
   – log260 = 6
   – log2 1030 and log2 1060 = 10.
  Motivation
System Model
Feature Space
  Diagnoses
  Evaluation
                    Goal of Evaluation

Demonstrate that Confidence’s approach of suggesting
  broad actions using clustering can help users detect and
  remediate faults more quickly and accurately than
  previous approaches.

• Evaluate the detection latency and number of false
  positives in multiple different scenarios: 1) Single fault
  injection; and 2) Multiple fault injection.
• Evaluate impact our choice of parameters has on system
  performance for the EWMA, number of clusters, feature
  scaling.
• Discuss our experiences in deploying Confidence in
  several test and real-world deployments
                    Features

Environmental       Potential Action
Feature
Std Deviation       Check sensor connection

Gradient            Check sensor

Distance LDR        Check sensor calibration

Distance NLDR       Extract Physical Sample



 Network Feature     Potential Action

 Liveness            Check node

 Congestion          Check congestion

 Sensor Data Time    Check sensorboard

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:0
posted:4/21/2013
language:Unknown
pages:21