Taint-Enhanced Policy Enforcement by wuxiangyu


									       RAMSES (Regeneration And iMmunity SErviceS):
        A Cognitive Immune System

                         Self Regenerative Systems
                             18 December 2007

           Mark Cornwell                         R. Sekar
            James Just                      Stony Brook University
             Nathan Li
           Robert Schrag
           Global InfoTek, Inc

    Overview
    Efficient content-based taint identification
    Syntax and taint-aware policies
    Memory attack detection and response
    Testing
    Red Team suggestions
    Questions

    Demo

RAMSES Attack Context                                        Incoming
 Attack target: “program” mediating                      (Untrusted input)
  access to protected resources/services
 Attack approach: use maliciously crafted
  input to exert unintended control
  over protected resource operations
 Resource or service uses:
      Well-defined       APIs to access
            OS resources
            Command interpreters
            Database servers
                                                         Outgoing requests
            Transaction servers,
            ……                                              operations)
      Internal      interfaces
            Data   structures and functions within program
              Used   by program components to talk to each other

Example 1: SquirrelMail Command Injection

    $send_to_list =              Interface        sendto=“nobody; rm –rf *”

    $command = “gpg -r           Program
    $send_to_list 2>&1”                           $command=“gpg –r
                                                  nobody; rm –rf * 2>&1”

    popen($command)          “Output” Interface   Attack: Removes all
                                                  removable files in web
                                                  server document tree

Example 2: phpBB SQL Injection
                                                      UNION SELECT
                                     Interface        ord(substring(user_password,1,1))
                                                      FROM phpbb_users
                                                      WHERE user_id = 3”

      $sql = “SELECT p.post_id                        $sql= “SELECT p.post_id FROM
       FROM POSTS_TABLE              Program          POSTS_TABLE WHERE
        WHERE p.topic_id =                            p.topic_id = -1 UNION SELECT
              $topic_id”                              ord(substring(user_password,1
                                                      ,1)) FROM phpbb_users
                                                      WHERE user_id = 3”

    sql_query($sql)              “Output” Interface   Attack: Steal another
                                                      user‟s password

Attack Space of Interest (CVE 2006)
                                   Format string
           Others       1%
            24%                               SQL injection


        Directory                               18%
           4%                     Generalized Injection
                     Cross-site         Attacks
Detection Approach
  Attack: use maliciously crafted
                                                 Input Interface
   input to exert unintended                     (Untrusted input)
   control over output operations
  Detect “exertion of control”
           Based on “taint:” degree to             Program
            which output depends on input
  Detect           if control is intended:
           Requires   policies (or training)
                                                “Output” Interface:
             Application-independent            (Security-sensitive
             policies are preferable                operations)

RAMSES Goals and Approach
                                                                   Input Interface
   Taint analysis: develop efficient and                          (Untrusted input)
    non-invasive alternatives
      Analyze observed inputs and outputs
            Needs no modifications to program
            Language-neutral                                          Program
      Leverage learning to speed up analysis
   Attack detection: develop framework to detect
    a wide range of attacks, while minimizing         “Output” Interface
    policy development effort and FP/FNs
      “Structure-aware policies:” leverage interplay
       between taint and structural changes to output requests
      Use Address-Space Randomization (ASR) for memory corruption
              ASR: efficient, in-band, “positive” tainting for pointer-valued data
   Immunization: filter out future attack instances
      Output filters: drop output requests that violate taint-based policies
      Input filters: “Project” policies on outputs to those on inputs
            Relies on learning relationships between input and output fields
            Network-deployable
     Efficient Content-Based Taint

 Develop  efficient algorithms for inferring flow of
   input data into outputs
     Compare   input and output values
     Allow for parts of input to flow into parts of output
     Tolerate some changes to input
            Changes such as space removal, quoting, escaping,
            case-folding are common in string-based interfaces
     Based     on approximate substring matching
 Leverage        learning to speed up taint inference
     Even the “efficient” content-matching algorithms
      are too expensive to run on every input/output
     Same learning techniques can be used for detecting
      attacks using anomaly detection
Weighted Substring Edit Distance Algorithm
          Maintain a matrix D[i][j] of minimum edit
           distance between p[1..i] and s[1..j]
          D[i][j] = min{D[i-1][j-1]+ SubstCost(p[i],s[j]),
                          D[i-1][j] + DeleteCost(p[i]),
                          D[i][j-1] + InsertCost(s[j])}
          D[0][j] = 0 (No cost for omitting any prefix of s)
          D[i][0] = DeleteCost(p[1])+…+DeleteCost(p[i])
          Matches can be reconstructed from the D matrix
          Quadratic time and space complexity
              Uses O(|p|*|s|) memory and time

Improving performance
    Quadratic complexity algorithms can be
     too expensive for large s, e.g., HTML outputs
          Storage requirements are even more problematic
    Solution: Use linear-time coarse filtering algorithm
      Approximate   D by FD, defined on substrings of s of length |p|
      Let P (and S) denote a multiset of characters in p (resp., s)
      FD(p, s) = min(|P-S|, |S-P|)
            Slide   a window of size |p| over s, compute FD incrementally
      Prove:        D(p, r) < t  FD(p, r) < t for all substrings r of s
Result: O(|p|2) space and time complexity in practice
 Implementation results
      Typically30x improvement in speed
      200x to 1000x reduction in space
      Preliminary performance measurements: ~40MB/sec

Efficient online operation
 Weighted  edit-distance algorithms are still too
   expensive if applied to every input/output
     Need         to run for every input parameter and output
 Key        idea:
     Use        learning to construct a classifier for outputs
            Each   class consists of similarly tainted outputs
                taint identified quickly, once the class is known
     Classifying        strings is difficult
            Our technique operates on parse trees of output
            For ease of development, generality, and tolerance to
             syntax errors, we use a “rough” parser
            Classifier is a decision tree that inspects parse tree
             nodes in an order that leads to good decisions

Decision Tree Construction
 Examines  the nodes of syntax tree in some order
 The order of examination is a function of the set
  of syntax trees
     Chooses  nodes that are present in all candidate
      syntax trees
     Avoids tests on tainted data, as they can vary
     Avoids tests that don’t provide significant degree of
            “similar-valued” fields will be collected together and
             generalized, instead of storing individual values
     Incorporates       a notion of “suitability” for each field
           or subtree in the syntax tree
            Takes   into account approximations made in parsing

 Example of a Decision Tree
1. SELECT * FROM phpbb_config
2. SELECT u.*,s.* FROM phpbb_sessions s,phpbb_users u WHERE
    s.session_id='[a3523d78160efdafe63d8db1ce5cb0ba]' AND u.user_id=s.session_user_id
3. SELECT * FROM phpbb_themes WHERE themes_id=1
4. SELECT c.cat_id,c.cat_title,c.cat_order FROM phpbb_categories c,phpbb_forums f WHERE
               f.cat_id=c.cat_id GROUP BY c.cat_id,c.cat_title,c.cat_order ORDER BY c.cat_order
5. SELECT * FROM phpbb_forums ORDER BY cat_id,forum_order

switch (1) {
  case ROOT : switch (1.1) {
    case CMD : switch (1.1.2) {
      case c FINAL {@1.1.1:SELECT
                    @1.1.3:. cat_id,c.cat_title,c.cat_order FROM phpbb_categories
                     c,phpbb_forums f WHERE f.cat_id=c.cat_id GROUP BY
                     c.cat_id,c.cat_title,c.cat_order ORDER BY c.cat_order }
      case u FINAL {@1.1.1:SELECT
                    @1.1.3:. *,s.* FROM phpbb_sessions s,phpbb_users u WHERE
                     s.session_id='[a3523d78160efdafe63d8db1ce5cb0ba]' AND
                     u.user_id=s.session_user_id }
      case * FINAL {@1.1.1:SELECT
                    @1.1.3:FROM phpbb_?????? }

Implementation Status and Next Steps
   “Rough” parsers implemented for
      HTML/XML

      Shell-like      languages (including Perl/PHP)
      SQL

   Preliminary performance measurements
      Construction  of decision trees: ~3MB/sec
      Classification only: ~15MB/sec
            Significant   improvements expected with some performance tuning
   Next steps
      Develop      better clustering/classification algorithms based on
           tree edit-distance
            Current   algorithm is based entirely on a top-down traversal, and
             fails to exploit similarities among subtrees

           Syntax and taint-aware

Overview of Policies
   Leverage structure+taint to simplify/generalize policy
      Policy    structure mirrors that of parse trees
            And-Or   “trees” with cycles
      Can     specify constraints on values (using regular expressions)
           and taint associated with a parse tree node


           NAME = “script”                      OR

                                PARAM                     ELEM_BODY

      PARAM_NAME=“src”                      PARAM_VALUE

   Most attacks detected using one basic policy
      Controlling “commands” vs command parameters
      Controlling pointers vs data
Controlling “commands” Vs “parameters”
 Observation: parameters don’t alter syntactic structure of
  victim’s requests
 Policy: Structure of parse tree for victim’s request should
  not be controlled by untrusted input (“tainted data”)
 Alternate formulation: tainted data shouldn’t span multiple
  “fields” or “tokens” in victim’s request
           root                                                   root

           cmd                                  cmd                             cmd

name       param     param         name   param param     separator      name   param   param

 gpg        -r     sekar@abc.com   gpg     -r    nobody       ;           rm     -rf      *

Policy prohibiting structure changes
   Define “structure change” without using a reference
      Avoids       need for training and associated FP issues
   Policy 1
      Tainted       data cannot span multiple nodes
              for binary data, it should not span multiple fields
   Policy 2
      Tainted       data cannot straddle multiple subtrees
            Tainted    data spans two adjacent subtrees, and at least one of
               them is not fully tainted
                 Tainted data “overflowed” beyond the end of one subtree and
                  resulted in a second subtree
   Both policies can be further refined to constrain the node
    types and children subtrees of the nodes

Commands Vs parameters: Example 2
 Memory          corruption attack overflowing stack buffer
          For binary data, we talk about message fields rather
           than parse trees
             Stack     Return    Stack     Return    Stack
            frame 1   Address   frame 2   Address   frame 2
          Violation: tainted data spans multiple stack “fields”
 Heap   overflows involve tainted data spanning
   across multiple heap blocks

Attacks Detected by “No structure change” Policy
 Various forms of script or command injection
 SQL injection

 XPath injection

 Format string attacks

 HTTP response splitting

 Log injection

 Stack overflow and heap overflow

Application-specific policies
 Not all attacks have the flavor of “command
 Develop application-specific policies to detect
  such attacks
     Policy 3: Cross-site scripting: no tainted scripts in
      HTML data
     Policy 4: Path traversal: tainted file names cannot
      access data outside of a certain document tree
 Other        examples
     Policy     5: No tainted CMD_NAME or CMD_SEPARATOR
           nodes in shell or SQL commands

Implementation status
   Four test applications
      phpBB
      SquirrelMail
      WebGoat   (J2EE)
   Detects following attacks without FPs
      Command   injection (Policies 1, 2, 5)
      SQL injection (1, 2, 5)
      XSS (3)
      HTTP Response splitting (2)
      Path traversal (4)
      Memory corruption detected using ASR

   Should be able to detect many other attacks easily
      XPATH   injection (1,2), Format-string (1, 2), Log injection (1,2)

           Memory Attack Discussion

Memory Error Based Remote Attack
    Attacker’s         goal:
           Overwrite   target of interest to take over instruction
    Attacker’s         approach:
           Propagate   attacker controlled input to target of
           Violate certain structural constraints in the
            propagation process

Stack Frame Structural Violation
                         A’s stack frame
                      Function arguments
             High        Return address
                      Previous stack frame
                    Exception Registration Record
                          Local variables
                         B’s stack frame
                      Function arguments
                      Return address( to A)
                      Previous stack frame
                         Local variables
                         C’s stack frame
                      Function arguments
             Low      Return address (to B)
      EBP              Previous stack frame
      FS:0          Exception Registration Record
      ESP                Local variables

   Heap Block Structural Violation

                             Size                     Previous Size
                                     Flags            Unused          Tag Index


                       Windows Free Heap Block Header Structure

      Happens when removing free block from double-linked list:

      Ability to write 4 bytes into any address, usually well known address, like
       function pointer, return address, SEH etc.

ASLR and Crash Analysis
       ASLR randomizes the addresses of targets of interest
       Memory attack using the original address will miss
        and cause crash (exception).
       Crash analysis tracks back to vulnerability, which
        enables accurate signature generation
          Structural information usually retrievable at
           runtime, thanks to enhanced debugging technology
          Crash analysis aided with JIT(Just In-time Tracing)
            JIT   triggered at certain events:
              “Suspicious”    network inputs, e.g. sensitive JMP
            Attach/detach  JIT monitor at event of interest
            Memory dump can be dumped in the right granularity, log
             info from a few KB to a 2GB

Crash Root Cause Analysis
                         Root Cause Analysis

                           Exception Record/Context,
                      Faulting thread/Instructions/Registers
                       Stack trace/Heap/Module/Symbols

              Stack Corruption                        Heap Corruption

        Read                                                        Write
                          Access Violation
   Access Violation                                            Access Violation
                           Bad Deference
       Bad EIP
                          (Corrupted Local
  (Corrupted Return                                        (Address to write,
  Address or SEH)                                           Value to write )

Stack-based Overflow Analysis
      “Target” driven analysis
            The  goal of attack string is to overwrite target of interest
             on stack, e.g., return address, SEH handler.
            Start matching target values from crash dump to input, like
             EIP, EBP and SEH handler
               More   efficient than pattern match in the whole address space
            Ifany targets are matched in input, expand in both
             directions to find LCS
            A match usually indicates the input size needed to overflow
             certain targets

SEH Overflow and Analysis
   A unique approach for Windows exploit
      SEH stands for Structured Exception Handler
      Windows put EXCEPTION_REGISTRATION_RECORD chain on stack
       with SEH in the record.
   More reliable and powerful than overwrite return address
      More JMP address to use (pop/pop/ret)
      An exception (accidental/intentional) is desired
      Can bypass /GS buffer check
   SEH crash analysis:
      Catch   the first exception as well as the second one (caused by
      Locate the SEH chain head from first dump, usually overwritten
       by input
      Usually first exception is enough, second exception can be used
       for confirmation

Heap Overflow Analysis
   How to analyze heap overflow attack?
      Exploit      happens in free blocks unlink
            Multiple   ways to trigger
      Write       Access Violation with ASR
            with   overwriting in invalid address
      Overwrite        4 bytes value in arbitrary address
            Interested   targets include return address, SEH, PEB and UEF
      Exploit      contains the pair: (Address To Write, Value to Write)
            Appeared  in the overflowed heap blocks
            Usually contained in registers
            Should be provided from input by attacker
            Match found in synthetic heap exploits

      The     value pairs need to be in fixed offset
            For a given heap overflow vulnerability
            To enable overwrite the right address with the right value desired

Case Studies
           Vulnerability                            Exploit
IIS ISAPI Extension synthetic stack   Overwrite return address
buffer overflow

IIS ISAPI Extension synthetic stack   Overwrite Structure Exception Handler
buffer overflow

IIS w3who.dll stack buffer            Overwrite Structure Exception Handler

Microsoft RPC DCOM Interface stack Overwrite return address and Structure
buffer overflow(CVE-2003-0352)     Exception Handler

Synthetic Heap Overflow               Overwrite function pointer inside PEB

Case Study: RPC DCOM
            Step   1: Exception Analysis
              ExceptionCode: c0000005 (Access violation)
            Attempt to read from address 0018759f
            PROCESS_NAME: svchost.exe
            FAULTING_THREAD: 00000290

            Step   2: Target – Input correlation:
            StackBase: 0x6c0000, StackLimit: 0x6bc000,Size =0x4000
            Begin analyze on Target Overwrite and Input Correlation:
            Analyze crash EIP:
                Find EIP pattern at socket input:
                Bytes size to overwrite EIP= 128
            Analyze crash EIP done!
            Analyze SEH:
                Find SEH byte at socket input:
                Bytes size to overwrite SEH handler= 1588
            Analyze SEH done!

Signature Generation
     Signature generation:
        Signature captures the vulnerability characteristics

              Minimum      size to overwrite certain target(s)

            Use   contexts to reduce false positive:
              Using incoming input calling stack
                Stack offset can uniquely identify the context

              Using incoming input semantic context:
                Message format like HTTP url/parameter
                Binary message field

    Components & Implementation
                                                    Crash Monitor:
                                                    * Catch interested
                                                      exception only
                               1                     •Snapshots for a
                         Crash(Exception)              given period
                                                       * Self healer          Uses

                                                           2   Generate
         Protected Application                        Crash Dump*              Debug
                                    5                                         Engine
                                        Signature          4   Analyze
                                  3 Provide
                                                     Crash Analyzer          Uses
                                                  •Fault type detection   Infrastructure:
                                                   •Security oriented      Save Crash Dump
                                                         analysis         Extract Relevant Info
   * Crash Dump provides the same interface                                  Search/Match
  as LIVE process, so Crash Analyzer actually         •Feedback               Disassemble
does NOT have to work on saved crash dump file.


 Test Attacks & Applications
Attack                           Vulnerability              Target App     App Lang   Exploited Lang   Targets
phpBB SQL Injection              CAN-2003-0486              phpBB          PHP        SQL              Database
SquirrelMail Command Injection   CAN-2003-0990              SquirrelMail   PHP        cmd/shell        Server
SquirrelMail XSS Attack          CAN-2002-1341              phpBB          PHP        JavaScript       3rd party clients
PHP XML-RPC                      CAN-2005-1921              PHP Library    PHP        XML
HTTP Splitting                   CR LF escapes              WebGoat        Java       HTTP Request     Server
HTTP Splitting Cache Poisoning   tainted expiration field   WebGoat        Java       HTTP Request     Server page cache
Path Based Access Control        tainted file open          WebGoat        Java       file path        Server
Xpath injection                  tainted xpath string       WebGoat        Java       Xpath Library    Server
JSON injection                   flawed architecture        WebGoat        Java       JSON             Server Application
XML inject                       flawed architecture        WebGoat        Java       XML              Server Application

       Baseline Applications                                        Many “sub languges”
       • phpBB (php)                                                SQL, XML, JavaScript,
       • squirrelMail (php)                                         HTML, HTTP, JSON, shell,
                                                                    cmd, path
       • WebGoat (java)
       • hMailServer (C++)

   Possible Testbed Configurations
                                                                                                       Protected System
                           Protected System

                  Web                                                                                            Mail
                              Web                           Mail
                 Server                                                                                         Server
                                           SQL             Server

Attacker                                                                       Attacker

    Baseline testbed setup                                          Protect Mail server exposed as a
                                                                                             Protected System
               Web                                                                                            files
                           Web                          Mail                         Web
              Server                                                                            Web
                                                       Server                       Server                                Mail
                                       SQL                                                      Apps
              Apache)                                                               (IIS/
                                     Database                                      Apache)

Attacker                                                            Attacker
                                      Protected System

  Protect just mail server in context of                              Can extend protected system to
  Web service.                                                        include Mail Serve
Traffic Generation
    Purpose
     Coverage      of legitmate structural variation in
           monitored structures
            SQL,   command strings, call parameters
     Stress        of log complexity for practicality
            Multiple   users, multiple sessions
     Performance          measurements
            Program  performance metrics
            Quantify performance impact

Traffic Generation to Web Sites
   Approaches
      Simple      Record/Playback (basic)
              with minor substitutions (cookies, ips)
              shell scripts, netcat, MaxQ (jython based
          Custom DOM/Ajax scripting (learning)
            Can access dynamically generated browser content after(during)
            client side script eval
            Automated site crawls of URLS

            Automated form contents (site specific metadata)

          COTS tools
              Load testing and metrics

           Red Team Suggestions

Suggested Red Team ROEs
 Initialtelecons held in Fall
 Claim: RAMSES will defeat most generalized
  injection attacks on protected applications
 Red Team should target our current and planned
  applications rather than new ones (unless new
  application, sample attacks and complete traffic
  generator can be provided to RAMSES far enough in
  advance for learning and testing)
     Remote  network access to the targeted application
     Attack designated application suite

 Requiredinstrumentation yet to be determined
 Red Team exercise start 15 April or later
 ……
   RAMSES Project Schedule
                       CY06      CY07                    CY08          CY09

Baseline Tasks
                       Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3
1. Refine RAMSES
2. Design RAMSES
3. Develop
4. Integrate System
5. Analyze & Test
6. Coordinate & Rept
       Prototypes                  1           2         3
Optional Tasks
                                                             Red Team Exercise
O.3 Cross-Area Exper

                              Today: 11 September 2007
           Next Steps

 Develop  input filters from output policies
 Extend memory error analyzer

 Demonstrate RAMSES on more applications and
  attack types
     Native   C/C++ app (most likely app is hMail server)

 Integratecomponents
 Performance and false positive testing

 Red Team exercise



Tokenizing and Parsing
   Focus on “rough” parsing that reveals approximate
    structure, but not necessarily all the details
      Accurate parsers are time-consuming to write
      More important: may not gracefully handle errors (common in
      HTML) or language extensions and variations (different shells,
      different flavors of SQL)
    Implemented using Flex/Bison
          Currently done for SQL and shell command languages
            Parse into a sequence of statements, each statement consisting of
             a “command name” and “parameters”
            Incorporates a notion of confidence to deal with complex
             language features, e.g., variable substitutions in shell
      Modest  effort for adding additional languages, but
       substantially simplifies subsequent learning tasks
      Don’t anticipate significant additions to this language list
       (other than HTML/XML)

Taint inference Vs Taint-tracking
   Disadvantages of learning
      False  negatives if inputs transformed before use
         Low likelihood for most web apps
      False positives due to coincidence
         Mitigated using statistical information
      Plan to evaluate these experimentally

   Benefits of learning
      Low   performance overhead
      Some significant implicit flows handled without incurring high
       false positives
      Can address attacks multi-step attacks where tainted data is
       first stored in a file/database before use
         More generally, in dealing with information flow that crosses
          module boundaries

Attack Coverage 2004

                           Tempfile                (Stack-smashing, heap
         Config errors       4%                    overflow, integer overflow,
             3%                        Memory      data attacks)
 Other logic                            errors
   errors                                27%

                                             Format string
    DoS           Injection Attacks          SQL injection
    9%                                            2%

                                      Command      CVE
           Directory                   injection   Vulnerabilities
           traversal     Cross-site
                          scripting      15%
             10%                                   (Ver. 20040901)
RAMSES System Concept
                                                                 Protected System                   Components
                                                                                                      Event Collector
                   Firewall (e.g. mod_security)                                                    • parse/decode/normalize

                                                                                                     HTTP requests,
                                                        Web              Web                         parameters, cookies, …

                                                       Server            App
                                                                                        Database      Attack Detector
                                                         (IIS/           (PHP/
                                                                                        (MySQL)    • Address-space
                                                       Apache)            ASP)
                                                                                                   • Taint-based policies,

                                                                  RAMSES Interceptors                 Filter Generator
                                                                                                   • Output filter
                                                       Network          OS         Application     • Input filter
                                                        DLLs           DLLs          DLLs

     Key research problems
           Learn                  taint propagation
                Identify                         tainted components in output, generate filtering criteria
           Learn                  input/output transformation
                Use              transformation to project output filters to input

Advantages of RAMSES Filters
 Filters       easily sharable
     Complements         Application Community focus on end
           user applications
 Filters       are human readable
     Filter    generation algorithms can be enhanced to
           address privacy concerns wrt sharing

Filter types
           Filter Criteria                        Filter Location
     Correlative filters                Input filter
                                            Easier to deploy but harder to
        Equality-based filter
        Structure-based filter
                                         Output filter (precedes sensitive
        Statistical filter               operation)
     Causal filters                        Easier to synthesize than input
        Filtering criteria derived
                                             filter, but deployment needs
                                             deeper instrumentation
         from attack detection
                                            May be too late for some attacks
         criteria (policy or
                                             (memory corruption)

      Note: All filters evaluated using large number of
      benign samples and 1 attack sample


To top