Netflow and Botnets

Document Sample
Netflow and Botnets Powered By Docstoc
					Netflow and Botnets

   Steven M. Bellovin
  Columbia University



          smb           1
                  Hypothesis
• Most hosts are either clients or servers
  – P2P traffic is an exception
• Bots talk to other bots and thus to command
  and control node
• By looking for unusual traffic flows – client-to-
  client traffic that isn’t P2P – we can find bots



                          smb                         2
               Methodology
• Use Netflow data to identify clients and
  servers
• Classify nodes as clients or servers
• Build a traffic matrix from the data to see
  which clients talk to which other clients
• Exclude P2P traffic, which is generally
  identifiable based on flow size


                        smb                     3
                   Netflow
• Originally from Cisco; now implemented by
  most router vendors
  – Also an IETF “Proposed Standard”
• Records “flow information” – src/dst pairs
  (addresses and port numbers), length, timing,
  etc. – for “connections” through a given router
• Intended for accounting and for traffic
  engineering

                        smb                     4
        Problems with Netflow
• Flows are unidirectional; need two records for
  complete picture
  • This is a consequence of Internet topology; most
    inter-ISP connections follow asymmetric paths
• Routers often deliver sampled data; can miss
  flow start/end packets
• Does not give unambiguous indication of
  client versus server

                         smb                           5
                      Strategy
• Build tools at Columbia
  – Easy access to machines and data
• Use existing archive of CU netflow data
  – Unclear if there are botnets present; get classification
    right first
• Get other netflow archives (e.g., from
  predict.org)
• Bring nominally-working code to AT&T to
  experiment with large-scale datasets
• Compare with previous results from AT&T as
  check on correctness
                            smb                                6
            Node Classification
• Must use heuristics
  – Flag field in netflow data doesn’t show client vs.
    server
  – Timestamp not useful because of sampling
• Current strategy: look at port number
  distribution
  – Clients usually use ports 48K-64K
• Considering using node degree
  – But – problems with low-activity hosts?

                          smb                            7
          Classification is Hard
• Simple heuristics have not been satisfactory
• Building visualization tools to help us
  understand the data




                       smb                       8
Client: Port Number by Volume




             smb                9
Client: Port Number Scatter Plot




               smb                 10
Server: Port Number by Volume




             smb                11
Server: Port Number Scatter Plot




               smb                 12
Ambiguous Host




      smb        13
Ambiguous Host Scatter Plot




      Is this the sort of host we’re looking for?
                      smb                           14
               Current Status
• Have basic tools built
• Working with visualization tools to understand
  the data
• Next steps:
  – Refine classification algorithms
  – Confirm analysis of bots in sample data
  – Try tools on larger dataset


                         smb                   15

				
DOCUMENT INFO