ABSTRACT
CHAUDHARY, UMANG KAMALAKAR. Flow Classification Using Clustering And Associative
Rule Mining. (Under the direction of Dr. Michael Devetsikiotis.)
Traffic classification has become a crucial domain of research due to rise in applications that
are either encrypted or tend to change port consecutively. The challenge of flow classification
is to determine the application without any information of the payload. In this paper, our goal
is to achieve a robust and reliable flow classification using data mining techniques. We propose
a classification model which not only classifies flow traffic, but also performs behavior pattern
profiling. The classification is implemented using clustering algorithms, and association rules
are derived by using Apriori algorithms. We are able to find association between flow parame-
ters for various applications, therefore making the algorithm independent of the characterized
applications. The rule mining helps us to depict various behavior patterns for an application,
and those behavior patterns are then fed back to refine the classification model.
Flow Classification Using Clustering And Associative Rule Mining
by
Umang Kamalakar Chaudhary
A thesis submitted to the Graduate Faculty of
North Carolina State University
in partial fulfillment of the
requirements for the Degree of
Master of Science
Computer Engineering
Raleigh, North Carolina
2010
APPROVED BY:
Dr. Yannis Viniotis Dr. Do Young Eun
Dr. Michael Devetsikiotis
Chair of Advisory Committee
DEDICATION
To my beloved parents and my sister
ii
BIOGRAPHY
Umang Chaudhary received his Bachelor of Engineering degree from University of Mumbai,
India. He is currently pursing Master of Sciences in Computer Engineering in Network Traffic
Classification in North Carolina State University in Raleigh. He worked as research associate
for Network Performance Research Group for accurate classification of NetFlow traffic using
Machine Learning algorithms. His experience with ITng services helped him gain valuable
insight of industrial technologies.
iii
ACKNOWLEDGEMENTS
I would like to thank my parents, Kamalakar and Lata, for their support and commitment in
helping me succeed. Also, to my sister, Bhavna, who constantly reminds me that life teaches
you more than school work.
I am grateful for the advice and guidance of my advisor, Dr. Michael Devetsikiotis, and for
that of my committee members, Dr. Yannis Viniotis and Dr. Do Young Eun. Also, Ioannis
Papapanagiotou, who consistently supported me and shared the vision of this research work.
A special thanks to all of the members of the ITng Services at Centennial Campus who gave
me direction and advice for this project.
iv
TABLE OF CONTENTS
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motive and Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Thesis Organisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Chapter 2 Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Packet Classification and NetFlow Classification . . . . . . . . . . . . . . . . . . 4
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 Port-based Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 Payload-based Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.3 Transport-layer based classification . . . . . . . . . . . . . . . . . . . . . . 9
2.2.4 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.5 Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.6 Statistical Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Chapter 3 Classification Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1 Classification Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.1 Processing and Flow Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.3 Transductive Classification . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.4 Association-Rule Classification . . . . . . . . . . . . . . . . . . . . . . . . 19
Chapter 4 Analysis and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1 Traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Classification Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3 Analysis and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3.1 K-Mean Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3.2 Model-based Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3.3 Rule based Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4 Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.5 Behavior Pattern for Application . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Chapter 5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.1 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
v
Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Appendix A Application Behavior Profiling . . . . . . . . . . . . . . . . . . . . . . . 41
A.1 DNS behavior Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
A.2 Mail behavior Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
A.3 HTTP behavior Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
A.4 IRC behavior Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
A.5 SMTP behavior Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
vi
LIST OF TABLES
Table 2.1 NetFlow version 5 format . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Table 2.2 Comparison between payload and NetFlow . . . . . . . . . . . . . . . . . . 7
Table 3.1 Distribution of application class in traces . . . . . . . . . . . . . . . . . . . 15
Table 3.2 Parameterization of Covariance Matrix in Mclust Package . . . . . . . . . 18
Table 4.1 Evaluation Metrics for Machine Learning Classification Techniques . . . . 23
Table 4.2 Precision & Recall for Tranductive K-Mean and Model-based classification 30
Table 4.3 Accuracy for Association Rule classifier based on K-Mean and Model-
based clustering for WAND and CRAWDAD Traces . . . . . . . . . . . . . 32
Table 4.4 Association Rules from Apriori Association Algorithm . . . . . . . . . . . . 35
vii
LIST OF FIGURES
Figure 2.1 Various Network Traffic Classification Methods . . . . . . . . . . . . . . . 8
Figure 2.2 Application behavior patterns based on transport layer attributes . . . . . 9
Figure 2.3 Application behavior patterns based on transport layer attributes . . . . . 10
Figure 2.4 Graph-based classification for HTTP-TG application . . . . . . . . . . . . 12
Figure 3.1 Classification Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Figure K-Mean Transductive Classifier Precision Graph for WAND Traces . . . .
4.1 25
Figure K-Mean Transductive Classifier Recall Graph for WAND Traces . . . . .
4.2 25
Figure K-Mean Transductive Classifier Accuracy Graph for WAND Traces . . .
4.3 26
Figure Bayesian Information Criterion distribution for parameterization . . . . .
4.4 27
Figure Model-based Transductive Classifier Precision Graph for WAND Traces .
4.5 28
Figure Model-based Transductive Classifier Recall Graph for WAND Traces . . .
4.6 29
Figure K-Mean and Rule-based Classification Accuracy Graph for WAND Traces
4.7 31
Figure Model-based clustering and Rule based classification Accuracy for WAND
4.8
and Crawdad traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Figure 4.9 Distribution of flow attributes in scatter plot matrix . . . . . . . . . . . . 33
viii
Chapter 1
Introduction
New applications create a vast diversity of Internet traffic. As the Internet continues to expand,
optimum provisioning and managing has become a severe challenge [21]. These applications
are finding innovative ways to tensely use network resources.
1.1 Motive and Goals
In recent times, the network operators need a methodology in which applications can be pro-
filed and classified by taking into account their recent behavior patterns. Traffic classification
and analysis will help Internet Service Providers (ISPs) to plan resources for applications and
optimize their business goals [17]. Moreover, pattern recognition and behavior profiling has
become a necessity in the discovery of malicious traffic, and Denial of Service attacks [20].
A lot of research has been done in the field of network traffic classification using various
techniques and models. The classification can be done using payload attributes, transport layer
information, application behavior, statistical data modeling, and machine learning techniques.
Packet-based techniques uses session information, content signatures and behavior signa-
tures for accurate classification. The accuracy rate is usually very high even for peer-to-peer
applications [15]. Packet identification faces limitations due to encryption, string search opti-
mization, storage and processing overhead to support massive amounts of packet analysis.
Flow based techniques use header information and connection pattern between source and
destination IPs to deduce classification models. Though due to the non-existence of the payload
information, a classifier based on netflow traces have been long regarded as a challenging process.
In this work, we applied clustering and data mining techniques to overcome this issue. In the
following a brief explanation of the classification processes is given.
Port-based classification is a conventional method to classify traffic in the network by consid-
ering port number of the payload traffic. The standard list of port numbers for each application
1
is assigned by Internet Assigned Numbers Authority (IANA) [9]. Due to dynamic port allo-
cation and use of unregistered ports, port based classification has reduced effectiveness for
peer-to-peer applications.
Payload-based classification has packets processed to match particular signatures of the
application of interest. The approach needs to remain up to date with signatures of the ever
evolving applications. There is need of vigorous regular expression matching and string search
optimization for identifying unique signature values of various applications [15].
Transport-layer heuristics analyze transport layer parameters to interpret flow connection
patterns of various network applications. The method overcomes processing and memory draw-
back of payload-based analysis [11]. In [12], the host related characteristics are utilized for
classification on social, functional and application level.
Statistical-based classification utilize statistical properties to build empirical models based
on parameters such as byte size, flow/packet duration, inter-arrival packet timing. Distinct
traffic flow statistics properties have been observed for network applications [4].
Graph-based Classification combines network-wide behavior and flow-level characteristics
using Traffic Dispersion Graphs (TDG). This technique provides almost 90% accuracy for peer-
to-peer traffic with backbone traces achieving 95% of precision [10].
Machine Learning based Classification utilizes powerful data mining, information heuristics
techniques for describing structural patterns in data for classification purpose. The Machine
Learning Model computes patterns from training datasets. These patterns form basis for classifi-
cation of the test dataset [16]. One of the main drawbacks of the above classification techniques
is that they cannot profile applications with changing behaviors, as well as there is a need
constant profiler updates for any future application
1.2 Contribution
In the proposed classification model, we combine traffic clustering and application behavior
profiling. Such a technique increases the classification accuracy and is independent of the format
of the applications. This is achieved by creating association rules between flows that belong
to the same applications. Association rules are calculated without any human intervention
therefore making the algorithm independent of the investigated application.
Due to the need for an efficient and practical approach, we developed a two step netflow
classification algorithm. The proposed process, first utilizes a clustering algorithm to cluster
and classify the data, and secondly implements association rule mining technique for labeling
flow datasets. More specifically, our contribution can be summarized as follows:
• We achieve high classification precision, recall and overall accuracy. The model reaches
overall accuracy of 94% with almost perfect HTTP precision.
2
• We are able to profile behavior patterns for each network applications using Apriori as-
sociation rule algorithm. With rule evaluation parameters such as lift, confidence and
support, we are able to model and characterize interaction of the flow attributes.
• Due to creation of the association rules, the process is independent of the dataset. There-
fore, for the unknown or future applications that algorithm instantly defines each pattern.
1.3 Thesis Organisation
In the next chapter, we describe various network traffic classification techniques. In the third
chapter, we propose our classification model using Apriori algorithm and clustering. In the
fourth chapter, we analyze K-Mean and Model based clustering driven by apriori classification
and discuss their results. Finally, conclusions are provided at the last chapter.
3
Chapter 2
Background and Related Work
This chapter provides an overview of the related work in the network traffic classification,
addresses payload and flow classification, gives a brief review of the types of classification
models, algorithms and techniques.
2.1 Packet Classification and NetFlow Classification
Packet Classification
Packet contain valuable information of the serviced application in the network. Due to spec-
ification of the application protocols, there are particular signalling information or messages
necessary to initial communication between hosts. Thus, there are certain string patterns or
signatures present in the packet that map back to the application.
Packet based classification searches for characteristics signatures of applications in the
packet payload. The classification model utilize stateful reconstruction of session and appli-
cation data strings from each packet’s content. There is an advantage that duplicate data
can be identified but there is a need to reassemble packets in real time. In the peer-to-peer
applications, the client payload contains signaling and peer information which form basis to
application signatures. Moreover, the location and length of signatures in the packets plays a
vital role for classification. This method provided low false negative value for frequently used
peer-to-peer applications [22].
Although payload based classification achieves high accuracy, the pattern matching in the
packet payload is quite CPU intensive. There are large requirements of processing power and
memory allocation for huge packet traces. As the applications evolves, the signalling behavior
changes and hence signatures needs to be update with current application patterns. In addition,
there are various implementation of a particular application which do not comply with avaliable
documentation. Encrypted payload cannot be scrutinized for payload signature matching [8].
4
NetFlow Classification
Flow based classification utilizes packet header information between source and destination
hosts. Cisco’s NetFlow collects valuable flow statistics and has become widely used in traffic
monitoring tools. According to NetFlow version 5, flow is defined as a unidirectional sequence
of packets between a particular pair of source and destination IP addresses.
A new flow is decided when any one of the following key fields change,
• Source IP address
• Destination IP address
• Source port number
• Destination port number
• Layer 3 protocol type
• ToS byte
• Input logical interface (ifIndex)
NetFlow are advantages for customer applications due to the following reasons [3]:
• Network Monitoring The real time network monitoring capability allows analysis of traffic
patterns in individual network devices such as routers and switches as well as on a network
wide basis. This assists in troubleshooting and provide rapid solutions.
• Application Monitoring and Profiling NetFlow helps view detailed time based application
usage over the network. The information can be used for planning new services and
applications. Moreover, network and application resources can be allocated according to
customer base.
• User Monitoring and Profiling On the user side, NetFlow provides detailed information
of user utilization of network and application resources. This assists in planning user
network, allocate access, network and backbone resources as well as avoid security threats.
• Network Planning NetFlow has provisions for collecting data for a long period of time for
understanding the need of growth and upgradation in the network. NetFlow minimizes
cost by maximizing performance, capacity and reliability. Valuable information is gained
by identifying unwanted WAN traffic, validation bandwidth and Quality of Service.
5
Table 2.1: NetFlow version 5 format
Bytes Contents Description
0-3 srcaddr Source IP address
4-7 dstaddr Destination IP address
8-11 nexthop IP address of next hop router
12-13 input SNMP index of input interface
14-15 output SNMP index of output interface
16-19 dPkts Packets in the flow
20-23 dOctets Total number of Layer 3 bytes in the packets of the flow
24-27 First SysUptime at start of flow
28-31 Last SysUptime at the time the last packet of the flow was received
32-33 srcport TCP/UDP source port number or equivalent
34-35 dstport TCP/UDP destination port number or equivalent
36 pad1 Unused (zero) bytes
37 tcp flags Cumulative OR of TCP flags
38 prot IP protocol type (for example, TCP = 6; UDP = 17)
39 tos IP type of service (ToS)
40-41 src as Autonomous system number of the source, either origin or peer
42-43 dst as Autonomous system number of the destination, either origin or peer
44 src mask Source address prefix mask bits
45 dst mask Destination address prefix mask bits
46-47 pad2 Unused (zero) bytes
• NetFlow Data Ware Housing and Data Mining NetFlow data can be hub for retrieval and
analysis in support of applications according to user behavior. In addition, the Market
Researchers access gain crucial information of ”who”, ”what”, ”where” and ”how long”
information relevant to the service providers.
Table-2.1 is reproduced from [3] for illustration of NetFlow Version 5 packet format. NetFlow
flows records are forwarded to the collector in form of UDP/IP packets. It can be heavy in
computation for high speed network enivornment [3]. NetFlow features provide considerable
revelance for IP traffic under test. From symmetic uncertainity of all features, source and
destination IPs, source and destination ports, and TCP flags show high relevance. The reason
is that IP address consist of social behavior of various applications such as HTTP servers, Mail
servers [24]. Hence, in our analysis, we consider high relevance flow attributes - source IP,
source port, destination IP and destination port.
6
Table 2.2: Comparison between payload and NetFlow
Attributes Payload NetFlow
Computation Cost High Low
Storage Overhead High Low
Encryption Fails Success
Privacy Issues Diffcult Easy
Packet vs NetFlow
We provide an evaluation of advantages and disadvantages of payload and NetFlow methods
as follows:
Processing at network speed for payload methods is high due to system memory bus bot-
tlenecks. With increase in volume of packets, there is risk to packet loss at high utilizations of
a monitored link. In NetFlow, header information of all packets communicated between two
hosts accurately depicts IP traffic characteristics.
Storage requirements increases significantly for payload methods if packet size increases.
On the other hand, flow sampling could reduce storage overhead.
Encryption causes signature matching in payload methods to fail. Alternatively, header
information remains unencrypted hence flow statistics can be computed.
Privacy Issues, due to RIAA litigations, has made providers reluctant to allow payload
analysis. Flow methodology provides an better solution to many privacy and legal issues rised
due to deep packet inspection.
2.2 Related Work
A number of techniques are used for accurate network application classification using port infor-
mation, transport layer attributes, machine learning alogrithms or statistical models. Figure-
2.1 has shown the hierarchy of the various classification methods. We will briefly explain each
technique in the following sections.
2.2.1 Port-based Classification
IANA assigns port numbers for Internet applications for performing their operations. Port
based classification inspects the port numbers of the traffic to classify applications. Many
applications deviate from standard port numbers to avoid detection and bypass firewall. Due
to rise of peer-to-peer applications, dynamic port allocation does provide accurate classification
results.
7
Traffic Classification
Methods
Transport Machine
Port Payload layer
Graph Statistical
Learning
Unsupervised Supervised
Learning Learning
Figure 2.1: Various Network Traffic Classification Methods
For byte based accuracy, the accuray was observed to be no better than 70% . For flow
accuracy, the value of accuracy fell to 30-50% . Hence, port numbers has been less reliable due
to deviatation from conventional port numbers [14].
2.2.2 Payload-based Classification
Payload-based classification identifies charateristics bit strings in packet payload that depict
controlling signals for a particular application. This method are applied majorly for peer-to-
peer application and intrusion detection traffic due to their variant nature. Signatures are
derived from frequent bit patterns, string position, fixed or variable offset and even behavioral
signatures [15].
For peer-to-peer applications-, the overall accuracy is as high as 97.39%. Moreover, combi-
nation of port and payload based classification increases accuracy to almost 79% [18]. Although
payload classification provide more reliable and accurate results than port based classification,
it imposes processing and storage overhead constraints. The generation of application signa-
tures is a tedious process with intense string searching and pattern matching. Signatures needs
to updated frequently as new trends in the applications are evolved. Hence, payload classifica-
tion does provide considerable accuracy but at the cost of processing, storage and computation
constraints.
8
2.2.3 Transport-layer based classification
Port-based had limitations of reliability and low accuracy whereas payload-based had large
processing and storage requirements. To overcome these limitations, transport-layer heuristics
identifies unique behavior patterns of applications during transferring and signalling phase.
They utilize only transport layer header information for flow traffic. From Figure-2.2 reproduced
from [12], we can see the interaction between hosts at various social, functional and application
level defines heuristics for classification.
Figure 2.2: Application behavior patterns based on transport layer attributes
From [12], we can observe that there are more that 90% flows with high accuracy values of
95%. Futhermore, both accuracy and completeness exceeds 80%. It overcomes limitations of
payload and port based methods by utilizing flow based and behavior pattern recognition. But
this technique is capable of identifying types of applications such as peer-to-peer, ftp, mail and
9
not distinct applications such as eMule, BitTorrent. In addition, the behavior of hosts are more
prominent at the edge of the network and hence limiting its use in the backbone networks.
2.2.4 Machine Learning
Machine learning (ML) techniques has two phases- training and testing. In the training phase,
the heuristics is built for the classification model. The training dataset analyzes characterisitics
of flow datasets to build the classification model. The model is then subjected to test dataset
for classification of applications.
Figure 2.3: Application behavior patterns based on transport layer attributes
Figure-2.3, reproduced from [16], provides us understanding of network traffic classification
by ML-based model. The ML-based classification model working can be illustrated as follows
[16]:
1. Real Time Traffic Capture − Network traffic is mirrored to a particular port and then
collected in a memory unit. The collection can take place at backbone of the network
or at the edge of the network. The point of probe for network identification is crucial in
10
some classification methods. For instance, classification based on transport layer attribute
determinng host behaviors become more effective at the edge of the network.
2. Flow Statistics Processing − The captured packets are converted to flows. Various
statistics related to flows such as mean packet inter-arrival time, medialn packet length
or flow duration are calculated. NetFlow becomes a convenient tool for capturing and
processing network traffic in such class model.
3. Data Sampling − There are large data sets captured for a particular session of captured
traffic trace. Processing and memory power required for these operation is extremely high.
Data sampling provides us means to narrow down searches in large datasets without losing
valuable information.
4. Training Process − A fraction of dataset is used for training the classification algorithm.
The training datasets helps to understand characteristics of flow traffic and enhances
classification for testing dataset. Training data set continously updates the model in real
time to keep itself updated with changing behavior patterns.
5. Classification Model − Classification model consist of various clustering, data mining,
and information heuristics techniques. Classification learning involes machine learning
from training phase to build a set of classification rules to classify unseen traffic. During
training, behavior patterns are analyzed in various application classes. During testing,
the traffic flow matching such patterns and charateristics are classified accordingly.
6. Classified Output − Flows are labeled with application class according to the classi-
fication model. There can be clustering of flow sets with similar characteristics. Any
association between feature can also be presented with a probabilistic estimatation.
There are two types of ML techniques − Supervised learning and Unsupervised learning.
Supervised learning requires learning phase to map the ML classes to application. For this
reason, this ML is more popular for application identification for a particular group of class.
Unsupervised learning relies on automatic discovery of classes through clustering of datasets.
The clusters needs to be labeled to a particular application through direct inspection by a
human expert [16].
There are number of algorithms such as EM algorithms, Naive Bayes, k-NN used for machine
learning technique of network traffic classification. We observe high accuracy of 81.51 % for
unsupervised classifier using Naive Bayes. Moreover, there was higher accuracy of 90.51%
achieved for AutoClass unsupervised learning technique. Due to use of flow statistics, storage
and processing overheads is low. The limitation of encryption and privacy issues is overcomed
because only packet header attributes are utilized. Futhermore, new patterns or classes are
11
recognized due to unsupervised clustering techniques. But there is need of human intervention
for accurately mapping groups clusters to application class [6].
2.2.5 Graph
In [10], network traffic is analyzed in an innovative by looking at flows in form of Traffic Disper-
sion Graphs (TDG). The graph based peer-to-peer detection (Graphtion) causes partitioning
traffic flows into clusters depending on flow attributes. Figure-2.4, reproduced from [10], shows
the graph-based interaction of various hosts for client-server for HTTP application.
Figure 2.4: Graph-based classification for HTTP-TG application
In this method, there is around 90% precision and recall for peer-to-peer detection. For a
variant traffic or uniform traffic mixtures, the number of clusters(k) is different. The limitation
12
of the method is that there is estimation of number of clusters required for K-Mean algorithm.
2.2.6 Statistical Classification
The basis of this method is that applications has unique statistical characteristics such as bytes,
duration, packet interarrival times, flow duration. These attributes help distinguish different
applications by finding relationship between traffic class and its observed statistical properties
[16]. As noted in [19], the connection characteristics are analysed for constructing empherical
models using bytes, duration, arrival periodicity for a number of TCP connection.
The statistical methods observe accuracy values of 80-90% for internet chat systems [4]. But
there are limitation due to its dependency on link utilization and network performance. Critical
attributes as distribution of interarrival times are heavily dependent on link utilization and
speed. There is a need to understand traffic patterns for large datasets and multidimensional
spaces of flows and packet attributes.
13
Chapter 3
Classification Model
In this chapter, we are going to understand the classification model for achieving accurate flow
classification by using clustering and associative rule mining techniques. The main motive is
to combine unsupervised learning techniques with associative rule mining to achieve finer and
more accurate flow classification model.
3.1 Classification Model
The main part of the algorithm consists of unsupervised learning techniques with associative
rule mining, to achieve finer accurate flow classification. The methodology that was followed
is depicted in Fig.3.1 and was determined such that its implementation is similar to the one
developed in [16].
3.1.1 Processing and Flow Data
Processing
Libtrace is used to process trace file in legacy ATM format to pcap format [5]. In order to
convert the pcap traces to flows, we used conversation option of tshark tool [23]. The output
flows are conversation between two hosts with IP, port, duration and packet size information.
The flow output from tshark is parsed and stored in the database. The following parameters
attributes of the flow are stored in the database.
1. Source IP
2. Source Port
3. Destination IP
4. Destination Port
14
Machine Learning
Classification Process
Association
Rule
Classification
Processing
Transductive Classified
&
Classification Output
Flow Data
Clustering
Figure 3.1: Classification Model
Table 3.1: Distribution of application class in traces
Application Class No. of Flows Flow Proportion Port Number
HTTP 24568 9.3% 80, 443
DNS 9284 3.5% 53
SMTP 2197 0.83% 25
MAIL 1464 0.55% 110
IRC 553 0.21% 113
5. Source Byte and Source Packets
6. Destination Byte and Destination Packets
7. Total Byte and Total Packets
8. Protocol
Trace Statistics and Methodology
Port-based classification is used for defining application class of a particular flow. Since the
traces are captured for year 2001, peer-to-peer applications were not prevalent leading to well
behaved network traffic with standard port numbers. We summarize traffic class distribution
in our traces as follows.
As the classes are not evenly distributed, the training sets will not present accurate results
due to large HTTP traffic. Hence we randomly select 100 flows from each class to form train
15
data set for the classification model. This allows a fair training for the classification model with
no redundancy and equal application class distribution.
3.1.2 Clustering
We implement unsupervised learning techniques for classification. Due to ever changing appli-
cation patterns, we choice clustering technique to discover natural clusters in the data using
internalized heuristics. They define cluster flows with similar patterns or properties according
to evaluation metrics such as Euclidean distance, entropy or Bayesian information number.
There are two clustering methods evaluated in this work - K-Mean and Model-based clustering
methods.
K-Mean Clustering
K-Mean Cluster method uses attributes of datasets to cluster objects in to k-partitions. The
sum of squared distances from each data point to cluster center is minimized iteratively. The
squared error function is given as,
k
E= | xj − µi |2
i=1 xj ∈Si
where there are k clusters Si , i = 1,2,...,k and µi is mean value of all the points xj ∈ Si .
The algorithm randomly partitions dataset into k-clusters. For each k clusters, the mean
values is calculated. The intra-cluster variance or squared error function is calculated for each
data with respect to all mean values. The clusters are recomputed in a way to reduce square
error value iteratively. The algorithm converges when data point no longer switch clusters.
In our work, R function kmean is used for k-means analysis. There are two arguments of
the functions - number of clusters (k) and number of iteration (iter.max). After number of
iteration, we were able to empirically conclude the value of iteration to be 10. The value of k
was increased to understand the growth in accuracy, precision and recall of each application
class.
K-Mean method is extremely fast and with quick convergence. But there is need to estimate
number of clusters (k). The algorithm is more effective for spherical clusters naturally present
in the dataset. Also, due to random selection of clusters, the results will not be same for each
run of K-Mean analysis.
16
Model-based Clustering
Model-based clustering is used for grouping flow data and based on a strong framework to
derive knowledge from data. This clustering technique is built on an assumption that data were
generated by a particular model and tries to find the original model from data. With number
data model recovered, the number of clusters and its characteristics can retrieved. In general
model-based clustering, the model parameter is selected by maximum likelihood criterion that
maximizes the log-likelihood of generating data. Model-based clustering is more flexible. The
adaptive nature is advantageous for various distributions of data such as Bernoulli, Gaussian
with non-spherical variance, or member of different family [7].
In MCLUST package, Bayesian Information Criterion (BIC) provide comprehensive strate-
gies for parameterization and estimation of number of clusters. Maximum Likelihood is used as
a criterion for estimating the model. The prediction of number of clusters is achieved by using
Bayesian Information Criterion.
A normal or Gaussian mixture model is assumed,
n G
τk φk (xi |µk , Σk )
i=1 k=1
where,
x = data
G = number of components
−π
φk (x|µk , Σk ) = (2π) 2 exp{− 1 (xi − µk )T Σ− 1(xi − µk )}
2 k
τk = probability that an observation belongs to the kth component(τk ≥ 0; sumG τk = 1)
k=1
n = number of data points
Σ = covariance matrix
The classification likelihood with parameterized normal distribution assumed for each class:
n
φli (xi |µli , Σli )
i=1
where li = labels indication a unique classification of each observation; li = k if xi belongs
to the kth component.
Ellipsoidal distribution centered at means µk is observed for both component and cluster
models. Geometric features are determined by covariances Σk . Each covariance matrix is
paratermized by eigen value decomposition in the form
T
Σk = λk dk Ak Dk
17
Table 3.2: Parameterization of Covariance Matrix in Mclust Package
Identifier Model HC EM Distribution Volume Shape Orientation
√ √
E (univariate) equal
√ √
V (univariate) variable
√ √
EII λI Spherical equal equal NA
√ √
VII λk I Spherical equal equal NA
√ √
EEI λA Spherical variable equal NA
√
VEI λk A Diagonal equal equal coordinate axes
√
EVI λAk Diagonal equal variable coordinate axes
√
VVI λk Ak Diagonal variable variable coordinate axes
√ √
EEE λDADT Ellipsoidal equal equal equal
T √
EEV λDk ADk Ellipsoidal variable equal variable
T √
VEV λk Dk ADk Ellipsoidal variable equal variable
T √ √
VVV λk Dk Ak Dk Ellipsoidal variable variable variable
where,
Dk = orthogonal matrix of eigenvectors
Ak = diagonal matrix whose elements are proportional to the eigenvalues of Σk
λk = scalar
In the model, principal component orientation for Σk depends on Dk . Also, the shape of
the density contours is determined by Ak . Volume of the corresponding ellipsoid is specified by
λk m which is proportional to λd | Ak | where d represents data dimension. The characteristics
k
of the geometric features − orientation, volume and shape are estimated from data. These can
vary between clusters, or restricted to the same for all clusters. In one dimension dataset, equal
variance and varying variance is represented by E and V respectively. For multiple dimensions,
geometric characteristics of the model such as volume, shape and orientation is taken into
consideration. We can see from Table-3.2 that there are 9 models available in the Mclust.
The volume, shape and orientation represents clustering of flow data. In our case, VEV
model represents volumes of all clusters as varying (V), shapes of all clusters as equal (E) and
orientation of all clusters as varying (V). BIC is an approximation to the Bayes factor. It adds
a penalty term on the log-likelihood depending on number of parameters.
BIC(k, Σ) ≡ 2 ∗ p(x|Θ, k, Σ) − v(k, Σ) ∗ log(n)
where, k is the number of clusters, p is the conditional probability; π stands for different
variance-covariance structures and v(k, Σ) is the number of free parameters in a model with k
clusters and covariance structure Σ; Θ represents maximum likelihood estimate of the parame-
ters in the model (k, Θ) and n is the number of observations. BIC for each model is evaluated,
18
and model with the highest BIC is the best model for the flow dataset.
3.1.3 Transductive Classification
Due to clustering, we have grouped data sets with similar properties. These clusters need to
map to application class using a function or algorithm. Transductive classifier can be applied
to any clustering technique for classification. Consider train set is represented by X m . There
are l number of classes already labeled for the train set where i = 1,2,3, .. ,l. Now, X n is
test set in the model. We feed complete train and test set X m+n to clustering method. The
clustering methods partitions in to k-subsets. To find the class of a particular cluster, we need
to find most common labels in that cluster according to train set labels. After clustering, the
transductive classifier algorithm can be built as follows,
• Label cluster with class i if the most common label in the cluster is i, i ∈ 1, 2, .. l.
• Select uniformly at random any value of i for tie between two or more labels in a cluster
or when no labels present.
The transductive classifier searches for labels from training datasets in the cluster. It labels
application class corresponding to the most common label in the cluster. The remaining data
points have high probability of having similar properties as they are clustered together depend-
ing on algorithm heuristics. Thus, transductive classifier maps application class to clusters
depending on train dataset labels.
3.1.4 Association-Rule Classification
Association Rule Classification unit provides a second level of heuristics to achieve finer clas-
sification. Association rules find regularities between flow parameters with different measures
of interestingness for applications from transductive classifier output. Apriori Algorithm is
used for association rule learning [1]. The derived rules are traced back in the main trace and
identified flows attain high accuracy, as will be shown in the next section. Moreover, the rule
association also helps predict IPs and ports used for servicing an application in the future. The
rules heuristics applied to flow data causes accurate classification and thus making classification
method finer due to association rule mining techniques.
The association rule mining technique help we derive valuable relations within the data
points. In our case, such association will help us recognize pattern in a particular application.
The goal of the data mining and classification models is to build a heuristic for predictive
recognizing the occurrence of application class from the data points.
The association rules composes of two item sets called an antecedent and consequent. An-
tecedent is the preceding event and consequent is an event associated and followed after an-
19
tecedent. In other words, an event occurring due to antecedent is followed by the consequent
is depicted by the association rules for a particular class. The rules are represented as,
Antecedent => Consequent
Each rule is associated with three parameters, [1].
1. Support is the percentage of transactions that the rule can be applied to (the percentage
of transactions, in which it is correct).
2. Confidence is the number of cases in which the rule is correct relative to the number of
cases in which it is applicable (and thus is equivalent to an estimate of the conditional
probability of the consequent of the rule given its antecedent).
3. Lift is the ratio of the probability that antecedent and consequent occur together, to the
multiple of the two individual probabilities for antecedent and consequent. There are
conditions where both support and confidence is high, and still result in to invalid rule.
Therefore Lift indicates the strength of a rule over random occurrence of antecedent and
consequent, given their individual support. It provides information about improvement
and increase in probability of consequent for a given antecedent. In the other words, Lift
is given as,
Rules with lower lift and confidence values are filtered out. Confidence is measured for
certainty of the rule. It measures event’s item set that matches antecedent of the implication
in the association rule and also matches consequent. The association rule indicates an affinity
between antecedent and consequent with evaluation parameters such as Support, Confidence
and Lift.
Apriori Algorithm
Apriori Algorithm are based on the concept of prefix tree. There are two steps for this algorithm
− frequent item sets and rule generation. In frequent item sets, the set of items maintaining
least support are filtered. In the second phase, there will be rule generation from the computed
frequent item set. Apriori gains high efficiency by observing no superset of an infrequent item
set.
Algorithm 1 has Apriori property stating that any subset of frequent item set must be
frequent. Join operation has a set of candidate k-item sets by joining Lk+1 with itself to find
Lk . Lastly, frequent item set are sets of item with minimum support. This represented by Li
for i-item set.
20
Algorithm 1 Apriori Algorithm
Ck : Candidate item set of size k
Lk : frequent item set of size k
Lk = {frequent item s}
for k = 1; Lk = φ ; k++ do
Ck+1 = candidates generated from Lk
for transaction t in database do
Increment count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with minimum support
end forreturn ∪k Lk
end for
21
Chapter 4
Analysis and Results
In this chapter, we evaluate our classification model for two clustering methods - K-Mean and
Model-based clustering with second level of classification heuristics using Apriori algorithm.
We perform cluster analysis and application profiling for understanding behavior patterns with
probabilistic prediction for each classified flow.
4.1 Traces
Waikato Internet Traffic Storage (WITS) is an open community for collecting data traces for
traffic measurement and identification research. Auckland VI traces for 2001-06-11 are used
for analysis and testing of classification model. TCP, UDP and ICMP traffic and packets are
zeroed for length more 64 bytes. Anonymised IP are having one to one mapping for IP in the
range 10.0.0.0/8. The direction of the traffic coming towards the University Network [2].
Community Resource for Archiving Wireless Data At Dartmouth (CRAWDAD) provide us
with wireless traces for research community. Packet headers from every wireless packet sniffed
in four of the campus buildings. The buildings were among the most popular wireless locations
in 2001, and included libraries, dormitories, academic departments and social areas. The MAC
address were sanitized by randomizing bottom six hex digits. IP address are anomalized using
prefix-preserving IP sanitizer as described in [25]. In this technique, the two IP addresses
sharing the same k-bit prefix result in to a sanitized address with same k-bit prefix [13].
4.2 Classification Metrics
The classification model is evaluated by the conventional machine learning metrics such as
Precision, Recall and Accuracy [6]. The data set from classification algorithms can be placed
in combination of the following categories - Positive and Negative. The following metrics are
22
Table 4.1: Evaluation Metrics for Machine Learning Classification Techniques
Classified as X ¯
X
X TP FN
¯
X FP TN
used as evaluation parameters for machine learning classification algorithms.
Table-4.1 explains relationship of True Positive, False Positive, False Negative and True
¯
Negative. In rows, X is represented as positive presence and X is represented as negative
¯
presence. In columns, X is represented as True value and X is represented as False value.
1. True Positive(TP): Total percentage of members (in our case flows) classified as Class A
belong to Class A.
2. False Positive(FP): Total percentage of members of Class A but does not belong to Class
A.
3. False Negative(FN): Total percentage of members of Class A incorrectly classified as not
belonging to Class A.
4. True Negative(TN): Total percentage of members which do not belong to Class A are
classified not a part of Class A. It can also be given as (100% - FP).
These metrics will help us form the overall accuracy, precision and recall of the classification
model. We define these terms as follows:
Precision is given as the ratio of True Positives to True and False Positives. This value
helps us understand the classification capability to identify objects correctly.
T rueP ositive
P recision =
T rueP ositive + F alseP ositive
23
Recall is given as the ratio of True Positives to True and False Negatives. This value helps us
understand the classification capability to determine misclassified members are something else.
T rueP ositive
Recall =
T rueP ositive + F alseN egative
Overall Accuracy is given as the ratio of all True Positives to all True and False Positives for
all classes. n
i=1 (T rueP ositive)
OverallAccuracy = n
i=1 (T rueP ositive + F alseP ositive)
4.3 Analysis and Results
We have utilized two clustering algorithms- Kmean and Model-based clustering algorithms. For
Kmean clustering, k indicates the number of clusters and varied from 50 to 500 with a step
of 50. For Model-based classification, there is no need estimation of number of clusters in the
dataset. Each algorithm clusters are passed through transductive classifier and finally precision,
recall and accuracy is calculated and plotted. The following section will have analysis of each
algorithm with their respective performance graphs.
4.3.1 K-Mean Clustering
Precision Graph
The values of precision increases considerably as number of cluster increases as seen fromFig-
ure 4.1. Value of precision is in the vicinity of 60% for all applications for K 400 for all application
class.
Accuracy Graph
KMean Transductive Classifier Accuracy Graph
100
90
80
70
60
Accuracy (%)
50 `
40
30
20
10
0
50 60 70 100 150 200
250 300 350 400 450 500
Cluster Size (k)
Figure 4.3: K-Mean Transductive Classifier Accuracy Graph for WAND Traces
From Figure 4.3, the overall accuracy is given by the aggregate of precision for all the
application class. We observe high accuracy of 92% for k= 500 and low accuracy of 64% for
k=50. As the have been growth of precision for all the classes, this has been well reflected with
high overall accuracy.
The growth in the precision and recall of the cluster is due to fine segmentation of the
flow clusters due to increase in cluster number. As there is increase in number of clusters, the
K-Mean iteration for each k cluster are more tightly clustered by reducing within-cluster sum
of squares. This results in to clusters of particular application class with unwanted flows from
other class being eliminated. We can see very evidently with increase in precision for all the
application class.
There is need to estimate number of clusters(k) for each clustering run. As we know same
26
application class observe different behavior, the value of k cannot simply be equal to number of
application class. Hence it is difficult to estimate k value for each new dataset as well as maintain
high accuracy and precision. Moreover, large flow sets will have limitation of processing power
due to high iteration required for accurate results.
4.3.2 Model-based Clustering
Figure 4.4: Bayesian Information Criterion distribution for parameterization
Mclust uses Bayesian Information Criterion to estimate models and number of clusters.
From Fig.4.4, we can see that BIC value for each of the nine models is calculated. The model
with the highest value will be chosen as the best estimate of the input data. In our case,
BIC is strongest for number of components equal to 9 corresponding to VEV model. VEV
model represents volumes of all clusters as varying (V), shapes of all clusters as equal (E) and
orientation of all clusters as varying (V). BIC is an approximation to the Bayes factor. It adds
a penalty term on the log-likelihood depending on number of parameters. The best model
estimated is VEV model. VEV indicates volume and orientation is variable (V) and shape is
equal (E) resulting into Ellipsoidal distribution. The number of clusters is very close to number
of applications, as opposed to K-Mean technique.
27
Flow sets have source and destination IP and port numbers and hence can be described as
multidimensional dataset. Hence , we can observe that geometric features come in to consid-
eration for data model. From the model, we can estimate that flow points will be ellipsoidally
placed in space.
Precision Graph
Modelbased Transductive Classifier Precision Graph
100
90
80
70
60
Precision (%)
50 `
40
30
20
10
0
http mail smtp dns irc
Application
Figure 4.5: Model-based Transductive Classifier Precision Graph for WAND Traces
By narrowing down a particular data model for test data, Mclust accurately estimates the
clustering parameters using BIC . VEV indicates volume and orientation is variable(V) and
shape is equal(E) with Ellipsoidal distribution.
Model-based hierarchical clustering is used for this particular ellipsoidal distributed data
model. Each flow behaves as an individual cluster and pairs are merged as it progresses up
the hierarchy. This is bottom-top approach as individual flows are considered as initialization
point. Thus, flows with similar properties are clustered together as it moves up the hierarchy.
As compared to K-Mean, this clustering algorithm does not give considerable precision
especially for Mail and DNS. The lowest value of 50% for DNS is observed however we observe
28
strong precision for 100% for HTTP, IRC and SMTP flows.
Recall Graph
Modelbased Transductive Classifier Recall Graph
100
90
80
70
60
Recall (%)
50 `
40
30
20
10
0
http mail smtp dns irc
Application
Figure 4.6: Model-based Transductive Classifier Recall Graph for WAND Traces
False positive rejection has been prominent for SMTP application. SMTP applications are
used for mail relaying, user-client application and delivery only protocol. Hence, SMTP flows
are communicating with mail server on set of port numbers resulting into strong clustering.
This can be reinforced due to high precision value for SMTP. The recall is consistent for HTTP,
DNS, IRC, Mail The rejection is strong of order of 80% for most the applications.
Overall Accuracy
Cluster Size Overall Accuracy
9.0 81.5
The overall accuracy is higher for this particular clustering classifier. As the number of
clusters is estimated, we are able to narrow down clusters to a deterministic vale of 9.0 and
hence we achieve good accuracy of 81.5%.
29
Table 4.2: Precision & Recall for Tranductive K-Mean and Model-based classification
Application K-Mean Precision Model Precision K-Mean Recall Model Recall
http 79.58 100 65.63 81.5
mail 86.68 73.02 97.5 78.5
smtp 76.1 100 76.1 100
dns 76.12 55.28 84.67 78.5
irc 75.42 100 75.88 69
The package has features of combining model-based hierarchical clustering, expectation
and maximization for Gaussian mixture models with parameter estimation using BIC. The
probability density function is calculated for observation belonging to each class. A covariance
matrix (i,j) is derived with probabilistic value of ith flow belonging to the jth class. Hence, due
to the data modeling and parameter estimation, the clustering algorithm converge to a finer
flow classification [7]. Moreover, the number of cluster is predicted by BIC and is also very
close to the actual number of application class.
The gain in accuracy is because model-based clustering techniques assumes data belongs to
a particular model. There is an accurate estimation of model leading enhanced clustering. As
opposed to K-Mean, there is no need of human intervention for predicting cluster size for each
flow data.
4.3.3 Rule based Classification
Fig.4.7 shows accuracy for K-Mean clustering followed by Apriori Algorithm. There are weak
association in K-Mean clusters because there is fall in accuracy for all application class.
SMTP observed lower accuracy for all the range of cluster size. SMTP have user-client
application and messaging services running for a particular host. The messaging services varies
in destination port numbers. Moreover, relaying of services for mail services results in varied
host IPs. K-Mean clustering does not accurately converge to the mean values for this application
class. One of the reason is that mean values are randomly selected for initialization of the
algorithm.
On the other hand, HTTP and DNS have gained high accuracy due to client-server protocol
behavior. HTTP servers and DNS servers are set of host application that are constantly being
communicated by user hosts. These IP are clustered together due to few servers to many client
communication patterns.
The goal of the second level of association rule classification is to identify flows and segregate
them from flows of other application class. K-Mean clustering does provide an accurate estimate
of the flow data corresponding to their application class. Hence, the association for K-Mean
30
KMean and Rulebased Classification Accuracy Graph
http mail smtp dns irc
100
90
80
70
Accuracy (%)
60
50
40
30
20
10
0
250 300 350 400 450 500
Cluster Size
Figure 4.7: K-Mean and Rule-based Classification Accuracy Graph for WAND Traces
clusters map to flows with other application class indicating poor false positive rejection.
Fig.4.8 refers to model-based clustering with apriori algorithm for association rule classifi-
cation. Due to accurate model estimation capability, the nature of the cluster is predicted with
Bayesian Information Criterion. The estimation leads to appropriate choice of distribution,
leading to an enhanced clustering result.
Association rule based on flow attributes identify accurately flows having similar application
behavior. Due to strong data evaluation capability of Model-based clustering, the association
rule heuristics have high confidence value for application class.
We tested our classification model with other traces to nullify dependency of particular
network or data traces. Crawdad data traces does observe accuracy on similar range as WAND
traces. WAND traces average accuracy comes up to 94% and Crawdad comes around 93%. As
the two values are very comparable, we achieved an accurate classification model with clustering
and rule association mining.
31
Modelbased and Rulebased Classification Accuracy Graph
WAND Crawdad
100
90
80
70
60
Accuracy (%)
`
50
40
30
20
10
0
http mail snmp
dns irc
Application
Figure 4.8: Model-based clustering and Rule based classification Accuracy for WAND and
Crawdad traces
Table 4.3: Accuracy for Association Rule classifier based on K-Mean and Model-based cluster-
ing for WAND and CRAWDAD Traces
Application K-Mean Accuracy WAND Trace CRAWDAD Trace
http 61.48 100 100
mail 42.94 94.05 87.82
smtp 11.28 89.34 94.87
dns 61.67 92.11 88.15
irc 70.5 96.74 95.26
32
Figure 4.9: Distribution of flow attributes in scatter plot matrix
4.4 Cluster Analysis
In Fig.4.9, the distribution of IP and ports for source and destination is depicted for Model-
based clustering method. The graph has diagonal blocks mentioned with the parameters in
that row and column. In order to consider source and destination IP distribution, we need to
go at plot interesting source IP row/column and destination IP row/column.
• For HTTP (blue) application, large number of source IPs direct traffic too few destination
IPs. HTTP behavior is prominent because a number of source hosts are communicating
with small number of destination hosts i.e. HTTP servers.
• In DNS application, source hosts communicate with a set of DNS servers. Each host
interacts with different domains such as www, .org and .net for resolving IP addresses.
The DNS behavior is prominent because large group of source hosts communicate with
small repeated set of destination IPs.
• In IRC application, a group of hosts communicates with other group of hosts. There is
roughly one-to-one correspondence between source and destination IP. This behavior is
prominent because IRC (cross) traffic has set of source IPs communicate with approxi-
mately equal number of destination IPs.
33
• Moreover, source IP and destination port distribution show that destination port numbers
are in the range of 30000 for SMTP (green) traffic. DNS destination ports are randomly
distributed where as HTTP and MAIL have destination ports number in lower range of
1000.
4.5 Behavior Pattern for Application
The association rule mining technique help we derive valuable relations within the data points.
In our case, such association will help us recognize pattern in a particular application. The
goal data mining and classification models is build a heuristics for predictive for recognizing
the occurrence of application class from the data points.
Rules with lower lift and confidence values are filtered out. Confidence is measured for
certainty of the rule. It measures event’s item set that matches antecedent of the implication in
the association rule and also matches consequent. Table I clearly mentions flow parameters in
antecedent and consequent format for building association rules. The association rule indicates
an affinity between antecedent and consequent with evaluation parameters such as Support,
Confidence and Lift.
For example, for HTTP, destination IP with index 1 and destination port number equal to
1273 will have a flow from source IP of index 233 with confidence (noted as C) of 1 and lift
of 81.5. When we trace back this flow, we are accurately able to classify this flow as HTTP.
Moreover, in the future, if we observe any particular flow with this association then we would
be certain to classify it as HTTP with confidence value of 1.
We can observe that each application are profiled according to their respective behavior.
Every rule confidence value tells us that flow with this particular association will be belonging
to that corresponding application class. Thus, we understand the following for each application
behavior profiling: It is used for resolving host names with IP address. Domain Server services
queries for any particular host. Hence, we observe centralized architecture with client-server
communication pattern. In Table-4.4, we can clearly observe that source and destination IPs
are associated with each other. There are host pairs (200,46), (187,32) and (19,69) representing
communication between host and domain server for DNS traffic. Various user messages are ser-
viced by using mail application. There is a mail server servicing, routing and relaying messages
between two parties. The communication behavior of relaying and messaging is prominently
observed in Mail and SMTP application. In IRC application, different destination port number
with set IPs communicating with each other is very well depicted.
Hence, we are able to gain higher accuracy by using apriori algorithm after transductive
classifier. The design helps us find valuable association rules for classified data with estimation
of confidence, lift and support.
34
Table 4.4: Association Rules from Apriori Association Algorithm
Traffic Rules Confidence Lift
DNS dip=16 => dport=56322 1 21.8
dport=4207 => dip=43 1 17.7
dstip=19 => srcip=69 0.02 35.5
srcip=69 => dstip=19 0.02 35.5
srcip=200 => dstip=46 0.1 9.7
dstip=46 => srcip=200 0.1 9.7
srcip=187 => dstip=32 0.2 3.5
dstip=32 => srcip=187 0.2 3.5
Mail dip=8 => dport=53 1 6.7
dip=16 => dport=53 1 6.7
srcip=226 => dstip=49 1 53.75
dstip=49 => srcip=226 1 53.75
dstip=8 => dstport=53 1 6.7
srcip=200 => dstip=51 1 4.57
dstip=51 => srcip=200 1 4.57
srcip=95 => dstip=30 0.4 8.95
dstip=13 => dstport=53 0.4 6.71
HTTP dip=1 & dport=1273 => srcip=233 1 81.5
{} => dip=1 0.7 1
srcip=139 => dstip=1 1 1.4
srcip=9 => dstip=1 1 1.4
srcip=41 => dstip=1 1 1.4
dstport=2006 => dstip=1 1 1.4
dstip=41 => srcip=164 40.75
srcip=233 & dstip=1 => dstport=1273 1 81.5
IRC sip=162 => dport=37273 1 69
{} => dip=16 0.7 1
dstip=3 & dstport=39938 => srcip=18 1 46
srcip=168 & dstport=37321 => dstip=3 1 4.4
srcip=168 & dstip=3 => dstport=37321 1 69
dstip=3 & dstport=37321 => srcip=168 1 69
srcip=162 & dstport=37273 => dstip=16 1 1.3
srcip=162 & dstip=16 => dstport=37273 1 69
SMTP sip=262 & dport=4868 => dip=28 1 2.8
dip=28 & dport=4868 => sip=262 1 15.3
srcip=196 => dstip=3 1 5.7
dstport=4706 => srcip=261 1 22.2
dstport=4706 => dstip=28 1 2.8
srcip=277 => dstip=43 1 7.1
srcip=241 => dstip=3 1 5.7
srcip=220 => dstip=43 1 7.1
35
Chapter 5
Conclusion
5.1 Summary of Results
In this work, we have presented a classification model that achieves high flow classification
accuracy with application behavior profiling. Association rules derived from transductive clas-
sification are fed back to refine classification model. This leads to a finer classification model
with much better accuracy. The rule heuristics are produced without any human intervention
and become independent of dataset. In addition, our model is able to detect new behavior
patterns for next generation applications. We summarize our results in the following points:
• We built two level classification models with unsupervised learning techniques. Clustering
transductive classification was used for first level and association rule mining for second
level of classification. Model-based clustering and Apriori Algorithm form basis of the
classification model.
• We achieve accuracy value of around 94%. The precision value for all application class
remains high, and independent of trace sets.
• Each application has behavior profiling due to use of Apriori Algorithm. The behavior
patterns tells us various flow attributes so that we can identify and classify flows in the
future.
• Each association rule is providing with confidence value to gain accurate results. More-
over, support provides information of the number of flows present in the dataset. In
addition, Lift defines strength of the relation between antecedent and consequent.
• Due to utilization of flow parameters, the processing and computational power is less.
Large flow data sets can be analyzed with accurate results and application behavior
patterns.
36
5.2 Future Work
In the future work, we are planning to capture trace files for reviewing our classification model.
We plan to analyze wide variety of applications with different behavior patterns and connection
norms. With accurate classification, we want to build a prediction model for application class.
This will enhance the capability of classifying as well as monitoring of particular traffic in the
future.
37
REFERENCES
[1] C. Borgelt and R. Kruse. Induction of association rules: Apriori implementation. In Comp-
stat: Proceedings in Computational Statistics: 15th Symposium Held in Berlin, Germany,
2002, page 395. Physica Verlag, 2002.
[2] WAND Trace Catalogue. http://www.wand.net.nz/wits/catalogue.php.
[3] IOS Cisco. Netflow white papers, 2006.
[4] C. Dewes, A. Wichmann, and A. Feldmann. An analysis of Internet chat systems. In
Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement, pages 51–
64. ACM, 2003.
[5] Libtrace Documentation. http://www.wand.net.nz/trac/libtrace.
[6] J. Erman, A. Mahanti, and M. Arlitt. Internet traffic identification using machine learning.
In Proceedings of IEEE GlobeCom. Citeseer, 2006.
[7] C. Fraley and A.E. Raftery. MCLUST version 3 for R: Normal mixture modeling and
model-based clustering. Technical report, Citeseer, 2006.
[8] P. Haffner, S. Sen, O. Spatscheck, and D. Wang. ACAS: automated construction of ap-
plication signatures. In Proceedings of the 2005 ACM SIGCOMM workshop on Mining
network data, page 202. ACM, 2005.
[9] IANA. Internet Assigned Numbers Authority (IANA).
http://www.iana.org/assignments/port-numbers.
[10] M. Iliofotou, H. Kim, M. Faloutsos, M. Mitzenmacher, P. Pappu, and G. Varghese. Graph-
based P2P traffic classification at the internet backbone. In IEEE Global Internet Sympo-
sium. Citeseer, 2009.
[11] T. Karagiannis, A. Broido, and M. Faloutsos. Transport layer identification of P2P traffic.
In Proceedings of the 4th ACM SIGCOMM conference on Internet measurement, pages
121–134. ACM, 2004.
[12] T. Karagiannis, K. Papagiannaki, and M. Faloutsos. BLINC: multilevel traffic classification
in the dark. ACM SIGCOMM Computer Communication Review, 35(4):240, 2005.
[13] David Kotz, Tristan Henderson, Ilya Abyzov, and Jihwang Yeo. CRAW-
DAD trace dartmouth/campus/tcpdump/fall01 (v. 2004-11-09). Downloaded from
http://crawdad.cs.dartmouth.edu/dartmouth/campus/tcpdump/fall01, November 2004.
[14] A. Madhukar and C. Williamson. A longitudinal study of P2P traffic classification. In 14th
IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and
Telecommunication Systems, 2006. MASCOTS 2006, pages 179–188, 2006.
38
[15] A.W. Moore and K. Papagiannaki. Toward the accurate identification of network applica-
tions. Passive and Active Network Measurement, pages 41–54, 2005.
[16] TTT Nguyen and G. A rmitage. A survey of techniques for internet traffic classification
using machine learning. IEEE Communications Surveys & Tutorials, 10(4):56–76, 2008.
[17] I. Papapanagiotou and M. Devetsikiotis. Aggregation Design Methodologies for Triple Play
Services. In IEEE CCNC 2010, Las Vegas, USA, pages 1–5, 2010.
[18] B.C. Park, Y.J. Won, M.S. Kim, and J.W. Hong. Towards automated application signa-
ture generation for traffic identification. In IEEE Network Operations and Management
Symposium, 2008. NOMS 2008, pages 160–167, 2008.
[19] V. Paxson. Empirically derived analytic models of wide-area TCP connections. IEEE/ACM
Transactions on Networking (TON), 2(4):336, 1994.
[20] V. Paxson. Bro: A system for detecting network intruders in real-time. Comput. Networks,
31(23):2435–2463, 1999.
[21] M. Roughan, S. Sen, O. Spatscheck, and N. Duffield. Class-of-service mapping for QoS: a
statistical signature-based approach to IP traffic classification. In Proceedings of the 4th
ACM SIGCOMM conference on Internet measurement, pages 135–148. ACM, 2004.
[22] S. Sen, O. Spatscheck, and D. Wang. Accurate, scalable in-network identification of p2p
traffic using application signatures. In Proceedings of the 13th international conference on
World Wide Web, page 521. ACM, 2004.
[23] tshark Documentation. http://www.wireshark.org/docs/man-pages/tshark.html.
[24] J. Wang, Z. Ge, H. Jiang, S. Jin, and A.W. Moore. Lightweight application classification
for network management, July 2008.
[25] J. Xu, J. Fan, M. Ammar, and S.B. Moon. On the design and performance of prefix-
preserving IP traffic trace anonymization. In Proceedings of the 1st ACM SIGCOMM
Workshop on Internet Measurement, page 266. ACM, 2001.
39
APPENDIX
40
Appendix A
Application Behavior Profiling
A.1 DNS behavior Pattern
no rules support confidence lift
1 srcip=180 => dstip=6 0.010 1 71
2 dstip=6 => srcip=180 0.010 0.75 71
3 dstip=16 => dstport=56322 0.01 1 21.84
4 dstport=4207 => dstip=43 0.010 1 17.75
5 srcip=283 => dstip=22 0.014 1 28.4
6 dstip=19 => srcip=69 0.028 1 35.5
7 srcip=69 => dstip=19 0.028 1 35.5
8 dstport=3391 => dstip=39 0.028 1 35.5
9 dstip=39 => dstport=3391 0.028 1 35.5
10 dstip=13 => dstport=49406 0.031 1 15.77
11 dstip=12 => dstport=49406 0.031 1 15.77
12 dstip=3 => dstport=56322 0.035 1 21.84
13 dstport=56322 => dstip=3 0.035 0.769 21.84
14 dstport=1025 => dstip=31 0.07 1 13.52
15 dstip=31 => dstport=1025 0.07 1 13.52
16 srcip=200 => dstip=46 0.10 1 9.79
17 dstip=46 => srcip=200 0.10 1 9.79
18 srcip=187 => dstip=32 0.28 1 3.506
19 dstip=32 => srcip=187 0.28 1 3.50
41
A.2 Mail behavior Pattern
No rules support confidence lift
1 srcip=226 => dstip=49 0.013 1 53.75
2 dstip=49 => srcip=226 0.013 0.75 53.75
3 dstip=8 => dstport=53 0.013 1 6.71
4 dstip=16 => dstport=53 0.03 1 6.75
5 srcip=95 => dstip=30 0.04 1 8.95
6 dstip=13 => dstport=53 0.04 1 6.71
7 srcip=246 => dstip=24 0.05 1 19.54
8 dstip=24 => srcip=246 0.051 1 19.54
9 dstip=25 => srcip=112 0.05 1 4.3
10 srcip=59 => dstip=21 0.05 1 17.91
11 dstip=21 => srcip=59 0.055 1 17.91
12 srcip=253 => dstip=30 0.065 1 8.95
13 dstip=52 => srcip=364 0.093 1 10.23
14 srcip=364 => dstip=52 0.093 0.952 10.23
15 dstip=19 => srcip=112 0.181 1 4.3
16 srcip=112 => dstip=19 0.18 0.78 4.3
17 srcip=200 => dstip=51 0.21 1 4.574
18 dstip=51 => srcip=200 0.218 1 4.574
42
A.3 HTTP behavior Pattern
No rules support confidence lift
1 {} => dstip=1 0.711 0.711 1
2 srcip=139 => dstip=1 0.012 1 1.40
3 srcip=9 => dstip=1 0.012 1 1.405
4 srcip=41 => dstip=1 0.012 1 1.405
5 dstport=2006 => dstip=1 0.012 1 1.405
6 dstip=41 => srcip=164 0.01 1 40.75
7 dstip=55 => srcip=164 0.012 1 40.75
8 srcip=11 => dstip=1 0.012 1 1.405
9 dstip=53 => srcip=242 0.012 1 81.5
10 srcip=242 => dstip=53 0.012 1 81.5
11 srcip=6 => dstip=1 0.012 1 1.405
12 srcip=134 => dstip=2 0.01 1 54.3
13 srcip=439 => dstip=1 0.012 1 1.405
14 srcip=407 => dstip=1 0.0127 1 1.405
15 dstport=1273 => srcip=233 0.012 1 81.5
16 srcip=233 => dstport=1273 0.012 1 81.5
17 dstport=1273 => dstip=1 0.012 1 1.405
18 srcip=424 => dstip=1 0.012 1 1.405
19 srcip=233 => dstip=1 0.012 1 1.405
20 srcip=409 => dstip=1 0.0122 1 1.405
21 srcip=2 => dstip=1 0.018 1 1.405
22 srcip=50 => dstip=1 0.018 1 1.405
23 srcip=190 => dstip=1 0.018 1 1.405
24 srcip=226 => dstip=49 0.01 1 13.58
25 srcip=389 => dstip=1 0.0181 1 1.405
26 srcip=382 => dstip=1 0.011 1 1.405
27 srcip=408 => dstip=1 0.024 1 1.405
28 srcip=338 => dstip=1 0.024 1 1.405
29 srcip=75 => dstip=45 0.024 1 32.6
30 dstip=45 => srcip=75 0.024 0.8 32.6
31 srcip=344 => dstip=1 0.024 1 1.405
32 srcip=444 => dstip=1 0.030 1 1.405
33 srcip=155 => dstip=1 0.024 0.8 1.124
34 srcip=233 & dstport=1273 => dstip=1 0.012 1 1.405
35 dstip=1 & dstport=1273 => srcip=233 0.012 1 81.5
36 srcip=233 & dstip=1 => dstport=1273 0.012 1 81.5
43
A.4 IRC behavior Pattern
No rules support confidence lift
1 {} => dstip=16 0.71 0.71 1
2 srcip=301 => dstip=16 0.014 1 1.39
3 srcip=247 => dstip=16 0.014 1 1.39
4 srcip=49 => dstip=16 0.014 1 1.39
5 srcip=457 => dstip=16 0.014 1 1.39
6 dstport=39938 => srcip=18 0.014 1 46
7 dstport=39938 => dstip=3 0.0144 1 4.41
8 srcip=99 => dstip=16 0.014 1 1.39
9 srcip=168 => dstport=37321 0.014 1 69
10 dstport=37321 => srcip=168 0.014 1 69
11 srcip=168 => dstip=3 0.014 1 4.45
12 dstport=37321 => dstip=3 0.014 1 4.45
13 srcip=162 => dstport=37273 0.014 1 69
14 dstport=37273 => srcip=162 0.014 1 69
15 srcip=162 => dstip=16 0.014 1 1.39
16 dstport=37273 => dstip=16 0.01 1 1.39
17 srcip=292 => dstip=3 0.014 1 4.45
18 srcip=428 => dstip=16 0.014 1 1.39
19 srcip=316 => dstip=16 0.014 1 1.39
20 srcip=209 => dstip=16 0.014] 1 1.39
21 srcip=183 => dstip=16 0.021 1 1.393
22 srcip=18 => dstip=3 0.021 1 4.45
23 srcip=322 => dstip=16 0.028 1 1.39
24 srcip=305 => dstip=16 0.028 1 1.39
25 srcip=18 & dstport=39938 => dstip=3 0.014 1 4.45
26 dstip=3 & dstport=39938 => srcip=18 0.014 1 46
27 srcip=168 & dstport=37321 => dstip=3 0.014 1 4.45
28 srcip=168 & dstip=3 => dstport=37321 0.0144 1 69
29 dstip=3 & dstport=37321 => srcip=168 0.014 1 69
30 srcip=162 & dstport=37273 => dstip=16 0.014 1 1.39
31 srcip=162 & dstip=16 => dstport=37273 0.014 1 69
32 dstip=16 & dstport=37273 => srcip=162 0.014 1 69
44
A.5 SMTP behavior Pattern
No rules support confidence lift
1 srcip=196 => dstip=3 0.01 1 5.71
2 dstport=4706 => srcip=261 0.01 1 22.22
3 dstport=4706 => dstip=28 0.01 1 2.89
4 srcip=277 => dstip=43 0.01 1 7.14
5 srcip=241 => dstip=3 0.01 1 5.714
6 srcip=220 => dstip=43 0.01 1 7.14
7 srcip=404 => dstip=16 0.01 1 5
8 srcip=240 => dstip=16 0.01 1 5
9 srcip=211 => dstport=38173 0.01 1 100
10 dstport=38173 => srcip=211 0.01 1 100
11 srcip=211 => dstip=16 0.01 1 5
12 dstport=38173 => dstip=16 0.01 1 5
13 dstport=38156 => srcip=235 0.01 1 66.66
14 dstport=38156 => dstip=16 0.01 1 5
15 srcip=245 => dstip=43 0.01 1 7.142
16 srcip=149 => dstip=43 0.01 1 7.142
35 srcip=237 => dstip=16 0.015 0.75 3.75
36 srcip=261 & dstport=4706 => dstip=28 0.01 1 2.89
37 dstip=28 & dstport=4706 => srcip=261 0.01 1 22.22
38 srcip=211 & dstport=38173 => dstip=16 0.01 1 5
39 srcip=211 & dstip=16 => dstport=38173 0.01 1 100
40 dstip=16 & dstport=38173 => srcip=211 0.01 1 100
41 srcip=235 & dstport=38156 => dstip=16 0.01 1 5
42 dstip=16 & dstport=38156 => srcip=235 0.01 1 66.66
43 srcip=235 & dstip=16 => dstport=38156 0.01 1 100
44 srcip=348 & dstport=2151 => dstip=43 0.01 1 7.14
45 dstip=43 & dstport=2151 => srcip=348 0.01 1 50
46 srcip=205 & dstport=4837 => dstip=15 0.01 1 25
47 dstip=15 & dstport=4837 => srcip=205 0.01 1 50
45