Swinburne
Browse

A novel approach for practical real-time, machine learning based IP traffic classification

Download (2.29 MB)
thesis
posted on 2024-07-11, 17:13 authored by Thuy T. T. Nguyen
Today's Internet does not guarantee any bounds on packet delay, loss or jitter for traffic traversing its networks. Uncontrolled networks can easily lead to bad user experiences for those emerging applications that have more stringent Quality of Service (QoS) requirements. This suggests there is a vital need for an effective QoS-enabled network architecture, in which the network equipment is capable of classifying Internet traffic into different classes for different QoS treatments. Beyond technology, there are other issues related to a practical QoS solution for the Internet, including the challenges of minimising the deployment cost of QoS technologies and simplifying users' experiences. Like other services, the Internet is expected to be user-friendly, simple and easy to understand, stable and available on request, predictable and transparent, and not requiring users to understand its underlying architecture in order to use the service. With an awareness of these issues, my thesis focuses on the automation of the QoS control process, particularly by means of an automated, real-time IP traffic classification (IPTC) mechanism. Traditional techniques for the identification of Internet applications are based either on the use of well-known registered port numbers or on payload-based protocol reconstruction. However, applications can use unregistered ports or encryption to obfuscate packet contents; and governments may impose privacy regulations that constrain the ability of third parties to lawfully inspect packet payloads. Newer approaches, on the other hand, classify traffic by learning and recognising statistical patterns in externally observable attributes of the traffic (such as packet lengths and inter-packet arrival times). State-of-the-art approaches look closely at the application of Machine Learning (ML) â- a powerful technique for data mining and knowledge discovery â- to the classification of IP traffic. However, before I began publishing my work no ML-based approach to IPTC properly considered the constraints of being deployed in real-time operational networks. Most publications on the use of ML algorithms for classifying IP traffic have relied on bi-directional, full-flow statistics (from start until finish or time-out), while assuming that flows have an explicit direction implied by the first packet captured, or a known client-server relationship. Some other studies have tried classification using the first few packets of a flow. In contrast, most if not all real-world scenarios require a classification decision well before a flow has finished, using statistics derived from a small number of recent packets rather than from the entire flow. Classifiers may also have missed an arbitrary number of packets from the start of a flow, and be unsure of the direction in which the flow started. To overcome these problems, I propose and evaluate novel modifications to the current MLbased approaches. My goal is to achieve classification by using statistics derived from only the most recent N packets of a flow (for some small value of N). Because a target application's short-term traffic statistics vary within the lifetime of a single flow, I propose training the ML classifier on a set of multiple short sub-flows, each 'sub-flow' being a collection of N consecutive packets extracted from full-flow samples of the target application's traffic. The sub-flows are picked from regions of the application's flow that have noticeably different statistical characteristics. I further augment the training set by synthesising a complementary version of every sub-flow in the reverse direction, since most Internet applications exhibit asymmetric traffic characteristics in the client-to-server and server-to-client directions. Finally, I propose a novel use of unsupervised ML algorithms for the automated selection of appropriate sub-flow pairs when examples of traffic are given from applications that we wish to classify. I combine my proposals into a training approach that I call Synthetic Sub-flow Pairs with the assistance of Clustering Techniques (SSP-ACT). I demonstrate my optimisation when applied to the Naive Bayes and C4.5 Decision Tree ML algorithms, for the identification of an online game â- Wolfenstein Enemy Territory (ET) and VoIP traffic. My experiments showed that for ET, being trained using SSP-ACT and classifying using a small sliding classification window of 25 packets (roughly corresponds to 0.5 of a second in real-time), the Naive Bayes classifier achieved 98.9% median Recall and 87% median Precision, and the C4.5 Decision Tree classifier achieved 99.3% median Recall and 97% median Precision. My results also confirmed that classification performance is maintained even when the classification is initiated at an arbitrary point within a flow and is independent of the direction of the first packet captured. For VoIP, being trained using SSP-ACT and classifying on a sliding window of 25 packets (approximately 0.25 seconds in real-time when there is voice traffic in both directions), the Naive Bayes classifier achieved 100% median Recall and 95.4% median Precision, and the C4.5 Decision Tree classifier achieved 95.7% median Recall and 99.2% Precision. I also study the impact of packet loss on SSP-ACT's performance, with 5% synthetic, random and independent packet loss. ForWolfenstein Enemy Territory traffic, 5% packet loss only degraded the Recall and Precision of both the Naive Bayes and C4.5 Decision Tree classifiers by less than 0.5%. For VoIP traffic, 5% packet loss did not manifest noticeable degradation on the Naive Bayes classifier's Recall and Precision. However, it degraded the C4.5 Decision Tree classifier's Recall and Precision by 8.5% and 0.1% respectively. Despite this degradation, median Recall and Precision of the C4.5 Decision Tree classifier still remained above 87% and 99% for all the tested positions of the sliding window. Deeper investigation of the sensitivity of the Naive Bayes and C4.5 Decision Tree classifiers with regards to packet loss is left for future research. This work also can be expanded in future with other loss rates and loss models. I also demonstrate that SSP-ACT is effective in identifying both ET and VoIP traffic concurrently, by using a single common classifier or two separate classifiers in parallel, one for each application. My results reveal that using a common classifier provides better Precision and Recall, with a trade-off in the classification speed. It also has several pros and cons compared to the latter option of using two separate classifiers. How SSP-ACT could scale to classify a larger number of applications simultaneously is a question that requires further study. My results show that SSP-ACT is a significant improvement over the previous, published state-of-the art for IP traffic classification. My present work has focused on IPTC of an online game and VoIP, and revealed a potential solution to the accurate and timely classification of traffic belonging to other Internet applications.

History

Thesis type

  • Thesis (PhD)

Thesis note

Dissertation submitted in accordance with the requirements for the degree of Doctor of Philosophy, Swinburne University of Technology, 2009.

Copyright statement

Copyright © 2009 Thi Thu Thuy Nguyen.

Supervisors

Grenville J. Armitage

Language

eng

Usage metrics

    Theses

    Categories

    No categories selected

    Keywords

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC