Swinburne
Browse

P-Gram: Positional N-Gram for the Clustering of Machine-Generated Messages

Download (1.4 MB)
journal contribution
posted on 2024-07-26, 14:52 authored by Jiaojiao Jiang, Steve Versteeg, Jun HanJun Han, Md Arafat Hossain, Jean-Guy Schneider, Christopher Leckie, Zeinab Farahmandpour
An IT system generates messages for other systems or users to consume, through direct interaction or as system logs. Automatically identifying the types of these machine-generated messages has many applications, such as intrusion detection and system behavior discovery. Among various heuristic methods for automatically identifying message types, the clustering methods based on keyword extraction have been quite effective. However, these methods still suffer from keyword misidentification problems, i.e., some keyword occurrences are wrongly identified as payload and some strings in the payload are wrongly identified as keyword occurrences, leading to the misidentification of the message types. In this paper, we propose a new machine language processing (MLP) approach, called ${P}$ -gram, specifically designed for identifying keywords in, and subsequently clustering, machine-generated messages. First, we introduce a novel concept and technique, positional ${n}$ -gram, for message keywords extraction. By associating the position as meta-data with each ${n}$ -gram, we can more accurately discern which ${n}$ -grams are keywords of a message and which ${n}$ -grams are parts of the payload information. Then, the positional keywords are used as features to cluster the messages, and an entropy-based positional weighting method is devised to measure the importance or weight of the positional keywords to each message. Finally, a general centroid clustering method, ${K}$ -Medoids, is used to leverage the importance of the keywords and cluster messages into groups reflecting their types. We evaluate our method on a range of machine-generated (text and binary) messages from the real-world systems and show that our method achieves higher accuracy than the current state-of-the-art tools.

Funding

Australian Research Council

Virtual environments for improved enterprise software deployment : Australian Research Council (ARC) | LP150100892

History

Available versions

PDF (Published version)

ISSN

2169-3536

Journal title

IEEE Access

Volume

7

Pagination

12 pp

Publisher

Institute of Electrical and Electronics Engineers (IEEE)

Copyright statement

Copyright © 2019 the authors. This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/

Language

eng

Usage metrics

    Publications

    Categories

    No categories selected

    Keywords

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC