CTI-ANN: Self-Training-Based Annotation With Tailored Augmentation for Cyber Threat Intelligence Posts

Research output: Contribution to journalArticlepeer-review

Abstract

Cyber threat intelligence (CTI) leverages real-time information on emerging threats to strengthen defense and risk management. Social media platforms like X and Reddit are valuable CTI sources, yet detecting relevant posts is hindered by costly manual annotation and severe class imbalance. We propose CTI-ANN, a theoretically grounded self-training framework that begins with a small labeled seed set and iteratively generates pseudolabels for large-scale unlabeled data. To address class imbalance, we introduce CyberAugment, a suite of domain-specific augmentation methods supported by formal proofs of semantic preservation and imbalance reduction. Empirical results show that CyberAugment improves accuracy by up to 12.2% over baselines and consistently outperforms existing methods from experiments on the X dataset. When applied to both X and Reddit data, CTI-ANN demonstrates robust cross-platform performance, improving model accuracy significantly after five iterations. Our contributions include novel CTI-specific augmentation, an integrated self-training pipeline with comprehensive scalability analysis, formal theoretical analysis, and the release of our CTI-annotated datasets from both X and Reddit.

Original languageEnglish
JournalIEEE Transactions on Industrial Informatics
DOIs
StateAccepted/In press - 2025

Keywords

  • Augmentation
  • cyber threat intelligence (CTI)
  • domain-specific augmentation
  • pseudolabeling
  • self-training
  • social media mining

Fingerprint

Dive into the research topics of 'CTI-ANN: Self-Training-Based Annotation With Tailored Augmentation for Cyber Threat Intelligence Posts'. Together they form a unique fingerprint.

Cite this