Building a Machine Learning Threat Detection System with AWS Lambda and SageMaker

Traditional signature-based security systems struggle against sophisticated cyber threats. This guide demonstrates how to build a production-ready, machine learning-powered threat detection system using AWS Lambda for preprocessing and SageMaker for inference.

The Problem Statement

Modern cyber threats evolve rapidly, requiring intelligent systems that can identify attack patterns in real-time. Organizations need scalable, cost-effective solutions that can process network logs and detect multiple attack vectors including SQL injection, DoS attacks, port scanning, brute force attempts, and data exfiltration.

Architecture Overview

Our threat detection pipeline processes network logs through automated stages:

Network Logs → Lambda Preprocessing → Feature Extraction → 
SageMaker Batch Transform → Threat Classification → SNS Alerts

The system extracts 36 sophisticated features from raw network logs to identify six threat categories with 100% validation accuracy.

Technologies Used

AWS Lambda: Serverless preprocessing and feature extraction
Amazon SageMaker: ML model training and batch inference
Amazon S3: Data lake architecture
Amazon SNS: Real-time threat alerting
Python: XGBoost, Pandas, Scikit-learn
Machine Learning: 36 engineered features for threat detection

Phase 1: Development Environment Setup

Set up the Python environment with proper dependency management:

mkdir cyber-threat-detection-sagemaker
cd cyber-threat-detection-sagemaker

python -m venv .venv
.\.venv\Scripts\Activate.ps1

pip install --only-binary=all numpy pandas scikit-learn
pip install boto3 xgboost

Configure AWS CLI credentials:

aws configure
aws sts get-caller-identity

Phase 2: Advanced Feature Engineering

The Lambda preprocessing function transforms raw logs into 36 threat indicators across five categories:

Feature Categories

Network Behavior Analysis (12 features)

IP geolocation risk assessment
Internal vs external connection detection
Protocol anomaly detection
Traffic volume patterns

Temporal Pattern Analysis (4 features)

Business hours detection
Weekend activity monitoring
Time-based attack patterns

Port and Service Analysis (7 features)

High-risk port identification
Service categorization
Port usage patterns

Attack Pattern Detection (8 features)

SQL injection pattern matching
XSS attempt detection
DoS attack indicators
Suspicious user agent identification

Traffic Characteristics (5 features)

Bytes ratio analysis
Upload/download patterns
Connection frequency monitoring

Phase 3: Realistic Threat Data Generation

Created a dataset with 2,500 network log records across six threat categories:

Threat Type	Percentage	Count	Key Characteristics
Normal Traffic	60.6%	1,500	Regular browsing, DNS queries
SQL Injection	5.9%	150	Malicious payloads, suspicious agents
DoS/DDoS	7.9%	200	High frequency, small packets
Port Scanning	12.0%	300	Sequential ports, short duration
Brute Force	9.8%	250	Repeated login attempts
Data Exfiltration	3.7%	100	Large outbound transfers

Phase 4: Machine Learning Model Training

The XGBoost training pipeline implements enterprise-grade practices:

Class imbalance handling with automatic weight computation
Robust data validation for infinite/NaN values
Comprehensive per-class metrics and AUC scores
Feature importance analysis for interpretability
SageMaker-compatible interface

Model Performance

Training Accuracy: 99.96%
Validation Accuracy: 100%
Validation AUC: 1.0000
Per-Class Detection: Perfect precision/recall across all threat types

Top 5 Critical Features

is_external_connection (29.94%)
ip_geolocation_risk (12.38%)
is_well_known_port (10.62%)
potential_dos (9.69%)
protocol_numeric (9.49%)

Phase 5: AWS Production Deployment

The deployment script executes six automated steps to create the complete infrastructure:

Step 1: S3 Bucket Creation

Three buckets with account-specific naming (Account ID: xxxxxxxxxxxx):

cyber-threat-raw-data-xxxxxxxxxxxx – Network log ingestion
cyber-threat-processed-data-xxxxxxxxxxxx – Feature storage
cyber-threat-model-artifacts-xxxxxxxxxxxx – Model and results

Step 2: IAM Role Configuration

Creates CyberThreatDetectionSageMakerRole with:

Trust policies for SageMaker and Lambda services
S3 read/write permissions for all threat detection buckets
CloudWatch logging capabilities
SageMaker training and inference permissions

Step 3: Training Data Upload

Uploads datasets to S3:

Raw network logs to raw-data/ folder
Processed features to train/ folder

Step 4: Lambda Function Deployment

Deploys cyber-threat-detector-sagemaker:

Python 3.9 runtime with preprocessing code
300-second timeout, 256MB memory
Integrated with SageMaker for batch inference
Automatic updates for existing functions

Step 5: SageMaker Model Registration

Registers trained XGBoost model:

Model name: cyber-threat-detector-xxxxxxxxxx
Ready for batch transform jobs
Configured for real-time threat classification

Step 6: SNS Alert Configuration

Sets up cyber-threat-alerts topic:

Real-time threat notifications
Email/SMS subscription support
Integration with Lambda for automated alerting

Phase 6: Production Testing and Validation

Deployed Resources Summary

Component	Resource Name	Purpose
Lambda Function	`cyber-threat-detector-sagemaker`	Preprocessing & SageMaker integration
SageMaker Model	`cyber-threat-detector-xxxxxxxxxx`	XGBoost threat classification
S3 Raw Data	`cyber-threat-raw-data-xxxxxxxxxxxx`	Network log ingestion
S3 Processed	`cyber-threat-processed-data-xxxxxxxxxxxx`	Feature storage
S3 Artifacts	`cyber-threat-model-artifacts-xxxxxxxxxxxx`	Model & results storage
SNS Topic	`cyber-threat-alerts`	Threat notifications
IAM Role	`CyberThreatDetectionSageMakerRole`	Security & permissions

Real-Time Threat Detection Flow

The system processes network logs automatically:

Upload: CSV files uploaded to raw-data/ folder
Trigger: S3 event invokes Lambda function
Feature Extraction: 36 features extracted from each log
Batch Transform: SageMaker processes features
Classification: Model identifies threat patterns
Alerting: SNS sends notifications for detected threats

Testing the Pipeline

Upload test network logs to verify end-to-end processing:

aws s3 cp test_network_logs.csv s3://cyber-threat-raw-data-xxxxxxxxxxxx/

Subscribe to Threat Alerts

Configure email notifications:

aws sns subscribe \
  --topic-arn arn:aws:sns:us-east-1:xxxxxxxxxxxx:cyber-threat-alerts \
  --protocol email \
  --notification-endpoint your-email@company.com

Production Pipeline in Action

Automatic Processing Workflow

When network logs are uploaded to S3, the system automatically:

Triggers Lambda function via S3 event
Extracts 36 features from raw logs
Submits batch transform job to SageMaker
Model classifies threats with confidence scores
SNS sends alerts for detected threats

Performance Metrics

Processing Speed: 2,500+ records in ~5 minutes
Feature Extraction: 36 features per network log entry
Detection Latency: Sub-minute threat identification
Cost Efficiency: Pay-per-use serverless architecture
Scalability: Auto-scales with traffic volume

Architecture Benefits

Serverless Advantages

Cost-effective pay-per-use model
Automatic scaling with traffic spikes
High availability with AWS-managed infrastructure
Sub-minute threat detection capabilities

Security Features

IP address hashing for privacy
S3 server-side encryption at rest
Least-privilege IAM access controls
CloudTrail audit logging enabled

System Capabilities

Processing: 2,500+ records in ~5 minutes
Accuracy: 100% validation performance across all threat types
Features: 36 sophisticated threat indicators extracted per log
Cost: ~$10-50 monthly depending on usage volume

Key Achievements

Complete AWS infrastructure deployed with Lambda and SageMaker integration
Advanced ML pipeline with 36 engineered features
Real-time processing via serverless Lambda functions
SageMaker batch transform for scalable inference
Perfect model accuracy with 100% validation performance
Production testing verified with SNS alerting
Cost-effective serverless architecture (~$10-50/month)

Future Enhancements

Real-time inference endpoints for immediate threat detection
CloudWatch dashboards for monitoring and metrics
Kinesis integration for streaming log processing
SIEM integration (Splunk, QRadar) for enterprise security
Automated response workflows for threat remediation
Multi-region deployment for high availability

Conclusion

This production-ready threat detection pipeline demonstrates enterprise-level security capabilities using AWS Lambda for preprocessing and SageMaker for inference. The system achieves perfect accuracy in detecting six attack types while maintaining cost-effective, scalable infrastructure.

The complete serverless architecture processes 2,500+ records in ~5 minutes with automatic alerting through SNS, providing organizations with real-time threat visibility at minimal operational cost.

Complete source code and deployment scripts: [GitHub Repository Link]

Faaez Nafiu's Cloud Projects

recent posts

about