Traditional signature-based security systems struggle against sophisticated cyber threats. This guide demonstrates how to build a production-ready, machine learning-powered threat detection system using AWS Lambda for preprocessing and SageMaker for inference.

The Problem Statement

Modern cyber threats evolve rapidly, requiring intelligent systems that can identify attack patterns in real-time. Organizations need scalable, cost-effective solutions that can process network logs and detect multiple attack vectors including SQL injection, DoS attacks, port scanning, brute force attempts, and data exfiltration.

Architecture Overview

Our threat detection pipeline processes network logs through automated stages:

Network Logs → Lambda Preprocessing → Feature Extraction → 
SageMaker Batch Transform → Threat Classification → SNS Alerts

The system extracts 36 sophisticated features from raw network logs to identify six threat categories with 100% validation accuracy.

Technologies Used

  • AWS Lambda: Serverless preprocessing and feature extraction
  • Amazon SageMaker: ML model training and batch inference
  • Amazon S3: Data lake architecture
  • Amazon SNS: Real-time threat alerting
  • Python: XGBoost, Pandas, Scikit-learn
  • Machine Learning: 36 engineered features for threat detection
Phase 1: Development Environment Setup

Set up the Python environment with proper dependency management:

mkdir cyber-threat-detection-sagemaker
cd cyber-threat-detection-sagemaker

python -m venv .venv
.\.venv\Scripts\Activate.ps1

pip install --only-binary=all numpy pandas scikit-learn
pip install boto3 xgboost

Configure AWS CLI credentials:

aws configure
aws sts get-caller-identity
Phase 2: Advanced Feature Engineering

The Lambda preprocessing function transforms raw logs into 36 threat indicators across five categories:

Feature Categories

  1. Network Behavior Analysis (12 features)
  • IP geolocation risk assessment
  • Internal vs external connection detection
  • Protocol anomaly detection
  • Traffic volume patterns
  1. Temporal Pattern Analysis (4 features)
  • Business hours detection
  • Weekend activity monitoring
  • Time-based attack patterns
  1. Port and Service Analysis (7 features)
  • High-risk port identification
  • Service categorization
  • Port usage patterns
  1. Attack Pattern Detection (8 features)
  • SQL injection pattern matching
  • XSS attempt detection
  • DoS attack indicators
  • Suspicious user agent identification
  1. Traffic Characteristics (5 features)
  • Bytes ratio analysis
  • Upload/download patterns
  • Connection frequency monitoring
Phase 3: Realistic Threat Data Generation

Created a dataset with 2,500 network log records across six threat categories:

Threat TypePercentageCountKey Characteristics
Normal Traffic60.6%1,500Regular browsing, DNS queries
SQL Injection5.9%150Malicious payloads, suspicious agents
DoS/DDoS7.9%200High frequency, small packets
Port Scanning12.0%300Sequential ports, short duration
Brute Force9.8%250Repeated login attempts
Data Exfiltration3.7%100Large outbound transfers
Phase 4: Machine Learning Model Training

The XGBoost training pipeline implements enterprise-grade practices:

  • Class imbalance handling with automatic weight computation
  • Robust data validation for infinite/NaN values
  • Comprehensive per-class metrics and AUC scores
  • Feature importance analysis for interpretability
  • SageMaker-compatible interface

Model Performance

  • Training Accuracy: 99.96%
  • Validation Accuracy: 100%
  • Validation AUC: 1.0000
  • Per-Class Detection: Perfect precision/recall across all threat types

Top 5 Critical Features

  1. is_external_connection (29.94%)
  2. ip_geolocation_risk (12.38%)
  3. is_well_known_port (10.62%)
  4. potential_dos (9.69%)
  5. protocol_numeric (9.49%)
Phase 5: AWS Production Deployment

The deployment script executes six automated steps to create the complete infrastructure:

Step 1: S3 Bucket Creation

Three buckets with account-specific naming (Account ID: xxxxxxxxxxxx):

  • cyber-threat-raw-data-xxxxxxxxxxxx – Network log ingestion
  • cyber-threat-processed-data-xxxxxxxxxxxx – Feature storage
  • cyber-threat-model-artifacts-xxxxxxxxxxxx – Model and results

Step 2: IAM Role Configuration

Creates CyberThreatDetectionSageMakerRole with:

  • Trust policies for SageMaker and Lambda services
  • S3 read/write permissions for all threat detection buckets
  • CloudWatch logging capabilities
  • SageMaker training and inference permissions

Step 3: Training Data Upload

Uploads datasets to S3:

  • Raw network logs to raw-data/ folder
  • Processed features to train/ folder

Step 4: Lambda Function Deployment

Deploys cyber-threat-detector-sagemaker:

  • Python 3.9 runtime with preprocessing code
  • 300-second timeout, 256MB memory
  • Integrated with SageMaker for batch inference
  • Automatic updates for existing functions

Step 5: SageMaker Model Registration

Registers trained XGBoost model:

  • Model name: cyber-threat-detector-xxxxxxxxxx
  • Ready for batch transform jobs
  • Configured for real-time threat classification

Step 6: SNS Alert Configuration

Sets up cyber-threat-alerts topic:

  • Real-time threat notifications
  • Email/SMS subscription support
  • Integration with Lambda for automated alerting
Phase 6: Production Testing and Validation

Deployed Resources Summary

ComponentResource NamePurpose
Lambda Functioncyber-threat-detector-sagemakerPreprocessing & SageMaker integration
SageMaker Modelcyber-threat-detector-xxxxxxxxxxXGBoost threat classification
S3 Raw Datacyber-threat-raw-data-xxxxxxxxxxxxNetwork log ingestion
S3 Processedcyber-threat-processed-data-xxxxxxxxxxxxFeature storage
S3 Artifactscyber-threat-model-artifacts-xxxxxxxxxxxxModel & results storage
SNS Topiccyber-threat-alertsThreat notifications
IAM RoleCyberThreatDetectionSageMakerRoleSecurity & permissions

Real-Time Threat Detection Flow

The system processes network logs automatically:

  1. Upload: CSV files uploaded to raw-data/ folder
  2. Trigger: S3 event invokes Lambda function
  3. Feature Extraction: 36 features extracted from each log
  4. Batch Transform: SageMaker processes features
  5. Classification: Model identifies threat patterns
  6. Alerting: SNS sends notifications for detected threats

Testing the Pipeline

Upload test network logs to verify end-to-end processing:

aws s3 cp test_network_logs.csv s3://cyber-threat-raw-data-xxxxxxxxxxxx/
Subscribe to Threat Alerts

Configure email notifications:

aws sns subscribe \
  --topic-arn arn:aws:sns:us-east-1:xxxxxxxxxxxx:cyber-threat-alerts \
  --protocol email \
  --notification-endpoint your-email@company.com
Production Pipeline in Action

Automatic Processing Workflow

When network logs are uploaded to S3, the system automatically:

  1. Triggers Lambda function via S3 event
  2. Extracts 36 features from raw logs
  3. Submits batch transform job to SageMaker
  4. Model classifies threats with confidence scores
  5. SNS sends alerts for detected threats

Performance Metrics

  • Processing Speed: 2,500+ records in ~5 minutes
  • Feature Extraction: 36 features per network log entry
  • Detection Latency: Sub-minute threat identification
  • Cost Efficiency: Pay-per-use serverless architecture
  • Scalability: Auto-scales with traffic volume
Architecture Benefits

Serverless Advantages

  • Cost-effective pay-per-use model
  • Automatic scaling with traffic spikes
  • High availability with AWS-managed infrastructure
  • Sub-minute threat detection capabilities

Security Features

  • IP address hashing for privacy
  • S3 server-side encryption at rest
  • Least-privilege IAM access controls
  • CloudTrail audit logging enabled

System Capabilities

  • Processing: 2,500+ records in ~5 minutes
  • Accuracy: 100% validation performance across all threat types
  • Features: 36 sophisticated threat indicators extracted per log
  • Cost: ~$10-50 monthly depending on usage volume
Key Achievements
  • Complete AWS infrastructure deployed with Lambda and SageMaker integration
  • Advanced ML pipeline with 36 engineered features
  • Real-time processing via serverless Lambda functions
  • SageMaker batch transform for scalable inference
  • Perfect model accuracy with 100% validation performance
  • Production testing verified with SNS alerting
  • Cost-effective serverless architecture (~$10-50/month)
Future Enhancements
  • Real-time inference endpoints for immediate threat detection
  • CloudWatch dashboards for monitoring and metrics
  • Kinesis integration for streaming log processing
  • SIEM integration (Splunk, QRadar) for enterprise security
  • Automated response workflows for threat remediation
  • Multi-region deployment for high availability
Conclusion

This production-ready threat detection pipeline demonstrates enterprise-level security capabilities using AWS Lambda for preprocessing and SageMaker for inference. The system achieves perfect accuracy in detecting six attack types while maintaining cost-effective, scalable infrastructure.

The complete serverless architecture processes 2,500+ records in ~5 minutes with automatic alerting through SNS, providing organizations with real-time threat visibility at minimal operational cost.


Complete source code and deployment scripts: [GitHub Repository Link]

Posted in

Leave a comment