Traditional signature-based security systems struggle against sophisticated cyber threats. This guide demonstrates how to build a production-ready, machine learning-powered threat detection system using AWS Lambda for preprocessing and SageMaker for inference.
The Problem Statement
Modern cyber threats evolve rapidly, requiring intelligent systems that can identify attack patterns in real-time. Organizations need scalable, cost-effective solutions that can process network logs and detect multiple attack vectors including SQL injection, DoS attacks, port scanning, brute force attempts, and data exfiltration.
Architecture Overview
Our threat detection pipeline processes network logs through automated stages:
Network Logs → Lambda Preprocessing → Feature Extraction →
SageMaker Batch Transform → Threat Classification → SNS Alerts

The system extracts 36 sophisticated features from raw network logs to identify six threat categories with 100% validation accuracy.
Technologies Used
- AWS Lambda: Serverless preprocessing and feature extraction
- Amazon SageMaker: ML model training and batch inference
- Amazon S3: Data lake architecture
- Amazon SNS: Real-time threat alerting
- Python: XGBoost, Pandas, Scikit-learn
- Machine Learning: 36 engineered features for threat detection
Phase 1: Development Environment Setup
Set up the Python environment with proper dependency management:
mkdir cyber-threat-detection-sagemaker
cd cyber-threat-detection-sagemaker
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install --only-binary=all numpy pandas scikit-learn
pip install boto3 xgboost


Configure AWS CLI credentials:
aws configure
aws sts get-caller-identity


Phase 2: Advanced Feature Engineering
The Lambda preprocessing function transforms raw logs into 36 threat indicators across five categories:
Feature Categories
- Network Behavior Analysis (12 features)
- IP geolocation risk assessment
- Internal vs external connection detection
- Protocol anomaly detection
- Traffic volume patterns
- Temporal Pattern Analysis (4 features)
- Business hours detection
- Weekend activity monitoring
- Time-based attack patterns
- Port and Service Analysis (7 features)
- High-risk port identification
- Service categorization
- Port usage patterns
- Attack Pattern Detection (8 features)
- SQL injection pattern matching
- XSS attempt detection
- DoS attack indicators
- Suspicious user agent identification
- Traffic Characteristics (5 features)
- Bytes ratio analysis
- Upload/download patterns
- Connection frequency monitoring



Phase 3: Realistic Threat Data Generation
Created a dataset with 2,500 network log records across six threat categories:
| Threat Type | Percentage | Count | Key Characteristics |
|---|---|---|---|
| Normal Traffic | 60.6% | 1,500 | Regular browsing, DNS queries |
| SQL Injection | 5.9% | 150 | Malicious payloads, suspicious agents |
| DoS/DDoS | 7.9% | 200 | High frequency, small packets |
| Port Scanning | 12.0% | 300 | Sequential ports, short duration |
| Brute Force | 9.8% | 250 | Repeated login attempts |
| Data Exfiltration | 3.7% | 100 | Large outbound transfers |


Phase 4: Machine Learning Model Training
The XGBoost training pipeline implements enterprise-grade practices:
- Class imbalance handling with automatic weight computation
- Robust data validation for infinite/NaN values
- Comprehensive per-class metrics and AUC scores
- Feature importance analysis for interpretability
- SageMaker-compatible interface
Model Performance
- Training Accuracy: 99.96%
- Validation Accuracy: 100%
- Validation AUC: 1.0000
- Per-Class Detection: Perfect precision/recall across all threat types
Top 5 Critical Features
- is_external_connection (29.94%)
- ip_geolocation_risk (12.38%)
- is_well_known_port (10.62%)
- potential_dos (9.69%)
- protocol_numeric (9.49%)
Phase 5: AWS Production Deployment
The deployment script executes six automated steps to create the complete infrastructure:
Step 1: S3 Bucket Creation
Three buckets with account-specific naming (Account ID: xxxxxxxxxxxx):
cyber-threat-raw-data-xxxxxxxxxxxx– Network log ingestioncyber-threat-processed-data-xxxxxxxxxxxx– Feature storagecyber-threat-model-artifacts-xxxxxxxxxxxx– Model and results

Step 2: IAM Role Configuration
Creates CyberThreatDetectionSageMakerRole with:
- Trust policies for SageMaker and Lambda services
- S3 read/write permissions for all threat detection buckets
- CloudWatch logging capabilities
- SageMaker training and inference permissions

Step 3: Training Data Upload
Uploads datasets to S3:
- Raw network logs to
raw-data/folder - Processed features to
train/folder
Step 4: Lambda Function Deployment
Deploys cyber-threat-detector-sagemaker:
- Python 3.9 runtime with preprocessing code
- 300-second timeout, 256MB memory
- Integrated with SageMaker for batch inference
- Automatic updates for existing functions

Step 5: SageMaker Model Registration
Registers trained XGBoost model:

- Model name:
cyber-threat-detector-xxxxxxxxxx - Ready for batch transform jobs
- Configured for real-time threat classification
Step 6: SNS Alert Configuration
Sets up cyber-threat-alerts topic:
- Real-time threat notifications
- Email/SMS subscription support
- Integration with Lambda for automated alerting

Phase 6: Production Testing and Validation
Deployed Resources Summary
| Component | Resource Name | Purpose |
|---|---|---|
| Lambda Function | cyber-threat-detector-sagemaker | Preprocessing & SageMaker integration |
| SageMaker Model | cyber-threat-detector-xxxxxxxxxx | XGBoost threat classification |
| S3 Raw Data | cyber-threat-raw-data-xxxxxxxxxxxx | Network log ingestion |
| S3 Processed | cyber-threat-processed-data-xxxxxxxxxxxx | Feature storage |
| S3 Artifacts | cyber-threat-model-artifacts-xxxxxxxxxxxx | Model & results storage |
| SNS Topic | cyber-threat-alerts | Threat notifications |
| IAM Role | CyberThreatDetectionSageMakerRole | Security & permissions |
Real-Time Threat Detection Flow
The system processes network logs automatically:
- Upload: CSV files uploaded to
raw-data/folder - Trigger: S3 event invokes Lambda function
- Feature Extraction: 36 features extracted from each log
- Batch Transform: SageMaker processes features
- Classification: Model identifies threat patterns
- Alerting: SNS sends notifications for detected threats
Testing the Pipeline
Upload test network logs to verify end-to-end processing:
aws s3 cp test_network_logs.csv s3://cyber-threat-raw-data-xxxxxxxxxxxx/
Subscribe to Threat Alerts
Configure email notifications:
aws sns subscribe \
--topic-arn arn:aws:sns:us-east-1:xxxxxxxxxxxx:cyber-threat-alerts \
--protocol email \
--notification-endpoint your-email@company.com
Production Pipeline in Action
Automatic Processing Workflow
When network logs are uploaded to S3, the system automatically:
- Triggers Lambda function via S3 event
- Extracts 36 features from raw logs
- Submits batch transform job to SageMaker
- Model classifies threats with confidence scores
- SNS sends alerts for detected threats
Performance Metrics
- Processing Speed: 2,500+ records in ~5 minutes
- Feature Extraction: 36 features per network log entry
- Detection Latency: Sub-minute threat identification
- Cost Efficiency: Pay-per-use serverless architecture
- Scalability: Auto-scales with traffic volume
Architecture Benefits
Serverless Advantages
- Cost-effective pay-per-use model
- Automatic scaling with traffic spikes
- High availability with AWS-managed infrastructure
- Sub-minute threat detection capabilities
Security Features
- IP address hashing for privacy
- S3 server-side encryption at rest
- Least-privilege IAM access controls
- CloudTrail audit logging enabled
System Capabilities
- Processing: 2,500+ records in ~5 minutes
- Accuracy: 100% validation performance across all threat types
- Features: 36 sophisticated threat indicators extracted per log
- Cost: ~$10-50 monthly depending on usage volume
Key Achievements
- Complete AWS infrastructure deployed with Lambda and SageMaker integration
- Advanced ML pipeline with 36 engineered features
- Real-time processing via serverless Lambda functions
- SageMaker batch transform for scalable inference
- Perfect model accuracy with 100% validation performance
- Production testing verified with SNS alerting
- Cost-effective serverless architecture (~$10-50/month)
Future Enhancements
- Real-time inference endpoints for immediate threat detection
- CloudWatch dashboards for monitoring and metrics
- Kinesis integration for streaming log processing
- SIEM integration (Splunk, QRadar) for enterprise security
- Automated response workflows for threat remediation
- Multi-region deployment for high availability
Conclusion
This production-ready threat detection pipeline demonstrates enterprise-level security capabilities using AWS Lambda for preprocessing and SageMaker for inference. The system achieves perfect accuracy in detecting six attack types while maintaining cost-effective, scalable infrastructure.
The complete serverless architecture processes 2,500+ records in ~5 minutes with automatic alerting through SNS, providing organizations with real-time threat visibility at minimal operational cost.

Complete source code and deployment scripts: [GitHub Repository Link]
Leave a comment