Logging & Troubleshooting¶
This guide helps you monitor PyBMP operations, understand log outputs, and troubleshoot common issues.
Log Configuration¶
Log Levels¶
PyBMP supports multiple log levels for different verbosity:
Level | Description | Use Case |
---|---|---|
DEBUG |
Very detailed information | Development and deep debugging |
INFO |
General information about operation | Normal production monitoring |
WARNING |
Warning messages about potential issues | Production with enhanced monitoring |
ERROR |
Error conditions that don't stop operation | Production (default) |
CRITICAL |
Critical errors that may stop operation | Production alerts |
Setting Log Level¶
# Environment variable
export LOG_LEVEL=DEBUG
# Docker run
docker run -e LOG_LEVEL=DEBUG ...
# Docker Compose
environment:
LOG_LEVEL: DEBUG
Log Format¶
PyBMP uses structured logging with the following format:
2024-01-15 10:30:45,123 - module_name - INFO - Message content
TIMESTAMP - LOGGER_NAME - LEVEL - MESSAGE
Log Sources¶
Main Application Logs¶
Monitor the main application container:
# Docker Compose
docker-compose logs -f bmp-collector
# Docker container
docker logs -f bmp-collector
# Follow logs with timestamps
docker-compose logs -f -t bmp-collector
Database Logs¶
Monitor TimescaleDB operations:
# Database container logs
docker-compose logs -f timescaledb
# PostgreSQL logs inside container
docker exec -it bmp-timescaledb tail -f /var/log/postgresql/postgresql-15-main.log
Understanding Log Messages¶
Normal Operation Logs¶
Startup Sequence¶
2024-01-15 10:30:45 - src.main - INFO - Starting BMP Collector...
2024-01-15 10:30:45 - src.database.connection - INFO - Database connection established
2024-01-15 10:30:46 - src.bmp.server - INFO - BMP server starting on 0.0.0.0:11019
2024-01-15 10:30:46 - src.bmp.server - INFO - BMP server listening for connections
Router Connection¶
2024-01-15 10:31:20 - src.bmp.server - INFO - New connection from 192.168.1.10:51234
2024-01-15 10:31:21 - src.bmp.processor - INFO - Router 192.168.1.10 session started
2024-01-15 10:31:21 - src.bmp.processor - INFO - Processing initiation message from 192.168.1.10
Route Processing¶
2024-01-15 10:31:25 - src.bmp.processor - INFO - Processed 150 route updates from 192.168.1.10
2024-01-15 10:31:30 - src.bmp.processor - DEBUG - Route announcement: 10.0.0.0/8 via 192.168.1.1
2024-01-15 10:31:35 - src.bmp.processor - DEBUG - Route withdrawal: 172.16.0.0/12
Warning Messages¶
Buffer Management¶
2024-01-15 10:35:00 - src.bmp.server - WARNING - High buffer usage for connection 192.168.1.10 (85%)
2024-01-15 10:35:30 - src.bmp.processor - WARNING - Batch processing delayed, queue size: 500
Database Performance¶
2024-01-15 10:40:00 - src.database.connection - WARNING - Database query slow: 1.5s
2024-01-15 10:40:15 - src.database.connection - WARNING - Connection pool near capacity: 18/20
Error Messages¶
Connection Issues¶
2024-01-15 10:45:00 - src.bmp.server - ERROR - Connection lost from 192.168.1.10: [Errno 104] Connection reset by peer
2024-01-15 10:45:10 - src.database.connection - ERROR - Database connection failed: connection timeout
Data Processing Errors¶
2024-01-15 10:50:00 - src.bmp.parser - ERROR - Invalid BMP message format from 192.168.1.10
2024-01-15 10:50:05 - src.bmp.bgp_parser - ERROR - Malformed BGP update in BMP message
Monitoring with Prometheus Metrics¶
Key Metrics to Monitor¶
Access metrics at http://localhost:9090/metrics
:
Connection Metrics¶
# Active connections
bmp_active_connections{router_ip="192.168.1.10"} 1
# Total messages received
bmp_messages_received_total{router_ip="192.168.1.10",message_type="route_monitoring"} 1500
# Message processing rate
bmp_message_processing_rate{router_ip="192.168.1.10"} 25.5
Database Metrics¶
# Database queries executed
database_queries_total{operation="insert"} 5000
# Database connection pool usage
database_pool_connections{state="active"} 8
database_pool_connections{state="idle"} 12
# Query execution time
database_query_duration_seconds{operation="insert",quantile="0.95"} 0.05
System Metrics¶
# Memory usage
process_resident_memory_bytes 256000000
# CPU usage
process_cpu_seconds_total 120.5
# File descriptors
process_open_fds 45
Setting Up Alerts¶
Example Prometheus alerting rules:
groups:
- name: pybmp_alerts
rules:
- alert: BMPHighConnectionLoss
expr: rate(bmp_connections_lost_total[5m]) > 0.1
for: 2m
labels:
severity: warning
annotations:
summary: "High BMP connection loss rate"
description: "BMP connection loss rate is {{ $value }} connections/sec"
- alert: BMPDatabaseSlow
expr: database_query_duration_seconds{quantile="0.95"} > 1.0
for: 5m
labels:
severity: critical
annotations:
summary: "Database queries are slow"
description: "95th percentile query time is {{ $value }} seconds"
- alert: BMPHighMemoryUsage
expr: process_resident_memory_bytes > 1000000000
for: 10m
labels:
severity: warning
annotations:
summary: "High memory usage"
description: "PyBMP using {{ $value | humanize1024 }}B of memory"
Common Issues and Solutions¶
1. Router Connection Issues¶
Symptoms¶
ERROR - Connection lost from 192.168.1.10: [Errno 104] Connection reset by peer
WARNING - No data received from 192.168.1.10 for 60 seconds
Possible Causes & Solutions¶
Network Connectivity:
# Test connectivity from PyBMP host to router
telnet 192.168.1.10 179 # BGP port should be open
ping 192.168.1.10
# Check firewall rules
iptables -L -n | grep 11019
ufw status
Router Configuration:
# Verify router BMP configuration
# Check if router can reach PyBMP
show ip route <PYBMP-IP>
show bmp summary
PyBMP Configuration:
# Check if PyBMP is listening on correct interface
netstat -tulpn | grep 11019
ss -tulpn | grep 11019
# Verify container networking (if using Docker)
docker port bmp-collector
2. Database Performance Issues¶
Symptoms¶
WARNING - Database query slow: 2.5s
ERROR - Database connection timeout
WARNING - Connection pool near capacity: 20/20
Solutions¶
Optimize Database:
-- Update table statistics
ANALYZE routes;
ANALYZE route_history;
-- Check for missing indexes
SELECT schemaname, tablename, attname, n_distinct, correlation
FROM pg_stats
WHERE schemaname = 'public' AND tablename = 'routes';
-- Add custom indexes for frequent queries
CREATE INDEX CONCURRENTLY idx_routes_recent_by_router
ON routes(router_ip, time DESC)
WHERE time > NOW() - INTERVAL '7 days';
Tune Database Configuration:
-- postgresql.conf
max_connections = 200
shared_buffers = 2GB
work_mem = 8MB
effective_cache_size = 8GB
checkpoint_timeout = 10min
wal_buffers = 16MB
Adjust Connection Pool:
# Increase database pool size
DB_POOL_SIZE=30
DB_MAX_OVERFLOW=60
# Adjust batch processing
BATCH_SIZE=500
BATCH_TIMEOUT_SECONDS=10
3. High Memory Usage¶
Symptoms¶
Solutions¶
Monitor Memory Usage:
# Container memory usage
docker stats bmp-collector
# Detailed memory breakdown
docker exec bmp-collector cat /proc/meminfo
docker exec bmp-collector ps aux --sort=-%mem
Optimize Configuration:
# Reduce buffer sizes
BMP_BUFFER_SIZE=32768
BMP_MAX_CONNECTIONS=50
# Limit batch sizes
BATCH_SIZE=500
WORKER_THREADS=2
Database Memory:
-- Check database cache hit ratio
SELECT datname,
blks_read,
blks_hit,
round((blks_hit::float/(blks_read+blks_hit+1)*100)::numeric, 2) as cache_hit_ratio
FROM pg_stat_database
WHERE datname = 'bmp_collector';
4. Message Processing Delays¶
Symptoms¶
WARNING - Batch processing delayed, queue size: 1000
INFO - Processing lag: 30 seconds behind real-time
Solutions¶
Increase Processing Capacity:
# More worker threads
WORKER_THREADS=8
# Larger batch sizes (if database can handle)
BATCH_SIZE=2000
# Shorter timeout for faster processing
BATCH_TIMEOUT_SECONDS=2
Database Optimization:
-- Partition old data
SELECT drop_chunks('routes', INTERVAL '30 days');
-- Update chunk interval for better performance
SELECT set_chunk_time_interval('routes', INTERVAL '6 hours');
5. Log File Issues¶
Log Rotation¶
# Setup logrotate for Docker logs
cat > /etc/logrotate.d/docker-pybmp << EOF
/var/lib/docker/containers/*/*.log {
daily
rotate 7
compress
delaycompress
missingok
notifempty
create 0644 root root
postrotate
docker kill --signal=USR1 $(docker ps -q)
endscript
}
EOF
Disk Space Monitoring¶
# Monitor log directory size
du -sh /var/lib/docker/containers/*/
# Clean old logs
docker system prune -f
docker container prune -f
Debugging Tools¶
Enable Debug Logging¶
# Temporarily enable debug logging
docker exec bmp-collector kill -USR1 1
# Or restart with debug logging
docker-compose down
docker-compose up -d -e LOG_LEVEL=DEBUG
Database Query Analysis¶
-- Show slow queries
SELECT query, mean_time, calls, total_time
FROM pg_stat_statements
WHERE mean_time > 100
ORDER BY mean_time DESC;
-- Show current running queries
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 minutes';
Network Packet Capture¶
# Capture BMP traffic for analysis
tcpdump -i any -w bmp_capture.pcap port 11019
# Analyze with tshark
tshark -r bmp_capture.pcap -Y "tcp.port == 11019"
Resource Monitoring¶
# Real-time monitoring
watch -n 1 'docker stats --no-stream bmp-collector'
# Continuous monitoring with timestamps
while true; do
echo "$(date): $(docker stats --no-stream --format 'table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}' bmp-collector)"
sleep 60
done
Performance Tuning¶
Application Tuning¶
# Optimize for high-volume environments
BMP_BUFFER_SIZE=131072
BMP_MAX_CONNECTIONS=200
BATCH_SIZE=5000
BATCH_TIMEOUT_SECONDS=1
WORKER_THREADS=16
Database Tuning¶
-- Optimize for time-series workload
SET maintenance_work_mem = '1GB';
SET checkpoint_timeout = '15min';
SET wal_compression = on;
SET log_min_duration_statement = 1000;
Container Resource Limits¶
# docker-compose.yml
services:
bmp-collector:
deploy:
resources:
limits:
memory: 2G
cpus: '2.0'
reservations:
memory: 1G
cpus: '1.0'
Getting Help¶
When reporting issues, include:
- Log snippets with timestamps
- Configuration (sanitized)
- System information (OS, Docker version, hardware)
- Network topology and router models
- Metrics data if available