Monitoring
Part of the Data Center Technician training program
Overview
Master the DCIM (Data Center Infrastructure Management) systems, environmental monitoring platforms, and alerting technologies that keep data centers running 24/7. This capstone module covers sensor networks, network management, automated response, and the operational procedures that ensure continuous facility uptime.
Sub-topics
DCIM Systems & Platform Overview
Study DCIM software platforms (Netcool, Sunbird, Nlyte, Intelix), asset management workflows, capacity planning tools, power and cooling visualization dashboards, and integration with IT service management (ITSM) systems. Learn about data center mapping software and real-time 3D visualization.
Lessons
DCIM Platform Types
Asset tracking: Serial numbers, locations, warranties. Capacity planning: Power, cooling, space. Monitoring: Real-time dashboards. Reporting: Compliance, efficiency.
Asset Management
Barcode/RFID tagging. Checkout/check-in workflows. Warranty tracking. Lifecycle management. Integration with procurement.
Capacity Planning Tools
Predictive analytics. Trending reports. What-if scenarios. Automated alerts. Integration with monitoring systems.
Dashboard Design
Key metrics: PUE, WUE, CUE. Real-time status. Drill-down capability. Mobile access. Custom views for different roles.
ITSM Integration
Change management. Incident tracking. Problem management. Knowledge base. Workflow automation.
Practical Exercises
- Configure asset tracking for 100 devices
- Create capacity plan for 6-month expansion
- Design executive dashboard with key metrics
- Integrate DCIM with ticketing system
- Generate compliance report for audit
Key Formulas
PUE = Total Facility Power / IT Equipment PowerWUE = Annual Water Usage / Annual IT EnergyCUE = IT Equipment Power / Total Facility PowerSafety Checklist
- Verify all assets are tagged properly
- Check integration with monitoring system
- Test report generation functionality
- Review user access permissions
- Backup configuration regularly
Environmental Sensor Networks
Deploy and configure temperature sensors, humidity probes, water leak detectors, smoke detectors, air pressure differentials, and vibration sensors throughout the facility. Learn sensor placement strategies, calibration procedures, data aggregation, and false alarm reduction techniques.
Lessons
Sensor Placement Strategy
Temperature: At rack inlet, exhaust, and under-floor. Humidity: Same locations. Leak: Under floors, around plumbing. Smoke: Near CRAC units, in corridors.
Calibration Procedures
Temperature: Ice bath (0°C) and boiling water (100°C). Humidity: Saturated salt solutions. Frequency: Annually or per manufacturer.
Data Aggregation
Edge computing: Local processing. Cloud: Centralized storage. Database: Time-series optimized. API: Integration with other systems.
False Alarm Reduction
Filtering: Statistical analysis. Hysteresis: Deadband settings. Correlation: Multiple sensor confirmation. Machine learning: Pattern recognition.
Network Topology
Wired: Ethernet with PoE. Wireless: Battery-powered, mesh network. Redundancy: Dual paths. Security: Encryption, authentication.
Practical Exercises
- Install temperature sensors at 20 rack locations
- Perform humidity sensor calibration
- Configure sensor network for redundancy
- Set up data aggregation to time-series database
- Tune alarm thresholds to reduce false positives
Key Formulas
Sensor Density = Number of Sensors / Facility AreaNetwork Reliability = (1 - Failure Rate)^Number of PathsData Retention = Storage Capacity / (Data Rate × 86400)Safety Checklist
- Verify sensor calibration before deployment
- Check network connectivity for all sensors
- Test alarm notification for each sensor type
- Document sensor locations in asset system
- Review sensor data quality weekly
Network Management & SNMP
Configure SNMP monitoring for network switches, PDUs, UPS systems, CRAC units, and environmental controllers. Set up network management servers, MIB (Management Information Base) configuration, trap monitoring, and alert escalation procedures for critical infrastructure devices.
Lessons
SNMP Versions
SNMPv1: Basic, community strings. SNMPv2c: Better error handling. SNMPv3: Encryption, authentication. Use SNMPv3 for security.
MIB Configuration
Standard MIBs: IF-MIB, POWER-MIB. Vendor MIBs: Specific to device. Load MIBs on management station. Verify OIDs are responding.
Trap Monitoring
Configure trap destinations. Set trap severity levels. Filter traps to reduce noise. Log all traps for analysis.
Polling Intervals
Critical devices: 1-5 minutes. Non-critical: 15-30 minutes. Balance data granularity with network load. Use bulk operations.
Alert Escalation
Level 1: Email notification. Level 2: SMS/text. Level 3: Phone call. Level 4: External notification. Include troubleshooting steps.
Practical Exercises
- Configure SNMPv3 on 10 network switches
- Load vendor MIBs for PDU monitoring
- Set up SNMP trap receiver and test notifications
- Create monitoring dashboard with polled data
- Configure escalation for critical alerts
Key Formulas
Polling Load = (Number of OIDs × Poll Frequency) / Device CountTrap Processing = Trap Rate × Average Processing TimeNetwork Utilization = (Data Rate × 8) / Link SpeedSafety Checklist
- Use SNMPv3 with strong authentication
- Verify all devices are responding to polls
- Test trap notifications for critical alerts
- Review alert logs weekly
- Update community strings periodically
Alert Management & Incident Response
Design alarm escalation matrices, automate response playbooks (runbooks), implement shift-based on-call procedures, and practice incident response for power failures, cooling losses, and environmental emergencies. Learn about post-incident analysis and continuous improvement processes.
Lessons
Escalation Matrices
Severity levels: Critical, High, Medium, Low. Response times: 15 min, 1 hour, 4 hours, 24 hours. Notification methods: Email, SMS, phone. Roles and responsibilities.
Automated Runbooks
Scripted responses: Email alerts, SNMP sets. Integration: DCIM, ticketing systems. Approval workflows. Rollback procedures.
On-Call Procedures
Rotation schedule: Weekly, bi-weekly. Backup: Secondary on-call. Tools: Mobile apps, web interfaces. Handoff procedures.
Incident Response Process
Detect: Alarms, monitoring. Respond: Acknowledge, assess. Resolve: Execute runbook. Document: Incident report. Review: Post-mortem.
Post-Incident Analysis
Root cause analysis. Timeline documentation. Action items. Preventive measures. Knowledge base update.
Practical Exercises
- Create escalation matrix for 4 severity levels
- Write runbook for UPS battery failure
- Configure on-call schedule in monitoring tool
- Simulate power failure incident response
- Conduct post-mortem analysis for test incident
Key Formulas
MTTR = Total Downtime / Number of IncidentsAvailability = (Total Time - Downtime) / Total TimeResponse Time = First Alert Time - Incident Detection TimeSafety Checklist
- Verify all contacts are current in escalation list
- Test notification methods monthly
- Review runbooks annually
- Update contact information quarterly
- Conduct incident response drills quarterly
Learning Objectives
- Deploy and configure a complete DCIM monitoring platform with asset and capacity management
- Install and calibrate environmental sensors following best practices for coverage and accuracy
- Configure SNMP monitoring for all critical infrastructure devices including UPS, PDUs, and cooling
- Design and test alarm escalation procedures for power, cooling, and environmental emergencies
- Execute incident response playbooks for simulated facility emergencies
- Perform capacity trending analysis to predict future power, cooling, and space requirements