Module 4
1 week (8 hours)

Monitoring

Part of the Data Center Technician training program

Overview

Master the DCIM (Data Center Infrastructure Management) systems, environmental monitoring platforms, and alerting technologies that keep data centers running 24/7. This capstone module covers sensor networks, network management, automated response, and the operational procedures that ensure continuous facility uptime.

Sub-topics

DCIM Systems & Platform Overview

2 hours

Study DCIM software platforms (Netcool, Sunbird, Nlyte, Intelix), asset management workflows, capacity planning tools, power and cooling visualization dashboards, and integration with IT service management (ITSM) systems. Learn about data center mapping software and real-time 3D visualization.

Lessons

DCIM Platform Types

Asset tracking: Serial numbers, locations, warranties. Capacity planning: Power, cooling, space. Monitoring: Real-time dashboards. Reporting: Compliance, efficiency.

Asset Management

Barcode/RFID tagging. Checkout/check-in workflows. Warranty tracking. Lifecycle management. Integration with procurement.

Capacity Planning Tools

Predictive analytics. Trending reports. What-if scenarios. Automated alerts. Integration with monitoring systems.

Dashboard Design

Key metrics: PUE, WUE, CUE. Real-time status. Drill-down capability. Mobile access. Custom views for different roles.

ITSM Integration

Change management. Incident tracking. Problem management. Knowledge base. Workflow automation.

Practical Exercises

  • Configure asset tracking for 100 devices
  • Create capacity plan for 6-month expansion
  • Design executive dashboard with key metrics
  • Integrate DCIM with ticketing system
  • Generate compliance report for audit

Key Formulas

PUE = Total Facility Power / IT Equipment PowerWUE = Annual Water Usage / Annual IT EnergyCUE = IT Equipment Power / Total Facility Power

Safety Checklist

  • Verify all assets are tagged properly
  • Check integration with monitoring system
  • Test report generation functionality
  • Review user access permissions
  • Backup configuration regularly

Environmental Sensor Networks

2 hours

Deploy and configure temperature sensors, humidity probes, water leak detectors, smoke detectors, air pressure differentials, and vibration sensors throughout the facility. Learn sensor placement strategies, calibration procedures, data aggregation, and false alarm reduction techniques.

Lessons

Sensor Placement Strategy

Temperature: At rack inlet, exhaust, and under-floor. Humidity: Same locations. Leak: Under floors, around plumbing. Smoke: Near CRAC units, in corridors.

Calibration Procedures

Temperature: Ice bath (0°C) and boiling water (100°C). Humidity: Saturated salt solutions. Frequency: Annually or per manufacturer.

Data Aggregation

Edge computing: Local processing. Cloud: Centralized storage. Database: Time-series optimized. API: Integration with other systems.

False Alarm Reduction

Filtering: Statistical analysis. Hysteresis: Deadband settings. Correlation: Multiple sensor confirmation. Machine learning: Pattern recognition.

Network Topology

Wired: Ethernet with PoE. Wireless: Battery-powered, mesh network. Redundancy: Dual paths. Security: Encryption, authentication.

Practical Exercises

  • Install temperature sensors at 20 rack locations
  • Perform humidity sensor calibration
  • Configure sensor network for redundancy
  • Set up data aggregation to time-series database
  • Tune alarm thresholds to reduce false positives

Key Formulas

Sensor Density = Number of Sensors / Facility AreaNetwork Reliability = (1 - Failure Rate)^Number of PathsData Retention = Storage Capacity / (Data Rate × 86400)

Safety Checklist

  • Verify sensor calibration before deployment
  • Check network connectivity for all sensors
  • Test alarm notification for each sensor type
  • Document sensor locations in asset system
  • Review sensor data quality weekly

Network Management & SNMP

2 hours

Configure SNMP monitoring for network switches, PDUs, UPS systems, CRAC units, and environmental controllers. Set up network management servers, MIB (Management Information Base) configuration, trap monitoring, and alert escalation procedures for critical infrastructure devices.

Lessons

SNMP Versions

SNMPv1: Basic, community strings. SNMPv2c: Better error handling. SNMPv3: Encryption, authentication. Use SNMPv3 for security.

MIB Configuration

Standard MIBs: IF-MIB, POWER-MIB. Vendor MIBs: Specific to device. Load MIBs on management station. Verify OIDs are responding.

Trap Monitoring

Configure trap destinations. Set trap severity levels. Filter traps to reduce noise. Log all traps for analysis.

Polling Intervals

Critical devices: 1-5 minutes. Non-critical: 15-30 minutes. Balance data granularity with network load. Use bulk operations.

Alert Escalation

Level 1: Email notification. Level 2: SMS/text. Level 3: Phone call. Level 4: External notification. Include troubleshooting steps.

Practical Exercises

  • Configure SNMPv3 on 10 network switches
  • Load vendor MIBs for PDU monitoring
  • Set up SNMP trap receiver and test notifications
  • Create monitoring dashboard with polled data
  • Configure escalation for critical alerts

Key Formulas

Polling Load = (Number of OIDs × Poll Frequency) / Device CountTrap Processing = Trap Rate × Average Processing TimeNetwork Utilization = (Data Rate × 8) / Link Speed

Safety Checklist

  • Use SNMPv3 with strong authentication
  • Verify all devices are responding to polls
  • Test trap notifications for critical alerts
  • Review alert logs weekly
  • Update community strings periodically

Alert Management & Incident Response

2 hours

Design alarm escalation matrices, automate response playbooks (runbooks), implement shift-based on-call procedures, and practice incident response for power failures, cooling losses, and environmental emergencies. Learn about post-incident analysis and continuous improvement processes.

Lessons

Escalation Matrices

Severity levels: Critical, High, Medium, Low. Response times: 15 min, 1 hour, 4 hours, 24 hours. Notification methods: Email, SMS, phone. Roles and responsibilities.

Automated Runbooks

Scripted responses: Email alerts, SNMP sets. Integration: DCIM, ticketing systems. Approval workflows. Rollback procedures.

On-Call Procedures

Rotation schedule: Weekly, bi-weekly. Backup: Secondary on-call. Tools: Mobile apps, web interfaces. Handoff procedures.

Incident Response Process

Detect: Alarms, monitoring. Respond: Acknowledge, assess. Resolve: Execute runbook. Document: Incident report. Review: Post-mortem.

Post-Incident Analysis

Root cause analysis. Timeline documentation. Action items. Preventive measures. Knowledge base update.

Practical Exercises

  • Create escalation matrix for 4 severity levels
  • Write runbook for UPS battery failure
  • Configure on-call schedule in monitoring tool
  • Simulate power failure incident response
  • Conduct post-mortem analysis for test incident

Key Formulas

MTTR = Total Downtime / Number of IncidentsAvailability = (Total Time - Downtime) / Total TimeResponse Time = First Alert Time - Incident Detection Time

Safety Checklist

  • Verify all contacts are current in escalation list
  • Test notification methods monthly
  • Review runbooks annually
  • Update contact information quarterly
  • Conduct incident response drills quarterly

Learning Objectives

  • Deploy and configure a complete DCIM monitoring platform with asset and capacity management
  • Install and calibrate environmental sensors following best practices for coverage and accuracy
  • Configure SNMP monitoring for all critical infrastructure devices including UPS, PDUs, and cooling
  • Design and test alarm escalation procedures for power, cooling, and environmental emergencies
  • Execute incident response playbooks for simulated facility emergencies
  • Perform capacity trending analysis to predict future power, cooling, and space requirements

Ready to Get Started?

Enroll in the Data Center Technician program and begin your training today.