Grey space: the capacity you're paying for and not using

Every data centre operator runs with safety margins. Thermal ceilings 10-15 degrees below equipment limits. Power caps at 80% of rated capacity. Redundancy ratios calculated for worst-case simultaneous failure. These margins exist for good reason — exceeding a thermal limit or tripping a breaker is expensive and sometimes dangerous.

But the margins are static, and operating conditions are not.

At 3am on a Tuesday in January, a facility in Northern Europe runs at 40% compute utilisation with outside air at -2C. The thermal ceiling is set at 85F because that’s the safe limit in August when outside air is 32C and every rack is at peak load. The power cap is at 80% because that’s the safe limit with both utility feeds active and the BESS at minimum charge. The redundancy ratio assumes N+1 on every cooling unit because maintenance windows are unpredictable.

The gap between the static limit and the actual safe limit at any given moment — that’s grey space. Real capacity the operator owns, has paid for, and isn’t using.

This is the problem that led me to build a working prototype of what I’m calling ARMP (Adaptive Resource Management Protocol) — a coordination layer that sits above device-level protocols like Redfish and BACnet. Existing protocols tell you what individual equipment is doing. ARMP lets grey space (power, cooling, physical plant) and white space (compute, workloads, GPUs) communicate constraints, capacity, and intent to each other bidirectionally. The cooling system knows a training job is about to spike. The workload scheduler knows cooling in Zone A is constrained. The facility operates as one system rather than two that share a building.

A protocol like this would ultimately need to come from a consortium — OCP, IEC, or a dedicated working group — not a single vendor. ARMP is a placeholder name for a working implementation of the concept.

Three-column diagram showing grey space across thermal power and cooling — Figure 2.1 — Grey space across three dimensions. The gap between static limit and dynamic safe limit.

How much capacity is stranded

In ARMP, we instrumented a simulated data centre zone with 20 servers across 4 racks, served by 2 CRAC units at 50kW each. We tracked the gap between static and dynamic limits across thermal, power, and cooling dimensions continuously.

At any given moment: 15-30% of thermal headroom was unused. Power headroom was wider — 20-40% depending on workload mix. Cooling capacity was tightest, typically 10-20% above actual demand.

24-hour stacked bar chart showing grey space expanding at night and contracting during peak — Figure 2.2 — Grey space over 24 hours. Widest at 4am, tightest at 2pm. Static BMS sets the ceiling for 2pm and applies it all day.

This capacity could be sold, used for burst compute, or traded for efficiency. It sits idle because the control system doesn’t know it exists.

Why static margins persist

Dynamic constraint management is hard to do safely. If you raise the thermal ceiling from 85F to 90F because conditions allow it, you need certainty that you can bring it back down before conditions change. A workload spike, an outside air temperature swing, a cooling unit going into maintenance, a utility feed dropping — any of these can happen in minutes. The control system needs to detect the change, recalculate the safe operating envelope, and adjust constraints before equipment exceeds real limits.

Traditional BMS and DCIM systems don’t do this. They set thresholds at commissioning, maybe adjust seasonally, and leave them. The engineer who set the threshold optimised for the worst case they could imagine, not for current conditions.

Dynamic constraint management in practice

In ARMP, the constraint engine runs continuously alongside the MARL agents. It maintains a real-time model of the safe operating envelope across six dimensions: thermal (per zone and per rack), power (per PDU, per circuit, per feed), cooling (per CRAC, per CDU, per zone), UPS (battery state, load level, efficiency curve), workload (SLA commitments, migration feasibility), and physical plant (outside air, humidity, utility status).

Each dimension has three boundaries:

Hard limit: equipment damage or safety compromise. Never exceeded.
Soft limit: operational risk increases. Can be exceeded temporarily with monitoring.
Current operating point: where the system is right now.

Grey space is the distance between current operating point and soft limit. The constraint engine calculates it continuously and exposes it to the MARL agents as part of their observation space.

A cooling agent doesn’t just see “temperature is 74F.” It sees: temperature is 74F, soft limit is 82F given current conditions, grey space is 8F, and based on thermal trend and workload schedule, grey space will shrink to 4F in the next 20 minutes. That observation lets the agent make efficiency decisions — reduce fan speed, save power — that a static threshold cannot support.

Thermal event prevention

In our end-to-end demo, the system prevented a thermal event that static controls would have missed. A training job on Rack-001 drove utilisation from 45% to 85%. Static controls would have waited for the 85F alarm and then reacted. The MARL system detected the temperature trend at 77F (WARNING), predicted it would reach CRITICAL in 9.5 minutes, and proactively migrated two workloads to Rack-004 while dropping the CRAC setpoint by 2F.

Thermal event timeline showing detection prediction and mitigation — Figure 2.4 — Thermal event: detected at 77°F, predicted CRITICAL in 9.5 minutes, mitigated in 30 seconds.

Total response time: under 30 seconds from detection to action. Temperature peaked at 77.4F and recovered to 73.6F. CRITICAL was never reached.

Energy result: 16.7% savings over the demo period, projected to $290/week per zone at scale. Most of that comes from cooling optimisation — running CRAC units closer to actual demand rather than at the static worst-case setpoint.

Cost comparison table static controls vs ARMP showing 26.6 percent savings — Figure 2.5 — Weekly cost: static controls vs ARMP. $570/week savings per zone.

The commercial shift

Grey space management changes the conversation between operations and commercial teams. Instead of “we have X MW of capacity and it’s Y% sold,” the conversation becomes: we have X MW of static capacity, but at current conditions we have X+Z MW of dynamic capacity available for the next N hours.

Z is sellable. Burst capacity for cloud tenants. Scheduling flexibility for AI training jobs. Demand response capacity for utility grid programmes. It exists right now in every facility — the question is whether the control system can see it, and whether the operator trusts it enough to use it.

Building that trust is the harder problem. The technology works. Operational confidence takes longer.

This is part of a series on building autonomous data centre management systems. The views are my own.

Adi Kumar

Executive in the power and industrial technology sector, based in Switzerland. Writing about AI infrastructure, data centre power architecture, and autonomous operations.

How much capacity is stranded

Why static margins persist

Dynamic constraint management in practice

Thermal event prevention

The commercial shift

More on

32 agents, one data centre: multi-agent reinforcement learning in practice

Eight protocols, one data centre, zero coordination

Follow the writing