Rochelle
Indirect Labor Recovery
A projected six-figure annual recovery from an indirect labor operation — surfaced when 22,167 instance records revealed three distinct populations hiding inside one daily average.
← All workA six-figure annual recovery hidden inside a daily average — surfaced when 22,167 instance records revealed three problems where the dashboard showed one.
Twenty-two thousand instance records. Three populations. One mislabeled signal.
The Problem
The site was a regional cold-chain warehouse in the Midwest — a high-throughput facility where Material Handling Equipment runs the floor and the batteries that power that equipment have to be swapped through a dedicated changing station several times a shift. The site had two associates whose full-time job was running that station. On December 1, both positions were eliminated. Direct labor associates absorbed the work, untrained on the changeover equipment.
The metric that ran the site’s labor reporting picked it up immediately. Monthly hours logged under the battery-change job code jumped from 274 to 483 — a 76% increase. On-Standard% (the share of total hours measured against an engineered standard) dropped from 77.2% to 74.9%. Effectiveness%, which combines On-Standard% with labor variance, fell from 87.5% to 83.3%. DLMTPP — direct labor minutes per throughput pallet, the network-wide site comparison metric — inflated.
The daily aggregate data told a clean story. The average time per battery change had moved from 5.83 minutes pre-December to 8.66 minutes post-December — a 48% increase. The interpretation followed itself: direct labor associates didn’t know how to change a battery efficiently. Train them, the average comes back down, the hours follow. The original project charter projected $34,000 in annual recovery on that logic, with a target of 250 hours per month.
Before I committed to that interpretation, I asked for the underlying instance-level data — every individual battery-change scan over the prior seven months. The dashboard had been showing me a daily average. I wanted to see the population it was an average of.
The Reframe
The instance-level data — 22,167 individual battery-change records over seven months — didn’t look like a single distribution. It looked like three.
A spike at the front: 38.8% of scans came in under two minutes. Battery changes don’t take two minutes. These weren’t real changes. They were associates scanning into the job code as an idle or transition activity, padding the time between productive work.
A normal-ish cluster in the middle: 44.7% of scans landed between 2 and 15 minutes, with a mean of 7.6 minutes after the December 1 transition and 6.8 minutes before. The legitimate population had slowed down by 12%, not 48%.
A long right tail past 15 minutes: 16.5% of scans ran over the 15-minute mark across the seven months — but the share doubled across the December 1 transition, from 10% pre to 20% post. With the worst stretching past 200 minutes, these weren’t slow changes either. They were associates logged into the job code while doing other activities, taking extended breaks, or transitioning between tasks. Seventeen percent of instances across the full dataset, twenty percent in the post-December subset. Roughly fifty-eight percent of the monthly hours either way. The actual cost driver.
The daily average had been blending three populations and treating them as one. Misuse pulled the average down. Extreme outliers pulled it up. Legitimate changes barely moved. The 48% rise in the daily average wasn’t a 48% rise in legitimate change time. It was a 100% rise in the rate of extreme-outlier scans — 10% pre-December to 20% post — driven by the dedicated associates no longer being there to enforce scan discipline at the station.
One more pattern the data carried: the misuse rate at 44% had existed before the dedicated associates were eliminated. The scan-discipline problem wasn’t new. It had been there all along, masked by two people whose presence at the station kept everyone else honest.
The Approach
The reframe pointed to a different intervention structure than the training-gap interpretation had suggested. If the cost driver was extreme outliers — not slow legitimate changes — then a training program targeting everyone would deliver almost none of the savings. The intervention had to be calibrated to the population that owned the cost.
I built the project as a DMAIC engagement and structured the improvement work around three levers, prioritized by their share of recoverable hours.
Lever 1 — eliminate extreme outliers. A 15-minute scan cap with automated supervisor alerts when any battery-change instance exceeded the cap. Daily outlier reports that named the associate, the shift, and the duration. Coaching for the top 10 offenders, who together accounted for 40% of the extreme-outlier hours across the entire site. The intervention was behavioral, not technical — make the activity visible, make accountability immediate, and the time abuse contracts.
Lever 2 — enforce scan discipline. A minimum-duration threshold flagging any instance under two minutes for supervisor review. Shift-level scorecards on the misuse rate, with the two shifts running 50%+ misuse rates getting targeted attention first.
Lever 3 — targeted training. The ~15 associates whose legitimate change time exceeded 8.5 minutes (the upper tail of the legitimate population) received hands-on training on the changeover equipment. Not the full population — the modest skills gap was real, but it lived in a minority of the workforce.
The sensitivity analysis ran three scenarios — easy, achievable, and aggressive — at each lever, with the financial impact aggregated against the direct-labor wage rate. The achievable scenario projected approximately $97,000 in annual recovery, with Lever 1 alone delivering 90% of the total. The original charter had projected $34,000 across an undifferentiated training program. The reframe nearly tripled the recoverable value.
LLM-assisted development compressed the cycle from formulation to deployable analysis. The DMAIC framework, the Pareto cuts, the sensitivity model, the trimodal validation, and the lever prioritization were mine; the syntax for the statistical analysis and the daily monitoring tool was generated under direction.
One average. Three populations.
The Result
The model projected approximately $97,000 in annual labor recovery from the full three-lever implementation, with Lever 1 alone delivering 90% of that. The original training-gap charter had projected $34,000 from a different mechanism entirely.
Within the first two and a half weeks of the daily monitoring tool being in supervisor use, the extreme outlier rate dropped from 17.9% to 12.1% — a 5.9-percentage-point reduction, statistically significant under the Mann-Whitney test. Daily extreme hours fell from 8.2 to 5.1, a 38% reduction. The annualized labor recovery at this partial-implementation stage came to roughly $29,000 — about 30% of the full projection, from a single lever.
The model also predicted a substitution effect. When the 15-minute cap pressured extreme-outlier behavior down, associates would shift their pattern toward quick scans under two minutes — Lever 1 displacing into Lever 2’s territory. The data confirmed it: the misuse rate ticked up from 36.7% to 39.9% during the same period. The next intervention (scan-discipline enforcement) is what closes that gap.
Lever 2 and Lever 3 have not yet been activated. The full ~$97,000 target remains a model projection. But the first lever’s prediction held under field test, and the substitution effect predicted at the modeling stage showed up in the data exactly where the model said it would.
The Reflection
The most expensive metric in operations is the one that blends populations and presents itself as a single signal.
The daily average for battery-change time wasn’t mathematically wrong. It was correct. It was also useless. It mixed scans that weren’t real battery changes with extreme-outlier scans that weren’t really battery changes either, and presented the result as the truth about how long the people actually changing batteries were taking. The average reported a 48% slowdown. The population the interpretation was about had slowed by 12%. The 36-percentage-point gap was the cost of an aggregation nobody had questioned.
Every diagnostic since starts the same way: open the histogram first. If the population is one shape, the average is informative. If the population is two shapes or three, the average is the most expensive number in the room.