The Limits of RL and Digital Twins in Manufacturing – and What Actually Works

Feb 03, 2026

In theory, the promise of AI-native factories is incredible and, in some ways, feels obvious. If a factory were end-to-end observable and automated, you could treat it like a controllable system rather than a bunch of disconnected machines and people. In that world, RL becomes an attractive approach: observe the system’s state, take an action, receive a reward, and repeat.

RL paired with digital twins is seemingly a nice alternative to rules-based systems, which are hard to maintain in manufacturing. There are too many interacting variables and too much uncertainty for static optimization. The idea is that, rather than encoding logic by hand, the system could learn to adapt over time.

From a distance, a factory floor looks like a giant system that should be learnable. But in practice, much of what matters is still invisible to software and beyond its direct control.

Photos I took during different factory visits.

At some point in the future, I hope AI-native factories are the norm. But for most plants, that future is still a long ways away. The US alone has hundreds of thousands of manufacturing plants, with the vast majority being legacy facilities.

If we want to improve productivity in any meaningful way, we have to optimize what already exists. And for that, RL and digital twins applied to factory-wide operations are the wrong starting point.

The First Wall: You Don’t Actually Have a State

RL assumes you can observe the state of the system you’re trying to control. In most factories, that’s not the case.

Most plants still run on old equipment with limited telemetry. Timestamps are inconsistent. Important variables like material quality, tool wear, or process hacks live in operators’ heads instead of in databases.

Retrofitting sensors is theoretically possible, but it would be expensive and would require a tremendous amount of manual work. There’s, unfortunately, no such thing as a clean data stream. You have to integrate PLCs, historians, MES systems, spreadsheets, and handwritten logs. What therefore emerges is a stitched-together, incomplete approximation of what’s actually happening on the floor.

This is why so many digital twins end up being partial and laggy, or just wrong altogether. And without a meaningful state, you can’t learn a meaningful policy.

The Second Wall: You Don’t Control the Environment

RL also assumes that actions are executed and feedback is clean. If only it worked that way in manufacturing!

Factories are socio-technical systems made up of many processes layered on top of each other. Material handling, machining, quality, maintenance, scheduling, labor allocation, etc. Recommendations are sometimes ignored. Schedules get delayed. Exceptions are made.

This makes it hard to implement systems that actually take action. Alerts, suggestions and recommendations already exist in many cases, but telling someone what to do is very different from actually being able to do it.

I spoke with my friend Pavel Konov, who previously worked as an ML engineer at Toyota Research Institute and was an advisor to Hadrian, about this recently. He drew a useful analogy to DevOps:

“In the same way DevOps moved from tools like Nagios to PagerDuty and now to cloud-native alerting and workflow systems, I think there’s incredible room for improvement, and AI data scientists actually can help with that. Right now if you look at any given dashboard in a factory – likely some brittle, primitive VBA script in Excel or similar – you would get thousands of alerts that engineers and operators just ignore. There are a lot of reasons for this: many false alarms, poor alerting configurability, drifting baseline values, etc. This is just like problem #3 of 60 legacy telemetry problems.”

The gap between advice and action means the signal layer itself still needs work. Even when industrial data can be collected, it often arrives through unreliable connectivity and lacks the context needed to be useful. Turning raw telemetry into meaningful signals requires significant processing and domain-specific modeling, which in turn depends on data science and engineering talent that is scarce in legacy industrial environments.

And without this, there is no stable mapping between observation and outcome for a system to learn from.

What Actually Works

There are domains where RL can work in industrial settings. These are usually in parts of the system that already behave like environments: single processes, controlled machines, and software-defined workflows. In these cases, the action space is bounded. Feedback loops are tight and fast. And humans don’t need to be involved.

These opportunities are around process-level rather than factory-level optimization. Generally speaking, the narrower the scope, the better the results.

This is why the near-term goal shouldn’t be to make existing factories fully autonomous – it should be to make specific decisions better. That might be in the form of improved process control, cell-level optimization, smarter dispatch systems, better decision simulation, or even capturing the heuristics that operators already use inside workflows.

A useful way to find the right scope is to ask a small set of concrete questions:

What exact decision does the system control?
Is that decision automatic or just advisory?
How quickly does the reward show up?
Can the system be deployed without re-architecting the plant?
Can it demonstrate lift on a single process?

These questions help determine what can actually become a learning system. They force clarity around where intelligence can be applied and whether it’s feasible to do so in practice.

The Path Ahead

The future of industrial AI will be built by systems that control specific functions, where state is observable and actions can be measured. Over time, this will evolve (and already is) into orchestration layers that allow these systems to work together: coordinating multiple control loops, managing tradeoffs between them, and turning localized intelligence into plant-level outcomes.

Only once that foundation exists at scale does it make sense to look further ahead and attempt systems that learn the entire factory.

Mixture of Experts

Discussion about this post

Ready for more?