The Metropolitan Transportation Authority is looking for a vendor to build an AI system that can detect when a person, animal, or object enters the subway tracks before a train arrives. Before that system goes anywhere near the full network, the MTA plans to test it at exactly two stations, one underground and one elevated, for two years. The estimated cost of that pilot alone: $10 million to $50 million.
Read that again. Two stations. Two years. Tens of millions of dollars. And that is just to find out whether the system is good enough to expand.
If an agency running a 100-year-old subway system is willing to spend that much time and money proving an AI system works before trusting it with people's lives, what does it say about the rest of us who deploy AI into production after a two week trial and a confident vendor demo?
The Problem the MTA Is Actually Solving
This is not AI for AI's sake. Track intrusions caused roughly 6% of all subway delays last year, and the MTA logged 1,297 unauthorized track entries, a 22% jump from 1,062 in 2019. Those incidents range from someone reaching for a dropped phone to far more serious situations. The agency has tried to solve this before. Between 2014 and 2019, it tested CCTV cameras paired with lasers and video analytics, laser scanners with visual and infrared verification, thermal cameras, and microwave scanners. None of them made the cut.
Jamie Torres-Springer, the MTA official overseeing the effort, summed up why in plain terms: the technology "didn't work to do it in a precise enough way that we could manage how we respond to it." That sentence should be printed and taped to the wall of every organization currently rushing an AI tool into production. Precision is not nice to have. It is the entire point.
The False Positive Trap Is Universal
Here is what makes this story relevant well beyond transit. Vancouver's SkyTrain system already runs a similar detection setup, and according to reporting on the MTA's plans, it is sometimes triggered by birds or debris, which then requires a worker to physically inspect the track before service resumes. Celeste Kirkland, a union safety director quoted in coverage of the MTA's plans, put it bluntly: "We have rats all through the system. Would they want a train to stop mistakenly because a rat jumped... onto the tracks?
Swap "rat on the tracks" for "false fraud alert," "false malware detection," or "chatbot that escalates a routine billing question to a human at 2am," and you are looking at the exact same failure mode that shows up in enterprise AI every week. A system that cries wolves does not just waste time. It trains the humans around it to stop trusting it, and once that trust is gone, you have spent your AI budget building something people quietly route around. Shadow AI gets all the attention as an "uncontrolled tool" problem. This is its mirror image: a sanctioned tool nobody believes in anymore.
Why the MTA's Approach Is the Right Lesson, Not the Punchline
It would be easy to read this story as "government moves slowly, what's new." I would push back on that read. A two year, two station pilot with a defined budget range and a built-in evaluation period is not bureaucratic foot-dragging. It is exactly the lifecycle discipline NIST CSF 2.0 asks organizations to apply to AI: validate in a controlled environment, observe how the system performs against real conditions, and only then make the call to expand it.
Most enterprises skip straight from "the vendor's demo looked great" to "we are live in three departments." There is no instrumented pilot. There is no predefined threshold for what counts as success or failure. There is no plan for what happens if the false positive rate is too high to be useful. The MTA is building all three of those things into its plan before a single sensor goes up. That is pre-production validation done right, and it scales organizations far smaller than a transit authority serving millions of riders a day.
This also matters more, not less, for organizations in regulated environments. Financial services, healthcare, and government contractors do not get to explain a bad rollout as "we were moving fast." Examiners and auditors will ask the same question New York is answering in advance: how did you know this was ready before you trusted it?
What This Looks Like for an Organization That Does Not Have $50 Million
You do not need New York's budget to borrow New York's discipline. You need a structure that forces the same questions to get answered before going live, not after.
- Define your "two stations" before you start. Pick a narrow, well bounded environment to pilot in, one team, one process, one location, and resist the pressure to roll out everywhere at once just because the demo went well.
- Decide what success looks like in writing, before the pilot begins. What false positive rate is acceptable? What does "precise enough" mean for your use case? If you cannot answer that question before you start, you will not be able to answer it honestly when the results come in.
- Build the off-ramp into the plan. A pilot with no path to "this did not work, here is what we do instead" is not a pilot. It is a soft launch with extra steps.
- Measure the human cost of false positives, not just technical accuracy. Every false alert has a downstream cost: someone has to investigate it, explain it, or quietly start ignoring the system that generated it. That cost belongs in your evaluation criteria from day one.
- Treat the pilot as evidence, not theater. Document what you tested, what you found, and why you made the call you made. That record is exactly the kind of operational evidence ISO 42001 readiness, and increasingly your own board, will expect to see.
The Bottom Line
New York is not testing AI for two years because it does not trust technology. It is testing AI for two years because it understands what is actually at stake when a system that watches subway tracks gets it wrong, in either direction. That is the same math every enterprise leader needs to run before putting AI in front of customers, employees, or critical processes. The question is not whether you can afford to pilot properly. It is whether you can afford not to.
If your organization is moving faster on AI deployment than what you're testing and validation process can keep up with, that gap is where the expensive surprises live. The SamurAI helps organizations build pre-production validation programs grounded in NIST CSF 2.0, so AI gets proven before it gets trusted with the things that matter.



