Why all tests pass but code still breaks in production

There’s a morning I’ll probably remember for a long time.

We had just finished a new payment feature — the kind of feature you’re not allowed to get wrong, because there’s real money on the other end. Before deploying, I ran the entire test suite one more time, just to be safe. Everything passed. CI passed. Every check we had said “looks good.”

And still, a bug slipped through.

I’m not telling this story to confess a mistake, but because it taught me something about testing that no course has ever made this clear — and because the way it was discovered, isolated, and handled says a lot about what a good process actually protects you from. I’ll come back to that part later.

First, the bug itself.

It wasn’t the kind of obvious red-screen failure. The only symptom reported by the client was a single line: they couldn’t add a card — even though that same card worked fine everywhere else. The controller caught the error and returned a generic message, so from the outside, there was no clear signal of what was going wrong. So we did what we always do: investigate, trace, fix.

Digging in, it turned out the failure happened at the member creation step on the payment gateway — meaning we never even got to the actual charge. And it had nothing to do with the card itself.

What made me sit with this longer wasn’t the bug.

It was the question: if all tests passed, what exactly were they testing?

This article is the answer I arrived at, with real code from that project (cleaned up and anonymized). On this blog, a colleague of mine once wrote about using AI to code from a management perspective, with a metaphor I’ll borrow later: AI is a great co-pilot, but a terrible captain. This is the same lesson, told from inside the terminal — this time with code, and a real situation to examine.

The trap of “everything passed”

Let me show you a typical test.

This feature runs as a background job: call the payment service, capture the payment, then create an invoice. The test looks like this:

$mockGmoService = $this->createMock(PaymentGatewayService::class);
$mockGmoService->method('directCapturePayment')->willReturn([
    'success' => true,
    'entry' => ['success' => true, 'data' => ['accessID' => 'ACC123']],
    'exec'  => ['success' => true, 'data' => ['tranID'   => 'TRN123']],
]);

$job->handle($mockGmoService, $mockPaymentService);

$this->assertDatabaseCount('ad_invoices', 1);
$invoice = AdInvoice::first();
$this->assertEquals(INVOICE_STATUS_PAID, $invoice->status);

At a glance, it looks fine. It creates a mock service, tells it “whenever directCapturePayment is called, return this,” runs the job, and checks that an invoice is created correctly.

The test passes. It will always pass.

Terminal output showing all green passing unit tests for GmoMemberServiceTest.

All tests are green — but none of them validate the real behavior.

That’s the problem.

Look closely at the mock. I’m manually declaring that directCapturePayment returns a structure with success, entry, exec, and inside exec.data.tranID. But where did that structure come from? From my memory of the service, at the moment I wrote the test. The mock never checks the real service. It just replays the scenario I defined.

In other words: the mock is both the question and the answer.

This test doesn’t verify whether the job calls the service correctly — it verifies whether the job handles the exact array I just made up. It’s a mirror. I look into it, see my assumptions reflected back, and say “yep, that’s correct.”

And this isn’t an isolated case. Across that layer, almost all tests follow the same pattern — mock a service, define its output, assert against that output:

PHP

$this->gmoMemberServiceMock
    ->shouldReceive('searchCard')
    ->with($memberId, $cardSeq)
    ->andReturn([
        'success' => true,
        'data' => [['cardSeq' => '1', 'cardNo' => '411111******1111']],
    ]);

$result = $this->paymentService->searchCard($memberId, $cardSeq);
$this->assertTrue($result['success']);

Same mechanism. The success/data shape is defined by me inside the test. What the real service actually returns — the test neither knows nor cares.

Two green islands, and a gap between them

That’s the first piece. The second is what actually caused the failure.

The payment service — the one being mocked — has its own test suite. And that suite also passes. At some point, the real output of that service changed: some keys were renamed, the nesting structure shifted slightly. The developer who made the change did the reasonable thing: updated the service tests to match the new output. Those tests passed again.

You can probably see it now.

On the service side: code changed, tests updated, green. On the caller side (the payment job): still mocking the old structure, because the mock is a frozen snapshot of old assumptions. Tests are still green.

Two sides, two green islands.

But nothing in between.

No test checks whether “what the real service returns” still matches “what the caller expects.” That alignment — what testing calls a contract — drifted silently, without a single test turning red.

That’s how “everything passed” and “it broke in production” can both be true at the same time.

The tests did pass. They just didn’t test the thing that mattered.

A green test is not proof that the system is correct — it’s proof that the code matches the assumptions you encoded in the test. If those assumptions are wrong, the test will happily confirm the wrong thing for you.

The same problem, one layer higher

The most interesting part wasn’t even in the code.

To properly test this payment flow, you need a completely new account — because only a new account triggers member creation on the payment gateway, which is exactly the path we needed to validate. The problem was: our testing process had no step that verified whether the account used for testing was actually new, or just an old account with its card removed, which looks new on the surface.

Without verification, “we tested it and saw no issue” and “we actually tested the right scenario” are two completely different things — even though they look identical from the outside.

And this is where it clicked for me: it’s the same problem as the mock, just at a different layer.

The mock assumes the service returns an old shape — without verification. The testing process assumes the account is new — also without verification.

In both cases, an unverified assumption passes through every checkpoint without being challenged. The green color in the terminal, and a “seems fine” test run, are the same thing at their core: reassurance without validation.

This is no longer just a problem of tests or processes. It’s the same failure repeating across layers: unverified assumptions passing through every control point. When they drift away from reality, the system still reports green — just green according to what it believes, not what is actually happening.

Of course we traced where it slipped through — you have to understand it to fix it. But I didn’t want to spend too long asking “who let it through,” because that question rarely changes anything next time. When a bug passes through every checkpoint at once, it’s usually not that one checkpoint failed — it’s that the system is missing a layer that should have been there. Fixing individual gates won’t help; adding the missing layer will.

The safety net that caught the fall

I said a bug made it all the way to production, and you might expect a disaster.

No customers were affected.

Not because we got lucky, but because of a decision the team made before writing a single line of code.

Since this was the first release of a payment feature, we assumed there would be unknown issues — even if all tests passed. So instead of enabling it for everyone, we rolled it out in layers: deploy the backend to production, let staging frontend point to it for testing, while real users on app and web couldn’t access the new payment flow yet.

The bug surfaced exactly in that buffer zone.

It hit the net, not the users.

As for why the final test was done by the client instead of us: to fully run this flow, you need a real card transaction. There’s no good reason to charge a developer’s personal card. This was agreed from the start: production testing would be done by the client, on their own data, on their own schedule. A cautious release is a two-sided agreement, built upfront — not a last-minute reaction.

Looking back, the lesson isn’t “our tests were bad and we almost caused a disaster.”

It’s this: we never treated passing tests as sufficient proof of correctness, so we put a real-world safety net in place. That healthy skepticism — that your verification tools might be lying — is what turns a potentially serious incident into a calm log review.

Defense in depth is not overkill. It exists for exactly the days when your first line of defense — your tests — betrays you.

When AI protects the wrong thing

Most of these tests were written with AI assistance. It was fast, clean. Give it a service, and it produces a full test file: well-named methods, decent coverage. It feels like having an extra teammate.

After finding the root cause, I went back to the AI and asked it to rewrite these tests to import the real service and validate real behavior instead of mocking static outputs.

It refused.

Not with an error — with reasoning. It argued that mocking external services is best practice, that unit tests should stay isolated, that they shouldn’t touch real systems.

Here’s the key point: it wasn’t wrong.

According to textbooks, that’s correct. Mocking external services for fast, isolated unit tests is standard practice.

But it’s blind to something no book can encode for you: context.

The boundary it insisted on mocking wasn’t some harmless dependency — it was where real money flows. And that “correct unit testing practice” became the exact gap where an integration bug slipped through. The rule “unit tests shouldn’t touch real systems” works in nine out of ten cases. This was the tenth, where you need an additional layer that goes all the way to where mocks can’t reach.

AI doesn’t know which one you’re in.

It can’t distinguish between “safe to mock” and “mocking here loses money.” It applies the rule correctly, in the wrong place — and then confidently defends it. The same confidence that made me trust the green tests in the first place.

Recognizing when the rule doesn’t apply is a human job.

That’s not something you hand over to a co-pilot.

(We also dive deep into the limitations of AI and how developers should approach AI suggestions without blindly accepting them in our article, “How AI has changed the way Developers learn new technologies“)

So what does “fixing it” actually look like?

To be clear: the lesson is not “don’t mock,” and definitely not “don’t use AI to write tests.” Mocking is useful — you shouldn’t hit a real payment gateway every time you run tests.

The problem isn’t mocking.

It’s stopping there.

In the same project, there are tests I actually trust. They look like this:

PHP

$job->handle($mockGmoService, $mockPaymentService);

$this->assertDatabaseCount('ad_invoices', 1);
$invoice = AdInvoice::first();
$this->assertEquals(INVOICE_STATUS_PAID, $invoice->status);
$this->assertNotNull($invoice->paid_at);

$this->purchase->refresh();
$this->assertNotNull($this->purchase->last_billing_date);

Mail::assertSent(AdPaymentSuccessMail::class);

The difference is in the assertions.

This test doesn’t invent results and then verify them. It runs against a real database (RefreshDatabase), and checks real side effects: was the invoice written, is the status PAID, was the payment date set, was the email actually sent. These are things mocks can’t fake convincingly. If invoice logic breaks, this test turns red.

From this, I ended up with a few principles for myself:

Mock at the boundary, but have at least one layer of tests that crosses that boundary for real. For a payment integration, that usually means at least one test running against the provider’s sandbox — enough to catch when the response contract changes, instead of relying on an outdated mock structure. If a mock is a mirror, a contract test is a window — it lets you see the real world instead of your own assumptions.

Be careful with tests that re-implement logic. In this project, there were tests that duplicated the pricing formula inside the test and asserted against it, instead of calling the actual pricing function. Those tests always pass — even if production is wrong — because they write the question and the answer at the same time. A real test calls the logic, it doesn’t reenact it.

And the hardest one, because it’s not technical: don’t let the feeling of “green” replace understanding what your tests actually validate. A clean green screen is comfortable. It makes you stop asking questions. But the question “if I intentionally break this code, will this test fail?” is worth more than a hundred green checks.

Still about co-pilots and captains

I still let AI write tests. It’s still fast, still useful. What changed isn’t the tool — it’s that I no longer confuse “AI finished writing” with “this is done,” or “tests are green” with “the code is correct.”

AI is a good co-pilot. It writes faster than I do, makes fewer typos, remembers syntax better.

But it doesn’t know which boundary carries money. It doesn’t know this is the first release and needs extra caution. It doesn’t know the mock it just wrote drifted from reality last week.

Those are not things you delegate.

A green test is an invitation to rest.

Sometimes you deserve that rest.

But before you take it, ask one question: is this green telling me the system is correct, or just that it matches what I believed from the start?

Those are not the same.

And sometimes, the distance between them is exactly one payment flow.

(This article focused more on the problem than on “how to properly write tests with AI.” That deserves its own piece — and honestly, it’s something I’m still figuring out myself. So I’ll leave it for the next article, as a public promise to sit down and work it out properly. See you there.)

Looking to build a robust system?

Designing tests that anticipate edge cases and implementing secure deployment strategies—like the ones discussed in this article—are essential to protecting your business. At Linnoedge, our System Development services go beyond just delivering “working code”; we provide highly reliable systems built on verified assumptions. Feel free to contact us to discuss your next development project.

Want a team that works like this?

This is how our engineers in Ho Chi Minh City actually work. If you’re weighing offshore development in Vietnam, I do a free 15-minute call — no sales deck, just working out what actually matters for your specific project.

Book a 15-minute call →

Thinh NguyenThinh Nguyen

Senior Software Engineer · Linnoedge Inc.

The trap of “everything passed”

Two green islands, and a gap between them

The same problem, one layer higher

The safety net that caught the fall

When AI protects the wrong thing

So what does “fixing it” actually look like?

Still about co-pilots and captains

Related reading

Want a team that works like this?

Why I Stopped Comparing Vietnam vs. India vs. Philippines — And What I Use Instead (A CEO’s 15-Year Field Note)

The Map Is in the Mouth — Why Offshore Development Still Fails When the Specs Are Right

Why all tests pass but code still breaks in production

The trap of “everything passed”

Two green islands, and a gap between them

The same problem, one layer higher

The safety net that caught the fall

When AI protects the wrong thing

So what does “fixing it” actually look like?

Still about co-pilots and captains

Related reading

Want a team that works like this?

Tags:

Why I Stopped Comparing Vietnam vs. India vs. Philippines — And What I Use Instead (A CEO’s 15-Year Field Note)

The Map Is in the Mouth — Why Offshore Development Still Fails When the Specs Are Right