When AI Fails in Tax - The Netherlands Scandal Explained

When AI Fails in Tax: The Netherlands Scandal Explained

On January 15, 2021, the entire Dutch government resigned.

The cause was not a budget crisis or a political scandal in the usual sense. It was an algorithm. For nearly a decade, the Dutch Tax and Customs Administration had been using a risk classification system to flag suspected fraud in childcare benefit claims. The system targeted tens of thousands of innocent families, many of them working-class and immigrant. The recovery actions that followed pushed people into bankruptcy, broke up marriages, and led to over a thousand children being placed in foster care.

The technical story is about an algorithm. The real story is about everyone around it.

Quick answer

The Netherlands tax AI scandal, known in Dutch as the toeslagenaffaire, is the most cited example of AI failure in tax administration. Between roughly 2013 and 2019, the Dutch Tax and Customs Administration used a machine learning risk system that wrongly accused around 26,000 families of childcare benefits fraud. The system used sensitive data points, including nationality, as risk indicators. In January 2021, the entire Dutch cabinet resigned over it. It is the case every tax administration in Europe now studies.

What Actually Happened in the Netherlands

The Dutch childcare benefits system pays subsidies to working parents. It is administered by the Belastingdienst, the Dutch tax authority. Like most modern benefits systems, it relies on a risk model to flag claims for review.

The model the Belastingdienst used, known as FSV (Fraude Signalering Voorziening, or fraud signaling facility), did several things wrong at the same time.

It used nationality as a risk factor. Having a second nationality, or "non-Dutch sounding" surnames, increased your risk score. The Dutch Data Protection Authority later ruled this discriminatory.
It was trained on old data. Patterns from years earlier were used to flag families whose situations were entirely different.
It treated a flag as proof. Once flagged, families were not investigated. They were assumed guilty.
It had no working appeal path. Affected parents were required to pay back tens of thousands of euros immediately. Appeals took years. Many families could not survive the wait.

The consequences were brutal. The parliamentary inquiry that followed, titled Ongekend onrecht (Unprecedented Injustice), confirmed that around 26,000 families were wrongly accused. Roughly 1,675 children were taken into state care because of the financial collapse the recovery actions caused. Several suicides have been linked to the scandal.

The third Rutte cabinet resigned on January 15, 2021. The compensation scheme that followed offered each affected family a minimum of 30,000 euros. The total bill is still being calculated.

Why the Algorithm Was the Wrong Place to Start

It is tempting to blame the math. The math was the smallest part of the problem.

Every serious review of the case has reached the same conclusion. The technical model was crude, but the real failures were organizational.

The training data was indefensible. Sensitive variables like nationality should never have been allowed into a model that affected someone's right to a benefit. There were no formal bias tests. No external review.
Senior management trusted the system. Officials inside the Belastingdienst treated flagged families as confirmed fraudsters. The model became a shield against accountability.
There was no human in the loop where it mattered. Decisions that destroyed lives were processed at scale, with no real review at the level of the individual case.
The administration could not explain its own decisions. When families asked why they had been flagged, no clear answer existed.

That last point is the one every tax administration should sit with.

The Black Box Problem in Court

When an AI system flags a case, an audit, or a refund denial, the affected person can appeal. Eventually, the appeal lands in court. The administration must then explain why the system produced that decision.

This is where many modern AI systems break down.

A simple rule-based model can be explained in a paragraph. A more complex machine learning model, especially one built bottom-up from large amounts of data, is much harder to translate into language a judge can use.

In tax, "we cannot explain it" is a losing argument. The standard the courts want is defensibility. You must be able to show that the decision was based on reasonable, lawful, non-discriminatory logic. If you cannot, you lose.

This is why some administrations, including the Canada Revenue Agency, are reluctant to disclose when machine learning was used in a case selection. The legal exposure is real.

The Random Forest Trap

Random forests are one of the most popular machine learning algorithms in use today. They were introduced by Leo Breiman in 2001 and have been a workhorse in fraud detection, audit case selection, and risk scoring ever since.

A random forest works by building many decision trees (often thousands) and combining their predictions. The result is usually much more accurate than any single tree could be on its own.

Here is the catch. A random forest with 10,000 trees has no single explanation. You can show a court a million data points. You cannot show them an if-then logic they can read.

In a low-stakes setting, this trade-off is fine. In tax enforcement, where every decision is appealable, it is dangerous. This is one of the reasons the IRS Discriminant Index Function, which I covered in the previous post, is much simpler than a random forest. Its simplicity is exactly what has kept it defensible for over 50 years.

A Quieter Failure: The HMRC Chatbot Story

The Dutch case is the most extreme example. There are quieter ones worth knowing.

In the workshop, the presenter described what happened when the UK's HMRC deployed a chatbot that did not work well. According to his account, a third party scraped HMRC's public guidance, built their own chatbot using the same content, and many taxpayers started using the third-party version instead.

The public details of this specific case are limited, so I will not overstate it. But the principle is universal. If your administration ships a poor-quality AI tool, taxpayers will work around you. They will use private alternatives. They may receive bad guidance. Some of them will be penalized for following it.

The lesson is not to avoid chatbots. It is to either ship one that works, or do not ship one at all.

The US EITC Audit Disparity

In 2023, a research team led by Hadi Elzayn at Stanford published a study showing that Black taxpayers in the United States were audited at between 2.9 and 4.7 times the rate of non-Black taxpayers, largely driven by the IRS's audit selection process for the Earned Income Tax Credit. The Treasury Secretary acknowledged the finding and committed to changes.

This was not a single dramatic incident like the toeslagenaffaire. It was a slow, systemic disparity, invisible until researchers measured it. That makes it almost more important to study. Most AI failures in tax administration will not look like the Netherlands. They will look like patterns no one is checking for.

The Pattern: It Is Rarely the Algorithm

Look across all three cases. The Netherlands. The HMRC chatbot story. The US EITC findings.

What they have in common is not the math. It is the absence of three things:

A clear path for the affected person to challenge the decision
An honest review of training data before the system went into production
A senior owner accountable for what the system does in the field

The presenter in the workshop put it bluntly. "It has less to do with the algorithms. Everything to do with the people managing them."

What This Means for Your Administration

If you are responsible for AI in a tax administration, the practical takeaways from these cases are short.

Audit your training data for sensitive variables. Nationality, ethnicity, gender, and proxies for them must not be in your model unless you have a clear, lawful, justified reason. Document the justification.
Build a real appeal path before you deploy. Not a form on a website. A real human review, with a clear timeline and a real stop on enforcement during the review.
Use the simplest model that works. A logistic regression you can explain beats a random forest you cannot. Save the complex models for low-stakes work.
Have a kill switch. Every production AI system should have a clear, documented procedure to turn it off if something goes wrong. Test the procedure.
Disclose your use cases publicly. The Brazilian tax authority (RFB) is currently held up as a good model for this. Public disclosure forces scrutiny in advance, which is much cheaper than scrutiny after a scandal.

The Netherlands case did not have to happen. Every one of those five steps could have prevented it. None of them required new technology. All of them required someone in a position of authority to insist on them before the system went live.

That is still the cheapest, most effective AI governance step you can take.

Frequently Asked Questions

What is the Netherlands tax AI scandal?

The Netherlands tax AI scandal, also known as the toeslagenaffaire, refers to a years-long failure by the Dutch Tax and Customs Administration. It used a machine learning risk system to flag suspected fraud in childcare benefits claims. Around 26,000 families were wrongly accused, leading to bankruptcies, family breakups, and the resignation of the entire Dutch government in January 2021.

Why did the Dutch algorithm fail?

The system used nationality and other sensitive variables as risk factors, was trained on outdated data, treated a flag as proof of fraud, and provided no real path for affected families to appeal. The Dutch Data Protection Authority found the methods discriminatory.

Why is AI explainability so important in tax?

Tax decisions are legally appealable. If a system flags a case for audit or denies a refund, the administration must be able to explain why in court. Algorithms that cannot be explained in plain language are difficult to defend, regardless of how accurate they are.

What is a random forest, and why is it risky for tax administrations?

A random forest is a machine learning algorithm that combines many decision trees, often thousands, to make a prediction. It is highly accurate but very hard to explain. In tax administration, where every decision can be challenged, this lack of explainability creates real legal risk.

Are there examples of AI failure outside of the Netherlands?

Yes. Stanford research published in 2023 showed that Black taxpayers in the United States were audited at 2.9 to 4.7 times the rate of non-Black taxpayers under the IRS's Earned Income Tax Credit audit selection. The IRS acknowledged the disparity and committed to changes.

How can a tax administration reduce the risk of AI failure?

Five practical steps. Audit training data for sensitive variables. Build a real appeal path. Use the simplest model that works. Have a tested kill switch. Disclose your AI use cases publicly.

References

Ongekend onrecht (Unprecedented Injustice). Final report of the Dutch parliamentary inquiry into the childcare benefits affair. Tweede Kamer der Staten-Generaal, December 2020. The official parliamentary record of the scandal.
Autoriteit Persoonsgegevens (Dutch Data Protection Authority). Decision on the Belastingdienst's use of nationality in risk classification. 2020. The DPA fined the Belastingdienst 2.75 million euros for unlawful data processing.
Algemene Rekenkamer (Netherlands Court of Audit). Reports on the FSV (Fraude Signalering Voorziening) risk classification system.
NRC Handelsblad and Trouw. Investigative reporting from 2018 onward that brought the scandal into the public eye.
Breiman, L. (2001). Random Forests. Machine Learning, 45 (1), pp. 5-32. The foundational paper that introduced the random forest algorithm.
Elzayn, H., Smith, E., Hertz, T., Ramesh, A., Goldin, J., Ho, D.E., and Fisher, R. (2023). Measuring and Mitigating Racial Disparities in Tax Audits. Stanford Institute for Economic Policy Research. The study documenting EITC audit disparities by race.
US Department of the Treasury. Letter from Secretary Janet Yellen to Senate Finance Committee, May 2022. Acknowledging racial disparities in IRS audit selection.
European Union. Regulation (EU) 2024/1689 (the EU AI Act). Annex III lists high-risk AI use cases, including those used by public authorities to evaluate eligibility for essential public benefits. Many of the obligations in this annex are direct responses to the Netherlands case.
Brazil. Receita Federal do Brasil. Published AI policy for tax administration. Presented at the United Nations Committee of Experts on International Cooperation in Tax Matters in 2025.
International Monetary Fund. Technical Notes on AI in Revenue Administration. Available through the IMF eLibrary.