Business & Technology

Public sector urged to put data provenance at AI core

Published

on


Butterfly Data has urged public sector organisations to put data provenance at the centre of AI development, arguing that the issue goes beyond conventional data quality work.

Maja Strawinska, a data scientist at Butterfly Data, said teams often assume cleaner data will resolve concerns about fairness, accuracy and governance. But even a well-structured dataset may be unsuitable for AI if organisations cannot explain where it came from, why it was collected, and whether it can lawfully and appropriately be reused.

She drew a distinction between clean data and trustworthy data, particularly in the public sector, where automated systems can affect access to services and care. In that context, the history of a dataset matters as much as its format or completeness.

At the centre of her argument was a simple question: can organisations explain the origin of the information they use? That means knowing who collected it, under what conditions, for what purpose, and whether those circumstances create risks for its current use.

Source matters

To illustrate the point, Strawinska compared data provenance with the farm-to-table approach in food. In both cases, trust depends not only on the end product, but on a clear account of the supply chain behind it.

This carries particular weight in the public sector, where many datasets have built up over time through legacy systems and older processes. Data may later be migrated, standardised and corrected, but those technical improvements do not answer questions about the original basis for its collection or the terms under which it may now be used.

“The important question we need to ask is simple: where did this data actually come from?” Strawinska said.

The issue also extends to compliance and oversight. Provenance, she argued, should not be treated as a narrow technical matter, but as a core part of responsible AI, with direct relevance to data protection obligations and growing scrutiny from regulators and oversight bodies.

Her comments reflect a wider shift in AI governance, especially in government and public services, where organisations are increasingly expected to explain not just what a model does, but the foundations on which it was built. In that context, a data audit trail becomes part of the case for deploying an AI system at all.

Limits of cleaning

Strawinska said standard data quality work has clear value, from removing duplicates to standardising formats, but it cannot solve every problem. Data collected without valid consent, or for a different purpose, does not become appropriate for a new use simply because it has been cleaned and validated.

She used the analogy of food grown in contaminated soil to make that point. A vegetable can be washed and prepared, she said, but still remain unsafe because of what happened at source. The same logic applies to datasets whose origins create legal, ethical or representational problems.

This may be especially acute for public bodies managing information gathered over decades. Some data was collected before current data protection standards took shape, creating a challenge for organisations seeking to apply modern AI methods to old records.

Bias earlier

Strawinska also raised concerns about the point at which bias enters a system. Debate about AI bias often focuses on model outputs and fairness testing, but distortions may begin much earlier, during the collection and assembly of training data.

If a dataset over-represents certain demographics, regions or periods, the resulting model may reflect those imbalances. She cited examples such as systems trained mainly on urban data that may perform poorly in rural settings, or systems built on data from an unusual period of demand that may fail when conditions return to normal.

For public services, she argued, these limitations should be identified before deployment rather than after the fact. Provenance helps organisations assess how representative a dataset really is and where its gaps may lie.

Audit trail

The challenge grows as AI systems become larger and draw on more sources. The more data that is used, Strawinska said, the harder it becomes to maintain a clear account of who handled it, how it changed, and whether those changes introduced distortions.

She argued that organisations that build provenance tracking into projects from the outset will be better placed when they face auditors, oversight committees and public scrutiny. In the public sector, the ability to explain those decisions is closely tied to trust in how AI is used.

“Data provenance – the ability to trace where data came from, who handled it and how it has changed – is often seen as a niche technical topic. It isn’t. It is at the heart of what responsible AI requires,” Strawinska said.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *

Trending

Copyright © 2026 Oxinfo.co.uk. All right reserved.