Key Learnings for building trust in data in AI

Key Learnings for building trust in data in AI

Digital transformation is a hot topic, and it has been one since I started at Neal Analytics close to ten years ago. During this time, I have observed that one of the biggest hurdles business leaders face with their digital transformation initiatives is making data part of their organization’s daily processes. The challenge lies in answering the question: Is this data something I can trust to help me make the decisions I need to make?

When a company is spending millions of dollars recruiting talent, maintaining IT infrastructure, and migrating its data and apps to the cloud, the expectation is that the data will be there when you need it and that it will be accurate and timely. Lamentably, many end users end up skeptical of data for various reasons.

Those investments to collect business data, process it at scale, and form business intelligence or advanced analytics to gain insights are meaningless if they’re not operationalized in a way that can positively and meaningfully impact how you run your business. If the data doesn’t change what you’re doing or going to do, it’s worthless… nothing more than a pat on the back.

Therefore, a key success factor for those transformation initiatives is to ensure as much value as possible can be extracted from this.

At Neal, we have identified the two high-level elements paramount to extracting that business value:

    • Select a data architecture that enables users to discover and understand data.
    • Ensure users trust the data by sharing its quality state in real-time.

In a nutshell, to succeed, we must build an end-to-end data strategy that removes (or at least minimizes) any hesitation or uncertainty about data relevance and accuracy. Therefore, investing to collect and store data in a vacuum is only worthwhile if you are also investing to keep it visible, understood, and accurate.

The data catalog

A data catalog (also called a data dictionary or master data) is the documentation and metadata of your data.

This information helps remove the mystery around your data. It essentially tells you what a field means inside of a data set. Also, it can go beyond the metadata or telemetry that comes out of a tool like SQL Server for reporting.

For instance, this catalog could record:

  • The last updated date
  • Whether the data is restricted or not
  • If the data contains PII
  • Who wrote the calculated field to create this column of data

Therefore, we must collect telemetry, metadata, and tribal (tacit, often undocumented) knowledge, typically human-based inputs that provide business context you cannot get from any tooling or software. In essence, we must fully document the data.  

One crucial element to highlight is that a data catalog remains valuable only if you keep it maintained. The only useful data catalogs are treated as living documents. They are an ongoing practice, not a set-and-forget document. These are often maintained by dedicated roles such as Data Stewards. 

Similarly, any ML model derived from this data will also need to be regularly updated to ensure model drift won’t happen, especially in times of crisis. To achieve this, you need to implement a robust MLOps process. One key part of that process relevant to this topic is surfacing the current model state/performance to end users. Just like with representing data status, the goal is avoiding the age-old pushback of “I wasn’t sure if I could trust the prediction/data of the model/report, so I just went with my gut.” 

Ensure users trust the data 

Whether it’s a predictive model or BI/reporting data, you still need to give end users a level of confidence in the source data. Otherwise, the insights and potential actions derived from the data will be disregarded; or worse, users will use them blindly and shirk responsibility. 

To improve trust and, therefore, adoption, three key things must happen: 

Three key things to improve trust in data

  • Proactively address objections 
  • Involve users early on 
  • Instrument your data

First, you should proactively address the end user’s possible objections, such as data timeliness, outliers/odd values, privacy issues, et cetera. These objections can be properly addressed by providing data or context that lets users know that you have already evaluated them and are giving the OK. A data catalog is one way to do this, as are model and dataset portfolio views that pull live status data from tooling like Azure Data Factory, DevOps, or MLFlow to tell users a simple Red/Yellow/Green on whether a dataset or model is safe to use. Neal Analytics is one of the leaders who has built a system like this, but that’s a topic for another day, so enquire if you’re interested in learning more.

In our experience, adoption issues with clients are mostly driven by a lack of trust or ownership of the result (or both!). Therefore, you should involve end-users early in the project to increase the sense of ownership. You should also require strong accountability by the project team for the end model, dashboard, or datasets to demonstrate that it was built correctly and can be trusted.

Finally, you must instrument your end-to-end solution, from the raw data to the ML models, to monitor your overall solution pipeline health reliably. Users should be able to see whether, based on the current conditions, the solution is as trustworthy and precise as it should be or whether they need to apply their own judgment before acting on the insights presented. As mentioned earlier, using red/yellow/green KPIs to indicate the data or model state is a simple option, or it could be more appropriate to let end users in on the details, such as last refresh, known errors, model performance metrics, etc. When end users get convinced that the data presented in front of them is trustworthy, then and only then will they be willing to trust the end results, be it a BI dashboard, an ML-powered forecast, or any other data-backed solution.

Conclusion 

User trust in data and ML-based business processes is certainly possible but should not be expected without a concerted effort. With the right tools, methodology, and ongoing management processes, it is certainly achievable without draconian tactics like forcing use or tying compliance to compensation. We have successfully deployed hundreds of data-powered solutions, from simple business dashboards to advanced deep-learning models. Through this approach, we can ensure our customers extract authentic business value. There’s nothing worse than investing in your data and getting nothing out of it but meaningless platitudes.