Managing AI risk through data and AI governance
Alex Lipinski
Enterprise genAI success is dependent on layers of governance and accountability, from the upstream raw inputs of secure, clean, and trusted data, to the downstream outputs of an unbiased and accountable guarded model.
Data governance is the standards and controls that determine whether/what data can be trusted and used. Under a good data governance framework, the entire lifecycle of data is observed and controlled from creation, through retention, to deletion. It ensures that data is accurate, secure, and private.
AI governance concerns how a language model interacts with data outputs. It addresses the accuracy, trainability, and ethical accountability of model responses, as well as the dynamic interactions of models in filtering prompts and limiting sensitive outputs.
Your enterprise AI needs both...
Takeaways
- Data and AI governance share similarities in regards to the accuracy and availability of handled information, but at different stages of an end-to-end AI workflow.
- The work of governance does not stop upstream with data inputs.
- Data enrichment is essential for GenAI to make accurate use of data through context.
- Input and output guardrails are an essential component of AI governance that ensure sensitive information is clipped out of prompts and answers.
Data needs governance.
AI needs data.
Find what you need.
Intelligent data extraction isn’t a one-size-fits-all solution, and not all extraction tools are created equally.
Fill out our free worksheet to rate and score solution categories that are right for your business, and narrow-in on the best tool for the job.
Why does governance matter for AI and pipeline?
Data and AI governance occur throughout the data pipeline at ingestion, transformation, and consumption. But failure between the two can be somewhat cause-and-effect.
Poor data controls and quality governance become a major catalyst for the garbage in, garbage out behavior of AI that we’ve harped on so much, or for LLMs leaking sensitive information in responses – and that is a magnificent danger to an organization. Because if the movement, transformation, and consumption of data is weak; if access rules are inconsistent; or if content is formatted for AI consumption without validation, GenAI will compound mistakes at great scale.
In the age of AI, data governance failure leads to massive compliance problems, data silos, and poor data quality, which in turn affects model performance. Much of the root of AI hallucinations and bias is in initial data management practices.
What does good data governance look like for AI?
As the fuel of the AI present and future, data has become immensely valuable and extremely powerful. As such, data governance has shifted away from an IT department only problem to an all-hands-on deck enterprise initiative.
A document processing pipeline that starts with AI-enabled extraction or intelligent document processing (IDP) improves data governance by extracting, indexing, metadata tagging, and validating content at the earliest stage of the data lifecycle. 100% accurate extraction, classification, and tagging is absolutely necessary to ensure content that will later be served by an LLM is free of sensitive information, and devoid of hallucination – a result of contradictory information, outdated information, and/or data gaps in a model’s training set.
Furthermore, modern IDP platforms enrich data through semantic layering – an essential step for assisting a downstream GenAI model in how to interpret data.
Why is a semantic layer important?
Here is an example of why semantic layering, or data enrichment, is highly beneficial to improving AI results:
Imagine: You’re a data analyst for Netflix. You prompt the enterprise model (Claude, ChatGPT, etc.) connected to your corporate applications for a summary of the current lost customer rate.
Without semantic layering, the model might return numbers for all cancelled subscriptions, account deletions, and expired free trials. You report the response, an alarmingly high number, to the board. Chaos ensues.
But cancelled subscriptions and account deletions aren’t mutually exclusive. And free trial users were never paying customers.
With a semantic layer, data is extracted from reports and enriched with a tag, i.e., “lost customer,” that is applied only when the loss relates to paid cancellations. Enriched data provides greater accuracy and clarity through much-needed context.
More best practices for data governance
- Role/attribute/purpose-based access controls that gate information from unauthorized users.
- Replacing sensitive data with non-sensitive tokens.
- Audit trails detailing time of data creation, interaction, and ownership.
- Automated data disposal and retention plans.
Why is AI governance important?
If data governance is about communicating to a model, what is true – AI governance is about communicating how to explore data safely. AI governance is about improving AI results and safety by setting guardrails and by governing who interacts with and trains an LLM, and how.
Your AI guardrails are the validation and control layers that sit between an end-user and a model, enforcing behavior and policy on every prompt. Input guardrails use filters and classifiers to screen prompts for sensitive data or malicious instructions, blocking or redacting content that violates policy. Output guardrails do the same but trigger when the model pulls and attempts to serve sensitive or misaligned responses, even if it was never prompted to do so. AI governance exists, so if these data inaccuracies bleed into your model, protocols are in place to raise red flags, deny models from interacting with misaligned or sensitive data, or prevent retraining on faulty information.
More contributors to AI governance success
- Actively trying to break the model to identify bias and faulty data (i.e. evasion testing)
- Instilling policy-as-code to automate continuous checks against data.
- Real time checks into incoming data that looks significantly different to prevent model drift.
- Establish a governance committee to limit model tweaking to a dedicated group and test observed outputs.
- Human-in-the-Loop (HITL) for 100% oversight of AI outputs in high stakes situations.
Governance accompanies AI adoption
Data governance and AI governance are not at all separate problems. Poor data classification and weak controls feed broken AI outputs, and weak AI governance means these broken outputs go unchecked – a massive risk. In a recent study on data breaches, IBM found that 87% of organizations responding to a survey reported having no governance policies or processes to mitigate AI risk. The speed of adoption and the desire to do things better/faster/stronger cannot outpace setting the standards for how to do it right. Manage risk with clean capture, accurate extraction, enrichment, and monitored AI training and guardrails.