This blog with co-written with Sundaresh Sankaran.
The Artificial Intelligence (AI) era is here. To prevent harm, ensure proper governance and secure data, we need to trust our AI output. We must demonstrate that it operates in a fair and responsible manner with a high level of efficiency. As builders of AI, you need to not only trust your AI but also convince others to trust it as well, including:
- AI Regulators: Various agencies propose, adopt and enforce a growing number of AI standards and regulations across the global market. Penalties for non-compliance can run into millions of dollars.
- Consumers and end users: An ability to transparently explain the use and maintenance of AI solutions not only enables and educates your consumers, it shows them you care. It also projects confidence and capability. If your AI is transparent and trustworthy, AI Governance becomes a competitive advantage.
- Employees: Top talent demands governed and trustworthy AI. No one wants to build tooling that hurts others.
Data is central to AI, so the first step in building trust in your AI and AI Governance is to establish trust in your data. Organizations face privacy, compliance, and data robustness issues in trusting their data.
Organizations have a duty to protect private and sensitive information in their custody. This includes Personally Identifiable Information (PII), which is any data or combination of data values that can be used to identify a person. Failure to protect PII can lead to fines and loss of trust.
Personally Identifiable Information (PII) is any data that can be used to identify a specific individual, either directly or indirectly, including: names, email addresses, Social Security numbers, and addresses, but also sensitive data like health records or financial information.
Unfortunately, you can’t simply remove all personal characteristics from your data. Interesting relationships between individual characteristics and outcomes might add value to your analysis. For example, in health care, symptoms of a heart attack differ between men and women. Removing sex or gender from your data hides a key insight that saves lives.
To retain this information, organizations tend to follow practices such as anonymization or masking of other identifiable characteristics. But those practices tend to be weak forms of anonymization because a combination of traits like sex, race, zip code, or age, when used together, can still identify an individual.
Improving the generalizability of machine learning models helps build a more accurate model and requires training data that represents the wider population. Accuracy is important because incorrect predictions by your model can result in tangible costs that may include potential revenue loss, increased operational expenses, or adverse impacts on users.
You need data for testing in addition to training a model. Test data should enable rigorous tests of a trained model against edge cases and potential scenarios so that you can adjust your model accordingly. Robust test data should include previously unseen values, extreme values, and data quality issues than can be expected in real data such as missing values. For example, when you buy a car, you hope the auto maker tested that model in all conditions—snow, rain, heat, sharp turns, sudden stops, city streets, and highways—to ensure reliability. Similarly, you need accurate data to rigorously test your models.
Synthetic data can mimic patterns found in real world data while reducing privacy concerns by minimizing data leakage risks. This can reduce the risk of penalties or loss of trust while enabling model development and research. This is also helpful when real representative or comprehensive data isn’t available. It’s fast and easy to generate what you need.
Using a representative training data set to build your model or AI system means more accuracy in production. Using a robust testing data set consisting of edge cases, extreme values, and unseen conditions can help you understand and address where you model or AI fails before it’s used to make business decisions. Synthetic data can be used to generate cost-effective sets the protects sensitive information yet still mimic real-world patterns.
To help organizations and data science teams generate high-quality synthetic data for any scenario, SAS recently released SAS Data Maker. SAS Data Maker is a low-code/no-code tool for generating high-quality synthetic data that mirrors real-world data sets. It lets you augment existing data or create entirely new data sets, reducing the cost of data acquisition, protecting sensitive information and accelerating AI and analytics development. SAS Data Maker is available globally in the Microsoft Marketplace with consumption base pricing so teams can start generating trusted, representative and production-ready data fast. To learn more about SAS Data Maker, check out this short demo.
In short, synthetically generated data can build more representative and robust data while also protecting your data, which leads to improved security and more accurate modeling against your target population. Better security and better models mean better AI Governance. This all means stakeholders and users can trust your AI systems.



