People are starting to compile resolutions for the new year, focusing on evolving their own habits and goals. At SAS, we’ve also looked toward 2026 to gather predictions on how AI in the public sector might evolve over the next 12 months. 

Prediction: By 2026, governments will utilize large language models (LLMs) to generate synthetic, unstructured text data for research, training and testing purposes. In fact, we’re already starting to use them this way.

Why LLMs are well-suited for synthetic text generation

LLMs are ubiquitous now. When I am gaming online with my buddy from sixth grade (we’re now well into our careers), whenever he gets stuck, he simply asks an LLM, such as ChatGPT, for a hint. The ability for LLMs to “understand” and generate human-like text frankly amazes me.

By analyzing and learning from vast collections of text, such as books, websites, and articles, LLMs excel at many tasks we subject them to, including summarizing, comparing and contrasting, in particular. It was straightforward to incorporate an interactive chat component into all the public sector text analytics demos I’ve been working on for the past couple of years.

However, the versatility of LLMs got me thinking about the different creative ways they can be utilized. Alongside my colleague Jerry, we set out to address a challenge that could benefit many of our public sector customers as we work to help them understand the benefits of synthetic text data, specifically.

What synthetic data is – and why governments need it

Synthetic data is algorithmically generated data that mimics real data without containing any information from real-world sources. There are two types of synthetic data: structured (e.g., spreadsheets) and unstructured (e.g., text documents, images, or videos). Jerry and I work in the public sector, where a surprising amount of data is available publicly that applies to our work, but not all of it.

For example, some notable gaps include email data, which can be used to demonstrate insider risk capabilities, as well as intelligence data. For health, doctors’ notes are in short supply. For police narratives, there exists only one set of data from Dallas, which is notably absent from accompanying structured data. Synthetic data in these domains can be used for research, analysis, demonstration, and model training capabilities, particularly in domains with rare events, rather than real, sensitive personal data or classified information.

Overcoming LLM limitations with hybrid techniques

This also creates the potential for access to diverse data, which can help reduce bias in datasets. The challenge we encountered was that when asked to create a diverse set of analysis data, the LLM-only approach fell short in terms of creativity.

To stimulate that creativity, we pulled applicable publicly facing data along with randomized heuristic-rules-based approaches and were able to have LLMs “hallucinate” in the direction we wanted them to.

For example, to simulate an email that involved a component of insider risk, we asked an LLM to create an email by giving it randomized snippets pulled from the infamous Enron dataset, one of which mentioned chairs. The LLM subsequently generated an email story about how an insider leaked a seating chart and how to execute damage control. This was just one of hundreds of examples we reviewed when determining how well this approach worked.

Building realistic insider risk scenarios with synthetic text

To make our insider risk solution demos more realistic, we generated synthetic text data to simulate workplace email communications. Our solution relies heavily on linguistics-based text analytics to detect insider risks such as workplace violence, suicide and espionage. However, our technical demonstrations lacked realistic text data as the examples available in the public domain were insufficient.

By utilizing LLMs to generate synthetic text, we were able to produce domain-specific examples without disclosing sensitive or proprietary information. This approach accelerates experimentation by reducing reliance on data collection, enabling us to showcase complex scenarios and rare event types that are difficult to capture in actual datasets.

Why governments will ramp up synthetic text generation in 2026

Our early experiments suggest that by 2026, governments will actively generate large-scale synthetic text datasets to replicate the complexity of confidential documents without compromising privacy or security, and utilize them to support research, training, and testing. Soundness of the synthetic data can be assured by guardrails and fortunately, at SAS, we have those in the form of advanced text analytics capabilities that have been in use for years, so we aren’t simply reliant on another LLM to validate the results of an LLM, an increasingly expensive cycle to get involved in.

With these in place, LLMs will be pivotal in creating synthetic unstructured datasets and align with predictions that by 2030, synthetic data may surpass real data in AI development, especially for applications requiring unstructured text.

The benefits: Privacy protection, better training and faster innovation

LLM-generated synthetic unstructured text would have a variety of benefits for the public sector. Namely, synthetic data would enable government agencies to develop and train AI models without risking the exposure of sensitive personal information, supporting regulatory compliance and reducing privacy risks. For example, this could include helping train both humans and AI models to accurately interpret medical notes by creating vast amounts of made-up ones for them to practice and train on. Agencies could utilize fake incident reports, cables, adverse events, emails, call transcripts and legal contracts to help train and test everything from disaster and emergency response protocols to fraud, waste and abuse detection.

Synthetic data accelerates digital transformation by enabling secure data sharing, breaking down silos, and fostering collaboration across agencies.

Looking ahead

LLM-generated synthetic unstructured text data isn’t just a workaround; it’s set to fundamentally reshape research, training, and testing across industries in 2026. With the proper analytic guardrails in place, organizations will be able to generate realistic, scalable, and privacy-preserving unstructured synthetic data. However, success will depend on continual investment in data quality, governance, and collaboration between humans and AI. As noted above, we succeed when we imbue our own creativity in solving AI-related challenges with AI-enabled solutions.




Source link


administrator