The world of data and AI is evolving at breakneck speed, with 2025 shaping into a year of breakthroughs and significant challenges.

From AI model hallucinations to the role of synthetic data in innovation, industry leaders are grappling with complex issues that will shape the future of technology.

I recently discussed this with a few experts who had a lot to say about the latest tech trends and asked them to share their thoughts:

  • Jorge Silva, a modeling expert: Jorge is the guy you want to talk to about AI models – from cutting-edge training techniques to the next big breakthroughs in machine learning.
  • Josh Griffin, a coding guru: Josh is your guy if it involves code. He discusses the latest programming trends, tools, and languages reshaping the tech world.
  • Harry Keen, a synthetic data visionary: Harry’s pushing boundaries with synthetic data. As you’ll read, Keen breaks down how it’s changing AI training, boosting privacy, and speeding up data-driven innovation.

Whether you’re a professional in technology or just curious about where things are headed, settle in for a deep dive into 2025 tech trends.

Foundation models are often criticized for hallucination. Do you believe their perceived importance will decline as this issue becomes more prominent?

Jorge Silva, Advanced Analytics R&D, SAS: Hallucination in large language models (LLMs) and other foundation models refers to generating content that is factually wrong and/or fabricated and not grounded in reality. While it has recently come to prominence in LLMs, hallucination can occur even in far simpler machine learning and statistical models. When queried with test data outside of the bounds covered by its training data (the “support” of the model), any predictive model can yield wildly incorrect predictions. This is known as extrapolation.

For this reason, claims of “zero-hallucination” AI should be taken with skepticism. With a sufficiently anomalous prompt and without stringent external constraints such as prompt preprocessing and response postprocessing, any neural-based foundation model will hallucinate to some degree. Moreover, imposing overly strict guardrails can neuter the model to the point where the responses become bland and uninteresting (e.g., overly relying on “I don’t know” responses).

It should be noted that hallucination is not always harmful. In certain applications, such as artistic imagery generation and drug discovery, hallucinations can be harnessed to produce entirely novel breakthroughs. In this sense, they are a mechanism for creativity, much like humans. Check out this New York Times article for a more in-depth discussion.

Hallucinations cannot be entirely removed but can often be detected and mitigated. Some foundation models, such as diffusion, allow the detection of hallucinations via monitoring the variance of the final steps in the inference process. For a deeper dive, read about techniques such as mode interpolation, which provides for the removal of 95% of hallucination in such models.

SUBSCRIBE TO THE SAS INSIGHTS NEWSLETTER

Will advancements in fine-tuning and architecture mitigate these concerns?

Silva: Interestingly, it has been shown that fine-tuning by itself is not always sufficient to reduce hallucinations and in some cases, you can see how it can exacerbate them. On the other hand, strategic fine-tuning in combination with in-context learning and Retrieval Augmented Generation (RAG) can be more effective, though sometimes rigid.

Far more interesting and promising is the application of rules-based Reinforcement Learning to enforce reasoning in foundation models, as evidenced by the spectacular success of recent models such as DeepSeek-R1. In short, it is fair to say that hallucinations are not hampering the growing importance of AI and foundation models – far from it.

Some argue that SAS performance relies heavily on individual programmers’ coding skills. How can we ensure performance claims reflect the platform’s capabilities, not developer expertise?

Josh Griffin, Advanced Analytics R&D, SAS: In my experience, I would say that in most cases, the above statement is, on the surface, true with a significant “caveat.” Without this caveat, I would agree that, like a creative genius, the ability to write performant code relies on strategically hiring rare unicorn-type individuals with a keen understanding of mathematics, software, modern hardware, and the targeted analytics being developed.  Indeed, this was the consensus at SAS in 2018 that we seemed to be converging to as the number of customer-reported performance issues compared to best-of-breed open-source mounted.

Before I explain the caveat, I would like to provide a real-world analogy that might help explain why the above is both true and not true, held in a superposition state, to borrow from quantum terminology. When I was fifteen, I had a few close calls on my paper route and decided to spend a portion of my profits on karate lessons. After three years, I proudly received my black belt.

As they say, a proud look goes before a fall, and mere weeks later, I was jumped on my way to my dishwashing job and beaten soundly, ending with bits of teeth like gravel swimming in my mouth. I was horrified to have spent so much money, time, and sweat in an endeavor that obviously didn’t work.

From that point on, I believed nature always beats nurture. I lost faith that any amount of training might allow David to defeat Goliath in the real world. After that, I saw martial arts as a form of dance, which I still loved; I was no longer under the illusion that it was practical at all.

Five years later, in graduate school, for fun, I signed up for a Brazilian jiu-jitsu (BJJ). In my karate school, we practiced choreographed moves that were too deadly to try to use on each other. In BJJ, most moves are simple submissions and endlessly applicable in a friendly environment at near full effort. We would spend the first hour doing the choreographed “dance” training I was familiar with from karate (my partner does A, I do B, they do C, and … D checkmate, I win).

However, the next hours were all-out war, where we wrestled each other for real. To my shock, nothing we had just learned worked on the first day. Pulling off any of the submission moves would take much practice when one’s opponent was actively helping you.

And I had an epiphany. Theory and intuition soon create a house of cards unless each building block has been thoroughly tested in comprehensive real-life situations. The flip also seemed true. Daily and rapid experimentation will ultimately guide one toward rock-solid theory and intuition. A truth is not a truth unless it can be forgotten and found repeatedly in multiple contexts.

The style of learning that has made BJJ predominant in the MMA world can be generalized to all endeavors. Indeed, this concept exists in many popularized modalities, from John Boyd’s Observe, Orient, Decide and Act (OODA) Loop to the Shewart Plan, Do, Check, Act (PDCA) Cycle to the scientific method itself. All advocate rapid interweaving experimentation with proposed theory, hypothesis, and intuition.

Griffin: In response to customer concerns in 2018, SAS pioneered a system of performance development that, very much like BJJ, OODA and PDCA, helps teams to rapidly improve targeted software in a systematic finite time way, where practice and time of this protocol will reliably beat nature. I believe this development system can mathematically be proven (in an if and only if sort of way) to monotonically and rapidly improve the performance of any analytic to which it is applied until it is state-of-the-art.

Further, rigorous applications of this system have the second-order effect of creating unicorn individuals in-house in the same fashion as BJJ schools, which tend to generate great fighters. It would not behoove us to describe in detail how it works; however, the results of our applying this system can be seen by the growing frequency of such postings where some of the same routines that back in 2018 were causing customers consternation are now the same routines that we SAS beats its metaphorical chest about.

The beauty of this is that SAS has the power to respond to customers’ concerns with fast action applied directly to the product that concerns them most at the deepest of levels, as SAS (unlike many competitors) owns and understands its code base, which it has carefully grown and tested for multiple decades. More recent competitors who have come on the scene do not have time to do so and must build on third-party software that they are unlikely to know, understand, or be able to extend similarly.

So, in closing, to revisit the original question, while our competitors must “rely heavily on individual programmers’ coding skills,” we at SAS now rely on a tried-and-true system that self-regenerates the talent it needs using a system that continues to make huge advances throughout SAS’ code base.

Skepticism exists about synthetic data generation, specifically the concern that building models on artificial data could lead to unreliable outcomes. Are there scenarios where synthetic data can still provide significant value despite these concerns?

Harry Keen, Product Evangelist, SAS: Synthetic data (when used to try and exactly replicate a dataset in a truly privacy-preserving manner) will contain slightly different statistical properties when compared to the real data. Therefore, when used to train a model, that model will behave slightly differently to a model trained against the real data.

This is an unavoidable fact. If you want privacy, you need to introduce this slight statistical difference. The degree of statistical difference can be controlled and tuned up/down with differential privacy and will depend on the size/characteristics of the real data set. However, a small difference will always exist.

How can this still be valuable?

Keen: The huge problem data leaders are trying to solve with the privacy benefits of synthetic data is the “time to data” within an organization. They have to make the difficult trade-off of do I have time to wait several months for the use of real data to be approved and the data to be extracted and sanitized, or do I give my data scientists on-demand access to safe synthetic versions that may not be a perfect match to the real data but are absolutely going to close enough to build, test, learn and iterate modeling and analytics approaches. This problem increases tenfold when the organization tries to work with external third parties and it is sometimes totally blocked unless sensible data privacy measures are in place.

Our work at Hazy has shown that our customers can derive actionable insights and collaborate effectively with third parties without ever needing to touch the real data; however, armed with a solid business case built on proofs derived from synthetic data, we’ve found that data leaders can then accelerate access to real data to validate their results if necessary.

When analytics leaders are stuck in the real data access trap, they have several other options. They can de-prioritize the project, wait for the real data or consider other data privacy technologies such as masking, anonymization, homomorphic encryption, query-based differential privacy, secure enclaves etc. Synthetic data outperforms all these options by being quicker and producing more statistically accurate data. It doesn’t require the end-user or organization to change anything about their analytics workflows. Users can get their hands on the synthetic data without limitations and use it as a drop-in replacement for real data.

Synthetic data technology also offers the opportunity to tune and refine the proportions of various classes in the synthetic output. This gives the user the ability to amplify the outlier signal and balance imbalanced datasets. These augmented synthetic datasets can be used to train models that are better at detecting those outliers, as they’ve seen more examples in training data. They’re fairer because real-world data collection and labeling limitations haven’t compromised their training data.

In summary, synthetic data isn’t perfect for every use case. There may be scenarios where only the actual data will do. However, it is valuable in an organization’s toolbox to speed up data access and build more robust models.

Synthetic data’s privacy and data augmentation capabilities allow on-demand data access that enables internal and external collaboration, accelerates the time to actionable insights and unlocks the ability to manipulate the signal in the synthetic data output, meaning users can build more robust models with less bias.

If you liked this story, read the Insights article What is synthetic data? And how can you use it to fuel AI breakthroughs?




Source link


administrator