Better AI with Less Data: How Domain-Specific Data Can Outperform Large Datasets
8 mins read

Better AI with Less Data: How Domain-Specific Data Can Outperform Large Datasets

Only 15% of all AI projects succeed in production, while surveys show that the average ROI of AI implementations within the enterprise is a meagre 1.3%[1]. While these stats are as sobering as they get, they beg the question of why so many organizations continue to pour resources–money, work hours, and compute—into data collection and model development without a clear path to justify ROI.

Don’t get me wrong. We at Solix know the transformational potential that AI can bring if done right. However, here’s how our thesis on successful AI differs. At core, we believe more data isn’t always better–the key is having the right datasets of high quality and just the correct quantity. If you pour poor unlimited data (and a lot of cash) into a project without a clear strategy or relevance, you’d just end up with diminishing returns. Ideally, a line plot of model accuracy vs training set size must show an increasing trend till it reaches a point of diminishing returns. In practice, even twice the data size after a point could probably result in just a few percentage points increase in accuracy.

Law of Diminishing Returns in AI

I’d like to think of AI as similar to classical economics. As explained by the law of diminishing marginal utility, the utility of each additional unit decreases as consumption increases, until the equation reaches a state of equilibrium, where any further increase results in zero or even negative marginal utility.

AI is very similar. Each additional data point dramatically increases accuracy in the early stages of training a model. As data volumes grow, this effect lessens, and more data doesn’t necessarily provide new insights on how to model the problem best.

For instance, when you train an image classification model, increasing the number of labelled, tagged images from 100 to 1,000 might significantly improve the model’s accuracy. However, when you go from 50,000 to 100,000 images, you will likely not see a 100% increase in model accuracy. If the model’s capacity is limited, throwing too much data at it can even slightly hurt performance as the model may overfit to the noise instead of the signal. AI and machine-learning models have their “sweet spots”, beyond which any volume gains would result in less than marginally improved performance. Depending on the model complexity, some models reach this plateau quicker than others that serve more complex use cases.

While in most cases, it’s very rare to have the problem of managing “too much data”, wastefully collecting random data can still prove costly. Beyond sheer volume, what’s in the data matters much more.

Quality vs Quantity: Domain-specific Data Wins!

The popular phrase – “garbage-in, garbage-out”, when applied to AI, becomes “garbage-in, garbage-squared”. This emphasizes that using noisy, irrelevant, or unrepresentative data won’t lead to useful insights and can even be misleading. In practice, clean, labeled, domain-specific data often outperforms a generic corpus.

Context-rich data beats volume. Even if the volumes are significantly smaller, data directly reflecting the task would perform better than a model broadly trained on data scraped off the internet. Enterprises seeking to implement AI to solve a singular problem might be better off building a “small language model” with domain-focused data. This would help increase domain-specific accuracy and ROI. Enterprises building custom models must ask themselves, “Does this data truly represent the domain and the problem that needs to be solved?”. If not, refining datasets may be of better value than simply adding more data.

Defining Your Scope: How To Decide What Data You Need?

Every AI project should begin by comprehensively defining its scope and success metrics. The data you need would depend on:

  • Use-case/Problem Complexity:How complex is the problem you are trying to solve? For a simple logistic regression, this might mean a sample dataset of 1000 to 10000 examples, while applications like open-domain questioning or building an autonomous taxi service like Waymo’s would require very large samples numbering in the millions.
  • Model Capacity & TypeAre you fine-tuning a small-language, domain-specific model or building the next big transformer-based LLM? Small language models (SLMs) that are domain-specific can be highly accurate, given that the training data is of high quality. Conversely, a larger model would require significantly more data.
  • Associated Business Risks & ROIAre you in a highly regulated industry? Have you secured sensitive data and PII information? Does your AI model have adequate access controls to prevent unauthorized access? What are the potential losses if your model makes mistakes? For industries like healthcare and financial services, you should have additional validation data to prevent model hallucinations while ensuring compliance with applicable regulations.

Getting More Value from Less Data

With technology advancing, AI teams now have newer tools and techniques to help them outperform brute-force data collection. Here are some methods that can help you amplify value of datasets that you already possess:

  • Create a semantic layer with structured context:Knowing what data you own is essential for the success of any AI project. Many organizations, whether large or small, have collected vast amounts of data over the years, often with little to no clear business context. Adding a semantic layer to your data can help you identify dark data and allows AI and machine learning models to interpret data more intelligently. Instead of just parsing flat tables, your model can now understand relationships between datasets, business logic, and constraints.
  • Active Learning and Intelligent Data ClassificationLet your model decide what data to label next. Active learning focuses on the most informative samples, usually in areas where the model is least confident. By combining this with smart data classification, you can cluster and organize data based on relevance, novelty, and sensitivity. This helps you concentrate your labeling efforts while streamlining the process of what, when, and why datasets are labeled, ensuring each annotation adds value.
  • Transfer LearningIn most cases, training a language model from scratch can be impractical and very resource-intensive. Instead, starting with a commercially available model and fine-tuning it according to your business needs reduces the amount of labeled data required to achieve production-quality performance.
  • Synthetic Data GenerationFor niche use cases, gathering relevant datasets can be challenging. Instead, organizations might generate synthetic datasets that replicate the original characteristics pertinent to their domain. This approach can help kickstart early prototypes or supplement rare, edge cases to gain initial stakeholder approval.

Closing Thoughts

It’s not about more data, it’s about having access to the right data!

As emphasized throughout the blog, the quality of your data matters much more than the quantity. The key is to focus on developing business/use-case-ready data products that are clean, labeled, and domain-specific. Data strategy for AI should always center around use case complexity, compute needs, model selection, and business success metrics. Defining this would enable enterprises to arrive at a clear roadmap that could lead to AI success.

Another important aspect to consider is the overall compliance practices followed throughout the enterprise. Having the right compliance and data governance guardrails is almost as crucial as everything else discussed above. Since compliance and data governance for AI are highly complex, they deserve a separate discussion that I plan to cover in my next blog, so stay tuned!

At Solix, we empower data-driven enterprises to maximize their data assets. With the Solix Enterprise AI suite, we provide comprehensive solutions for staging data, developing domain-specific, business-ready data products, and enabling AI-powered governance at scale.

Solix Intelligent Data Classification, a key part of the Solix EAI suite, is an intelligent semantic layer that allows you to define business rules, enrich metadata, enhance context, and rediscover data. Using Solix IDC, enterprises can automatically tag datasets with AI-augmented metadata and classify them based on relevance, sensitivity, and compliance needs.

If you found this interesting, please contact us to schedule a session to learn more about how Solix can help enhance your existing data strategy.


[1] https://www.equalexperts.com/blog/tech-focus/ive-spent-1million-on-data-scientists-why-arent-i-seeing-a-return-on-my-investment/#:~:text=using%20cutting,generate%20a%20profit%20at%20all