Summary Bullets:

• Synthetic unstructured data, or text, can be used to train and finetune large language models (LLMs) used in customer support applications or chatbot conversations.
• The application of synthetic data, both tabular and unstructured, will continue to grow, driven by a need for additional training data as well as concerns over data privacy.
On October 1, 2024, MOSTLY AI announced that its platform can help enterprises create synthetic text, a timely new capability given the growing interest by enterprises to leverage GenAI to extract insights from unstructured data. Over the past several years, much of the conversation around synthetic data has focused on using GenAI to create synthetic tabular data. Tabular data is structured data that can be neatly organized, for example information that can be arranged in an excel file. The logical next step is to use GenAI to create text-based information that can be used to customize LLMs.
Synthetic data is information created by GenAI technology that is statistically similar to actual data. It is an attractive and increasingly popular option for organizations that need more data than they have readily available to train machine learning models or that don’t want to use actual data to train models because of privacy concerns. Synthetic tabular data is already being used to train models, test software quality, and support staging and demo environments. Similarly, synthetic unstructured data, or text, can be used to train and finetune LLMs used in customer support applications or chatbot conversations. And while there is always the option of manually creating data, the process is time consuming and resource intensive, making synthetic data an appealing alternative.
With MOSTLY AI’s new capability, customers use a combination of proprietary models from MOSTLY AI and open-source GenAI models from HuggingFace to fine tune an LLM and create statistically accurate synthetic text. The quality of the output data is enhanced by the use of structured data. The resulting synthetic text can then be used to customize GenAI-driven applications.
MOSTLY AI is already well-positioned to help organizations with their synthetic unstructured data needs. The Vienna, Austria-based company was founded in 2017 and is a well-known player in the synthetic data market. It has received $31 million in funding from European venture capitalists. MOSTLY AI designed its platform with ease of use in mind, making it accessible to those that aren’t data scientists or data engineers. For those that want to experiment with the technology and aren’t ready to commit to an enterprise license, which includes SLAs related to customer support, the company also offers a free tier of services.
There are, of course, challenges when it comes to working with synthetic data, the most notable of which is quality concerns. Various techniques and platforms result in data that can range in accuracy. Organizations will need to evaluate their synthetic data and take advantage of quality assurance reports. One best practice is to train one model using actual data, train another with synthetic data, and test the resulting models with actual data withheld from training, and compare results. Furthermore, even synthetic data may not be fully anonymous, a challenge users should be aware of. To tackle this problem, organizations should seek out platforms that offer tools that evaluate results, including outliers.
The application of synthetic data, both tabular and unstructured, will continue to grow, driven by the need for additional training data as well as concerns over data privacy. Though some organizations remain wary of using synthetic data, new tools are chipping away at remaining obstacles, making the solution a more attractive and attainable option. Evolving regulatory requirements will drive further momentum. However, there is still much need for education in this area since most organizations are only just getting started with the adoption of synthetic data.
