Skip to content
Modern computer servers
Modern computer servers

Desire to harness potential of generative AI drives rising interest in data as an asset

Investors are targeting AI developers and buying proprietary data sets to build new AI businesses. Here we explain how to manage risk in this fast-evolving space.

When MIT named generative AI as one of its breakthrough technologies of 2023, it couldn’t have predicted that less than a year later the technology would be the top investment priority for global businesses. More than 70% of respondents to KPMG’s most recent CEO survey said they were already investing heavily in generative AI as a source of future competitive advantage, with more than half expecting a return on their outlay within three to five years.

That investment is not only coming from corporates – financial sponsors are also targeting companies for the data they hold in order to build new AI-driven businesses. Looking ahead to 2024 we expect a rising proportion of deals across all sectors to have proprietary data as a strategic driver.

This is not necessarily about building large general purpose AI models from scratch, which in most cases will be prohibitive for a variety of reasons, including cost. Instead, many investors are looking to harness proprietary data to develop smaller, more specialised AI systems, or more commonly to customise pre-trained AI tools.

There are many ways to do this. At a high level, the two most common involve fine-tuning (taking the model weights for a pre-trained AI system and refining through further training on a smaller dataset) or retrieval augmented generation (effectively connecting the AI model to a proprietary database from which the AI model can “look up” and return outputs). The latter is the more common approach, particularly where the pre-trained system is a closed commercial model such as GPT-4, Claude or Gemini. In both cases, a target’s proprietary data can provide a valuable competitive advantage.

Due diligence of generative AI deals starts with the use case

From a due diligence perspective, the risk assessment for deals involving the development or deployment of generative AI starts with the use case. DD needs to be more strategic and forward-facing and be carried out with a sophisticated technical understanding of how any new or customised model will be developed as well as its potential commercial applications.

There are a host of risks associated with the selection of data on which to train generative AI models, including from a data privacy, IP and regulatory perspective. When acquiring data for building or customising AI, it’s important to ascertain how that data was collected, where is has been stored and how it has been processed.

If any of the data includes personal data (i.e. information relating to an identified or identifiable individual), this could raise potentially significant privacy law compliance issues. Some of the key principles underpinning these obligations include fairness and bias (broadly speaking, not using personal data in a way that could have an unjustified adverse effect on individuals), transparency (informing individuals that personal data concerning them is being collected, used, consulted or otherwise processed), data minimisation (any use of personal data being adequate, relevant and limited to the purposes for which it was collected), data security and privacy by design. Although data privacy compliance is typically not within the scope of a legal due diligence exercise (as this would require an on-the-ground audit, beyond a desktop review of policies and procedures), a purchaser should seek to identify any potential indicators of deficiencies in the target’s approach to data privacy as well as understand how those might impact its ability to comply with personal data obligations when using the target’s data to develop AI models. As an example, it will be significantly more challenging – if not technically impossible – to meet data privacy requirements when using personal data that has been scraped from the internet without permission. A risk assessment will then be required. The regulatory environment is only moving in one direction – in recent months we have seen high-level political focus on the privacy impacts of AI through initiatives such as the draft EU AI Act, the London AI Safety Summit and President Biden’s Executive Order on Safe, Secure and Trustworthy AI, which calls for Congress to pass data privacy legislation in the U.S.

Due diligence is also needed to assess IP infringement risk, particularly the possibility that copyright or other IP rights would be infringed: (a) if the purchaser were to use the data to train an AI model, or (b) by a subsequent user of the AI in generating an output. This risk would arise where the target does not own the IP rights in the data or does not have the necessary rights to use the data to train an AI model or, critically, to allow users of the AI model to use outputs generated by it. This risk chiefly arises in relation to data that the target may have scraped from the internet and is the scenario that underpins many of the recent infringement allegations that have been brought against AI developers. If the target’s data includes scraped information, the purchaser would need to conduct a risk/benefit analysis as to whether to use that data in its AI development. The nature and extent of any IP infringement risk is a complex, jurisdiction-specific question, and one that varies depending on a multitude of factors associated with how the model is trained, what it is used for, how the outputs from the model are deployed, any guardrails around its use, and relevant contract terms. Due diligence in this area requires a detailed assessment of likely future risks based on current or historical data collection practices and assumptions around future use cases.

Beyond scraped data, IP infringement risk may also arise in relation to data the target has in-licensed from a third party. Any due diligence exercise should assess the terms of those licences against anticipated AI use cases, which are unlikely expressly to allow use of the data for AI development, for instance. Indeed, they may contain specific restrictions that would prohibit some of the technical steps involved in developing AI models, such as consolidating that data with other datasets. The ownership and licensing provisions relating to improvements and derivative works require particularly careful consideration, as they may provide the data licensor with an argument to asset ownership.

More broadly, due diligence needs to be informed by, and test the target’s approach to, the evolution of AI-related laws around the world. The proposed EU AI Act for example – which is due to be adopted in early 2024 – is set to adopt a use case-based approach that will apply a separate set of laws to distinct types of AI use. Investors will want to consider (and test whether the target has considered) which of those categories its proposed AI model development would fall into.

IP ownership is another area of focus. Crudely speaking, an AI system is composed of the model (the weights representing the learned values derived from the training process), the source code for the system, and the data used to train the model (albeit the training is not stored by the model and therefore not an ongoing part of the system as such). As a result, any buyer looking to protect the commercial value of the acquired data that it will use to develop or customise an AI model will want to understand how that data is currently protected from an IP perspective. In addition, in many countries, the output generated by AI is not protected by copyright, which may diminish the value of content generated by AI. These will be jurisdiction-specific questions, will be specific to the type of data (eg images, text, video and audio) and may not be straightforward.

No AI investment is without risk, and mitigating potential issues may not be possible using traditional deal protections such as warranties.

For instance, if the investor is made aware through the due diligence process that certain parts of the target’s data has been scraped, that knowledge is likely to preclude warranty claims that the scraping breached laws or infringed IP rights. It may be possible to obtain an indemnity from the seller for specific instances of non-compliance, but this is likely to be strongly resisted. Certainly, the seller would be unlikely to agree to extend any indemnity to losses flowing from the purchaser’s future use of the data to build new AI models.

On that basis, investors are likely to need to rely on non-contractual protections as they use the acquired data to build or customise AI. There are steps that can be taken at any stage of the AI lifecycle to mitigate the risks arising from developing AI models. The appropriate risk mitigants always depend on the particular use case, but broadly they are likely to relate to the design of the use case (both for the training process and for the ultimate intended use of the AI model), internal governance controls to reinforce that use case, operational and infosec safeguards, and contractual protections.

Contractual protections vital where pre-trained models are being customised

The latter will be relevant where the investor is taking a pre-trained foundation model and customising it using the acquired data. Here, the investor will have a choice between licensing a closed AI model on commercial terms or using an open-source AI model subject to the relevant open-source AI licence. The question of whether to work with closed or open-source models is central to the broader philosophical questions around AI safety, but that debate notwithstanding, any use of open-source AI should be undertaken with extreme caution. “Open source” has a range of connotations in an AI context, both in relation to what is actually made available on an open-source basis (typically only the weights for the model) and to the applicable licence terms (many of which restrict commercial usage).

Contractual protections will also be important where the purchaser ultimately seeks to commercialise the developed AI model through arrangements with its customers. It may be possible to agree an allocation to the customer of some of the risks associated with developing the AI model, which is an area of rapidly evolving market practice.

Mitigating risk in AI-ready data-driven deals requires a three-pronged approach. First, the target’s data should be assessed against the proposed use case for developing or customising a future AI model, which can be adapted to mitigate risks as appropriate. Appropriate governance can help reduce any residual risk during development of the model and once it is in use. Finally, operational guardrails such as data encryption, model filters to prevent infringing outputs, and ongoing monitoring of training and performance provide an additional layer of protection.


Explore the series