Skip to content

Court approves use of predictive coding in large disclosure exercise

14 April 2016

​In Pyrrho Investments Ltd & ors v MWB Property Ltd & ors [2016] EWHC 256 (Ch) a Chancery Master has approved the use of “predictive coding” in a large disclosure process involving millions of electronically stored documents. Although not completely replacing manual review (for example, to train the software and check relevance or privilege) the use of such technology can vastly cut down the cost of (and time taken for) large disclosure exercises. This article discusses the meaning and effect of “Predictive coding” and the factors that were taken into account in this case for approving its use. 

What is predictive coding? 

Predictive coding is a process whereby software is trained to assess the likely relevance of large quantities of electronically stored information (ESI). The parties agree a predictive coding protocol, including definition of the data set, sample size, batches, control set, reviewers, confidence level and margin of error. Criteria will include who held the documents (custodians) and the date range, but perhaps also whether the documents contained any of the keywords chosen. Certain types of documents, not having any or any sufficient text, will be excluded (they will have to be considered manually). The resulting documents are “cleaned up”, by removing repeated content (eg email headers or disclaimers) and words that will not be indexed (eg because they are not useful in assessing relevance).  

A representative sample (eg of up to 4,000) of the “included” documents is then used to “train” the software. A person who would otherwise be making the decisions as to relevance for the whole document set  (ie a lawyer involved in the litigation) considers and makes a decision for each of the documents in the sample, and each such document is categorised accordingly. It is essential that the criteria for relevance are consistently applied at this stage. The best practice would be for a single, senior lawyer who has mastered the issues in the case to consider the whole sample. Where documents would, for some reason, not be good examples, they should be deselected so that the software does not use them to learn from. The software analyses all of the documents for common concepts and language used. Based on the training that the software has received, it then applies a likely relevance score to each individual document in the whole document set.

The results of this categorisation exercise are then validated through a number of quality control exercises. These are based on statistical sampling. The samples are randomly (and blindly) selected and then reviewed by a human for relevance. The software creates a report of software decisions overturned by humans. The overturns are themselves reviewed by a senior reviewer. Where the human decision is adjudged correct by the senior reviewer, it is fed back into the system for further learning. It analyses the correctly overturned documents just as the originals were analysed. Where not correct, the document is removed from the overturns. Where the relevance of the original document was incorrectly assessed at the first stage, that is changed and all the documents depending on it will have to be re-assessed.

Although the number of documents that have to be manually reviewed in the predictive coding process may be high in absolute numbers, it is only a small proportion of the total number Thus – whatever the cost per document of manual review – provided that the exercise is large enough to absorb the up-front costs of engaging a suitable technology partner, the costs overall of a predictive coding review should be considerably lower.

Large disclosure exercise – over 17.6 million documents

To give an idea of the scale of the disclosure exercise in the MWB case, the total number of electronic files restored from the back-up tapes of the second claimant was originally more than 17.6 million. After de-duplication, 3.1 million documents remained. The bulk of the relevant documents were controlled by the second claimant, which held back-up tapes storing email accounts used by the second to fifth defendants (who were directors of the second claimant).

Why did the court agree to predictive coding?

The Master commended the parties on their attempts to agree an approach to disclosure, including the proposal to use predictive coding. He considered that use of predictive coding would further the overriding objective in CPR Part 1. The Master observed that experience in other jurisdictions is that predictive coding can be useful, and there is no evidence to show that predictive coding is less accurate than human review alone or keyword searches and manual review combined. Greater consistency can be achieved using the computer to apply the approach of a senior lawyer towards an initial sample of documents than using several lower-grade fee earners and nothing in the CPR or Practice Directions prohibits the use of such software.

The Master noted that over three million documents had to be considered for relevance and possible disclosure and the cost of a manual search would be several million pounds and, therefore, unreasonable, at least in circumstances where a suitable automated alternative exists at lower cost. The cost of using predictive coding would depend on various factors, but would be between GBP 181,988 and GBP 469,049 (plus monthly hosting fees between GBP 15,717 and GBP 20,820) – far less expensive than the full manual alternative (although some degree of manual review might also be necessary after the software has been applied). The value of the claims is in the tens of millions of pounds and use of the software is therefore proportionate.

The trial is not scheduled until June 2017, leaving plenty of time to consider other disclosure methods if predictive coding should prove unsatisfactory. Finally, he noted that the parties had agreed on use of the software and how to use it.

Master Matthews acknowledged that the suitability of predictive coding would have to be assessed on a case- by-case basis.


An English judgment supporting the use of technology in the disclosure process (as envisaged by Practice Direction 31B) is welcome.

Although this may be the first published judgment on the subject, it is not the first time that the court has approved its use. In 2009, when acting for a major financial institution, Allen & Overy used predictive technology successfully (with the blessing of the court and the opposing party) in a Commercial Court case. It is likely that it has also been used in other cases that have not resulted in formal public judgments.

Master Matthews is correct to acknowledge that predictive coding is not suitable for every case – for example, in cases where there are vast amounts of manuscript documents (perhaps lab notebooks in some IP cases), or where there are large quantities of spreadsheets (it is not very good with strings of numerical characters), or where the documents have very short strings of words (perhaps instant messaging and chat data). It is, however, very effective with emails.

Using predictive coding is, in many ways, a leap of faith for lawyers and their clients. However, it need not be seen as just determining which documents fall to be disclosed. It could also be a very useful tool to assist with prioritisation of review sets; to get the documents most likely to be relevant in front of the senior team as early as possible; or to quality control the results of a human review. Having applied the software, documents are not just sent over to the opposition as they will need to be considered, eg for privilege. There is still a useful role for the so-called “lower-grade fee-earners”. Their reviews and sampling will be smarter if predictive coding technology is applied to the data-set and used to prioritise it ahead of review. 

Allen & Overy’s document management and review system, Ringtail Caseroom, already provides a predictive coding facility, in addition to powerful concept analysis and clustering tools. Allen & Overy also uses predictive coding technology via e-disclosure service providers if appropriate.

Further information

This case summary is part of the Allen & Overy Litigation and Dispute Resolution Review, a monthly publication.  For more information please contact Sarah Garvey, or tel +44 20 3088 3710.c