Predictive Analytics and Artificial Intelligence… Science Fiction or E-Discovery Truth?
Predictive Analytics, Machine Learning, Predictive Coding, Automated Coding or what ever the buzz word of the moment may be… non-linear computer assisted or automated review is the new “it” topic in the E-Discovery space and not likely to disappear any time soon. The industry would have you believe that this cutting edge breakthrough is game changing and will revolutionize the world as you know it. In many ways this has changed the landscape of the E-Discovery space, rendering the tried and true Electronic Discovery Reference Model (EDRM) inadequate to encompass the nuances of the changing workflows that have emerged as a result. But, somewhere lost in the discussion of cutting edge algorithms and lambda calculus is one important fact; none of this is novel and it is being used all around us every day in medical diagnostics, customer relationship management (CRM) tools, insurance, telecommunication and even your personal credit score.
Predictive Analytics Take out Some of the Guess Work
Predictive analytics as applied to E-Discovery may not be the silver bullet some companies have been touting, but if adapted and properly managed as it has been in other industries it could rein in the cost, time and labor that has rendered litigation almost cost prohibitive. The seminal text on predictive analytics, (then called Exploratory Data Analysis) was published over three decades ago[1] and although it has had multiple applications since the first publication, the core tenets have remained the same. Instead of following data collection with an artificially imposed model based on case assumptions or best guesses and then trudging through the output page by page manually, the process begins with analyzing the sample of data with the goal of inferring which model or algorithm is most appropriate. That model is then refined with new parameters as more of the data is analyzed or rules are refined to the point that the model reaches maximum accuracy and can run with minimal or no supervision throughout the body of data.
Machine Learning: It’s No Sci-Fi Fantasy
Once the concepts of predictive analytics were applied to computer based analysis, the ambiguous creature known as “machine learning” was born. Although it sounds more like science fiction than science, machine learning is no enigma. The most well-known explanation of what machine learning actually is came from text published in 1996: A computer program is said to learn from experience (E) with respect to some class of tasks (T) and performance measure (P), if its performance at tasks in (T), as measured by (P), improves with experience (E).[2] In laymen’s terms, machine learning techniques emulate human cognition and learn from training examples to better perform future tasks.
The Categories of Machine Learning: Unsupervised, Supervised, Semi-Supervised, and Reinforced
The method by which the computer has an “experience” in the context of this definition is one of the nuances that separate the dozen cutting edge Electronic Stored Information (ESI) providers, the other major differentiator lies in what workflow is used and how it is integrates with the human aspect of the review. Generally the systems of machine learning are broken into 4 categories: Unsupervised, Supervised, Semi-Supervised, and Reinforced Learning.
- Unsupervised Learning produces a data grouping prior to application based on statistical similarities or frequency of certain concepts or phrases (concept clustering, latent semantic indexing). This can be helpful in guiding the priority of a review but it can also misdirect a user if an outlier word receives priority disproportionate to its relevance to the case.
- Supervised learning is at the other end of the spectrum and is based on data decisions made by a human that are used to “train” the algorithm which then infers how data ought to be coded or classified. Iterative review of the algorithm’s inferences refines the model to maximal accuracy.
- Semi-Supervised Learning (SSL) begins in the same manner as supervised learning, but when Manual review of the algorithm’s inferences refine the model to maximal accuracy the remainder of the corpus of data is analyzed in an automated fashion with random sampling for accuracy.
- SSL can be either transductive or inductive, with the former prioritizing statistically similar documents and the latter inferring general rules to be applied to the corpus of test data in training the algorithm.
- Reinforcement Learning is similar to the traditional SSL model, but rather than data analysis followed by testing of the results, this model modifies the application of the algorithm to the entire body of data throughout the review. Absent prioritization this could prove risky if a key set of documents are towards the end of the body of documents.[3]
Technology Expedites Legal Review, but Machines Can’t Do it All
The most exhaustive analysis of various machine-assisted review algorithms – as opposed to traditional linear manual human review – has been done by Text Retrieval Conference (TREC), sponsored by the National Institute of Standards and Technology (NIST). TREC results have been inconclusive about what search methods work best. But in analyzing the results, methods combining human review with automated review have had the most consistently strong results and have potential with the correct work flow to far surpass traditional linear review. The results for TREC 2010 should shed more light on this matter.
Based upon the findings of TREC, it is clear that this technology is not an “easy button” that will allow review to run on autopilot. However, under the correct guidance the potential to out perform traditional linear review is undeniable. When dealing with the application of machine learning it is useful to bear in mind the theorem of no free lunch which states, [4] “a general-purpose universal optimization strategy is theoretically impossible, and the only way one strategy can outperform another is if it is specialized to the specific problem under consideration”[5]. In terms of electronic discovery this means that the law firm, legal service providers and the attorneys reviewing in conjunction with the company managing need to play an active part in determining which technology or model best fits the case at hand and the data set and must remain actively involved throughout the process.
Cut Costs and Maximize Human Analysis with Predictive Analytics & E-Discovery
Predictive analytics may not be a one stop solution, but it has fundamentally changed how document review and discovery is handled. Clear lines of communication and collaboration between all service providers and the firm need to start as early as possible and E, D, R & M can no longer be considered discrete separate steps because the lines have blurred. End-to-end solutions embracing technological advancements and leveraging them to maximize human analysis and reduce cost and time necessary for the complex large data reviews are the next evolution in this space. Running an implementation of an algorithm on a computer trained by a team of expert attorneys reviewing training data costs very little relative to the cost of exponentially more human time. If the algorithm fails, then the feedback loop of the human reviewer can refine it to the point of desired precision and recall. But, when an algorithm succeeds in finding a satisfactory solution in an acceptable amount of time, a small investment has yielded a big payoff.
Share your thoughts…
Have you deployed predictive analytics in an E-Discovery project? What kind of results did you see?
[1] Tukey, John (1977), Exploratory Data Analysis, Addison-Wesley.
[2] Mitchell, T. (1997). Machine Learning, McGraw Hill.
[3] Jenkins, Johnathan (2008), What can information technology do for law?, Harvard Journal of Law and Technology, Vol 21, Number 2 Spring 2008.
[4] No Free Lunch (NFL) Theorem : For any pair of algorithms a1 and a2
Related Posts
No related posts.






J. Kupcinski (September 14, 2011) #
This is an excellent description of the current e-discovery landscape! I work for a large government contractor and e-discovery services have been included in a number of recent RFPs/RFIs; however many KO don’t understand the complexity and diversity of models involved in choosing the right fit for their Agency. I’ll be sharing this article with my colleges.
Cat Casey (September 14, 2011) #
John- The crossover between government contracting and the E-Discovery space seems to increase by the hour. I am pleased that this overview has offered you and some of your colleagues clarity with regards to AI/Predictive Analytics.
George Socha (September 19, 2011) #
Cat – I think you are interpreting the EDRM diagram too narrowly. As we have noted for years, the model is conceptual, non-linear, and iterative. If you take a look at our Analysis Guide (edrm.net/133) and our Search Guide (edrm.net/61), you should see the very nuance you criticize the EDRM framework for lacking. Of course, if you do not find the nuance you are looking for, please help us add it; for that, you can start at Joining EDRM (www.edrm.net/45). Thanks, George.
Cat Casey (September 20, 2011) #
George – Thank you for taking time to read and analyze the piece. You are correct. The EDRM, in the context I was writing, more accurately ought to have been described as the “traditional understanding and application of the EDRM”. The dynamic nature of ESI and the E-Discovery market has certainly necessitated that the EDRM encompass iterative search technology (as the Analysis Guide depicts clearly) as well as a nonlinear approach to the concepts.
That being said, technologies that completely bypass review, or move analysis ahead of review to the point of collection/processing, or eliminate traditional Boolean search complicate the current representation of the EDRM. On the whole the EDRM is an invaluable rubric to present the concepts of E-Discovery, and the piece was meant to analyze the impact of Machine learning, AI and the spaces in between more than to critique the EDRM. I am pleased to join the EDRM and look forward to further discussion – Cat
Weekly Top Story Digest - September 21, 2011 | @ComplexD (September 21, 2011) #
[...] Analytics and Artificial Intelligence… Science Fiction or eDiscovery Truth? http://tinyurl.com/3tz7qx8 (Hudson [...]
The September 24th weekend edition of the “Top 20 … plus more” – a compendium of e-discovery articles/vendor news/upcoming events | The Electronic Discovery Reading Room (September 24, 2011) #
[...] Analytics and Artificial Intelligence… Science Fiction or eDiscovery Truth? http://tinyurl.com/3tz7qx8 (Hudson [...]
The September 24th weekend edition of the “Top 20 … plus more” – a compendium of e-discovery articles/vendor news/upcoming events – E-Discovery - ELECTRONIC DISCOVERY - E-Discovery Blog and Law Guides (September 24, 2011) #
[...] Analytics and Artificial Intelligence… Science Fiction or eDiscovery Truth? http://tinyurl.com/3tz7qx8 (Hudson [...]
Cloud Computing: What’s old is new… So why is what’s new so frightening? | Discovery in Practice - Hudson Legal Blog (September 30, 2011) #
[...] was the case with predictive analytics and machine learning, operating in the Cloud is not as new or as frightening of a concept as many would have you [...]
The New ESI Practitioner: E-Discovery 2.0 (January 6, 2012) #
[...] streamlining of the review process with machine assisted review has a two-fold [...]
A Game Theorist Perspective on E-Discovery | Discovery in Practice - Hudson Legal Blog (February 21, 2012) #
[...] e-Discovery as a weapon to force settlement as well as look to alternatives to reign in cost (predictive analytics, or advanced technology) we will continue to see a positive trend toward cooperative [...]
A Game Theorist Perspective on eDiscovery « ediscoverycat (February 28, 2012) #
[...] e-Discovery as a weapon to force settlement as well as look to alternatives to reign in cost (predictive analytics, or advanced technology) we will continue to see a positive trend toward cooperative [...]
The ESI Maven: A New Breed of ESI Practitioner « ediscoverycat (February 28, 2012) #
[...] The streamlining of the review process with machine assisted review has a two-fold impact: [...]
Predictive Analytics and Artificial Intelligence… Science Fiction or E-Discovery Truth? « ediscoverycat (February 28, 2012) #
[...] [1] Tukey, John (1977), Exploratory Data Analysis, Addison-Wesley. [...]
ediscoverycat (February 28, 2012) #
[...] [1] Tukey, John (1977), Exploratory Data Analysis, Addison-Wesley. [...]
Artificial Intelligence and Predictive Analytics… Science Fiction or E-Discovery Truth? « ediscoverycat (April 2, 2012) #
[...] Continue the aricle here Share this:TwitterFacebookLike this:LikeBe the first to like this post. [...]
The Practitioners Role in eDiscovery 2.0 | Discovery in Practice - Hudson Legal Blog (May 15, 2012) #
[...] are also susceptible to poor focus on the human aspect of machine learning. Take for example the supervised or semi-supervised machine-learning model employed by several notable predictive coding companies. These models rely heavily on a small seed [...]
Part Two: Like it or Not, Predictive Coding is Here and Judges Want you to Use it | Discovery in Practice - Hudson Legal Blog (November 15, 2012) #
[...] I discussed in “Predictive Analytics and Artificial Intelligence… Science Fiction or E-Discovery Truth?”, there are three primary flavors of technology assisted review that fall under the umbrella of [...]
Compelling Conversations on OSS » Blog Archive » The Big Data Phenomenon (February 14, 2013) #
[...] component building – device assembly – delivery – installation – configuration supply chain could be completely automated without any human [...]