Predictive Analytics, Machine Learning, Predictive Coding, Automated Coding or what ever the buzz word of the moment may be… non-linear computer assisted or automated review is the new “it” topic in the E-Discovery space and not likely to disappear any time soon. The industry would have you believe that this cutting edge breakthrough is game changing and will revolutionize the world as you know it. In many ways this has changed the landscape of the E-Discovery space, rendering the tried and true Electronic Discovery Reference Model (EDRM) inadequate to encompass the nuances of the changing workflows that have emerged as a result. But, somewhere lost in the discussion of cutting edge algorithms and lambda calculus is one important fact; none of this is novel and it is being used all around us every day in medical diagnostics, customer relationship management (CRM) tools, insurance, telecommunication and even your personal credit score.
Predictive Analytics Take out Some of the Guess Work
Predictive analytics as applied to E-Discovery may not be the silver bullet some companies have been touting, but if adapted and properly managed as it has been in other industries it could rein in the cost, time and labor that has rendered litigation almost cost prohibitive. The seminal text on predictive analytics, (then called Exploratory Data Analysis) was published over three decades ago and although it has had multiple applications since the first publication, the core tenets have remained the same. Instead of following data collection with an artificially imposed model based on case assumptions or best guesses and then trudging through the output page by page manually, the process begins with analyzing the sample of data with the goal of inferring which model or algorithm is most appropriate. That model is then refined with new parameters as more of the data is analyzed or rules are refined to the point that the model reaches maximum accuracy and can run with minimal or no supervision throughout the body of data.
Machine Learning: It’s No Sci-Fi Fantasy
Once the concepts of predictive analytics were applied to computer based analysis, the ambiguous creature known as “machine learning” was born. Although it sounds more like science fiction than science, machine learning is no enigma. The most well-known explanation of what machine learning actually is came from text published in 1996: A computer program is said to learn from experience (E) with respect to some class of tasks (T) and performance measure (P), if its performance at tasks in (T), as measured by (P), improves with experience (E). In laymen’s terms, machine learning techniques emulate human cognition and learn from training examples to better perform future tasks.
The Categories of Machine Learning: Unsupervised, Supervised, Semi-Supervised, and Reinforced
The method by which the computer has an “experience” in the context of this definition is one of the nuances that separate the dozen cutting edge Electronic Stored Information (ESI) providers, the other major differentiator lies in what workflow is used and how it is integrates with the human aspect of the review. Generally the systems of machine learning are broken into 4 categories: Unsupervised, Supervised, Semi-Supervised, and Reinforced Learning.
- Unsupervised Learning produces a data grouping prior to application based on statistical similarities or frequency of certain concepts or phrases (concept clustering, latent semantic indexing). This can be helpful in guiding the priority of a review but it can also misdirect a user if an outlier word receives priority disproportionate to its relevance to the case.
- Supervised learning is at the other end of the spectrum and is based on data decisions made by a human that are used to “train” the algorithm which then infers how data ought to be coded or classified. Iterative review of the algorithm’s inferences refines the model to maximal accuracy.
- Semi-Supervised Learning (SSL) begins in the same manner as supervised learning, but when Manual review of the algorithm’s inferences refine the model to maximal accuracy the remainder of the corpus of data is analyzed in an automated fashion with random sampling for accuracy.
- SSL can be either transductive or inductive, with the former prioritizing statistically similar documents and the latter inferring general rules to be applied to the corpus of test data in training the algorithm.
- Reinforcement Learning is similar to the traditional SSL model, but rather than data analysis followed by testing of the results, this model modifies the application of the algorithm to the entire body of data throughout the review. Absent prioritization this could prove risky if a key set of documents are towards the end of the body of documents.
Technology Expedites Legal Review, but Machines Can’t Do it All
The most exhaustive analysis of various machine-assisted review algorithms – as opposed to traditional linear manual human review – has been done by Text Retrieval Conference (TREC), sponsored by the National Institute of Standards and Technology (NIST). TREC results have been inconclusive about what search methods work best. But in analyzing the results, methods combining human review with automated review have had the most consistently strong results and have potential with the correct work flow to far surpass traditional linear review. The results for TREC 2010 should shed more light on this matter.
Based upon the findings of TREC, it is clear that this technology is not an “easy button” that will allow review to run on autopilot. However, under the correct guidance the potential to out perform traditional linear review is undeniable. When dealing with the application of machine learning it is useful to bear in mind the theorem of no free lunch which states,  “a general-purpose universal optimization strategy is theoretically impossible, and the only way one strategy can outperform another is if it is specialized to the specific problem under consideration”. In terms of electronic discovery this means that the law firm, legal service providers and the attorneys reviewing in conjunction with the company managing need to play an active part in determining which technology or model best fits the case at hand and the data set and must remain actively involved throughout the process.
Cut Costs and Maximize Human Analysis with Predictive Analytics & E-Discovery
Predictive analytics may not be a one stop solution, but it has fundamentally changed how document review and discovery is handled. Clear lines of communication and collaboration between all service providers and the firm need to start as early as possible and E, D, R & M can no longer be considered discrete separate steps because the lines have blurred. End-to-end solutions embracing technological advancements and leveraging them to maximize human analysis and reduce cost and time necessary for the complex large data reviews are the next evolution in this space. Running an implementation of an algorithm on a computer trained by a team of expert attorneys reviewing training data costs very little relative to the cost of exponentially more human time. If the algorithm fails, then the feedback loop of the human reviewer can refine it to the point of desired precision and recall. But, when an algorithm succeeds in finding a satisfactory solution in an acceptable amount of time, a small investment has yielded a big payoff.
Share your thoughts…
Have you deployed predictive analytics in an E-Discovery project? What kind of results did you see?
 No Free Lunch (NFL) Theorem : For any pair of algorithms a1 and a2
No related posts.