To learn more about electronic discovery or
discuss a specific matter
770-777-2090
E-Mail
Gregory L Fordham
(Last Updatd December 2013)
The emergence of digital evidence and the widespread implementation of e-discovery has brought both benefit and bane to the legal profession. In many respects, digital evidence has proven to be a better truth detector than its paper counterpart. At the same time, the technical nature and volume at which digital evidence exists makes time tested discovery techniques impractical. In fact, so significant are the technological differences between paper and digital evidence that even the handling procedures require considerable overhaul.
Clearly with the volume of digital evidence in many modern litigations, it simply is not practical to take a “boots on the ground” approach to document review and analysis. Certainly the volumes of data make it commercially impractical to use anything other than computerized techniques. Moreover, many other fields of human activity have demonstrated that the weak link in the chain is often the human element. As a result, automation and statistical sampling techniques are often used in other fields not only for economic reasons but for increased accuracy reasons as well.
In fact, the weakness in the once desired human element of document review and analysis is even reflected in Practice Point 1 of the Sedona Conference’s best practice commentary on search and retrieval methods in electronic discovery. Practice Point 1 states that,
In many settings involving electronically stored information, reliance solely on a manual search process for the purpose of finding responsive documents may be infeasible or unwarranted. In such cases, the use of automated search methods should be viewed as reasonable, valuable, and even necessary.
For all of the above reasons, computerized search and document review techniques have become widespread. Furthermore, their use will likely continue to become more prevalent. Practitioners, who have not used them in the past, will be forced to implement these technologies and techniques as digital evidence and e-discovery force them to forego the traditional “boots on the ground” approach.
In the sections that follow the author examines various computerized search methods as well as other digital litigation strategies about which litigators should be familiar and ready to employ in their cases as needed.
In examining computer search technology, it is best to first understand the search problem. The most often cited example illustrating the search problem is the 1985 Blair and Maron Study. The Blair and Maron Study is best known for its examination of a litigation matter involving a Bay Area Rapid Transit (BART) System vehicle that failed to stop at the end of the line.
The litigation team involved attorneys and paralegals experienced in complex litigation and document management. While the case clearly occurs prior to the ESI of today, the case did involve a computerized document management system with full text retrieval capability.
The litigation team believed that it has been able to find more than 75 percent of the relevant documents. The study, however, revealed that their actual recall was only about 20 percent. Further analysis revealed that linguistic issues were a significant contributor to the low recall rate.
Blair and Maron found that the words used by the two sides to refer to the relevant issues were entirely different. For example, defendants referred to the accident as “the unfortunate accident”. Plaintiffs, on the other hand, referred to it as a “disaster”. Third parties like witnesses or vendors used terms like the “event”, “incident”, “situation”, “problem” or “difficulty”. In the end, the linguistic differences were far more than realized by the legal team and this underestimate adversely affected their work.
In more recent times other groups have also studied document review success rates and even compared different methods. The Text Retrieval Conference (TREC) sponsored by the National Institute of Standards and Technology (NIST) has studied document retrieval issues at many conferences.
Document retrieval effectiveness is typically studied from two perspectives. The first is precision while the second is recall. Precision measures how well the retrieved documents meet the search criteria. Recall measures how well the retrieved documents matched or were relevant to the subject of the search.
In 2009 the TREC conference compared results of manual document review with technology assisted review. Technology assisted review can encompass all kinds of search retrieval technologies.
The results revealed that technology assisted review outperformed manual reviewers by a significant amount. More specifically, manual reviewers had average recall rates of 59.3 percent while technology assisted reviewers had rates of 76.7 percent. With respect to precision, manual reviewers had average rates of 31.7 percent while technology assisted reviewers had average rates of 84.7 percent.
Results such as the 2009 TREC analysis suggest that technology assisted review is superior to manual review. So, even without consideration of the economic aspects of technology assisted review, there are quality and performance reasons as well and these tend to clearly prove the superiority of technology assisted review over the old manual review approach.
Finding the documents is not the only problem. Another significant problem is interpreting them. In other words, whether a document is responsive or relevant is often a subjective determination and can depend on the reviewer making that determination.
The consistency of document disposition between different reviewer(s) or methods can also be measured using overlap. Overlap is the number of documents that have identical dispositions by different reviewers. In other words, it is the intersection of document populations by different reviewers
Several different studies have found the overlap percentages between document reviewers performing manual review range between 15 and 49 percent. Even at the higher percentage this means that there are significant differences between manual reviewers. So, computerized search not only hopes to bring greater economy but also consistency and repeatability to the retrieval problem.
The problems with computerized search are not limited to linguistics or even disposition. There are technology issues as well. The primary obstacle is that document text must be “readable” by the search technology. When documents are not in machine readable text they must be converted to machine readable text. This critical element is sometimes overlooked by litigators, particularly in the digital world where ESI is concerned.
The challenge for machine readable text is not just the difference between images and text. Sometimes it also involves document format. For example, a compressed archive (zip file) may contain machine readable text documents but can the search engine detect a compressed archive and open it such that its contents are readable?
Similarly, the document could be a certain format like XML that while storing the data in plain text it is placed in different internal document containers. A search engine, particularly one using proximity locators may need to be able to interpret the document in its final presentation form in order to assess whether or not it meets the search criteria. Clearly, it is imperative for the users of computer search technology to understand the capabilities of their search engine and ensure that it is actually capable of performing the task it has been asked.
When searching for a solution to these various problems the question then becomes what technology should be used. There are actually many different technologies that can be used to perform technology assisted review. The following sections discuss keyword search, context search and predictive coding.
Keyword search is probably the best known and easiest to implement of the computerized search technologies. Keyword search tools generally come in two flavors, indexed and not indexed.
An indexed keyword search tool relies on the index that it creates in order to actually locate documents having the search terms. Not indexed keyword search tools scan through the document with each search iteration to determine whether or not the terms exist.
An indexed search tool will need time to construct the index but once it has been built searches can be run very quickly. Not indexed search tools do not need to build the index but for each search iteration they will need to traverse the entire document population. Considering that keyword searches will often require multiple iterations, it is usually best to use indexed search tools.
Keyword search has the highest precision rate of any other search technology. It also tends to have the lowest recall rates of the computerized search methods. Thus, it will accurately find documents with the search criteria but the documents may not actually have any relevance to what the user was looking.
There is a caveat to the recall rate of keyword searches, however. With keyword searches, recall can be a function of the keyword search. In other words, better constructed keyword searches will provide higher recall rates than poorly constructed keyword searches.
Single term searches will have the lowest recall rates while more complex searches such as those with Boolean connectors and proximity locators will produce higher recall rates. The Boolean connectors are not only a means to provide additional discriminators to reduce false positives but they also allow consideration of linguistic differences in the search criteria. Other features like stemming and wildcards can also improve the recall rate of a keyword search.
Thus, the problems with keyword searches are often the linguistic issues as highlighted by the Blair and Maron study discussed previously. These limitations can be overcome, however, with more sophisticated use of keyword search technology.
Perhaps the best model of more sophisticated keyword search technology is Google’s search engine. It is not just simple keywords. In addition, it adds in features like stemming, fuzziness for misspellings and even relevancy ranking when only some of the search terms are present.
If there has been a failure of keyword search in legal applications it has come from two causes. First is the single term, fire and forget approach. In other words, it is overly broad with single word search terms. At the same time, it is an inadequate use of more sophisticated criteria such as stemming, wildcards, Boolean connectors and proximity locators. In addition, the single terms lacked adequate linguistic analysis in order to determine whether there were alternative terms that should be used.
The second failure is that there was no testing of the search term performance in an effort to improve performance or assess through sampling or other means that false negatives have been avoided and false positives minimized. Indexed search engines permit fast iterations of various search criteria. When the results are exported to tabular reports identifying attributes like the file name, location, date stamps, search term and about 100 characters either side of the term for context, keyword search users are able to review and assess the effectiveness of the criteria and make adjustments to the criteria based on their learning.
Both of these failures, overly simplistic search terms and untested seach terms, are tremendously ineffecient at finding the documents of interest. The inefficiency is even further compounded when the results are then subjected to full scale manaul document review. While keyword search users may think that developing the terms is simple, there is a big price to pay with what comes next. Thus, the best approach to keyword search methods combines more sophisticated search terms and testing of the search terms in order to validate the results are actually what is desired.
Keyword search methods can be further bolstered with statistical sampling. In other words, after searches are run and the populations divided into their related groupings, samples can be taken and reviewed as a quality assurance measure that the results are as expected. This could be especially useful when documents are cleared after privilege review. Since keyword search has such good precision results, the documents not containing the search terms could be reviewed using acceptance sampling in order to confirm the validity of the results.
For some, the silver bullet to the shortcoming of keyword search had been context search. Simply stated context search adds context to the search criteria. For example, when one is searching for jaguars are they searching for football teams and their players, automobiles or large cats on the African plain?
In essence, context search improves the keyword search solution by addressing the problems present in keyword searches related to synonymy and polysemy. Synonymy is a common linguistic issue where different words are used to express the same concept. Polysemy is where the same word can have different meanings such as the jaguar example mentioned above.
Context search solves both problems by embellishing on the traditional keyword index search model. More specifically, the index process of a context search engine collects more data, commonly called vector data, about a document’s terms such as its location (like headers, titles, page or paragraph text) within the document and relation to other terms. The additional data along with synonym expansion, fuzzy logic and stemming technology are used in an algorithm to identify and usually rank the documents having the best match to the likely search objectives.
So, the context search technology bundles numerous advanced techniques in a simple user interface. Remarkably its simplistic black box approach can also be its shortcoming. After all, all of these additional components such as the additional data captured during the indexing process, the synonym library and algorithm are usually closely guarded secrets. Furthermore, they cannot be modified by the user.
As mentioned previously, concepts like relevancy and responsiveness can be quite subjective. What might be responsive or relevant to one person is non-responsive or irrelevant to another. Yet, that determination of responsiveness or relevancy is controlled by the algorithm used in the context search engine.
In the final analysis, context search is a sophisticated keyword search. The real difference is the extent to which features like synonym usage, relevancy ranking and concept discrimination have been programmatically scripted into the search engine functionality. The same capability is available with traditional keyword search tools but the implementation is totally dependent on the user’s skill to incorporate those features in the keyword search query.
Since the user cannot modify the algorithm’s used by the context search tool, it is possible for sophisticated users of sophisticated keyword search tools to even exceed the capability of context tools. After all, relevance and significance are subjective determinations.
What context tools offer is a sophisticated capability for less sophisticated users. What they do not offer is a silver bullet solution. They have limits. Furthermore, users are likely not to know how to assess those limits since their machinery is hidden “under the hood”. Of course, sometimes this can be useful when negotiating search protocols, since one party may be more motivated to impede the search process than to promote it. In those cases, a context search engine may make it more difficult for such an opponent to game the search process.
The latest search solution is predictive coding. Predictive coding is not a search tool in the more traditional sense, since it not really term based. Rather, predictive coding employs statistical concepts to measure a document’s contents. Those measures are then captured in a baseline set of documents commonly called the training set. Those baselines are then applied against the scores of other documents in the population to determine similar or dissimilar documents.
The overall process is comprised of several steps that range from
Statistical techniques are heavily employed throughout the process. Statistical theory is used when determining the number of documents to be used in the baseline set of documents. They are used again when applying the baseline documents to the population and then finally again when assessing the accuracy of the results.
The following table illustrates how predictive coding works. After the training documents are manually reviewed and scored the document text is read by the computer and a fingerprint of sorts is developed. The fingerprint ignores “junk” words also known as “stop” words like “the”, “an”, “is”, “on”, “a”, “this”, “that” and others and only considers the more significant terms. Depending on the application, users may be able to edit the list of “junk”/”stop” words.
After eliminating the “junk” words the documents are scored by cataloging the specific words and the count of their occurrence in the document. The scoring of the document in terms of its significant words and their count provides a kind of fingerprint for that document. When the score or fingerprint of all of the training set documents is determined statistical values can be calculated that can be used to evaluate other documents in the population based on statistical theory. The score of the training set documents is compared to the score of the documents in the population and a difference computed that can be used to quantify how different each of the population documents is from the training set documents.
TRAINING DOCUMENT |
|
POPULATION DOCUMENT |
||
---|---|---|---|---|
TERM |
COUNT |
|
COUNT |
TERM |
Word 1 |
3 |
|
1 |
Word 1 |
Word 2 |
2 |
|
|
|
Word 3 |
4 |
|
3 |
Word 3 |
Word 4 |
1 |
|
|
|
Word 5 |
3 |
|
5 |
Word 5 |
|
|
|
3 |
Word 6 |
|
|
|
4 |
Word 7 |
Predictive coding cannot be used for all kinds of document review. There are some document types that it cannot do or will not do well:
Clearly, predictive coding is not simply application of a magic technology. It is not a simple push of a button. Rather, it is a process that may incorporate some automated technology; yet, much of the process involves manual review for the evaluation of the baseline documents and verification of the result sets.
So, there is considerable overhead when using predictive coding technology. Consequently, it is probably best suited for extremely large document sets and may have a much smaller payoff for smaller and even more garden variety cases.
The extra overhead is also not its only drawback. Indeed, it is very technical and there will be a cost associated with the expertise necessary to pull it off.
Besides economy and reliability another of the attractive attributes to predictive coding and other technology assisted review techniques is its repeatability. Yet, predictive coding can have variability as a result of the statistical techniques actually employed when applying the baseline documents to the population. Indeed there are actually quite a number of different statistical, machine learning models that can be employed such as:
While all of these models should produce repeatable results within themselves, different vendors could be using different methods and thus different results could be produced by the different products.
The science at the heart of predictive technology is often quite old. Its more recent appearance can be attributed to the computing power required to run the calculations. They simply were not practical until more recent times and more powerful computers.
In more recent times, the technology has found applications other than in document review for litigation. For example, the technology used by many spam filters is the same kind of feature identification technology used on predictive coding for document review in litigation.
Predictive coding could have value in litigation situations other than traditional document review. In trade secrets cases, for example, it is always interesting whether the sensitive documents appear in the wrong hands. Traditional ways of spotting such documents has involved the same hash calculations used to identify duplicate documents in a population. All too often, however, sensitive documents are altered once they migrate to the new employer. The changes could be small such as a header or inclusion of additional data; yet, these changes negate hash based comparisons.
Other technologies such as fuzzy hashing have been developed to solve the problem in trade secret cases where sensitive documents and data have been changed. Predictive coding provides another option. Baseline document sets could be developed of original owner documents and then the same predictive coding methods used to compare those baseline sets to document populations in order to find near duplicates or sensitive documents with slight changes.
Like the other technology assisted review methods, predictive coding is not without its issues as well. First, one must find the training documents and a suitable number must be selected. In addition, considerable effort can be expended evaluating those since they are key to the coding of the remaining population.
Second, there are two competing aspects to any statistical approach; confidence and precision. To have high confidence often means a broader precision range. To have high confidence and a narrow precision one must have larger samples or more homogenous populations. Achieving the more homogenous populations can require separating the document populations into similar document types. For example, it may require separating e-mail messages from other text documents.
In addition, it could also require developing baseline sets that are more focused on specific issues. A quantum claim, for example, has three parts: liability, causation and quantum. Similarly, in a construction case there could multiple causes of claims.
In statistical theory, stratification provides a means to bring higher confidence and greater precision while using smaller sample sizes than if one large sample had been taken. Since the overall goal with predictive coding is both economy and reliability, developing separate baseline sets would be overall more effective than having a single baseline set.
Third, manual review is still a component of the process both in developing the baseline training sets and in assessing the final results. Since manual review is inherently unreliable and inconsistent, the reliance on manual effort to identify baseline sets and evaluate performance could provide a significant defect to the methodology.
Fourth, the methodology is considerably complex once statistical theory is factored into the equation. The added complexity makes a considerable target for an opponent. If they are successful in finding a soft underbelly, all that may remain is a scientifically, quantifiable measure of incompetence.
The bottom line is that predictive coding is an accepted methodology for document review and disposition. But then it relies on scientific principles that have been accepted in all kinds of other disciplines, like DNA, fingerprint analysis and even document deduping with algorithms like the MD-5 hash. All of these methods rely on statistics to develop probabilities that are persuasive even though not absolute.
Whatever it is or ultimately may be, predictive coding is not an “easy” button or silver bullet. It offers increased economy and reliability over manually reviewed documents. It is also like taking a swim into deep water with strong currents. It really brings to life the old saying, “swimming with the sharks”.
The various search technologies are not the only ones that should be incorporated as part of a modern litigation strategy. Indeed, every phase of the process is subject to technology and its benefits. The following sections review various subjects that should also be incorporated in the litigation strategy.
Litigators should preserve electronic evidence in a forensically sound manner. This means capturing the data in a fashion that would not alter the potential evidence including its metadata. It also means capturing the full spectrum of data, which includes active and deleted data.
While a party may not ultimately have to produce the data it preserves, a party must still preserve relevant evidence even if it is not accessible. The new rules do not alter any of the prior statutory and common law duties.
It is not just the active data that needs to be preserved. Deleted data and all forms of metadata should be preserved as well. Consequently, the preservation effort is best focused at the storage media such as hard drives and backup tapes and not the data itself. If the media is preserved, everything on it will also be preserved.
Fortunately, preservation can be done rather economically. It is not necessary to analyze the data during preservation. Rather, it only needs to be preserved for future analysis.
Forensic preservation has other practical advantages. For example if the thirteen offending Plaintiffs in Pension Committee had performed forensic preservations, such as hard drive imaging, they could have returned to those images when questions about the adequacy of the collection and production process surfaced. After all, once the media is preserved any concerns related to the initial search and harvesting can be revisited for another bite at the apple.
The same concept applies to the backup tapes. If the tapes are preserved, they can be reviewed if needed. Yet Judge Scheindlin goes to great lengths to discourage backup tape preservation unless they are the sole source of relevant information. But, how can one know for sure without examining them and examining them is where the real costs will be incurred.
While backup tapes may contain highly redundant data the chance that they contain only duplicative data found elsewhere is highly remote. After all their very existence along with the cycle in which they are used is premised on the belief that there is something different within the data population worthy of protection.
Furthermore, preservation of a backup tape can be accomplished by simply taking possession of it. So, if it meets the general criteria of applicable time periods and relevant media, why be so exclusive?
In fact, preservation is actually economical. It is the analysis that is expensive. So, why not preserve broadly and produce narrowly?
Yet by not performing forensic preservation and by not preserving backup tapes the parties in Pension Committee were subjected to expensive and distracting motion practice about sanctions, which is exactly what Judge Scheindlin bemoaned.
Preservation is often quite simple and not as expensive or time consuming as many might believe. The important part is to get the data preserved and then the more expensive and time consuming part, the analysis, can be performed in accordance with an appropriate plan.
This is a good place to remember that the preservation phase is not the place to cut corners. Preservation is typically not that expansive or expensive. So, there is really not a lot to be gained by cutting corners during the preservation phase. Also, everything that comes later depends on how well the preservation phase was performed. Once it has passed, the data can never get any better.
During the preservation phase every case should be treated as if it will end up in court. It is easier to regard the computer as evidence from the start and ease up on the subsequent evidentiary analysis phase if it is determined that there is no substance to the issue. The opposite approach, however, is not impossible. So, it is best not to start working with the computer data in a casual manner and then realize that there is a problem. By that time it is often too late to start treating the data as if it were evidence. Instead, treat it as evidence from the start and practice good preservation methods.
The particular techniques that should be employed will depend on whether the data to be preserved is located on a read-only device, a read-write device, within specialty applications or on archival and disaster recovery systems.
Volume reduction is a significant goal for both increased economy and reliability. Smaller data sizes mean less review as well as a greater chance that consistent disposition efforts can be employed.
Another method of volume reduction is known as the removal of known file types or de-NISTing. The de-NISTing name comes from the use of a database of known software files published by the National Institute of Standards and Technology (NIST) called the National Software Reference Library (NSRL).
The NSRL is, “a repository of known software, file profiles, and file signatures for use by law enforcement and other organizations in computer forensics investigations.” The data from which the NSRL is created is obtained through purchase or by donation from the original software publishers. The database contains information like the application name, the file name and its signature (digital hash). E-discovery vendors use the database to identify those files in their own population of documents that can be excluded from further consideration.
For example, the file types of interest in a particular matter may include text files and spreadsheets. Within a particular spreadsheet application there will be text files related to licensing and installation as well as spreadsheet files themselves that are part of a demonstration or tutorial library contained within the application. In addition there may be PDF files that are reference manuals about the application’s operation and usage. With the NSRL, these files can be identified and excluded from further consideration in an e-discovery project.
While most of the files contained within an application are usually program executables and other software files that would never be selected for consideration in most e-discovery projects, a small number of files could meet the selection parameters. When these few files are multiplied by the number of software applications on a custodian’s computer hard drive, the number of excluded files can be significant. So, the ability to identify and remove the known files from an e-discovery data population is another discriminator that can be used to evaluate a vendor’s rate for overall lowest cost.
The de-NISTing is accomplished by matching files based on their digital fingerprints, typically either their MD5 or SHA1. While a litigation database may be able to perform the matching after the documents are processed and the database loaded, a vendor could perform the matching before the documents are ever extracted from the preserved media.
Traditionally, litigators like to cast the wide net during discovery. In the era of digital evidence, however, there can be unintended consequences when casting the wide net.
After all, a cavalier cast in the ocean of digital evidence could just as easily return the trophy catch as it could return useless salvage.
When fishing the ocean of digital evidence multi-stage discovery is a technique that can be used to reduce risk and increase efficiency. Under this technique there are two basic approaches.
The first is to collect the low hanging fruit before proceeding to the harder to reach and likely more expensive digital evidence. In fact, the hope is that there is never a need to harvest the harder to reach fruit.
Clearly the first approach is intended to satisfy the accessible versus inaccessible requirements of Rule 26(b)(2)(B).
The second approach is to recognize that not all the trees in the forest even have fruit to be harvested. So, it is not even a matter of low hanging versus hard to reach. It is simply a matter of likely benefit.
Clearly this approach is intended to address the situation where there may be numerous sources of digital evidence. In the case of backup tapes, for example, some may be more likely to have items of interest than others. A similar situation could occur in large populations of other media types like computer hard drives or external storage media.
The prospect of multi-stage or multi-tier discovery has authority in the 2006 changes to the FRCP. In the comments to Rule 26(b)(2) the Committee noted that,
“A party may have a large amount of information on sources or in forms that may be responsive to discovery requests, but would require recovery, restoration, or translation before it could be located, retrieved, reviewed, or produced. At the same time, more easily accessed sources–whether computer-based, paper or human–may yield all the information that is reasonably useful for the action. Lawyers sophisticated in these problems are developing a two-tier practice in which they first sort through the information that can be provided from easily accessed sources and then determine whether it is necessary to search the difficult-to-access sources.”
Although litigators are learning the various culling techniques for sifting through the catch once it is landed, like de-duping and keyword search techniques, the practice of landing the entire catch in one cast of the net and then pursuing full scale evidence processing requires a bigger net, a bigger boat and a bigger crew along with their associated costs.
Certainly larger volumes offer greater processing efficiency than smaller volumes. Even if the marginal cost of processing smaller catches is larger than a single large catch, the increased marginal costs could be offset should less overall processing prove necessary.
An iterative casting approach can have other benefits as well. Many smaller casts allow the litigator to prototype its processing method and prove its discovery plan before proceeding to full scale production. Such an approach could eliminate the need for re-processing the catch if shortcomings are detected before considerable budget is invested.
It should be understood that the multi-stage approach is intended for discovery and not for the preservation stage of a matter. While production can proceed in stages, it is still important for preservation to be broad based and done at once.
Since the data never gets any better. it is important to conduct the preservation as soon as possible. Simply continuing to use a computer can destroy potential evidence.
It is also important that, in most cases, the preservation effort focus on the media. If the media is preserved then everything that could potentially be needed later has been captured.
In fact, it is a broad based and media based preservation effort that so easily permits a multi-stage discovery approach. After all, once the data is preserved discovery can proceed at its own pace.
In order to keep preservation costs low, however, it is important to recognize the difference between preservation and subsequent production or analysis. The reality is that preservation need not be costly or overly burdensome.
Analytics is often a good first step in volume reduction. It involves selection or omission of documents based on basic attributes like file type, date ranges, and sender or recipient in the case of e-mail messages.
A litigation support database will likely not be able to provide this information until it is fully loaded with all data. A vendor, on the other hand, may be able to read the file system of the preserved media and provide with functionality without having to process any other documents. So, a vendor can leverage this feature to even the processing phase in order to reduce not just the document review effort but to the data processing effort as well.
When matched with a multi-stage discovery strategy analytics can help users plan their discovery for the most economy and efficiency and even avoid full scale production of all of the preserved media.
So, vendors with an analytic capability can provide significant cost savings advantages when trying to reduce document review and the costs associated with full scale production.
Analytics are not only useful for filtering and volume reduction but also for identifying and targeting documents suitable for searching versus those that need to be searched but must be prepped prior to searching.
For years now, most e-discovery has been performed by converting native electronic documents (digital evidence) into a different format such as Tagged Image File Format (TIFF) or Portable Document Format (PDF).
There are several drawbacks to this approach, however. First, there is a cost associated with transforming the original digital evidence into a different format. Second, the transformation also results in data loss.
The conversion, therefore, is like preparing for trial with only a picture of the murder weapon and no means to obtain fingerprint, ballistic, chemical analysis, sales records, ownership, registration and other relevant analyses. Of course, one party to the dispute would welcome such a scenario.
Despite the common practice, others have recognized the superiority of native format evidence. In United States v Davy, 543 F.2d 996, (1976), the increased efficiency and accuracy of native data was persuasive in compelling its production.
A similar result occurred in In re Air Crash Disaster at Detroit Metropolitan Airport on August 16, 1987, 130 F.R.D. 634 (ED Mich. 1989). In that case the production of nine track tapes was believed more economical and efficient than other formats.
More recently in the case of Williams v Sprint, 230 F.R.D. 640, (2005) the dispute involved metadata contained in native format documents. The Court ruled that, “When party is ordered to disclose electronic documents as they are maintained in ordinary course of business, i.e. as ‘active file’ or in ‘native format,’ producing party should produce electronic documents with their metadata intact. . . .”
Finally, the 2006 changes to the FRCP identify the format as an item to be determined during the 26(f) planning conference. In addition, Rule 34(b) was amended to permit the requesting party to specify the format in which it wants the data produced.
Despite these examples, resistance to native format evidence continues. The Williams v Sprint case, as well as others, describe the emerging standard for e-discovery production as TIFF or PDF format unless “the requesting party can demonstrate particularized need”.
Fortunately, the particularized need is easily demonstrated. First, the value of native format metadata is increasingly recognized. In Williams v Sprint the access to metadata was the essence of the discovery dispute.
Second, compare the increased cost of production caused by the conversion process. For this calculation assume 10 gigabytes of producible data.
The storage media on which to hold the 10 gigabytes cost less than $100. Similarly the labor to copy it there is also less than $100 assuming drag and drop.
If something more forensically sound were desired in order to preserve the file system date and time stamps of each file, the labor cost might be two or three times the drag and drop cost.
By comparison consider the conversion cost for TIFF or PDF. Most vendors charge between $1,000 and $2,500 per gigabyte to process the data. So, the total conversion cost would be between $10,000 and $25,000 for TIFF or PDF versus between $200 and $500 for native format.
Even the claim that much of the native format metadata is useless and should not be produced is a red herring. If a gigabyte contains the equivalent of 250,000 TIFF or PDF pages then a 10 gigabyte production would have the equivalent of 2.5 million TIFF or PDF pages. If so, would they all be useful?
Proponents of the conversion process advance numerous justifications such as a need for bates numbering individual pages, document security and visibility. None of these justifications are weighty.
The native evidence can have a bates number prefixed or suffixed to the file name. Similarly, TIFF and PDF images can be altered too. The only real security is provided by knowing the digital fingerprint of the file—native or converted. Finally, there are numerous software tools that can view hundreds of native format files. So, it does not have to be limited to TIFF or PDF.
In the final analysis, there is an economic answer for the litigator who thinks that the smoking gun resides in the opposing side’s computerized data but who has been reluctant to try e-discovery because of cost concerns. The solution: go native.
Another method of volume reduction is deduplication. This can be a rather complex subject with significant consequences for determining volume reduction. Furthermore, the ultimate success may depend on other facets such as production capability.
For example, unless the parties have agreed to produce only the unique document instances, how can deduplication reduce the review effort if the deduplicated results cannot be exploded back to all instances in the population?
One of the ways that deduplication can be performed without providing exploded production is when there are multiple instances of the same data for the custodian. This can occur with backup tapes for example where the same data is captured in each period’s full backup.
In that instance, there may be multiple copies of the same custodian’s data and clearly only one need be produced. Some vendors refer to deduplication within the same custodian as vertical deduplication.
The duplication issue becomes more complicated, however, when multiple custodians have the same document. For example, when both sender and recipient have the same e-mail is there any need to review both even though there may be a need to produce both? Some vendors refer to deduplication across custodians as horizontal deduplication.
While horizontal deduplication will provide a smaller review set than vertical deduplication it can still have limitations, particularly when compound documents like e-mails are involved. For example consider the case where an e-mail is sent to a distribution list. After sending the e-mail the sender realizes that the list was incomplete and then sends the same e-mail to some other recipients that were not previously included. If all things about the e-mail were identical except the recipient list and the sent date and time, is there really any need to review it separately? More than likely this e-mail will not be excluded through either horizontal or vertical deduplication using any number of deduplication methods because of the different recipient list and sent date and time.
In fact, other facets not visible to the normal user would also be different and also prevent its exclusion through vertical or horizontal deduplication. Indeed, the hidden metadata of a compound document, like e-mail, provides several challenges for deduplication. For example, consider the case where an e-mail is sent from Atlanta to recipients in Los Angeles and New York. Also, consider that all three custodians are significant to the case and their e-mails are collected and reviewed.
Although the content of the e-mail will be identical for all three parties, they could not be deduplicated through vertical or horizontal deduplication. As a result, all three would appear in the review population.
The problem is that despite the message and even all of the other visible parts of the e-mail are the same, the message headers will be different for each. In addition to other data, the message header captures time and date stamps for each server that it passes as it winds its way through the internet from sender to recipient.
On the sender’s side it will not have traveled the internet. So, it will not have any date and time stamps. While the recipients will have date and time stamps for the servers they will likely be different, since the path from Atlanta to Los Angeles is likely different than the path from Atlanta to New York. As a result of the above, when each of the messages is hashed, it will have a different value.
With granular deduplication this problem can be solved. Under granular deduplication only the fields of interest are considered when computing the hash value. Thus, for determining the smallest review population the calculation can be limited to just the message body.
The prospect for granular deduplication can occur in other compound documents besides e-mail such as office documents. Spreadsheets, for example, can capture metadata about the last viewed or printed event. This update would make a change to the file that would make it different from other instances even though the rest of the document is unchanged.
When performing a privilege review, for example, what difference would the last print date or viewed date matter? If the deduplication test could be limited to content other than this kind of metadata the review population could be further narrowed.
Clearly, there are differences between vertical, horizontal and granular backup. Understanding exactly what kind of deduplication the vendor will perform and understanding its consequence on the review population is essential to determining lowest cost.
Even understanding the particular method used for deduplication can be significant. If the method is based on one of the accepted algorithms like MD5 or SHA1 then fine but what if the vendor has their own algorithm? If so, what is it? How does it work? How effective is it? Is it even meaningful? Will it survive challenge in the event a production problem arises?
As good as it is, deduplication is not without some practical challenges. For example, once the reduced population has been reviewed what will be produced? Will it only be the unique document versions or will all instances be produced? Unless the parties have agreed to only produce the uniques then all instances will be produced. If so, how will the deduped population be exploded back to all instances for production? Even if only the unique instances are produced will a list of all the duplicate documents and their locations be produced and how will that list be prepared?
If the deduplication has been achieved through vertical deduplication this is probably not an issue if all that has happened is that the same document has been removed from repetitive storage media like backup tapes. If the deduplication has been horizontal or granular the production problem and the explosion of the reviewed uniques to the total population is more evident.
The problem is easily solved with software logic, however. It is simple enough to locate matching documents in the global population and code them identically for either production or withholding. Of course, the problem can be complicated when the explosion involves redacted documents, although it is merely a complication. The solution is similar as non-redacted documents but with a few more twists.
The production issue is not the only challenge, however. Another problem area that must be solved in order for the deduplication promise to have maximum benefit, involves privileged documents when they exist in compound documents like e-mail. After all it is possible that some components of a compound document could be duplicates that will be eliminated if the smallest review population is to be derived.
For example, consider the situation where an attorney sends an e-mail instructing the client to perform the tasks listed in the attachment for the current litigation. Such a communication would likely be withheld as a privileged document.
Also, consider that the attachment is separated from the e-mail and saved to the recipient’s hard drive. The deduplication process includes the stored version in the review population while eliminating the e-mail attachment version from the review population as a duplicate.
When the e-mail is reviewed it will likely be marked for privilege and withheld. When separated from the e-mail the attachment is likely to have lost its significance and could be marked for production. When the actual production is performed, however, the production logic could be to withhold all components of a compound document if any element is marked as privileged. If so, the e-mail and its attachment would not be produced; however, what happens to the attachment version that has been saved to the custodian’s hard drive that has been marked for production?
This issue, too, can be solved with software coding logic. For example, the actual production of documents could be prohibited for documents that also exist in compound documents like e-mails. Their production could be prohibited until all elements of the compound documents in which they also exist have been reviewed. If any component of a compound document in which they exist is marked as privilege then it is withheld.
Of course, the above are just some possible solutions. There are likely many others. In any event, they clearly illustrate that the e-discovery challenge is complex and the nuances need to be understood when evaluating vendor prices. Certainly, the problems could be simplified by avoiding issues like volume reduction but that would also result in significant costs to the client. Equally bad is that the greater volume also increases the chance of error.
Concept clustering is a means to gain greater efficiency by grouping like documents. The efficiency from the grouping can occur in several different ways.
First, clustering can group documents of similar subjects. Once grouped, reviewers may not need to review all documents in the group before making a determination about the significance of the documents. Rather, after looking at only a few documents in the group a reviewer can dismiss all of the documents in the group.
The technology used for clustering is similar to predictive coding. A baseline set of documents may not need to be created. Rather, the statistical technology simply reviews the document contents and quantifies those having similar content.
The second approach to clustering is grouping like kinds of documents. In other words if all invoices were grouped together, a manual reviewer may be able to develop a rhythm for reviewing and dispositioning a document by looking at the content at a particular location. If all of the documents are the same then a reviewer may be able to iterate through them much faster than if they are all kinds of documents.
The 2006 changes to the Federal Rules of Civil Procedure encouraged greater planning by the parties in order to avoid disputes and promote swift and economic justice. One of these planning opportunities is the discovery conference and its product the discovery plan.
The discovery plan is essentially an agreement between the parties about how the discovery will be conducted. If the e-discovery project can be analogized to building a house, the development of the discovery plan can be analogized to drafting the contract. The contract should be rather detailed and include detailed instructions and specifications.
While developing a detailed contract and specification will take time, there are many benefits to this approach. Just consider the home building analogy. The desired product could be described on the back of a napkin. The result could be presented in a "plan" view or an "architectural" view. Despite these other techniques are simple, it is widely accepted that a detailed engineered drawing package and contract are the best approach because it will resolve many issues on the front end and thereby avoid many disputes on the backend. In addition, the more detailed plan provides a more efficient baseline on which to base material ordering requirements and scheduling of subcontractors. As a result of the more detailed design and better plan the project is more likely to proceed on schedule and on budget.
Rule 1 of the Federal Civil Rules describes the primary constraint as "just, speedy and inexpensive determination of every action and proceeding." When properly constructed discovery plans can make this goal a reality, particularly when incorporating numerous features designed for these purposes.
For example, while discovery plans will inevitably address features like preservation that are basic requirements for every party in every litigation, they can also incorporate multi-stage discovery, analytics and prototyping prior to full scale production in order to more accurately target the real catch when casting the big net. While some may argue that such an approach is not speedy, the counter argument is that excessive motion practice is not speedy either. So, while avoiding the more detailed planning on the front end may get the case started faster it does not guarantee that it will end any sooner nor more economically. In fact, just the opposite will be more likely.
By contrast a well designed discovery plan can eliminate many of the issues about which parties frequently squabble. In addition, a well designed discovery plan can eliminate the waste of processing a lot of needless junk when the case will likely turn on only a few hundreds exhibits. Even if parties were to squabble early in the process when designing their discovery plan it, again, is at least before they have committed resources to sifting, processing and reviewing untold thousands of documents having no real significance.
In the end, one of the desired outcomes of the 2006 amendments was to force parties to better plan their discovery in order to deliver the"just, speedy and inexpensive determination of every action and proceeding." Those who would resist the discipline associated with designing a discovery plan on the basis of the additional time that it might take have probably forgottent the children's tale about the tortoise and the hare. Perhaps even more significant is that they do not appreciate the complexity and sophistication of e-discovery and that like many other disciplines involving the development of complex products and services careful front end planning is always the best way to deliver on schedule and on budget results.
Clients as well as their counsel should carefully consider the benefits of advanced planning and detailed discovery plan development. Of course, many advocates are not interested in the goal espoused in Rule 1 or even the seriousness of the certifcation in 26(g). Instead, their goal is to use the economic aspects (or perhaps more accurately stated the uneconomic aspects) of e-discovery tactically in order to achieve settlements that are not merit based. Consequently, at least one party to a litigation may be eager to avoid a disciplined approach in order to draw its prey into a quagmire. Remarkably, however, such tactics can easily backfire and leave the plotter caught in his own fly paper.
While one can look to Rule 26(f)(3) to learn some of the recommended subjects of a discovery plan that could be included in a discovery plan, there are actually many more. The following are nine subjects that should be included in any e-discovery discovery plan.
For more information about discovery plans and protocols see, Eleven Steps to Designing an E-Discovery Plan and Protocol: A Systems Engineering Approach.
Digital litigation involves many challenges. The solution to these challenges does not lie in the procedures of the past. Rather, the solution involves using technology to solve technology caused problems.
While it is natural for practitioners to search for a silver bullet or “easy” button, there just is not one. The reality is that digital discovery requires blending a lot of different technologies and techniques for an optimal solution to render swift and economic justice. After all, regardless of how large the original document population the number of those that will be needed at trial is probably less than a few hundred. The issue is how to find those few hundred documents.
The solution is not a matter of simply iterating through the original population in order to find the few documents that will be needed at trial, as has often been done in the past. The population of documents is not uniform and indistinguishable except for their content. On the contrary, there are many facets about the population of documents that can be used to differentiate them and narrow their numbers to those of interest and the final trial exhibits. The different technologies provide the means to differentiate those facets and find the final documents of interest.
Perhaps the best example of choices is in technology assisted review. Between keyword search, context search and predictive coding, litigators have several choices. While many want to dismiss keyword searches in favor of predictive coding that decision may not be reasonable. If there is a problem with keyword search it is simply that it is subject to misuse like any other technology. When properly used keyword search can be very effective.
At the other end is predictive coding. While it brings considerable science to the document retrieval problem, it also brings considerable overhead. As a result, it may not be well suited for garden variety cases. In addition, its added complexity likely means it is more subject to abuse than even keyword searches.
A kind of middle ground is context search. It offers sophisticated capability in a black box format. So, it may be the best way to bring sophisticated capability to less capable users.
Regardless of the method selected, they all are subject to human limitations, particularly with respect to the determinations of responsiveness and relevance.
Document retrieval is not the only technology about which litigators need to know how to use. Indeed there are quite a number of different technologies. All of these can be used to solve the overall problem faced by litigators in the digital age—how to deliver swift and economic justice.
The best answer is not a single silver bullet or “easy” button but likely a blend of all of these methodologies to deliver an optimal solution.