Summary of data mining

 Defeating terrorism requires a more nimble intelligence apparatus that operates more actively within the United States and makes use of advanced information technology. Data-mining and automated data-analysis techniques are powerful tools for intelligence and law enforcement officials fighting terrorism. But these tools also generate controversy and concern. They make analysis of data— including private data—easier and more powerful. This can make private data more useful and attractive to the government. Data mining and data analysis are simply too valuable to prohibit, but they should not be embraced without guidelines and controls for their use. Policymakers must acquire an understanding of data-mining and automated data-analysis tools so that they can craft policy that encourages responsible use and sets parameters for that use. This report builds on a series of roundtable discussions held by CSIS. It provides a basic description of how data-mining techniques work, how they can be used for counterterrorism, and their privacy implications. It also identifies where informed policy development is necessary to address privacy and other issues. One of the first problems with “data mining” is that there are varying understandings of what the term means. “Data mining” actually has a relatively narrow meaning: it is a process that uses algorithms to discover predictive patterns in data sets. “Automated data analysis” applies models to data to predict behavior, assess risk, determine associations, or do other types of analysis. The models used for automated data analysis can be based on patterns (from data mining or discovered by other methods) or subject based, which start with a specific known subject. There are a number of common misconceptions about these techniques. For example, data mining and data analysis do not increase access to private data. Data mining and data analysis certainly can make private data more useful, but they can only operate on data that is already accessible. Another myth is that data mining and data analysis require masses of data in one large database. In fact, data mining and analysis can be conducted using a number of databases of varying sizes. Although these techniques are powerful, it is a mistake to view data mining and automated data analysis as complete solutions to security problems. Their strength is as tools to assist analysts and investigators. They can automate some functions that analysts would otherwise have to perform manually, they can help prioritize attention and focus an inquiry, and they can even do some early analysis and sorting of masses of data. But in the complex world of counterterrorism, they are not likely to be useful as the only source for a conclusion or decision. When these techniques are used as more than an analytical tool, the potential for harm to individuals is far more significant. Automated data-analysis techniques can be useful tools for counterterrorism in a number of ways. One initial benefit of the data-analysis process is to assist in the important task of accurate identification. Technologies that use large collections of identity information can help resolve whether two records represent the same or different people. Accurate identification not only is critical for determining whether a person is of interest for a terrorism-related investigation, it also makes the government better at determining when someone is not of interest, thereby reducing the chance that the government will inconvenience that person. Subject-based “link analysis” uses public records or other large collections of data to find links between a subject—a suspect, an address, or other piece of relevant information—and other people, places, or things. This technique is already being used for, among other things, background investigations and as an investigatory tool in national security and law enforcement investigations. Pattern-based analysis may also have potential counterterrorism uses. Patternbased queries take a predictive model or pattern of behavior and search for that pattern in data sets. If models can be perfected, pattern-based searches could provide clues to “sleeper” cells made up of people who have never engaged in activity that would link them to known terrorists. The potential benefits for counterterrorism are significant. But when the government can analyze private data so much more effectively, that data could become more attractive, and the government’s power to affect the lives of individuals can increase. There is significant public unease about whether protections for privacy are adequate to address the negative consequences of increased government use of private data. These concerns are heightened because there is so little understanding of how the government might use these dataanalysis tools. Nor is there typically much public debate or discussion before these tools are adopted. This lack of transparency not only can make the government’s decisions less informed, but it increases public fear and misunderstanding about uses of these techniques. Perhaps the most significant concern with data mining and automated data analysis is that the government might get it wrong, and innocent people will be stigmatized and inconvenienced. This is the problem of “false positives”—when a process incorrectly reports that it has found what it is looking for. With these tools, a false positive could mean that because of bad data or imperfect search models a person is incorrectly identified as having a terrorist connection. But even if results are accurate, government mechanisms are currently inadequate for controlling the use of these results. If they are not controlled, private data can be used improperly. There are no clear guidelines now for who sees private data, for what reasons, how long it is retained, and to whom it is disseminated. A related concern is “mission creep”—the tendency to expand the use of a controversial technique beyond the original purposes. Use of controversial tools may be deemed acceptable given the potential harm of catastrophic terrorism, but there will then be a great temptation to expand their use to address other law enforcement or societal concerns ranging from the serious to the trivial.

One important avenue for addressing many of these challenges to privacy and liberties, at least in part, is technology. Some privacy protecting technology is already available and much more is being researched. Researchers are looking at methods to perfect search models and cleanse data to reduce false positives; “anonymizing” technology designed to mask or selectively reveal identifying data so that the government can conduct searches and share data without knowing the names and identities of Americans; audit technology, to “watch the watchers” by recording activity in databases and networks to provide effective oversight; and rule-processing or permissioning technology that ensures that data can be retrieved only in a manner that is consistent with privacy safeguards and other rules. Although this technology can address some of the risks with use of datamining and automated data-analysis techniques, it will not be adequate on its own. Policy action is needed to ensure that controls and protections accompany use of these powerful tools. The policy issues that require attention include:  Research on data mining and automated data analysis. Data-mining and automated data-analysis tools have great potential for counterterrorism, but to realize that potential fully, more research is needed. The government should support this research. A government policy for this research should take into account the context in which these tools may eventually be deployed. This means research on privacy protecting technology and even some analysis of privacy policy issues should be included.  Clarity about use of data mining and automated data analysis. One of the principal reasons for public concern about these tools is that there appears to be no consistent policy guiding decisions about when and how to use them. Policies for data-mining and automated data-analysis techniques should set forth standards and a process for decisionmaking on the type of data-analysis technique to use—subject-based or pattern-based, for example—and the data that will be accessed. They should mandate inquiries into data accuracy and the level of errors that the analysis is expected to generate, and they should require government to put in place a mechanism for correcting errors before operations begin.  Use of search results. There should also be a consistent policy on what action can be taken based on search results. When automated data-analysis results are used only to further analysis and investigation, and not as the sole basis for detention or some other government action, there are fewer possible negative consequences for individuals. Therefore, guidance is necessary on the circumstances, if any, under which results can be used as the basis for action.  Controls on the use of identifying information. Currently no clear guidance exists for government entities and employees about how to handle private data, and this lack of direction can lead to mistakes and inconsistent use of data. Perhaps the most important step to address privacy concerns with the use of data mining and automated data analysis is for the executive branch to implement clear guidelines for federal employees on how they may access, use, retain, and disseminate private data.  

Comments

Popular Posts