According to Declan McCullagh, a just-released U.S. National Research Council report entitled Protecting Individual Privacy in the Struggle Against Terrorists: A Framework for Assessment concludes that automated identification of terrorists through data mining or any other mechanism “is neither feasible as an objective nor desirable as a goal of technology development efforts.”
I haven’t had the time to read through the 352-page report. The committee that wrote the report includes Stanford professor William Perry, former MIT president Charles Vest, and Microsoft researcher Cynthia Dwork. Such a crew undoubtedly realizes that any data mining technique yields false positives. The big questions are whether the data mining techniques are more effective than the alternatives, and whether the using them is consistent with law and policy.
Based on McCullagh’s summary, the report seems to mainly call for oversight and objective evaluation. Nothing controversial there. And, as he wryly notes, Americans may have watched too many episodes of 24 to have a realistic sense of what data mining can and can’t do.
Still, I think we’d be naive to give up entirely on machine learning approaches to fight crime and improve national security. As with all science, we need to subject hypotheses to rigorous, objective testing. But remember, low-tech approaches have false positives too. There is no moral superiority in being a Luddite.
18 replies on “NRC Report: Data Mining won’t find the Terrorists”
The problem is that you can invest millions in ill-fated machine learning projects whereas these same millions invested in trained staff or data visualization tools better could save lives.
So, it is not as simple as saying that you have to do it carefully. By choosing to do machine learning, you also to choose to ignore other techniques which might be more productive.
How many billions were invested in AI since the 70s? How many broken promises?
Of course, this report will probably have the effect of generating even larger investments in machine learning for tracking terrorists.
Do not get me wrong. I am convinced that computers are smarter and better than human beings at many tasks, including several tasks involved in tracking terrorists. However, the bad guys also have AI on their side (think about spammers).
It could very well be that more machine learning is not the way to go.
Sure, I’d like to see resources allocated optimally. I don’t know whether investing in machine learning would provide the best bang (or bang prevention) per buck. But it’s also possible to invest millions in staff training or tool building without yielding any return.
I understand that AI and machine learning have a bad rap for overpromising and underdelivering in the past. And much of my work at Endeca is about building better tools to leverage human intelligence, rather than over-relying on the artificial sort.
But I still don’t like seeing overly simplistic attacks on technology that seem more motivated by ideology than by informed analysis. I can’t comment on the report itself, but McCullagh’s piece reads a bit too editorially for my taste.
I will likely purchase and read the report as I work in that space. Based on a quick glance through the report, I think there are a lot of things that can be taken and applied from the report, especially related to transparency in aggregating and ranking data.
“Counterterrorism programs should provide meaningful redress to any individuals inappropriately harmed by their operation. ”
Imagine if the banking industry which utilizes data mining and algorithms to identify potential home buyers would have to have to provide redress to any individuals they inadvertantly harmed. I’m sure there were significant number of false positives among people who could meet their mortgage payment.
No decision process is foolproof, and it’s legitimate to ask how we handle mistakes. If the process is transparent, then at least it’s easier to assign responsibility. Even so, fairness may require that we compensate individuals who were innocent victims of an inherently imperfect process.
That said, I think the key word here is “inappropriately” as opposed to “inadvertently”. We expect our government to act in good faith. As long as it does so, people will accept the rare accidents as tragic but forgivable.
What I don’t understand about applying data mining to counter-terrorism is the fact that you have very little, if any, real training data.
Doesn’t machine learning operate only under the assumption that similar patterns of behavior will happen again and again, and that through repetitive exposure to positive training examples you can discover the features or contexts under which those patterns operate?
Online shopping has millions and millions of patterns to learn from, so machine learning makes sense. But terrorism only has.. what.. dozens? How can one reasonably construct any kind of machine learning system with only a few dozen training examples?
Forget the issue with false positives. How do you even train it, to begin with? Especially when terrorists probably aren’t going to repeat their behaviors, the same way that online shoppers do? Aren’t terrorists purposely going to be trying to change their own methods, all the time? So even if you could learn something by training on a few dozen examples, there is no guarantee that the next instance of terrorism is going to “behave” anything like any of those previous instances. Right?
This is something that has always confused me about machine learning, as applied to counter-terrorism. Am I missing something fundamental here? Does someone want to explain it to me?
I’d say that’s true for many machine learning techniques, e.g., support vector machines. But it’s not necessarily true for unsupervised techniques like anomaly detection.
There’s still the problem of false positives from anomaly detection algorithms because of underestimating the variation in the signals they measure. Still, it’s an approach that can work in principle.
Besides, doesn’t human intelligence operate only under the assumption that similar patterns of behavior will happen again and again, and that through repetitive exposure to positive training examples you can discover the features or contexts under which those patterns operate?
I would agree with Jeremy that there isn’t enough “specific” training data. I would say that there are enough broader categories of data that allows the analyst to narrow down the haystacks to hay bales.
Counterterrorism is obviously harder than cyber attacks. I can use data mining to quickly analyze firewall logs and determine patterns, frequencies and attacker locations very quickly.
As far as analyzing intelligence, a good reference to read is http://zz.gd/20fec8 which focuses on human biases/thinking about thinking/etc.
For the benefit of recent readers: https://thenoisychannel.com/2008/07/11/psychology-of-intelligence-analysis/
But that is still making the assumption that terrorist activities are characterized by anomalies. Like buying a one-way ticket is more suspicious than buying a round-trip ticket. And that the terrorist with box-cutters would actually buy the one-way ticket, instead of just buying the round-trip.
So to me that’s the real question: Do we truly believe that critical terrorist behavior is characterized by anomalies? Let’s not worry again (for the moment at least) about the false positives. How well do we do with the false negatives?
I think that, in retrospect, we do see a trail of anomalous data. The problem is whether the anomalies are even detectable in principle if we don’t know what we’re looking for.
There is a danger that terrorist activities are all what Nassim Nicholas Taleb calls “black swans”: outliers that are predictable only in retrospect. It’s actually a problem I’m trying to get my head around.
More about black swans here: http://www.frankvoisin.com/?p=128
Do we really see a trail of anomalous data? Or, even in retrospect, do we really only see the anomaly at the last second, the moment the event actually occurs?
With the caveat that I know very little about such intelligence matters, it seems to me that a lot of behaviors are not really anomalous at all, as they’re happening. Farmer buys lots of fertilizer. Not an anomaly — farmers buy fertilizer all the time. Friend of farmer rents a yellow moving van. Not an anomaly — people rent moving vans all the time…even people that are friends of farmers.
Suddenly, farmer and friend pack the moving van with fertilizer..and..kablooey.
Even in hindsight, the trail itself is not very anomalous.. until you get to the very day, the very act itself. But by then it’s too late for any sort of data mining.
Similarly with the box-cutter fellow. He’s just going to buy the round-trip ticket. Nothing anomalous there at all. Nor is buying box-cutters. Nor is even buying box-cutters, and then flying. (Think of all the people moving to new jobs around the country.. packing their belongings before flying. I’ll bet lots of people buy box-cutters and then fly.)
So nothing about that is an anomaly, until the act itself. Even in hindsight.
Right? I’m still trying to wrap my head around it, too, and am thinking out loud right now.
At the level you’re describing, probably not.
But imagine that you’re gathering data by whatever means available in places where terrorist activities are likely to be planned. To get a sense of what means are available, take a look at http://en.wikipedia.org/wiki/List_of_intelligence_gathering_disciplines
If you do a decent job of data gathering, then you’ll have too much data to analyze manually. You need some way to focus your attention on the most likely data of interest.
That’s where my intuition tells me that machine learning will be useful.
[…] the benefit of readers using RSS, I just wanted to point people to great discussion going on in the comment thread for this […]
This is a great dialog.
For counterterrorism (ct) I would say that much of their activity is not outliers, especially as they learn denial and deception techniques (ie. move to England/Germany with easier accessibility to the US.). But there are certain activities that could be an indicator (travel to certain countries and known associations).
Unfortunately a lot of people assume that all of the data mining occurs in a vacuum outside of other intelligence sources. Based on that additional intelligence, the goal is to narrow the haystacks down to hay bales that can be looked at using other investigative methods.
In my opinion, you would be hard pressed to find any analyst in the IC who would say they have a big red Easy button that says “Find Terrorist” on their desktop. It just isn’t that simple but data mining techniques do help.
One other good resource is the book Analyzing Intelligence: Origins, Obstacles, and Innovations (Paperback) available at http://zz.gd/9c0e0e. I had a chance to listen to him speak on Tuesday and he was very engaging speaker.
FWIW – you don’t have to buy it, as with (all?) other NAP reports, it can be read for free online at: http://www.nap.edu/catalog.php?record_id=12452
Thank you Christina for directing Noisy Channel followers to our site, we appreciate the ongoing dialogue and we wanted to inform visitors that we do offer several reports which can be read for free online. Along with our recent publication “Privacy In The Struggle Against Terrorists: A Framework for Assessment”,
NAP also offers over 4,000 books online free as well as several podcasts and over 1900 PDFs which can be found at http://www.nap.edu.
Zenneia, thank you for providing this service–and Christina, thanks for letting us know about it. It’s great to see the National Academies taking the lead in information sharing, and I hope that other government agencies follow your example.