In the data science community, we’re witnessing the beginnings of an infodemic — where more data becomes a liability rather than an asset. We’re continuously moving towards ever more data-hungry and more computationally expensive state-of-the-art AI models. And that is going to result in some detrimental and perhaps counter-intuitive side-effects (I’ll get to those shortly).
To avoid serious downsides, the data science community has to start working with some self-imposed constraints: specifically, more limited data and compute resources.
A minimal-data practice will enable several AI-driven industries — including cyber security, which is my own area of focus — to become more efficient, accessible, independent, and disruptive.
When data becomes a curse rather than a blessing
Before we go any further, let me explain the problem with our reliance of increasingly data-hungry AI algorithms. In simplistic terms, AI-powered models are “learning” without being explicitly programed to do so, through a trial and error process that relies on an amassed slate of samples. The more data points you have – even if many of them seem indistinguishable to the naked eye, the more accurate and robust AI-powered models you should get, in theory.
In search of higher accuracy and low false-positive rates, industries like cyber security — which was once optimistic about its ability to leverage the unprecedented amount of data that followed from enterprise digital transformation — are now encountering a whole new set of challenges:
1. AI has a compute addiction. The growing fear is that new advancements in experimental AI research, which frequently require formidable datasets supported by an appropriate compute infrastructure, might be stemmed due to compute and memory constraints, not to mention the financial and environmental costs of higher compute needs.
While we may reach several more AI milestones with this data-heavy approach, over time, we’ll see progress slow. The data science community’s tendency to aim for data-“insatiable” and compute-draining state-of-the-art models in certain domains (e.g. the NLP domain and its dominant large-scale language models) should serve as a warning sign. OpenAI analyses suggest that the data science community is more efficient at achieving goals that have already been obtained but demonstrate that it requires more compute, by a few orders of magnitude, to reach new dramatic AI achievements. MIT researchers estimated that “three years of algorithmic improvement is equivalent to a 10 times increase in computing power.” Furthermore, creating an adequate AI model that will withstand concept-drifts over time and overcome “underspecification” usually requires multiple rounds of training and tuning, which means even more compute resources.
If pushing the AI envelope means consuming even more specialized resources at greater costs, then, yes, the leading tech giants will keep paying the price to stay in the lead, but most academic institutions would find it difficult to take part in this “high risk – high reward” competition. These institutions will most likely either embrace resource-efficient technologies or persue adjacent fields of research. The significant compute barrier might have an unwarranted cooling effect on academic researchers themselves, who might choose to self-restrain or completely refrain from persuing revolutionary AI-powered advancements.
2. Big data can mean more spurious noise. Even if you assume you have properly defined and designed an AI model’s objective and architecture and that you have gleaned, curated, and adequately prepared enough relevant data, you have no assurance the model will yield beneficial and actionable results. During the training process, as additional data points are consumed, the model might still identify misleading spurious correlations between different variables. These variables might be associated in what seems to be a statistically significant manner, but are not causally related and so don’t serve as useful indicators for prediction purposes.
I see this in the cyber security field: The industry feels compelled to take as many features as possible into account, in the hope of generating better detection and discovery mechanisms, security baselines, and authentication processes, but spurious correlations can overshadow the hidden correlations that actually matter.
3. We’re still only making linear progress. The fact that large-scale data-hungry models perform very well under specific circumstances, by mimicking human-generated content or surpassing some human detection and recognition capabilities, might be misleading. It might obstruct data practitioners from realizing that some of the current efforts in applicative AI research are only extending existing AI-based capabilities in a linear progression rather than producing real leapfrog advancements — in the way organizations secure their systems and networks, for example.
Unsupervised deep learning models fed on large datasets have yielded remarkable results over the years — especially through transfer learning and generative adversarial networks (GANs). But even in light of progress in neuro-symbolic AI research, AI-powered models are still far from demonstrating human-like intuition, imagination, top-down reasoning, or artificial general intelligence (AGI) that could be applied broadly and effectively on fundamentally different problems — such as varying, unscripted, and evolving security tasks while facing dynamic and sophisticated adversaries.
4. Privacy concerns are expanding. Last but not least, collecting, storing, and using extensive volumes of data (including user-generated data) — which is especially valid for cyber security applications — raises a plethora of privacy, legal, and regulatory concerns and considerations. Arguments that cyber security-related data points don’t carry or constitute personally identifiable information (PII) are being refuted these days, as the strong binding between personal identities and digital attributes are extending the legal definition PII to include, for example, even an IP address.
How I learned to stop worrying and enjoy data scarcity
In order to overcome these challenges, specifically in my area, cyber security, we have to, first and foremost, align expectations.
The unexpected emergence of Covid-19 has underscored the difficulty of AI models to effectively adapt to unseen, and perhaps unforeseeable, circumstances and edge-cases (such as a global transition to remote work), especially in cyberspace where many datasets are naturally anomalous or characterized by high variance. The pandemic only underscored the importance of clearly and precisely articulating a model’s objective and adequately preparing its training data. These tasks are usually as important and labor-intensive as accumulating additional samples or even choosing and honing the model’s architecture.
These days, the cyber security industry is required to go through yet another recalibration phase as it comes to terms with its inability to cope with the “data overdose,” or infodemic, that has been plaguing the cyber realm. The following approaches can serve as guiding principles to accelerate this recalibration process, and they’re valid for other areas of AI, too, not just cyber security.
Algorithmic efficacy as top priority. Taking stock of the plateauing Moore’s law, companies and AI researchers are working to ramp-up algorithmic efficacy by testing innovative methods and technologies, some of which are still in a nascent stage of deployment. These approaches, which are currently applicable only to specific tasks, range from the application of Switch Transformers, to the refinement of Few Shots, One-Shot, and Less-Than-One-Shot Learning methods.
Human augmentation-first approach. By limiting AI models to only augment the security professional’s workflows and allowing human and artificial intelligence to work in tandem, these models could be applied to very narrow, well-defined security applications, which by their nature require less training data. These AI guardrails could be manifested in terms of human intervention or by incorporating rule-based algorithms that hard-code human judgment. It is no coincidence that a growing number of security vendors favor offering AI-driven solutions that only augment the human-in-the-loop, instead of replacing human judgment all together.
Regulators could also look favorably on this approach, since they look for human accountability, oversight, and fail-safe mechanisms, especially when it comes to automated, complex, and “black box” processes. Some vendors are trying to find middle ground by introducing active learning or reinforcement learning methodologies, which leverage human input and expertise to enrich the underlining models themselves. In parallel, researchers are working on enhancing and refining human-machine interaction by teaching AI models when to defer a decision to human experts.
Leveraging hardware improvements. It’s not yet clear whether dedicated, highly optimized chip architectures and processors alongside new programming technologies and frameworks, or even completely different computerized systems, would be able to accommodate the ever-growing AI computation demand. Tailor-made for AI applications, some of these new technological foundations that closely bind and align specialized hardware and software, are more capable than ever of performing unimaginable volumes of parallel computations, matrix multiplications, and graph processing.
Additionally, purpose-built cloud instances for AI computation, federated learning schemes, and frontier technologies (neuromorphic chips, quantum computing, etc.) might also play a key role this effort. In any case, these advancements alone are not likely to curb the need for algorithmic optimization that might “outpace gains from hardware efficiency.” Still, they could prove to be critical, as the ongoing semiconductor battle for AI dominance has yet to produce a clear winner.
The merits of data discipline
Up to now, conventional wisdom in data science has usually dictated that when it comes to data, the more you have, the better. But we’re now beginning to see that the downsides of data-hungry AI models might, over time, outweigh their undisputed advantages.
Enterprises, cyber security vendors, and other data practitioners have multiple incentives to be more disciplined in the way they collect, store, and consume data. As I’ve illustrated here, one incentive that should be top of mind is the ability to elevate the accuracy and sensitivity of AI models while alleviating privacy concerns. Organizations that embrace this approach, which relies on data dearth rather than data abundance, and exercise self-restraint, may be better equipped to drive more actionable and cost-effective AI-driven innovation over the long haul.