The $150M Typo
On the morning of 28th February 2017, an engineer was troubleshooting an issue with the billing system for Amazon’s popular S3 cloud storage service. While attempting to take a few servers offline for debugging, he mistyped an entry, resulting in a cascading failure that affected multiple S3 sub-systems. This led to widespread service disruption across several high-profile websites and services that rely on Amazon’s cloud infrastructure, including Netflix, Airbnb and Slack.
It took over 4 hours for Amazon to get the systems back to normal. While it is challenging to calculate the exact loss caused by the outage, some estimates put it north of $150 million. It is a classic example of human error having unintended and disastrous consequences. And so the question begs to be asked – could AI have helped avoid the issue?
AI to the Rescue
Machine Learning and Artificial Intelligence have the potential to solve a wide range of problems in various fields. From customer service to healthcare to traffic management – the possibilities are endless. Today, apps are a critical component of any business – in a lot of cases, the apps are the business! So in this blog, we will explore how Artificial Intelligence can help improve the performance and availability of business-critical applications and the infrastructure they run on.
The complexity of infrastructure required to run today’s modern applications puts IT Operations teams under immense pressure to run at the pace of business. And today, we are starting to see AI coming to the rescue of IT Ops teams to help them sleep better at night. While there is no perfect framework for implementing AI for IT Operations, multiple use cases are starting to emerge which are real and implementable.
One of the key areas where AI is already making an impact in IT Operations is in automating routine tasks such as monitoring, provisioning and troubleshooting. AI can automate routine IT tasks such as server provisioning, application deployment, and system updates, freeing up IT staff to focus on more strategic initiatives. AI can also analyze data on IT resource usage to help IT teams optimize capacity and avoid over-provisioning. Of course, this requires robust process runbooks that can be used to execute the automation task.
One of the most valuable use cases for AI is to analyze historical data to identify patterns and trends that can help predict future issues. This can help IT Operations teams proactively address potential problems before they become complex issues that impact business operations. By analyzing data from sensors and other sources, AI can predict when IT equipment is likely to fail, allowing IT teams to perform maintenance proactively.
AI can help identify and prioritize incidents and provide real-time insights into their root cause. This helps IT teams in reducing downtimes and improve service quality. AI-based tools can detect anomalous behavior in IT systems that could indicate a security breach or other issue. AI can analyze large amounts of data to identify the root cause of IT issues, allowing IT teams to address them quickly and efficiently.
With the volume and complexity of cyber attacks growing exponentially, traditional approaches to security are no longer sufficient. As such, AI is one of the critical tools in the armor of organizations seeking to defend themselves from sophisticated, machine-scale attacks. AI has a wide range of applications in the field of security such as Threat Detection and Prevention, Malware Detection and Analysis, User Behaviour Analytics, Vulnerability Management, Fraud Detection and Security Response Automation.
The Sky is the Limit… or is it?
Overall, AI is expected to have a significant impact on IT operations, helping organizations reduce costs, improve efficiency and enhance security.
As AI continues to evolve, it is likely that we will see even more innovative applications for IT Ops. To stay ahead of the curve, IT Ops teams should consider adopting AI-powered tools and integrating them into their workflows. The ultimate goal of AI is to help IT operations teams move from a reactive posture to a proactive one, where they can anticipate and mitigate issues before they occur. By automating routine tasks, AIOps enable IT teams to focus on strategic projects that can drive innovation and improve business outcomes.
However, there are certain tasks that are unlikely to be fully replaced by AI. These include areas like Strategic Decision Making, Creativity and Innovation, Complex Problem Solving and Ethics and Values, which are dependent on human creativity, critical thinking and empathy.
And what better way to conclude this blog, but with a poem about AI in IT Ops by ChatGPT:
In IT Ops where data flows,
AI is changing how it goes,
With Algorithms smart and swift,
Tasks are automated, a time-saving gift
No more tedious, manual work,
AI can monitor and detect a quirk
Predictive Analytics, a new tool,
for IT Ops it’s really cool
Through machine learning, it can adapt,
to changes in patterns with an apt,
Eye for detail, it can detect,
Issues before they become a wreck
But amidst all the AI’s automation,
there is still a need for human conversation
To strategize and make decisions,
For IT Ops, it’s still in the vision
AI is a tool, a powerful one,
But without human judgment, it’s just a pun
So let’s embrace the AI revolution,
And use it to solve IT Ops’ resolution.
– by Anand Patil, Senior Director, Systems Engineering, Cisco India & SAARC