originally posted on LinkedIn
This is the play-by-play of my investigation into an abnormal Amazon Web Services (AWS) bill: how I traced back to the root cause, how I learned that Amazon itself was partially to blame, and the resulting outcome. There are no brilliant deductions or magic bullets here -- smart cloud administration usually boils down to (1) the availability of relevant, explorable data, (2) simple proactive alarms, and (3) the patience to wade through Google's increasingly irrelevant search results for answers.
Setting the Stage
I run a modest web empire with very predictable month-to-month costs and web traffic. This array of sites and services has run entirely on AWS since 2015, mostly because the cloud was cool back then and I needed to justify the cost of my first 3 AWS certs.
Halfway through the month of December 2023, I received a CloudWatch alarm projecting a 172% increase in my monthly bill. I did what all busy cloud administrators wish they could do: I turned off the alarm and resolved to figure it all out after Christmas!
Finding the Root Cause
My investigation began in earnest on December 28, using the AWS Cost Explorer dashboard. The basic view of my cost data showed the spike occurring in the cryptically-named "EC2-Other" category, which is like a pu pu platter of miscellaneous charges related to the Elastic Compute Cloud (EC2) service. I had to filter the graph on "EC2-Other" data and group by "Usage Type" to get a more detailed breakdown of what's in this category.
Applying Filter and Group By criteria will make Cost Explorer data more useful.
The detailed breakdown showed me that only two subcategories of activity were abnormal: DataTransfer-Regional-Bytes which measures data traffic between Availability Zones (AZ) in a Region, and CPUCredits:t3 which is the Uber surge charge you get slapped with when your EC2 instance is working too hard. It made sense that my server would have high CPU utilization to handle the spike in data transfer, but I knew for certain that all of my web empire was in a single AZ, so there should have been no new intra-AZ traffic.
Step 2 in the investigation was to look at the CloudWatch logs for each server in my cloud architecture to see which ones were working too hard. The culprit jumped out immediately -- while my servers usually hovered under 10% CPU use, one server's energy levels matched those of my six-year-old on each successive day of Winter Break.
"Why don't you go outside and run another lap around the house before bedtime?"
Step 3 in the investigation was to log into the stressed out server and examine the access logs for unusual requests. I used the tried and true log analyzer, WebLog Expert, which has served me well over the past 20 years and immediately found out what was causing the extra web traffic.
Misbehaving web crawlers is so 1998.
According to its developer page, "Amazonbot is Amazon's web crawler used to improve our services, such as enabling Alexa to answer even more questions for customers." While I concede that Alexa needs all the help it can get in this regard, this charity case is not worth $20 more in cloud spend.
Putting the Pieces Together
With all of my investigative steps documented, I was able to do some research and figure out the root cause.
After Actions
So far, so good!
Lessons Learned
tagged as
website,
programming
|
permalink
| 1 comment
|
|
Previous Post: Review Day |
Next Post: Math Teacher Day |
You are currently viewing a single post from the annals of URI! Zone history. The entire URI! Zone is © 1996 - 2024 by Brian Uri!. Please see the About page for further information.