Monday, January 08, 2024

Cloud Troubleshooting Day

originally posted on LinkedIn

This is the play-by-play of my investigation into an abnormal Amazon Web Services (AWS) bill: how I traced back to the root cause, how I learned that Amazon itself was partially to blame, and the resulting outcome. There are no brilliant deductions or magic bullets here -- smart cloud administration usually boils down to (1) the availability of relevant, explorable data, (2) simple proactive alarms, and (3) the patience to wade through Google's increasingly irrelevant search results for answers.

Setting the Stage

I run a modest web empire with very predictable month-to-month costs and web traffic. This array of sites and services has run entirely on AWS since 2015, mostly because the cloud was cool back then and I needed to justify the cost of my first 3 AWS certs.

Halfway through the month of December 2023, I received a CloudWatch alarm projecting a 172% increase in my monthly bill. I did what all busy cloud administrators wish they could do: I turned off the alarm and resolved to figure it all out after Christmas!

Finding the Root Cause

My investigation began in earnest on December 28, using the AWS Cost Explorer dashboard. The basic view of my cost data showed the spike occurring in the cryptically-named "EC2-Other" category, which is like a pu pu platter of miscellaneous charges related to the Elastic Compute Cloud (EC2) service. I had to filter the graph on "EC2-Other" data and group by "Usage Type" to get a more detailed breakdown of what's in this category.

Applying Filter and Group By criteria will make Cost Explorer data more useful.

The detailed breakdown showed me that only two subcategories of activity were abnormal: DataTransfer-Regional-Bytes which measures data traffic between Availability Zones (AZ) in a Region, and CPUCredits:t3 which is the Uber surge charge you get slapped with when your EC2 instance is working too hard. It made sense that my server would have high CPU utilization to handle the spike in data transfer, but I knew for certain that all of my web empire was in a single AZ, so there should have been no new intra-AZ traffic.

Step 2 in the investigation was to look at the CloudWatch logs for each server in my cloud architecture to see which ones were working too hard. The culprit jumped out immediately -- while my servers usually hovered under 10% CPU use, one server's energy levels matched those of my six-year-old on each successive day of Winter Break.

"Why don't you go outside and run another lap around the house before bedtime?"

Step 3 in the investigation was to log into the stressed out server and examine the access logs for unusual requests. I used the tried and true log analyzer, WebLog Expert, which has served me well over the past 20 years and immediately found out what was causing the extra web traffic.

Misbehaving web crawlers is so 1998.

According to its developer page, "Amazonbot is Amazon's web crawler used to improve our services, such as enabling Alexa to answer even more questions for customers." While I concede that Alexa needs all the help it can get in this regard, this charity case is not worth $20 more in cloud spend.

Putting the Pieces Together

With all of my investigative steps documented, I was able to do some research and figure out the root cause.

  1. In November, the Amazonbot discovered an instance of MediaWiki running on one of my servers (a book wiki for the Wars of Light and Shadow series by fantasy author, Janny Wurts) and decided to index it.
  2. This is educated conjecture based on my logs, but the Amazonbot seems to fail at recognizing that certain URLs represent the same page. For example, it may think that https://test.com/?sessionId=12 and https://test.com/?sessionId=34 are completely different pages even though the number at the end of the URL is just used to identify different visitors. This apparently caused it to build up a backlog of wiki pages it thought it hadn't visited yet and the number of requests skyrocketed in December. In other words, the Amazonbot is playing SessionID Go: Gotta collect 'em all! and isn't responsibly throttling the resulting requests.
  3. Most importantly, the Amazonbot itself is running in AWS but it's in a different AZ than my servers. So I'm getting charged once for the extra CPU utilization needed to handle the bogus requests (CPUCredits:t3), then charged again to move the requested data into another AZ (DataTransfer-Regional-Bytes).

After Actions

  1. I added a "Disallow" rule to my robots.txt file, which is effectively a polite way to tell the Amazonbot to pound sand the next time it wants to visit my server. If the bot continues to visit, I can rudely block it in my Security Groups instead.
  2. I reactivated my billing alarm so I can stay ahead of the next impending crisis.
  3. I should probably contact the Amazonbot team and let them know about this problem, but it appears that others have already done so and it doesn't seem like there's been any response.

So far, so good!

Lessons Learned

  1. Explore Before You Deduce: Gather as much data as you can before you commit too hard to any one debugging path. Making sense of what's in front of you is a different skillset than figuring out what's going wrong, and jumping to a likely root cause too soon might dissuade you from other possibilities.
  2. Use CloudWatch billing alarms: You don't need a vast array of alarms hooked into the innards of your servers' performance to stay financially responsible. A simple "Warn me when my bill goes over $X" alarm is sufficient.

tagged as website, programming | permalink | 1 comment
day in history


Previous Post: Review Day


Next Post: Math Teacher Day

 

You are currently viewing a single post from the annals of URI! Zone history. The entire URI! Zone is © 1996 - 2024 by Brian Uri!. Please see the About page for further information.

Jump to Top
Jump to the Front Page


January 2024
SMTWHFS
123456
78910111213
14151617181920
21222324252627
28293031
OLD POSTS
Old News Years J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
visitors since November 2003