It is often lamented that “time manipulations are hard.” However, this can be hard to appreciate considering how insidious these calculations are until you find a bug relating to one yourself.
One common failure mode in microservice architectures is when the system DoSes itself. It is not a surprising concept, but if you haven’t worked with microservices you might have trouble coming up with a concrete example of conditions that would cause this to happen.
The following story explains how both issues came together to allow one user on our platform to bring our system to a halt in one of our development environments. It’s an interesting story because the configuration of one user was enough to use the system’s throughput against itself and overwhelm some of our core services causing system usability to slow down for everyone on the platform.
Setting up the problem
It all started when we noticed the system was painfully slow for users. We proceeded to go through our playbook for checking system health. From there we saw that the service that sends our emails had hundreds of thousands of events queued for processing and that service wasn’t having any errors. Instead it was handling the events it had slowly. The traffic from the email sending service was actually creating significant load on the rest of the system by looking up data from core services. This caused overall system responsiveness to suffer.
This directed our attention one degree upstream to the service generating the events that our email service was handling. To confirm this service was the problem, we turned it off and observed that responsiveness returned to the rest of the system.
Looking deeper into the service itself, we could isolate the events to a fairly small block of code in the problem service. One of the features of the service is scheduling daily email reminders. The bulk of this logic occurs in a single function which after handling one of these events tries to schedule the next one by calculating the timestamp for the next day at 8am.
How the logic works
The general logic for the
calculate_next_event_time function isn’t anything
exceptional, but there are a few steps as follows:
- Start with
nowin UTC, the
timezoneof the user to be notified (e.g. ‘America/New_York’) and a
deltarepresenting “the next day at 8am”.
nowdown to the day and call this
midnight_utc. For example, if now is 16:00 May 5, 2019, then make this 00:00 May 5, 2019 and call this our
- Take the timezone offset between the local user time and UTC and call this
offset. For example, if the user is in New York City, this would be -5 hours.
- Subtract the
midnight_utcthis should give us
midnighttoday in local user time. Call this
- Add the
deltafrom step 1 to the time from step 4 will give us 8am on the following day in local user time.
datetime objects to UTC timestamps, we used the following helper:
from datetime import datetime EPOCH = datetime(1970, 1, 1) def to_utc_timestamp(dt): return (dt.replace(tzinfo=None) - EPOCH).total_seconds()
calculate_next_event_time function looked like this:
from dateutil.relativedelta import relativedelta from datetime import datetime def calculate_next_event_time( now, # e.g. datetime.utcnow(), current machine time in UTC timezone, # e.g.: gettz("Europe/Madrid"), any user-specific timezone ): # Step 1 one_day_eight_hours_later = relativedelta(days=1, hours=8) # Step 2 midnight_utc = datetime.combine(now, time(0)).replace(tzinfo=None) # Step 3 offset = timezone.utcoffset(datetime.utcnow()) # Step 4 midnight_local_time = midnight_utc - offset # Step 5 return to_utc_timestamp(midnight_local_time + one_day_eight_hours_later)
How it breaks
See the problem? Hopefully it isn’t too obvious because it was hard for us to see without plugging in test values.
Let’s break it down. Say that now is 6pm, and the user is in an extreme
Etc/GMT-14, then we have the following initial arguments:
In : now = datetime(2019, 3, 13, hour=18, minute=0, second=0) In : timezone = gettz("Etc/GMT-14") In : to_utc_timestamp(now) Out: 1552500000.0
When we plug this into the
calculate_next_event_time function we get:
In : to_utc_timestamp(now) == calculate_next_event_time(now, gettz("Etc/GMT-14"), relativedelta(days=1, hours=8)) Out: True
The function actually computes a timestamp equal to now! What’s worse is that it will compute the same timestamp for the next 6 hours until the day ends, e.g.:
In : now = datetime(2019, 3, 13, hour=20, minute=0, second=0) In : to_utc_timestamp(now) > calculate_next_event_time(now, gettz("Etc/GMT-14"), relativedelta(days=1, hours=8)) Out: True
Over the 6 hours between 6pm and midnight the system will generate a reminder that is already expired every time the daemon wakes up. Growing unchecked for a few days or weeks or months, this queue for reminders, which was implemented with a relational DB table, swelled to monstrous proportions until even simple queries on it actually spiked the CPU and impacted all services using the same database in addition to flooding the system with email events.
Finding the logical flaw
So where’s the break in the logic? As written, a simplified version of
reminder_time = now - (hours and minutes of now) - (UTC offset of local user) + (1 day 8 hours)
The problem with our code is that it is possible for
(hours and minutes of
now) + (UTC offset of local user) to be greater than
(1 day 8 hours). In the
example above, the problem started at 6pm, or 18 hours after midnight, and the
user was in a timezone offset of UTC+14. Together this gives us a total of 32
hours, or 1 day 8 hours before now.
So the problem with the function as written is that for extreme time zones it may calculate 8am in a time zone on a day that is already in the past.
Fixing the problem in the short and long terms
This is a case of the code being written in the way that seemed simplest in the moment without taking full advantage of the libraries available. This is OK for a first pass of code, and it’s great that we caught this in testing. After finding a problem with this logic, it was time to simplify it and take better advantage of the tools available to us.
An alternate, simpler algorithm might be: “starting from now in user local
time, go to the next calendar day at 8am.” In fact
python supports this
rather nicely with:
def calculate_next_event_time(now, timezone): local_now = now.astimezone(timezone) delta = relativedelta(days=+1, hour=8, minute=0, second=0, microsecond=0) reminder_time = local_now + delta return reminder_time.timestamp()
The new code has numerous simplifications over the original:
- Our original code was working with two time zones: the user timezone and UTC. Now we only deal with one timezone, reducing the risk of time arithmetic errors.
- The new code takes advantage of absolute deltas in its use of
relativedelta. It might be a little misleading for a function called
relativedeltasupports both absolute and relative deltas within the same invocation. The
relativedeltain the snippet says give me the next calendar day (relative) at 8am on that day (absolute).
- The new code uses the built-in
datetime.timestampfunction. It might seem obvious to use the built-in time function, but when this code was written this codebase was still in Python 2.7 and
datetime.timestampwas only added in Python 3.3.
The third bullet point is my favorite because it shows how the context that code exists in changes even when the code does not. When Globality was founded in early 2015, we were still shy (for no good reason) of Python 3. Even with our initial resistance, we eventually realized that it didn’t align with our value of innovation to stay stuck on Python 2.7 forever and we made the upgrade. When we did the upgrade we couldn’t go over every line of code in our codebase to rewrite it, and so we only uncovered our custom and now redundant logic when taking a closer look at this function.
In fact we will go further than just improving the code. This example of failure gives us ideas of how to prioritize future work for system resilience. The configuration of one user shouldn’t be able to bring down the whole system and neither should a poorly behaving service. These are themes we’ve know we needed to prioritize for a while, but this example shows us that heavy user load isn’t the only source of these types of issues and we need to start preparing for them now.
Above I walked through a situation that brought together two common problems in software systems: one with time zones and one with self-inflicted DoS. Hopefully this example sheds some light on how these common problems arise and helps you know where to look when performance suddenly takes a hit for no clear reason.