Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

We're responding to a post, by Netflix, explaining their downtime. That post is missing the single most important fact - that they need to be able to failover across regions. The rest of their explanation is just second-order noise.

Anyone who reads HN can see that a minimum uptime strategy for Amazon is to failover across regions. Each time there is a major AWS outage, we hear about HN readers whose service was affected even though they spanned availability zones within a single region. But to date, Amazon's regions have operated independently.

That observation is not dependent on knowledge of Quantcast (which is incidentally, far more than a write-only system), or the other production systems I've built in the last 35 years.

(I'll follow up by email about your support questions)



A little transparency can make life easier. Try this:

"Don't panic. You are using a backup datacenter. Some very recent queue or account changes may be missing, and some changes you make tonight may be lost. We are working nonstop to resolve this and appreciate your patience"

When stuck, just change the requirements.


Yeah, I'll bet $100 that there will be a global AWS outage (out to the extent of the April 2011 or June 2012 outages) in the next 3 years, affecting at least two regions for at least one hour of simultaneity.

(I'll assume it will be a routing problem or a software problem.)


Actually, I'd argue the best strategy is to have an alternate cloud provider in each "region".




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: