Amazon S3 and CloudFlare: How Simple Typos Can Cause a World of Pain
Some people excel at spelling and grammar and others fail miserably. In fact, we have become very dependent on our word processors, text editors, and email programs to underline, highlight, and auto correct grammar and miss-spellings. Developers rely on similar technology in their programming tools to optimize syntax, add comments, and ensure every parenthesis, semicolon, and required punctuation is correct before testing code. Unfortunately, as good as the written word is, the meaning, intention, and instructions could also falter miserably if we have a typo or the meaning can be interpreted in multiple ways. If the written words are application or operating system commands, a simple yet valid character can cause a myriad of problems from unintentional vulnerabilities to data outages.
In the last few weeks, both CloudFlare and Amazon S3 suffered significant issues due to simple, but valid, typos based on application programming and user commands. While human mistakes will always be present, these two incidents emphasize the need for vulnerability assessment, penetration testing, application control, least privilege management, and command line filtering. These disciplines are a part of every security team’s best practice intentions but sometimes they highlight the need for process refinement as well.
Let’s take a look at what happened in both of these cases, and what we can learn from them.
On Tuesday morning, February 28th, Amazon’s S3 team was working on a cloud-based billing storage solution. An incorrectly typed command during routine debugging caused a 5-hour long outage across multiple servers and services within Amazon Web Services (AWS) East Coast operations. The command issued was supposed to only affect a few systems but instead affected a larger group of instances than intended due to a typo. The results were an outage where systems sequentially went offline and the manual work required to re-instantiate them as a part of normal operations was great.
As stated by Amazon, many of these instances had not been rebooted in quite some time. While no security or operations technology is perfect, and command verification procedures were obviously not present, command line filtering and least privilege could definitely have prevented the outage experienced by Amazon. It is known (based on the Amazon public statement of the outage) that the command issued with a typo was routine and a part of their playbook, but malformed.
How Least Privilege and Command Filtering Could Have Helped
Technology like PowerBroker for Unix & Linux (PBUL) could have mitigated the risk. PBUL has a policy language that can elevate commands via least privilege and inspect all the options and switches (including what is embedded in scripts), and could have potentially identified the command (and denied its execution) if it was malformed, called inappropriate other commands (like more than the target servers), or was not considered a typical maintenance command assigned to the user for execution.
In command elevation solutions like PowerBroker the premise is simple. Users are assigned commands they are allowed to execute, they can run elevated without the need for Sudo or root, and the contents of the commands can be checked for potentially malicious activity. All of the commands typed, scripts executed, and screen output is logged for future auditing and forensics. Therefore, for Amazon, the incorrectly formatted command could have been denied due to the typo itself or due to calling of additional scripts or commands outside of the normal functional parameters the policy would have specified. In addition, in debugging mode, the command itself should probably never have been allowed to execute with privileges against the production environment. Typos and potential issues with inappropriate commands (whether intentional or a mistake) can be mitigated with PowerBroker Unix & Linux.
On February 23rd, the Hacker News reported a serious vulnerability effecting millions of sites hosted by CloudFlare due to a simple typo in checking the end of a buffer. According to CTO John Graham-Cumming, the coding fault of the now dubbed ‘Cloudbleed’ vulnerability was that “reaching the end of a buffer was checked using the equality operator and a pointer was able to step past the end of the buffer.” Simply put, “Had the check been done using >= instead of == jumping over the buffer end would have been caught.”
In essence, a simple but valid coding a mistake type caused the issue and was originally reported on by a Google security researcher, Travis Ormandy. This is the third time in three weeks Google’s Security Team has reported on vulnerabilities from Microsoft to CloudFlare including publishing proof of concept code for a zero-day vulnerability after 90 days of notification. While Travis was not actively looking for flaws at CloudFlare, he stumbled across the anomaly after seeing corrupt web pages being returned from certain HTTP requests committed through CloudFlare services. This lead him to identify and determine there was a security risk, and therefore notify CloudFlare.
The primary question however is still hanging out there. How did this typo get past DAST tools, vulnerabilities assessments, and even web application scans? The CloudFlare team knew that they had legacy tools that could cause security problems and had been working on a new HTML parser to mitigate some of the risks of existing code. It was just a matter of time and business priorities to mitigate known risks verses disrupting business services. With that in mind, I would commend CloudFlare since it only took three days from notification of the vulnerability to nearly full remediation of their services and correction of the afflicted code. I would however stress that if a security team knows of a problem, there work should be prioritized into any development backlog to remediate the threat.
How Scanning and Web Application Assessments Could Have Helped
To mitigate the risks of potential developer-introduced typos, testing of all applications and operating systems should be conducted on a regular basis from development to production. Using vulnerability assessment solutions for code review, identifying published vulnerabilities via scanning, and conducting regular web application assessments can help prioritize the work for security teams based on identified risks. If development teams are not testing applications and their hosts from development to production, they absolutely should introduce security testing into the process. Simple faults in configuration (like MongoDB) could lead to a world of pain and basic coding mistakes code land your organization on the front page of the newspaper.
In the end, we are just human. We make mistakes. Computers and software take our commands explicitly and if they are flawed, or the underlying code has flaws, we can introduce risk or outages. BeyondTrust solutions could be your proverbial spelling and grammar checker for the command line and scripts. If you are facing use cases like this, contact us today.