I frequently write and speak about fuzzing, with customers, colleagues, and students. I usually start by asking if anyone is familiar with fuzzing, and I get responses like the following:
The bit about random testing is the most persistent misinformation.
One of the best definitions we’ve seen comes from H.D. Moore:1
Fuzzing is the process of sending intentionally invalid data to a product in the hopes of triggering an error condition or fault.
—H.D. Moore
The basic premise of fuzzing is very simple. You create invalid, malformed, unexpected, or otherwise troublesome inputs and send them to a piece of software (the target) to see if anything fails.
The three components of any fuzzer are the poet, the courier, and the oracle. The poet figures out which test cases to create, the courier delivers the test cases, and the oracle decides if the target has failed.
Fuzzing allegedly began when Barton Miller, a professor at the University of Wisconsin, attempted to run some normal command lines over a 1200-baud modem connection during a severe thunderstorm.2 Because of the electrical noise, the command arguments got mangled, and the command programs often crashed. Miller subsequently studied the phenomenon methodically and was surprised to see how often software failed in the face of unexpected or badly formed input.
In this example, the poet was the electrical storm, introducing random anomalies into otherwise valid inputs. The courier was the command line itself, and the oracle was Miller’s observation of crashes.
To see a much more sophisticated poet, courier, and oracle, let’s take a look at how the Defensics fuzzer located the Heartbleed vulnerability in 2014. Heartbleed was an outrageously dangerous vulnerability in the OpenSSL open source software component, which at the time was used in about two-thirds of the world’s web servers.
The poet in this case was Defensics’ generational (model-based) approach to creating test cases. Generational fuzzing means that the fuzzer knows exactly how valid inputs should look and creates test cases that are mostly correct but messed up in some way. In this case, the test cases were heartbeat messages in the TLS protocol. The Defensics TLS Server Test Suite knows how TLS conversations should go, so it can have a valid conversation with the target up to a certain point, then deliver a badly formed message.
The courier was simply the Defensics TLS Server Test Suite’s built-in networking capabilities. The test cases were delivered over TCP sockets to the target, an OpenSSL server program.
The oracle was especially talented. Defensics can detect a variety of software failure modes by default, but it used a special capability called SafeGuard to find the Heartbleed vulnerability. SafeGuard checks on the health of the target while tests are running. One of the SafeGuard checks, Amplification, monitors the ratio of incoming and outgoing message sizes. The Amplification check raised an alert while the Defensics R&D team was fuzzing OpenSSL. When the team looked closer, they could see the server was spewing out the contents of its memory, including private keys and user passwords. Based on the seriousness of the vulnerability, the Defensics engineers coordinated with the OpenSSL team to get the bug fixed.
You can write a fuzzer in about five lines of Python, something that uses /dev/random as a source of test cases and passes those test cases to a piece of target software. Such a fuzzer might find you some vulnerabilities, as happened with Miller’s electrical storm. However, using other techniques to create test cases will give you more results in less time.
Fuzzing is an infinite space problem. For any piece of software, you can create an infinite number of malformed inputs. To get useful results in a reasonable amount of time, the trick is to select inputs that are most likely to cause failures in the target software.
Let’s say you want to fuzz a piece of target software that parses a JPEG image file. Every JPEG file begins with the two bytes FF D8. The very first thing the target will do is read the first two bytes and see if they are FF D8. If not, the target software will stop processing the file.
If you generate random files and use them as fuzzing test cases, then statistically speaking, only 1 in every 65,536 test cases will have the correct file header and be able to progress in the target software to a point where some vulnerability might be triggered. You might be testing how well the target software handles the first two bytes of a file, but you will be very unlikely to test any of the parsing and processing code for the rest of the file format.
Another way to think about random fuzzing is the Infinite Monkey Theorem. Yes, monkeys randomly hitting keys on typewriters will eventually produce the works of Shakespeare, but it will take a nonsensically long time.
Random fuzzing will eventually produce inputs that are effective in revealing vulnerabilities, but it will take a very long time. The quality of the test cases is low.
A better way of generating test cases is to start with valid inputs, then introduce anomalies to create test cases. In essence, this is what happened with Miller’s electrical storm.
The fuzzer can flip bits, truncate data, repeat data, and perform other mangling on the valid inputs to produce test cases. This type of fuzzing is also called mutational, because the valid inputs are mutated into test cases.
In the JPEG fuzzing example, creating test cases from valid files means that test cases are more likely to travel down interesting paths in the target software. Most test cases will have the magic FF D8 bytes at the beginning, and most of the rest of the structure of the file will also be valid, which means that any anomalies that are added are likely to do a good job in testing the target software.
An interesting variation on this technique is coverage-guided mutation fuzzing, in which the execution path in the target software is used to inform the mutations that are performed to create new test cases. If the execution path for a specific test case ends up in a new place, a new group of mutations are based on that test case in hopes of reaching more of the target software.
Coverage-guided mutation fuzzing has its limits. It is not well-suited to network protocol testing, except in the simplest cases, and it might still have trouble reaching format or model features that are not represented in the valid templates. Also, it requires access to the source code of the target software.
A generational, or model-based, fuzzer knows everything about the file format or protocol you are testing. It knows what every message should look like, the type of every field, and the meanings of fields like checksums, lengths, or session identifiers. Consequently, a generational fuzzer can produce very high quality test cases with specific anomalies in specific locations, which means it can drive the target software into a wide variety of states to probe for vulnerabilities.
Suppose every test case has a header with a checksum field that is the CRC32 of the payload of the message. A template fuzzer might take a valid test case with a correct checksum, then introduce an anomaly in the payload but fail to update the checksum. One of the first things the target software will do is try to verify the checksum; if that fails, the target will stop processing.
With a generational fuzzer, no matter how weird the payload becomes, the fuzzer will correctly calculate the checksum, which means its test cases will get deeper into the target code and have a good chance of uncovering more vulnerabilities.
Also, as we saw with Heartbleed, a generational fuzzer can exchange valid messages in a network protocol, such as the TLS handshaking messages, before delivering an anomalous message, such as a badly formed TLS heartbeat message.
Building a data model can be hard work, but fortunately, Defensics already has well over 250 test suites for a wide variety of network protocols and file formats. If you can’t find what you need, the Defensics SDK makes it possible to build a test suite of your own by defining a data model.
Fuzzing is an excellent method for locating vulnerabilities. For this reason, attackers love fuzzing. Even if you are not fuzzing the software you are making, attackers very well could be. When attackers locate vulnerabilities with fuzzing, they perform additional research to see how those vulnerabilities can be exploited.
The best defense is finding and fixing vulnerabilities before attackers get a chance to find them. Integrating fuzzing into your product life cycle enables you to locate vulnerabilities in the attack surface of your product and, by fixing them, to lower the overall risk for you and your customers.
We’ve only looked at a sliver of the big world of fuzzing. Anything that takes input can be fuzzed; if you start researching, you will find tools for fuzzing everything from individual function calls to web form inputs to radio protocol packets.
Keep in mind the poet, the courier, and the oracle:
I will close by giving you two words that apply no matter what kind of security testing you want to perform: automate and integrate. Testing must be part of your normal build and test pipelines so that it runs consistently and reliably. Results should be integrated into your normal issue-tracking system so that you have a systematic approach to driving down risk.
Next time people tell you that fuzzing means random inputs, don’t be shy. Go ahead and explain what fuzzing really means. They’ll thank you later.
1 From the foreword to Fuzzing: Brute Force Vulnerability Discovery by Michael Sutton, Adam Greene, and Pedram Amini. Ironically, while H.D. Moore got it perfectly right in the forward, the back cover of the book gets it completely wrong: “To ‘fuzz,’ you attach a program’s inputs to a source of random data.”
2 As told by Miller himself in his foreword to the first edition of Fuzzing for Software Security Testing and Quality Assurance by Ari Takanen, Jared D. DeMott, and Charlie Miller.