As a three-times Computer Security Awareness Week CTF finalist, I was very happy when NYU Poly’s Brendan Dolan-Gavitt invited me to give a talk about fcd at this edition’s new SOS workshop, where authors of open-source software would come and talk about their projects. This motivated me to finish some features that had been in the works for a while, which I’d like to describe here. Additionally, the talk gave me an opportunity to present fcd to a broader audience. Doing so, I realized that I’ve often written about what fcd does, but not about what I want it to do.

The goals of fcd

Of course, the broad goal of a decompiler is to produce analyzable source code from a binary program. Most decompilers specialize for a given architecture or compiler. This is essentially the current state of fcd: it handles x86_64 ELF programs, and not much else. However, this is not what I want this project to be about.

I like to use cake mix as an analogy for the compilation process. If compiling is like pouring cake mix and other ingredients into a bowl to finally bake it into delicious cake, then decompiling would be about taking the cake and trying to get cake mix back. Compilation is a very transformative process in which a significant amount of information is destroyed. In fact, since the fastest code is code that does not run, and the smallest code is code that does not exist, we usually evaluate compilers by how much code they are able to destroy. As a result, a final executable is usually fast, efficient, and devoid of information that would be useful to recover its original structure.

The problems of filling these gaps are what fcd wants to be great at. Decompiling is a process that can only exist on top of a disassembling process, and after a little more than a year working on fcd, it is my firm conclusion that disassembly-related problems are generally easier than decompilation problems. I’m shoving a lot of necessities under that rug:

  • Parsing executables;
  • Lifting machine code to IR (fcd’s approach, which I like to refer to as codegen on a budget, is about as cheap as codegen comes);
  • Parsing symbols.

The problems that remain once you’ve taken these away are those that I want fcd to be great at solving. They include:

  • Recovering function parameters;
  • Recovering types;
  • Producing good, C-like output.

As a result, I feel that features like executable parsers are less and less relevant to include in the core C++ code of fcd. I will probably find myself writing more Python extension points and leverage them to provide new inputs to fcd.

DIY executable parsers

Fcd has supported Python optimization passes for almost a year now. What hasn’t been as widely publicized is that fcd can now also accept Python scripts to parse executables. These scripts need to implement a very simple interface:

  • an init(data) function, where data is a byte string containing the executable’s data. The function is called before any other member of the module is used;
  • an executableType variable that contains an arbitrary string identifying the type of the executable;
  • an entryPoints global variable, typed as a list of (virtualAddress, name) tuples;
  • a getStubTarget(jumpTarget) method that accepts the memory location that an import stub function jumps to, and returns a (library name?, import name) tuple (where the library name can be None if it is unknown, which is the case in executable formats that don’t support two-level namespacing, like ELF);
  • a mapAddress(virtualAddress) function that accepts a virtual address and returns the offset in init’s data parameter that contains the information at this address.

This interface has been used to implement Portable Executable parsing, using Ero Carrera’s very good pefile Python module, in about 60 lines.

import pefile
import bisect

stubs = {}
sectionStart = []
sectionInfo = {}

executableType = "Portable Executable"
entryPoints = []

def init(data):
	global stubs
	global sectionStart
	global sectionInfo
	global executableType
	global entryPoints

	pe = pefile.PE(data=data)
	machineType = pefile.MACHINE_TYPE[pe.FILE_HEADER.Machine]
	executableType = "Portable Executable %s" % machineType[len("IMAGE_FILE_MACHINE_"):]
	
	imageBase = pe.OPTIONAL_HEADER.ImageBase
	for section in pe.sections:
		virtualAddress = imageBase + section.VirtualAddress
		bisect.insort(sectionStart, virtualAddress)
		sectionInfo[virtualAddress] = (section.PointerToRawData, section.SizeOfRawData)
	
	for entry in pe.DIRECTORY_ENTRY_IMPORT:
		for imp in entry.imports:
			if imp.name:
				stubs[imp.address] = (entry.dll, imp.name)
			else:
				# make up some name based on the ordinal
				stubs[imp.address] = (entry.dll, "%s:%i" % (entry.dll, imp.ordinal))
	
	entry = (imageBase + pe.OPTIONAL_HEADER.AddressOfEntryPoint, "pe.start")
	entryPoints.append(entry)
	
	if hasattr(pe, "DIRECTORY_ENTRY_EXPORT"):
		for export in pe.DIRECTORY_ENTRY_EXPORT.symbols:
			exportTuple = (imageBase + export.address, export.name)
			entryPoints.append(exportTuple)

def getStubTarget(target):
	if target in stubs:
		return stubs[target]
	return None

def mapAddress(address):
	sectionIndex = bisect.bisect_right(sectionStart, address)
	if sectionIndex:
		sectionMaybeStart = sectionStart[sectionIndex-1]
		thisSectionInfo = sectionInfo[sectionMaybeStart]
		pointerOffset = address - sectionMaybeStart
		if pointerOffset <= thisSectionInfo[1]:
			return thisSectionInfo[0] + pointerOffset
	return None

This script was not announced or widely distributed because fcd doesn’t support Windows executables very well. The main reason is that MSVC++ will very frequently use a custom calling convention for functions that are not externally linked, which is rather poorly supported at the moment. However, a similar Mach-O parser could be implemented and used on x86_64 executables, since Clang consistently uses the x86_64 System V ABI calling convention everywhere on macOS.

Using headers as poor man’s symbols

Over the course of the week, fcd finally gained the ability to reference in-executable functions from headers. This means that if you know the signature of a function contained in an executable, you can use a header file to better inform fcd. The declaration for that function has to be annotated with the special FCD_ADDRESS attribute macro. For instance, in the original announcement post, you could add this to the header:

int main(int argc, const char** argv) FCD_ADDRESS(0x040045e);

Assuming that the address of the main function is indeed 0x040045e, fcd will “recover” the prototype as uint32_t main(uint32_t argc, uint8_t** argv). As you can see, some information is lost in translation: the signedness of the integer and the constness of the argv parameter. This is because signedness and constness are concepts that do not exist in LLVM’s type system, and no effort has been made yet to carry that information over.

Under the hood, FCD_ADDRESS has the following definition:

#define FCD_ADDRESS(x) __attribute__((annotate("fcd.virtualaddress:" #x)))

It uses strtoull with base 0 to parse #x, meaning that you should be able to use just about any base that you like (though I expect that base 16 would be the most popular). Of course, this also means that the address has to be an integer literal, and not, say, a C++ constant expression.

Eli Friedman was kind enough to answer my question on the cfe-dev mailing list about which attribute could be used to carry that kind of information. It turns out that annotate has no meaning, and can be used to convey just about any information that you can serialize to a string. It can be specified multiple times, so I expect to use it again in the future to specify more information.

(Incidentally, in the strange case where the same function exists in two places in an executable, you can use multiple FCD_ADDRESS attributes to tell fcd that this prototype applies to multiple addresses.)

Fcd will recognize any function with the FCD_ADDRESS attribute as an entry point. This means that you can now use that to specify additional entry points instead of the -e command-line switch (unless you’re doing partial disassembly, in which case you still need it).

This change is powered by a new EntryPointProvider interface. It is currently implemented by two classes (Executable and HeaderDeclarations), but it would be a fair candidate for Pythonization. I will probably get to it when I come up with an acceptable design for an interface that provides full-on symbol information, since there are already enough ways as is to specify new entry points, and new ways to do that will almost certainly have actual symbol info. Headers already do, but I’m thinking about PDB/DWARF symbol parsers, for instance.

Thank you CSAW!

It was a blast this year again. I thought that last year was going to be my last time, but I’m very happy to have been proved wrong! It was a great experience to talk about fcd and gather feedback.

The presentation was recorded and the link might eventually find its way to this page.