Parsing headers for fun and profit
One of the many challenges of decompiling programs is figuring out the parameters of functions that don’t have a visible body. There are two main situations in which this can happen:
- the function is called indirectly;
- the function immediately jumps indirectly to somewhere else.
Of note, dynamic linker stubs create a lot of functions that immediately jump indirectly to somewhere else. Now, the great thing about dynamic linker stubs is that they often reference the name of an external function to call, and it’s quite possible that headers for the library are available.
Dynamic linkage primer
Since dynamic linkers are complicated beasts, I have no intention of diving very deep into their inner workings. However, for the sake of this post, it’s useful to put out some basics about linkage (specifically targeting on x86_64 Linux ELF executables using ld.so
). The basic idea is that when you pull in a function from a shared object in your program, the compiler creates a stub function. For instance, objdump
might show this for a stub to puts
:
0000000000400480 <puts@plt>:
400480: jmp QWORD PTR [rip+0x2008a2]
400486: push 0x5
40048b: jmp 400470
This function is what your program actually calls when it tries to call puts
. Here, @plt
stands for Procedure Linkage Table, and the address QWORD PTR [rip+0x2008a2]
points right into it. On a call to puts
, execution is transferred to the code pointed to by this entry in the PLT.
To save time on startup, the PLT is populated lazily. Initially, the entry at QWORD PTR [rip+0x2008a2]
actually points right back to 0x400486
, which is the instruction just after the initial jump. This means that the first time that you call puts
, the rest of the stub is executed. We can interpret the two instructions as some pseudo-function call to the routine at 0x400470
passing 0x5 as a parameter, where 0x400470
goes back into ld.so
. From there, the dynamic linker reads the metadata associated to the fifth entry, finding out that the import we’re interested with is named puts
. Then, it iterates dynamic libraries in order, until it finds one that exports that name. Ld.so
then writes back the address of that import in the PLT and transfers execution to it. That way, the next time you call puts
, you won’t need to look it up again.
Stub symbols generally don’t have their name stored in symbol tables. Tools that show names like puts@plt
“make them up” by parsing metadata just like ld.so
would have.
Of course, this means that fcd can do it too, and it does.
Unfortunately, in most cases, the function name doesn’t say a lot. For instance, exit
doesn’t say that the function accepts an integer and does not return. And since the function’s body is in a different library, it’s probably not possible (or at least, not practical) to get the implementation of that function.
Just put a compiler in your decompiler
Up until recently, fcd had a hard-coded list of approximately 50 glibc functions with their number of parameters, whether they returned a value, and if they were variadic. The parameter and return information did not include actual types. I added entries to that list as I tested programs that used new functions. This approach doesn’t scale very well.
Fortunately, there is an authoritative source of function signatures on almost every system out there: header files. We could solve this problem rather elegantly, and allow extensibility, by parsing .h files and using the information to determine function parameters.
As it turns out, fcd already links against LLVM and requires Clang to lift machine code to LLVM IR. Being that it’s already this close to link against Clang, it’s not a huge step to take.
The great thing about Clang, of course, is that it’s an actual compiler. This solution is not some half-working, in-house C parser that explodes at the slightest hint of a macro: it’s an industrial-grade and proven compiler that actually knows what it’s doing. It’s also extremely convenient that Clang is the only compiler (to my knowledge) that will happily parse Linux, Darwin (iOS, macOS) and Windows headers. Even though fcd only supports ELF executables at the moment, it’s good that Clang will not be a limitation in the foreseeable future.
Initially, I tried to integrate Clang to fcd by using libclang
. While it worked for the task of parsing headers, it had a number of downsides:
Libclang
statically links LLVM too, so fcd’s address space had two copies of LLVM living side-by-side. This is not a huge problem sincelibclang
doesn’t leak any LLVM object out, so there’s no chance of mixup between the two, but it still feels clumsy.Libclang
only exposes a small subset of the possible attributes that a function can receive. While most attributes have a larger impact on compilation, some of them provide very useful insight for decompilation as well. For instance, a call to anoreturn
function (one of the attributes thatlibclang
doesn’t expose) terminates a basic block just like a return instruction, but if a decompiler doesn’t know that, it will think that the function continues beyond the call, which will (at best) make a huge mess.- It’s good that
libclang
doesn’t leak any of its internal LLVM details when you have two instances of LLVM side by side, but it also means that it can’t be used to take aclang::FunctionDecl
(which is what we essentially get out oflibclang
) and extract allvm::FunctionType
out of it (which is what fcd needs).
Because of that, fcd links against the Clang static libraries. They solve these problems at the cost of an inscrutable memory ownership model. (Well, they almost solve these problems; some massaging still required to get a FunctionType
out of a FunctionDecl
.) If this is of any interest, the current implementation lives in fcd/header_decls.cpp.
Passing headers to fcd
To support this new feature, fcd gains two new command-line options:
-I
to add an include directory in the search path (can be specified multiple times);--header
to#include
a specific header.
While nothing prevents you from writing your own header file and including it, right now, fcd will only use the information for function stubs. Allowing users to somehow pass their knowledge of the program that they’re decompiling to fcd is definitely on the radar, though.
Here’s a small program that will write “Hello World!” to a file whose path is passed as the first parameter:
Without header information, fcd does a rather poor job at figuring out what’s going on:
However, with header information, we get something that actually makes sense:
Interestingly, we can see that the compiler promoted the call to fputs
into a call to fwrite
.
Of course, this example highlights that fcd doesn’t do a great job with string literals (and that it doesn’t know about main
’s signature). However, getting better type information is an obvious first step in in determining what should be displayed as a string literal.
Passing foreign headers to fcd
Anyone who’s been watching fcd is probably aware that my main development environment is macOS. However, while macOS standard library headers are generally source-compatible with Linux standard library headers, they are certainly not equivalent. Unfortunately, you can’t easily “just include” your own machine’s headers if you’re decompiling a program that targets a different platform. For instance, on macOS, fopen
boils down to _fopen
(it does not on Linux), and on the other side of the fence, putc
becomes _IO_putc
(it does not on macOS).
Fortunately, getting Linux headers on macOS (or any other platform) is quite easy. For instance, since I know that the program is an x86_64 Linux program using glibc, it’s possible to just head to a package repository and download the package for libc-devel
. After decompressing the .deb and then the data.tar.gz archive, fcd can be pointed to the right header location with the -I
parameter. For instance, I would use this invocation of fcd on macOS:
$ fcd \
-I /tmp/libc6-dev_2.24-2_amd64/data/usr/include \
-I /tmp/libc6-dev_2.24-2_amd64/data/usr/include/x86_64-linux-gnu \
--header stdio.h \
hello
This command has the same result as running fcd on a machine where these headers are actually installed. (Fcd configures Clang to target the platform of the executable that is being decompiled.)
In a future where fcd supports Mach-O programs, Apple provides downloads to its Libc source (this links points to the macOS 10.11.5 release), which could probably fulfill a similar role on non-Apple platforms.
The Windows situation is a bit more complicated, as the Windows SDK installer is a .exe program that is more involved than a self-extracting archive. Having not looked at the SDK license, it is possible that this use violates the EULA, too. That bridge will be crossed in time.
Looking forward
This new feature unblocks a lot of type information that can be used for inference, and this will probably be my next focus. It’ll also be interesting to add C++ support here, as the feature introduced a slight regression: the hard-coded list contained mangled C++ names, which are no longer available. C++ has its own set of challenges, notably virtual dispatch and templates.
Additionally, using headers to specify information about functions found in the program itself is a great point of interest. One option that could be explored is adding a new attribute, in fcd only, that specifies the virtual address of a function. However, it is unclear how feasible adding a new attribute is for out-of-tree Clang consumers.
Finally, at the moment, fcd has its own code to take a function prototype and figure out which locations will be used to pass parameter. It is known that LLVM can do that translation too, and it can certainly do it better. However, the specifics are nebulous. This will hopefully be investigated at some point in the future.