Generated code should be avoided. This article explains why.
There are two ways to write code: write code so simple there are obviously no bugs in it, or write code so complex that there are no obvious bugs in it. – Tony Hoare
Source code should be readable by human and analyzable by computers. For this, it is important that it is clearly structured, without many branching points and loops (computer scientists call that low cyclomatic complexity), and that there are few surprises.
One surprise is if the source code can not be found, because it is generated dynamically at compile time. Generated code is harder to understand and analyze. Its analysis is dependent on the target platform and its configuration, or requires a meta-discussion about the behaviour of the code generator across all platforms.
Another surprise is if the dynamic code generation follows rules
different from the programming language. A well-known example is this
definition for a maximum in
#define DNS_PP_MAX(a, b) (((a) > (b))? (a) : (b))
In this macro, either
b will be evaluated twice, which can
lead to unexpected side effects. Good C programmers know about this,
but the use of this macro can be hidden in other macros:
#define dns_p_calcsize(n) \ (offsetof(struct dns_packet, data) + DNS_PP_MAX(12, (n)))
n will be evaluated twice if it is large than 12, else it will
be evaluated only once. Checking these things in an audit is tedious
In this article, code generation means dynamic code generation at compile time. If a source code file is generated once and committed to the repository, then that file becomes a normal part of the source similar to a handwritten file. Statically generated code can be hard to read, depending on the generator, but it is probably not hard to analyze automatically.
In GnuPG, dynamically generated code comes from three sources:
*.intemplate files with variable substitution (check for uses of
configure.ac). Usually these are the Makefiles, Windows resource files (containing version numbers), library configuration files (containing installation paths).
BUILT_SOURCESvariables in Makefile.am.
GnuPG has close to 300 possible preprocessor definitions in
config.h. Some are flags to enable or disable pieces of code, some
control the behaviours of system headers (_GNU_SOURCE,
_POSIX_PTHREAD_SEMANTICS), others are global string constants to
configure default paths etc.
Many of these flags have been introduced to support a diversity of platforms and platform versions with varying support for standard library and compiler features. Without continuous integration tests and a well-defined set of target platforms, these options will only accumulate, and never be removed, even if the platform for which they were introduced is long abandoned.
Here is a table with the number of preprocessor macros per package:
Beside Makefiles, GnuPG uses library configuration scripts and Windows resource files that contain some variable substitution. The library configuration scripts provide a functionality similar to pkg-config, so they contain the installation paths for the shared libraries which are only known at build time. Similarily, the Windows resource files contain some package and version information that should not be duplicated, but derived from the global configuration of the source.
These files usually don’t contain program code, or the program code itself is not dynamic but only some variables are set. As such, they are not further discussed here.
In the past, the main include files have also been generated in this way, because they contained version information. By now, the public header files for GnuPG-related libraries are generated by scripts, because the generation process has accumulated more and more features, and simple variable substitution was not deemed sufficient anymore.
The following files are generated at build time:
src/err-sources.h src/err-codes.h src/code-to-errno.h src/code-from-errno.h src/err-sources-sym.h src/err-codes-sym.h src/errnos-sym.h
These files are generated from awk scripts that get a list of error codes from a data file. I wrote this after reading “How to write shared libraries” by Ulrich Drepper. In section 4.2.1, Drepper suggests a method to optimize link time for string tables by reducing the number of relocations.1 In hindsight, this was an ill-advised micro-optimization.
These files dynamically adjust the exported symbols depending on the
Windows platform (Windows 32 vs Windows CE), and provide a
substitution for the global
errno variable on Windows CE, which does
not have one.
This file is generated by a helper program
inspects the binary format of a static pthread mutex initializer for
the target platform, and generates a binary compatible definition for
gpg-error.h. The libgpg-error version of the lock initializer
includes a version number, which in principle would allow libgpg-error
to be more binary compatible than libc and
For once, gpg-error does not do this at build time for every platform,
but keeps a copy of the data for every supported target system in
src/syscfg, leading to a high maintenance burden and frustrated
contributors (see T2370, GnuPG
These files are generated, because they contain other generated material. In particular, they contain copies of the error codes and sources, Windows specific support, and the pthread-compatible lock initializer. Thus, they need to be generated as well.
These files again provide static string tables with a minimum number of relocations, similar to gpg-error above.
Similarily to gpg-error, the public header file for libassuan is composed of generic and platform specific parts, because it wraps a large number of POSIX and Windows interfaces, mainly for socket communication. There are 12 snippets that can be included, depending on platform.
libksba uses gnulib to provide GNU-compatible replacement for missing or incompatible POSIX functions on the target system. For alloca, it even can provide a replacement for the system header file.
asn1-parse.c is a standard yacc-generated parser, the file
asn1-tables.c is generated by a helper program. Both files are
generated by the maintainer and included in the distribution, so I
don’t consider them to be dynamically generated code. They are
included here for completeness.
In NeoPG, we don’t use generated code.
The string tables have been de-optimized to simple switch statements.
Macros are often unnecessary in C++ because it supports template metaprogramming (which is typesafe) and a more complete standard library.
In fact, template metaprogramming is powerful enough to support domain specific languages within the language. PEGTL, which I plan to use as a parser generator, does not require manual compile time code generation. Metaprogramming can be hard to understand for humans, but it is a part of the language and static analysis tools can be expected to support it.
Platform specific code is replaced by a simpler architecture that does
not rely on specific POSIX features as much as GnuPG, and by using a
more complete standard library. For example, C++11 has
std::thread, so there is no need for a home-grown extensive platform
support library. I will write more about removing libgpg-error, npth
and libassuan from the code base in a later blog post.
So far, NeoPG only sets a couple of compile time flags:
-D_DARWIN_C_SOURCEto work around a bug in Xcode 9.1
It also sets a global version number in src/config.h. Of course, there are also a bunch of legacy defines for existing code.
At this point I don’t know if it will stay this minimal, because NeoPG currently only supports GNU/Linux and MacOS. Support for Windows, Android and iOS is still missing. So, we will have to look back at this issue some time in the future.