No generated code in NeoPG

Generated code should be avoided. This article explains why.

Source code should be obvious

There are two ways to write code: write code so simple there are obviously no bugs in it, or write code so complex that there are no obvious bugs in it. – Tony Hoare

Source code should be readable by human and analyzable by computers. For this, it is important that it is clearly structured, without many branching points and loops (computer scientists call that low cyclomatic complexity), and that there are few surprises.

One surprise is if the source code can not be found, because it is generated dynamically at compile time. Generated code is harder to understand and analyze. Its analysis is dependent on the target platform and its configuration, or requires a meta-discussion about the behaviour of the code generator across all platforms.

Another surprise is if the dynamic code generation follows rules different from the programming language. A well-known example is this definition for a maximum in gnupg/dirmngr/dns.h (f6acd04264):

#define DNS_PP_MAX(a, b) (((a) > (b))? (a) : (b))

In this macro, either a or b will be evaluated twice, which can lead to unexpected side effects. Good C programmers know about this, but the use of this macro can be hidden in other macros:

#define dns_p_calcsize(n) \
  (offsetof(struct dns_packet, data) + DNS_PP_MAX(12, (n)))

Here n will be evaluated twice if it is large than 12, else it will be evaluated only once. Checking these things in an audit is tedious and error-prone.

Static code generation is OK

In this article, code generation means dynamic code generation at compile time. If a source code file is generated once and committed to the repository, then that file becomes a normal part of the source similar to a handwritten file. Statically generated code can be hard to read, depending on the generator, but it is probably not hard to analyze automatically.

Generated Code in GnuPG

In GnuPG, dynamically generated code comes from three sources:

  • Autoconf writes a config.h file which usually contains preprocesser definitons, but can also contain other things.
  • Autoconf can write any number of other files generated from *.in template files with variable substitution (check for uses of AC_CONFIG_FILES in configure.ac). Usually these are the Makefiles, Windows resource files (containing version numbers), library configuration files (containing installation paths).
  • Some files are generated by Makefile rules at build time. These can be found by looking for BUILT_SOURCES variables in Makefile.am.

config.h

GnuPG has close to 300 possible preprocessor definitions in config.h. Some are flags to enable or disable pieces of code, some control the behaviours of system headers (_GNU_SOURCE, _POSIX_PTHREAD_SEMANTICS), others are global string constants to configure default paths etc.

Many of these flags have been introduced to support a diversity of platforms and platform versions with varying support for standard library and compiler features. Without continuous integration tests and a well-defined set of target platforms, these options will only accumulate, and never be removed, even if the platform for which they were introduced is long abandoned.

Here is a table with the number of preprocessor macros per package:

package preprocessor macros
gnupg 292
libgcrypt 198
libgpg-error 84
libassuan 73
npth 55
libksba 51

AC_CONFIG_FILES

Beside Makefiles, GnuPG uses library configuration scripts and Windows resource files that contain some variable substitution. The library configuration scripts provide a functionality similar to pkg-config, so they contain the installation paths for the shared libraries which are only known at build time. Similarily, the Windows resource files contain some package and version information that should not be duplicated, but derived from the global configuration of the source.

These files usually don’t contain program code, or the program code itself is not dynamic but only some variables are set. As such, they are not further discussed here.

In the past, the main include files have also been generated in this way, because they contained version information. By now, the public header files for GnuPG-related libraries are generated by scripts, because the generation process has accumulated more and more features, and simple variable substitution was not deemed sufficient anymore.

BUILT_SOURCES

The following files are generated at build time:

libgpg-error

src/err-sources.h
src/err-codes.h
src/code-to-errno.h
src/code-from-errno.h
src/err-sources-sym.h
src/err-codes-sym.h
src/errnos-sym.h

These files are generated from awk scripts that get a list of error codes from a data file. I wrote this after reading “How to write shared libraries” by Ulrich Drepper. In section 4.2.1, Drepper suggests a method to optimize link time for string tables by reducing the number of relocations.1 In hindsight, this was an ill-advised micro-optimization.

src/gpg-error.def
src/mkw32errmap.map.c

These files dynamically adjust the exported symbols depending on the Windows platform (Windows 32 vs Windows CE), and provide a substitution for the global errno variable on Windows CE, which does not have one.

lock-obj-pub.native.h

This file is generated by a helper program gen-posix-lock-obj which inspects the binary format of a static pthread mutex initializer for the target platform, and generates a binary compatible definition for gpg-error.h. The libgpg-error version of the lock initializer includes a version number, which in principle would allow libgpg-error to be more binary compatible than libc and pthread themselves.

For once, gpg-error does not do this at build time for every platform, but keeps a copy of the data for every supported target system in src/syscfg, leading to a high maintenance burden and frustrated contributors (see T2370, GnuPG Dev, Debian #869609).

src/gpg-error.h
src/gpgrt.h

These files are generated, because they contain other generated material. In particular, they contain copies of the error codes and sources, Windows specific support, and the pthread-compatible lock initializer. Thus, they need to be generated as well.

GnuPG

common/audit-events.h
common/status-codes.h

These files again provide static string tables with a minimum number of relocations, similar to gpg-error above.

libassuan

src/assuan.h

Similarily to gpg-error, the public header file for libassuan is composed of generic and platform specific parts, because it wraps a large number of POSIX and Windows interfaces, mainly for socket communication. There are 12 snippets that can be included, depending on platform.

libksba

libksba/gl/alloca.h

libksba uses gnulib to provide GNU-compatible replacement for missing or incompatible POSIX functions on the target system. For alloca, it even can provide a replacement for the system header file.

src/asn1-parse.c
src/asn1-tables.c

While asn1-parse.c is a standard yacc-generated parser, the file asn1-tables.c is generated by a helper program. Both files are generated by the maintainer and included in the distribution, so I don’t consider them to be dynamically generated code. They are included here for completeness.

NeoPG does not rely on generated code

In NeoPG, we don’t use generated code.

The string tables have been de-optimized to simple switch statements.

Macros are often unnecessary in C++ because it supports template metaprogramming (which is typesafe) and a more complete standard library.

In fact, template metaprogramming is powerful enough to support domain specific languages within the language. PEGTL, which I plan to use as a parser generator, does not require manual compile time code generation. Metaprogramming can be hard to understand for humans, but it is a part of the language and static analysis tools can be expected to support it.

Platform specific code is replaced by a simpler architecture that does not rely on specific POSIX features as much as GnuPG, and by using a more complete standard library. For example, C++11 has std::mutex and std::thread, so there is no need for a home-grown extensive platform support library. I will write more about removing libgpg-error, npth and libassuan from the code base in a later blog post.

So far, NeoPG only sets a couple of compile time flags:

  • Any flag necessary to enable C++11 support in the target compiler.
  • -D_DARWIN_C_SOURCE to work around a bug in Xcode 9.1

It also sets a global version number in src/config.h. Of course, there are also a bunch of legacy defines for existing code.

At this point I don’t know if it will stay this minimal, because NeoPG currently only supports GNU/Linux and MacOS. Support for Windows, Android and iOS is still missing. So, we will have to look back at this issue some time in the future.


  1. In Appendix B, Drepper proposes a method to create such optimized string tables without awk, using only the preprocessor and macros. The code is still difficult to read and understand, and it is still a micro-optimization, but it is better than using awk. The code was not in the original draft of the paper that I read at the time.

If you like what you see, please support NeoPG development!

Bountysource

Become a Patreon

Spread the word

If you want to write code or documentation, join us on GitHub!