The Art of Logging

All developers have attempted to debug their programs by printing lifecycle and state information to the console. This concept, sometimes known as printf debugging, can be far more powerful of a tool than one might first expect. The essence of this debugging technique is the concept of logging, where developers add relevant information about the state of the running program to a log. The use of logging is vital to both developers and operators, and it is important to understand how and why to use logging from both perspectives.

Logging Fundamentals

Logging is far more than just printing to stderr, however. Typical logging systems are divided into a set of logging levels that generally define the audience and semantics of the log event. For example, in Apache Log4j 2, logging levels are divided into the following set:

Fatal: error messages that indicate that some subsystem or the entire program cannot continue execution and will terminate.
Error: error messages regarding a problem that should be handled by a human. These are generally useful for operators to alert on.
Warn: warning messages regarding potential problems that may need to be handled by a human. This level is often misused and ignored as a result.
Info: informative messages about the state of a program. These types of messages tend to be related to the lifecycle of a program and can be viewed as a way to debug the macro state of the program.
Debug: debugging information about internal states of the program. These messages are usually only helpful to the developers maintaining a program.
Trace: messages tracing the execution flow of a program. These messages are usually very low level and simply mirror the micro state of a program and generally don’t offer more information than a debugger would.

Some logging systems define other levels, but most logging systems categorize their log messages into similar buckets with similar use cases. Each level can be selectively enabled or disabled, though generally disabling one level will disable all levels below it as well. For example, if we used a logging configuration that was set to WARN as its level, then only warnings, errors, and fatal messages would be shown.

By simply adding severity information to log messages, we have already surpassed the functionality offered by printf, but we’ve only scratched the surface. Any given program is generally large enough to be made up of some sort of concept of modules or subsystems, so it seems like it would be useful to extend this configurable flexibility to subsystems as well. In Java programs, these tend to be separated by packages and classes, though the important concept to use here is that of a named logger. By naming the loggers used in a program, each subsystem can be independently configured to only output logs that are desired. For example, suppose a third party library is misusing the warning level and causing operations to be concerned about the health of your application. After verifying that every warning log message under the logger name prefix com.example.subsystem are not real warnings, we can use a higher level threshold for that set of loggers specifically while not having to disable warnings globally or modify the third party library’s source code. This also relates to the idea that logger names form a hierarchy; com.example is the parent of com.example.subsystem. This allows for simpler ways to configure entire subsystems in one setting.

At this point, we have a rather powerful abstraction over log event filtering using both a level and a name, but we can do better! An additional piece of metadata can be attached to log events: markers. A marker is a simple text string to mark some sort of cross-cutting concern of a particular log message. This can be used to help route specific log messages to different logging systems. For example, suppose a log message is marked with the ALERT marker. The logging configuration could have a filter for that marker which would route these messages regardless of level or logger name to a particular destination. This might be an alerts channel in Slack or an alerts mailing list.

In some programming languages such as Java, string manipulation is considered a somewhat low level operation, thus there are certain string templating features not present here that would be useful for logging. For example, logging a message that contains values from some local variables would normally require string concatenation, and if that log message is never displayed, then said concatenation was wasted CPU effort. Little things like this can add up over time to form a significant performance overhead, so we can certainly do better! Enter the parameterized log message which is quite similar to a parameterized SQL query in spirit. In Log4j and many other logging systems, parameters are specified by {} placeholders in the log message and provided as additional parameters to the logging method. For example:

logger.debug("User {} logged in", user.getName());

The placeholder is filled in only if debug logging is enabled, so the full string is never computed unless absolutely necessary. This technique is mostly relevant to languages like Java. In the Scala version of Log4j, for example, string templates are a built in feature to the language, and macros are used behind the scenes to avoid the template rendering when logging is disabled. Example:

logger.debug(s"User ${user.getName} logged in")

One more related API that is handy to know is that a lambda function can be provided instead of a string in order to defer some code needed to assemble a log message only when enabled. For example, suppose we wish to go fetch some additional metadata from a database for some debug log message. This overhead might be unacceptable most of the time, but we may wish to selectively enable it once in a while. The entire body of the function can be encapsulated into a lambda function and passed to the logger. This is generally cleaner than surrounding the code with if checks for the relevant log level or other noisy techniques.

There are far more features that can be covered regarding how to use a logging API from the developer’s point of view, but these are mostly convenience features regarding repetitive things like thread-local information always included in a log message, or structured log messages, generic event logging, and others. Far more information about these features are available in the Log4j manual.

Where Do Log Events Go?

Now that we’ve established a general framework for writing and filtering log events, what can we do with them? The simplest implementation of handling log events would be to print each log message to the console separated by new lines. Since quite a bit of context would be lost doing it this way, we generally include additional information from the log event such as the timestamp, log level, marker (if defined), logger name, and thread name for multithreaded programs. All the fields we wish to output should be configurable, and in fact, there are several different fields available which we can add to provide context about the log message. The output format could also use a structured format such as JSON which is more easily parsed than line-oriented log messages, though all log aggregation and search tools have powerful tools to extract log event information from all sorts of formats.

In some use cases, writing log events to stderr is acceptable. For example, during development of a program, the developer may wish to view log events while running the program in the console. On the other hand, perhaps you’re using an orchestration framework such as Apache Mesos to execute all your applications. Such a framework can be configured to watch the stderr streams of all running applications in order to collect log messages to a central location.

For many use cases, however, simply printing to a console that nobody looks at is not a valid strategy. Any website that requires a reasonable SLA and has more than, say, a few hundred users, generally requires multiple servers to distribute load. As the number of nodes increase, it simply becomes infeasible to watch the console. In fact, each node may be executing multiple applications, so without a program like tmux, we’d have to redirect the stderr of each program to a file anyways. With that in mind, we can directly configure the logging framework to output log events to a file instead of stderr. Each file can be monitored using a program such as tail to continually watch for new log events being appended to the file. This style of logging is pervasive in typical GNU/Linux and BSD systems where many running services will output log information to /var/log/ directories. However, if this is not configured to periodically rotate log files and delete old ones, then the server’s disk space can eventually fill up with log information! This job is typically filled by a program such as logrotate, though Log4j has a rolling file appender which provides similar functionality.

Simply outputting to a log file can be a good strategy for operators who are still stuck in the “do it by hand” mindset, but we can do better! Our main goal here should be to collect logs from all our servers into a central, searchable location. One such way to accomplish this is by using a product such as ELK, Fluentd, or Graylog. These tools offer more than just log aggregation; they offer ways to filter, sort, search, and alert based on the contents of the logs. However, by relying on log files, we’re also relying on the stability of the individual servers. Obtaining logs in disaster scenarios is generally more difficult but also far more important, so let’s improve on that.

Apache Flume is a project for collecting and aggregating large volumes of log events in a distributed computing environment. This is very useful in cluster scenarios such as running dozens or hundreds of Apache Hadoop or Apache Spark nodes for example. Individual nodes can pass along log events to a Flume agent, and each agent is responsible for reliably delivering the log events elsewhere. Combined with the Flume appender, this can be easily utilized in a distributed environment to collect all log events to a central log aggregator. Said aggregator may be something complex like Logstash or Graylog, or perhaps it may be something simple like a single master log file.

Now that we have our logs all in one place, we can really step up the operations game. We can set up alerts based on log level thresholds, number of messages, frequency of messages, and triggers based on any metadata contained within. If we want to get really fancy, we can train some machine learning models via Spark or Apache Mahout combined with any other exported metrics data to attempt to predict failure of our services. Such a technique could also be used for all sorts of observability of clusters and microservices. Combined with scripts to automatically scale or restart services, operations can become more proactive in maintaining their systems.

There are dozens more frameworks, libraries, and tools that could be covered here. Logging is something all developers do whether they’re using the proper tools or not, so it’s a great idea to get familiar with the tools and concepts in order to improve the metadata being created by applications. Developers should work closely with operators (devops) in order to find a good balance of logging verbosity and observability. Managing logs is a complex topic that many people tend to overlook, but having a good logging architecture in place can help save the day during a production issue. As a final note, to those using the Java Platform, Apache Log4j 2 is the premier logging library for Java, Scala, Groovy, Kotlin, and any other JVM language. It is common for logging to add noticeable overhead to applications, and the typical solution is to simply disable logging, but this removes all the advantages to logging in the first place! Instead, take a look at the numbers and see how Log4j can be used with very minimal overhead, even in high frequency trading applications.