The Art of Logging
All developers have attempted to debug their programs by printing lifecycle and state information to the console. This concept, sometimes known as printf debugging, can be far more powerful of a tool than one might first expect. The essence of this debugging technique is the concept of logging, where developers add relevant information about the state of the running program to a log. The use of logging is vital to both developers and operators, and it is important to understand how and why to use logging from both perspectives.
Logging Fundamentals
Logging is far more than just printing to stderr
, however. Typical
logging systems are divided into a set of logging levels that
generally define the audience and semantics of the log event. For
example, in Apache Log4j 2, logging levels are divided
into the following set:
- Fatal: error messages that indicate that some subsystem or the entire program cannot continue execution and will terminate.
- Error: error messages regarding a problem that should be handled by a human. These are generally useful for operators to alert on.
- Warn: warning messages regarding potential problems that may need to be handled by a human. This level is often misused and ignored as a result.
- Info: informative messages about the state of a program. These types of messages tend to be related to the lifecycle of a program and can be viewed as a way to debug the macro state of the program.
- Debug: debugging information about internal states of the program. These messages are usually only helpful to the developers maintaining a program.
- Trace: messages tracing the execution flow of a program. These messages are usually very low level and simply mirror the micro state of a program and generally don’t offer more information than a debugger would.
Some logging systems define other levels, but most logging systems
categorize their log messages into similar buckets with similar use
cases. Each level can be selectively enabled or disabled, though
generally disabling one level will disable all levels below it as
well. For example, if we used a logging configuration that was set
to WARN
as its level, then only warnings, errors, and fatal
messages would be shown.
By simply adding severity information to log messages, we have
already surpassed the functionality offered by printf
, but we’ve
only scratched the surface. Any given program is generally large
enough to be made up of some sort of concept of modules or subsystems,
so it seems like it would be useful to extend this configurable
flexibility to subsystems as well. In Java programs, these tend
to be separated by packages and classes, though the important concept
to use here is that of a named logger. By naming the loggers used
in a program, each subsystem can be independently configured to
only output logs that are desired. For example, suppose a third
party library is misusing the warning level and causing operations
to be concerned about the health of your application. After verifying
that every warning log message under the logger name prefix
com.example.subsystem
are not real warnings, we can use a higher
level threshold for that set of loggers specifically while not
having to disable warnings globally or modify the third party
library’s source code. This also relates to the idea that logger
names form a hierarchy; com.example
is the parent of
com.example.subsystem
. This allows for simpler ways to configure
entire subsystems in one setting.
At this point, we have a rather powerful abstraction over log event
filtering using both a level and a name, but we can do better! An
additional piece of metadata can be attached to log events: markers.
A marker is a simple text string to mark some sort of cross-cutting
concern of a particular log message. This can be used to help route
specific log messages to different logging systems. For example,
suppose a log message is marked with the ALERT
marker. The logging
configuration could have a filter for that marker which would route
these messages regardless of level or logger name to a particular
destination. This might be an alerts channel in Slack or an alerts
mailing list.
In some programming languages such as Java, string manipulation is
considered a somewhat low level operation, thus there are certain
string templating features not present here that would be useful
for logging. For example, logging a message that contains values
from some local variables would normally require string concatenation,
and if that log message is never displayed, then said concatenation
was wasted CPU effort. Little things like this can add up over
time to form a significant performance overhead, so we can certainly
do better! Enter the parameterized log message which is quite
similar to a parameterized SQL query in spirit. In Log4j and many
other logging systems, parameters are specified by {}
placeholders
in the log message and provided as additional parameters to the
logging method. For example:
logger.debug("User {} logged in", user.getName());
The placeholder is filled in only if debug logging is enabled, so the full string is never computed unless absolutely necessary. This technique is mostly relevant to languages like Java. In the Scala version of Log4j, for example, string templates are a built in feature to the language, and macros are used behind the scenes to avoid the template rendering when logging is disabled. Example:
logger.debug(s"User ${user.getName} logged in")
One more related API that is handy to know is that a lambda function
can be provided instead of a string in order to defer some code
needed to assemble a log message only when enabled. For example,
suppose we wish to go fetch some additional metadata from a database
for some debug log message. This overhead might be unacceptable
most of the time, but we may wish to selectively enable it once in
a while. The entire body of the function can be encapsulated into
a lambda function and passed to the logger. This is generally
cleaner than surrounding the code with if
checks for the relevant
log level or other noisy techniques.
There are far more features that can be covered regarding how to use a logging API from the developer’s point of view, but these are mostly convenience features regarding repetitive things like thread-local information always included in a log message, or structured log messages, generic event logging, and others. Far more information about these features are available in the Log4j manual.
Where Do Log Events Go?
Now that we’ve established a general framework for writing and filtering log events, what can we do with them? The simplest implementation of handling log events would be to print each log message to the console separated by new lines. Since quite a bit of context would be lost doing it this way, we generally include additional information from the log event such as the timestamp, log level, marker (if defined), logger name, and thread name for multithreaded programs. All the fields we wish to output should be configurable, and in fact, there are several different fields available which we can add to provide context about the log message. The output format could also use a structured format such as JSON which is more easily parsed than line-oriented log messages, though all log aggregation and search tools have powerful tools to extract log event information from all sorts of formats.
In some use cases, writing log events to stderr
is acceptable.
For example, during development of a program, the developer may
wish to view log events while running the program in the console.
On the other hand, perhaps you’re using an orchestration framework
such as Apache Mesos to execute all your applications.
Such a framework can be configured to watch the stderr
streams
of all running applications in order to collect log messages to a
central location.
For many use cases, however, simply printing to a console that
nobody looks at is not a valid strategy. Any website that requires
a reasonable SLA and has more than, say, a few hundred users,
generally requires multiple servers to distribute load. As the
number of nodes increase, it simply becomes infeasible to watch the
console. In fact, each node may be executing multiple applications,
so without a program like tmux, we’d have to redirect the
stderr
of each program to a file anyways. With that in mind, we
can directly configure the logging framework to output log events
to a file instead of stderr
. Each file can be monitored using a
program such as tail
to continually watch for new log events being
appended to the file. This style of logging is pervasive in typical
GNU/Linux and BSD systems where many running services will output
log information to /var/log/
directories. However, if this is
not configured to periodically rotate log files and delete old ones,
then the server’s disk space can eventually fill up with log
information! This job is typically filled by a program such as
logrotate, though Log4j has a rolling file
appender which provides similar functionality.
Simply outputting to a log file can be a good strategy for operators who are still stuck in the “do it by hand” mindset, but we can do better! Our main goal here should be to collect logs from all our servers into a central, searchable location. One such way to accomplish this is by using a product such as ELK, Fluentd, or Graylog. These tools offer more than just log aggregation; they offer ways to filter, sort, search, and alert based on the contents of the logs. However, by relying on log files, we’re also relying on the stability of the individual servers. Obtaining logs in disaster scenarios is generally more difficult but also far more important, so let’s improve on that.
Apache Flume is a project for collecting and aggregating large volumes of log events in a distributed computing environment. This is very useful in cluster scenarios such as running dozens or hundreds of Apache Hadoop or Apache Spark nodes for example. Individual nodes can pass along log events to a Flume agent, and each agent is responsible for reliably delivering the log events elsewhere. Combined with the Flume appender, this can be easily utilized in a distributed environment to collect all log events to a central log aggregator. Said aggregator may be something complex like Logstash or Graylog, or perhaps it may be something simple like a single master log file.
Now that we have our logs all in one place, we can really step up the operations game. We can set up alerts based on log level thresholds, number of messages, frequency of messages, and triggers based on any metadata contained within. If we want to get really fancy, we can train some machine learning models via Spark or Apache Mahout combined with any other exported metrics data to attempt to predict failure of our services. Such a technique could also be used for all sorts of observability of clusters and microservices. Combined with scripts to automatically scale or restart services, operations can become more proactive in maintaining their systems.
There are dozens more frameworks, libraries, and tools that could be covered here. Logging is something all developers do whether they’re using the proper tools or not, so it’s a great idea to get familiar with the tools and concepts in order to improve the metadata being created by applications. Developers should work closely with operators (devops) in order to find a good balance of logging verbosity and observability. Managing logs is a complex topic that many people tend to overlook, but having a good logging architecture in place can help save the day during a production issue. As a final note, to those using the Java Platform, Apache Log4j 2 is the premier logging library for Java, Scala, Groovy, Kotlin, and any other JVM language. It is common for logging to add noticeable overhead to applications, and the typical solution is to simply disable logging, but this removes all the advantages to logging in the first place! Instead, take a look at the numbers and see how Log4j can be used with very minimal overhead, even in high frequency trading applications.