Lessons learned from working on large scale, cross-project initiatives in OpenStack

I have been involved with OpenStack development since just before the Folsom summit in 2012. Over the course of that time, I have participated in innumerable discussions about 3 big features tied to OpenStack logging: translating log messages, adding request IDs to log messages, and adding unique message IDs to log messages. We have had various degrees of success with the design, implementation, and ongoing maintenance of all three features, and the reasons for success or failure in each case provide helpful insight into how to approach changes with large community and product scope that should be considered before our next discussion at the summit/forum in Boston in 2017.

Log Translations

Many software projects support translation of user-facing messages, such as appear on web pages, GUIs, and console apps. Early on, in 2010, message translation in OpenStack was extended to include log messages. The work started with a patch in 2010-12, and appears to have continued through 2011-02. It wasn’t clear from my research whether the final implementation was released as part of Bexar or if it had to wait for Cactus, but it was around that time. Strings were marked up for translation using a single translation domain so that all messages, including exceptions, command output, and log messages of any level, were stored in the same message catalog.

By the Folsom summit in 2012, the amount of work involved in managing translations had increased to the point that we had new discussions of what really needed to be translated. There was quick consensus about translating user-facing messages like errors and UI text, but some disagreement over whether translating log messages made sense. Having logs translated was of special interest to operators in China, for example, but others on the mailing list pointed out that translating log messages would result in splitting the online knowledgebase of search results for help in understanding those log messages.

At the Icehouse summit in Hong Kong in 2013, we considered a proposal to use separate translation domains for log messages of different levels. The idea was that by splitting the messages into multiple domains, the translation teams could prioritize translations for the messages of interest to their user community. Everyone would provide translation for user-facing messages, but translation teams would not be required to provide translations for the log messages. The proposal was adopted, and we made the necessary changes in oslo.i18n to support translating log messages of different levels and in oslo.log to allow logging in multiple languages simultaneously. The changes to update the translation markup for all of the log messages in the application code started rolling out during the same cycle, though not necessarily to all of the applications at the same time.

At the PTG in Febrary of 2017 the i18n team (responsible for the translation work) decided to drop all support for log message translations. The number of messages, and amount of change in the text of those messages, was too much for the translators to keep up with. After some broader discussion of the change found little or no support among operators for keeping log translations, the community decided to stop translating logs and to remove the translation markup from log messages. The translation features of oslo.i18n will be kept in place for the time being to allow project teams to remove the markup at their own pace.

Request ID Logging

Debugging OpenStack operations that span multiple services can be challenging, because a failure in one part of the service chain can trigger a different failure later, and because the distributed nature of the services makes it more complicated to find all of the relevant log messages to understand the sequence of events that led to the failure. Each call to an OpenStack service creates a “request ID” (usually a random UUID, but some services extend the value with more information). This ID is intended to be used to tie together log messages and notifications for a given operation. The earliest mention about adding request ID information to log files is from 2012-06, although I wouldn’t be  surprised to learn of earlier discussions. The consensus from that discussion appears to have been to create a separate request ID for each service, but to use a common “transaction ID” to represent the chain of all service calls (nova to glance, for example). There was more discussion of the feature in a Nova session at the Icehouse summit.

I lost the thread for the implementation work at this point, but the specification for returning request IDs through the REST APIs to the client, discussed in early 2015, shows that the transaction ID work was never fully implemented, so even if services were logging their own request ID it was still difficult to relate the messages from two different services to understand which were part of a single operation. The 7 month long discussion of that specification is another interesting, but tangential, example of the amount of work involved in finding agreement between all of our project teams when making a “global” change in a consistent way. It was one of the earliest examples of our community trying to work together and design a solution once before implementing it several times (in all of the clients). The amount of time involved is disheartening, but I think we’ve learned and improved so that new initiatives do not always take so long now.

After the spec was approved, the team working on the client library changes was able to make progress. Today, as far as I can tell, 4 of the starter-kit:compute services (nova, neutron, cinder, and glance) support exposing the request ID through the REST client libraries. However, at the most recent summit in Barcelona in 2016-10 the notes from the operator session on logging still include a request to improve support for tracing requests from one service to the next, and it is on the schedule to be discussed in Boston. So, while the effort has been partly successful, the work is clearly not complete.

Log Message IDs

Many commercial (closed-source) enterprise applications, and even some open source projects, include unique identifiers for every user and log message. The IDs are used as reference points in documentation and bug reports, either instead of or in addition to the text of the log message. Using an ID instead of the text of the message means that knowledge bases continue to be useful if the message text changes slightly over time. Message IDs also mean that online searches can be language independent, to some degree, although translating the results of a search may still require someone to be able to read English if there are no results in their primary language. The first reference I found to someone proposing adding message IDs to OpenStack was in 2012-05, after the Folsom summit.

Discussion continued in several venues, including an Oslo blueprint from 2013-11, a Glance spec in 2014-10 that was discussed at the Kilo summit in Paris, and a cross-project spec from 2015-04. The logging working group, started in early 2015, covered the spec many times as we tried to clarify and refine it.

Those discussions, particularly for the spec, focused on implementation details such as what a message ID looks like, where they should go in the log line, and how to manage the documentation for the messages. The most recent form of the spec proposes to define unique IDs for every message using a combination of the project name and a numerical value. The IDs are meant to be used both in the log file and in the error messages returned through the REST API. It describes a plan to identify all of the error messages and add codes to them, but leaves out information about writing the documentation those codes would enable, training reviewers about message IDs, and several other key implementation aspects.

The topic has stayed at the top of operator’s wish-list for logging improvements with little or no progress, and will be revived in Boston.

What happened?

Log translations were implemented, then as the amount of effort involved in maintaining them increased we changed the implementation before removing the feature completely. The small translation team could not keep up with the rate of change from the much larger set of contributors adding and changing log messages as they worked on other features, so it did not make sense to continue to ask the rest of the contributor base to expend the effort needed to understand the policies about which log messages should be translated or to enforce those policies during code reviews.

Request IDs for the current request within a given service can now be included in log messages for applications using oslo.log and oslo.context. Users of some of the client libraries can access the external request IDs, but not all clients support this and apparently not all of those that do are having the IDs logged by the caller. The team presenting the original cross-project spec went through a heroic amount of work to get as far as they did, but it looks like this case suffered from two problems. First, although no one was outright opposed to the idea, there was not a groundswell of support for it. Likely as a result of being one of the earliest cross-project specs to be discussed, it suffered from a certain amount of wary interest on the part of many teams, who did not see it as a clear community priority. Another problem was that the plan documented in the spec was not sufficiently detailed to allow the work to be undertaken in parallel, or to know when the project could be called “done.” The team driving the work undertook to complete it all themselves, without trying to attract collaborators. Contributions then dropped off over time, as people lost interest or were shifted to other priorities. The work would ultimately have been simpler if OpenStack had a single SDK for communicating between services, or at least shared more of the base client code between projects, instead of having each project manage its own completely independent library, but even with our current code base it should have been possible to complete the work in 1 or 2 release cycles.

We continue to look for a reasonable way to implement a message ID system. The proposals so far have focused almost exclusively on the message and log line format, while ignoring the planning needed to actually make such a significant change. There is general confusion about where the IDs will actually be used (log messages, API responses, or both?), and about some of the terminology and references used in the spec. The focus on syslog, in particular, ignores both the flexibility and limitations of our current logging code. Due to the disconnect between the proposal, the current code, and the contributor community, the spec is untenable. The experience of adding, then removing, log message translations provides sufficient justification for rejecting the current proposals. Even if we have contributors standing by today to add message IDs, we do not have the reviewer bandwidth to deal with them all. We do not have plans for tools to ensure that reviewers are not burdened in the future with ensuring that IDs are globally unique. And we do not have a commitment to build the reference documentation for all of the codes. If we are going to add message IDs of some sort, we need a new design that does not depend on a large up-front effort before deployers see any benefit and that identifies contributors ready to build the tools needed to ease the burden on the rest of the community.

What did we learn?

What common patterns for successful initiatives can we derive from these 3 examples?

  1. Successful cross-project efforts do not introduce a lot of maintenance work to be performed every release cycle.

The log message translation effort failed because the burden of the work was not balanced across the contributor community. It was very easy for an average contributor to generate a lot of work for the translators, without even realizing it.

  1. Successful cross-project efforts build tools to enable work to be distributed over time and contributors.

The request ID logging effort has not been completed because it required a lot of similar, but not identical, changes in many projects. Without sufficient shared code, tools could not be used to generate the needed patches mechanically.

  1. Successful cross-project efforts include clear documentation of all phases of the plan, including the criteria for calling the work “done” or at least what choice will need to be made to create the rest of the plan.

The request ID logging effort did not document all phases of the needed work, and now that the work has slowed down or stopped it will be more challenging to resume it.

The move to Python 3 is an example of a very-long term initiative where later phases could not be described in detail until earlier phases were complete — we need to be running on Python 3 before we talk seriously about dropping Python 2.

  1. Successful cross-project efforts combine the work of people skilled at cross-project collaboration with contributors from the project teams involved.

The log message ID effort has been struggling to gain traction because the proposal does not address the reality of making a change of that scope.

  1. Successful cross-project efforts avoid disrupting the normal flow of everyone else’s work by not introducing a large number of patches for review in a given project simultaneously.

Another reason the contributor community has hesitated to agree to the log message ID proposal is the number of patches involved in adding the message IDs all up front. Even if those changes are done in a single patch, a lot of the rest of the work in the project may need to be rebased to account for the logging changes.

  1. The success of a cross-project effort is directly related to the degree of parallelization of the implementation work.

The log message ID effort hasn’t been able to start successfully because there is such a significan lead time between doing the up-front work of assigning message IDs and seeing useful documentation.

  1. Serialized implementation, with strong a need for steps to be performed in order across multiple projects, take more time and effort.

With the request ID logging project, we saw that coordinating the reviews for landing patches in order across many repositories requires more communication among groups of people for whom a given initiative is not their top (or only) priority. Without a common base for the client libraries, they each had to be updated individually before the applications using those libraries could be updated.

  1. The more time an initiative takes to produce consumable output, the less likely it is to be completed.

Sometimes a delay between doing the setup and producing usable output is unavoidable. That was true in the log message translation case. The log message ID proposal is set up the same way, and that has proven to be a source of friction.

Partial Timeline

2010-12-03

i18n support in nova

2011-02-03

Bexar release

2012-04-05

Essex release

2012-04-16

Folsom summit, San Francisco

i18n in OpenStack

2012-05-01

i18n of log message (proposal to stick to English-only in log messages)

2012-06-12

request and transaction IDs across all projects

2012-09-27

Folsom release

2013-11-05

Icehouse summit, Hong Kong

nova discussion of tracking request ids

introduced log translation using separate domains

2013-11-28

adding request-id to API response for neutron

swift’s approach

2014-01-27

log message translations blueprint introduced

2014-04-03

Some Thoughts on Log Message ID Generation Blueprint

https://blueprints.launchpad.net/oslo/+spec/log-messages-id

https://blueprints.launchpad.net/oslo-incubator/+spec/log-messages-id

2014-04-17

Icehouse release

2014-05-13

Juno summit, Atlanta

Operator meetup notes prior to the summit

proposes returning request ids to users

point out that changing log messages every release makes it hard to build useful tools that last across release boundaries (anything looking for specific wording)

2014-06-20

log translation for oslo libraries

2014-10-16

Juno release

2014-11-04

Kilo summit, Paris

glance logging, supportability, error codes

2015-03-04

first logging working group meeting

discussed error codes proposal in glance

2015-03-05

log working group priorities

2015-03-10

cross-project meeting discussion of returning request IDs to caller

https://review.openstack.org/#/c/156508

2015-04-08

HTTP error codes thread forked to discuss message IDs

OPS thread on same topic

logging working group meeting

request ID spec mentioned

2015-04-10

Erring is Caring

announcement of first draft of spec for log message error codes

2015-04-15

2015-04-30

Kilo release

2015-05-19

Liberty summit, Vancouver

2015-05-20

2015-05-27

2015-06-23

2016-06-30

2015-07-07

discussion results in a major rework of the spec

2015-07-14

2015-07-28

2015-07-29

returning request id spec back on track

2015-08-04

2015-08-11

2015-08-18

2015-09-15

2015-10-15

Liberty release

2015-10-26

Mitaka summit, Tokyo  

2015-11-11

2016-02-29

2016-03-02

2016-04-07

Mitaka release

2016-04-25

Newton summit, Austin

2016-06-17

2016-10-06

Newton release

2017-03-15

Ops meetup, Milan

log structure came up, including unique numbers for errors

Pike Forum (planning)