https://www.infoq.com/articles/software-engineering-google/

The book Software Engineering at Google curated by Titus Winters, Tom Manshreck and Hyrum Wright provides insights into the practices and tools used at Google to develop and maintain software with respect to time, scale, and the tradeoffs that all engineers make in development. It also explores the engineering values and the culture that’s based on them, emphasizing the main differences between programming and software engineering. InfoQ readers can download the book for free: Book Software Engineering at Google. InfoQ interviewed Titus Winters, Tom Manshreck and Hyrum Wright, about software engineering at Google. InfoQ: What made you decide to write this book? Titus Winters: After the public launch for Abseil I tried to take a breather. It had been a few years since I’d really reconnected with the schools and friends I had before Google, so my wife and I did some travel and outreach in Southern California. When technical topics came up, a lot of people that I talked to seemed to be in exactly the same place they were technically back in the early 2000s. Schools still had the same curriculum. Companies were still ignoring the importance of unit tests. At the same time, Google was consciously pushing to be less of an “island” with respect to the tech of the rest of the industry, but I was seeing a lot of topics where I genuinely believed that the Google approach made more sense. So I pitched the idea of a Software Engineering version of the SRE book, and my VP at the time (Melody Meckfessel, now at Observable) gave us the go-ahead. Tom Manshreck: I had the advantage of having worked in publishing before as a managing editor so I knew what I was getting into. Although I had experience in the software industry for 15 years, once I joined Google, I needed to learn so much more about actual software development. I wanted to consolidate all of this knowledge about engineering practices into a coherent whole, so this idea was something in my head for a long time. When Titus approached me with the thesis for this book, I realized this was the opportunity to write all this knowledge down. And given Google’s unique position and scale, I knew that we had something unique to say. Hyrum Wright: Basically, Titus roped me into it. :) More seriously, it felt like we were having many of the same conversations with various people outside of Google about the way we do things “on the inside.” Google has published various academic articles and our colleagues have given a number of talks at various industry venues about our approach to software engineering, but there wasn’t a good referenceable and approachable resource for folks in the industry. The book is our attempt to put the lessons we’ve learned down in a digestible format so others can learn from both our successes and failures. The goal isn’t to tell people they have to follow our exact path, but to help them see what path we’ve taken and where the pitfalls might be as they find their own way. InfoQ: For whom is this book intended? Winters: We were aiming for this to provide the “Why”, and our understanding of best practices at Google scale. I think it’s probably most useful for people involved in policy and decision making for a software organization - whether that is the people making the decisions or the people proposing changes and improvements. I hope it is also useful in university classes discussing software engineering. A formal classroom environment is a great place to focus on the theory and the reasoning, before people make it into the industry and rely too much on “This is how we’ve always done it.” Manshreck: Usually you write a book for a specific audience, but in this case we have a few. People unfamiliar with the aspects of software engineering itself (i.e. students who have no work experience) will find a wealth of information about how software development “really works.” People already within the industry will see how a development team adjusts to a growing organization. And even industry professionals will find information about “how Google does it” to compare and contrast to their own practices. As we stated in the Preface, we don’t mean to imply (or even want) to tell others how to do things. But we do want to share what we’ve learned, and especially share how we dealt with mistakes along the way. Wright: Different audiences are going to have different takeaways. Higher-level decision makers will be interested in chapters that discuss the background of the decisions we’ve made (for example, why a build system like Bazel is ultimately more scalable for your organization). But we hope that it’s also useful for the folks “down in the trenches” as well. InfoQ: How would you define software engineering? Winters: I think one of the oldest definitions is still one of the best. Software engineering is “the multi-person development of multi-version programs”, and that quote (from the 1970s) still captures the main themes of Time (multi-version) and Scale, especially process and communication scaling (multi-person). It is one thing to write a program to solve your problem. It’s wholly another thing to collaborate with a 10 person team to solve your problem in a way that you can keep working in a decade after everyone on the team is gone to new things. InfoQ: How can we build a great software team? Manshreck: Brian Fitzpatrick’s chapters illustrate this well. The skills required for developing good software are not the same skills that were required (at one point) to mass produce automobiles, etc. We need engineers to respond creatively, and to continually learn, not do one thing over and over. If they don’t have creative freedom, they will not be able to evolve with the industry as it, too, rapidly changes. To foster that creativity, we have to allow people to be human, and to foster a team climate of trust, humility, and respect. Trust to do the right thing. Humility to realize you can’t do it alone and can make mistakes. And respect for the members of the team rather than relying on a few individuals. You simply can’t make a good software team with a few “rockstar” engineers who play by their own rules. That may work in the short term, but in the long term, it will fail. You need to allow people to continually evolve and contribute to the organization, and they need to be part of a team. InfoQ: The book mentions that psychological safety is essential for learning. What can we do to establish psychological safety in large groups? Winters: From what I’ve experienced, one of the most critical things is for leaders to admit their own fallibility. Normalize making mistakes, get people out of the false idea that perfection is expected (or attainable). Once we stop treating mistakes as failing, but instead look at them as a chance to learn, your team is going to accelerate a lot. And, counter-intuitively, when you make it clear that it is OK to make novel mistakes, you’ll wind up making fewer mistakes in the long term. That’s certainly been the case for my teams. InfoQ: What does Google do to build multicultural capacity in their organization? Winters: It varies across the organization in that there are Google-wide initiatives to build multicultural capacity, as well as initiatives that take place at the team level. The “Engineering for Equity” chapter examines the importance of multiculturalism and embracing diversity by looking at how unchecked bias can present itself in software engineering, and ultimately negatively impact our users. Building with a diverse team is, in our belief, critical to making sure that the needs of a more diverse user base are met. We see that historically: first-generation airbags were terribly dangerous for anyone that wasn’t built like the people on the engineering teams designing those safety systems. Crash test dummies were built for the average man, and the results were bad for women and children, for instance. In other words, we’re not just working to build for everyone, we’re working to build with everyone. It takes a lot of institutional support and local energy to really build multicultural capacity in an organization. We need allies, training, and support structures. And even with those things, we are asking ourselves how we can continue to do more and do better as a company and as engineers. At one point in the chapter we say “The path to equity is long and complex.” We still have work to do to close the gap between where we are and where we want to be, but we are improving. Still, the more we improve, the better things will be for our users and the world because products designed with equity and inclusion in mind are simply better for everyone. InfoQ: What steps does a typical code review have? Manshreck: A code review at Google requires two types of approval: a LGTM from an engineer (any engineer) that the code is correct, and an approval from an OWNER of the codebase directory that it belongs there and that they are willing to maintain it (important!). But that’s it (and often those two types of approval are wrapped into one person). The code review process at Google is also somewhat light, but frequent. We encourage smaller changes that a reviewer can quickly review. This maintains developer velocity. InfoQ: What benefits do you get from doing code reviews? Winters: Modern Code Review: A Case Study at Google (published at ICSE a few years back) identified what code review is really doing for us in practice. A lot of organizations seem to think that code review primarily serves to spot bugs. A lot of people (especially from groups that are under-represented in tech) also find that code review is used as a gatekeeping mechanism, and it’s certainly a place where there is a real risk of bias creeping in. That said, in our experience code review is mostly useful for education, for ensuring that things meet our standards and follow best practices, and for ensuring that the “team” understands and is willing to own and maintain that code going forward. Correctness and performance are also on the list, but there are other steps in the software workflow that may be better at those sorts of issues. So to my mind, it’s mostly the communication aspects that are the benefit: can others understand this, what can the author (and the reviewer) learn from the code in question, etc. Manshreck: As Titus mentioned, one insight I would like to emphasize is that a code review is not a purely technical interaction. It is a social interaction. As the code review chapter notes, the reasons we do code reviews are not primarily to catch bugs (though that is still important). The more important aspects involve knowledge sharing and team-building. At Google, a code review is in some ways (ideally) a low level game of ping pong, where each interaction is an opportunity to learn. It is also a significant way we learn about new ways to do things. Finally, a code review is a means to enforce the notion that the code we work on is a team effort. It’s not “my” code or “your” code, it’s “our” code. Wright: Code review is really an opportunity for learning. On a psychologically safe team, members can use the code review process to raise concerns about the code they are going to have to maintain together. Code review can be used to help instruct both the review and the reviewee, but it’s only part of a holistic review process (design reviews, API reviews, etc). Sometimes code review focuses a lot on the “how” questions: “is this code doing what it claims to do?”, but the “why question is just as important: “why are we writing this code at all?” The “why” is an important part of avoiding bad design decisions and technical debt from showing up in a codebase. InfoQ: What functions does the Google review tool Critique provide? Manshreck: Critique was custom-built to reflect our process, so it’s very good at subtly enforcing our ideas of what a code review should entail. The real function that Critique provides is under the hood. The interface is relatively streamlined. Since Critique is the one interface that all engineers use to send code for review and submit that code, however, we can hook all kinds of tools into that common process. Critique is where we get tests to run (often automatically) or perform static analysis on the code in question, for example. InfoQ: What makes it that static analysis tools work well at Google? Wright: There are a couple of main pieces to make this work well. The first, is that we surface the results of static analysis at the right time in the development workflow. Running static analysis tools on code that is already committed to the repository can be useful, but developers are much more likely to apply suggestions flagged by static analysis when they are already editing the code. Using a system like Tricorder integrates static analysis in the code review pipeline so that engineers see suggestions when they are most looking for them: at review time. The other important part is that Tricorder makes it really easy to subject experts to build static analysis and integrate it with the overall platform. Compiler and languages experts who are building the tools don’t have to worry about scalably running them across our codebase: they can focus on what they know best. Providing a robust and scalable platform lets these experts write the best tools they can. InfoQ: What benefits do static code analysis bring? Wright: Static analysis helps “shift left” various bug classes from runtime to compile- or review-time. I don’t know anybody who would rather be paged in the middle of the night than get a compile-time error or warning. The sooner bugs are caught the easier it is to fix them, and static analysis can help detect some classes of bugs much sooner in the development lifecycle. Several years ago, we introduced compile-time annotations which would help detect threading violations in C++ code. For example, an engineer could flag a variable as requiring a given lock be held during access, and the compiler would do some static analysis to determine if the lock was actually being held. This technique didn’t catch all the bugs (most engineers are more clever than the compiler), but was an additional tool in bug prevention. One team was experiencing intermittent crashes in production, and after much work narrowed the problem down to a race condition, but couldn’t reproduce it reliably. Separately, they decided to use these new annotations across their codebase, and were surprised when their code no longer compiled. After fixing the static analysis failures, they discovered that their production crashes had disappeared: static analysis had moved their failures from production to compile time, making it easier to find and fix them. There are lots of types of bugs and defects that computers are simply better equipped to identify. It’s cheaper and more consistent to identify these things early in development (compile time or code review time) instead of letting it get all the way to production. InfoQ: What branching strategy is used at Google and what are the pro's and con's of that strategy? Winters: There’s almost no branching here, at least not development branches. That is, the branches that do exist are release branches, and they tend to be short-lived and not merged back into trunk. This turns out to be really important: nobody has a choice of what version to depend upon, nor where to commit their changes. There’s no discussions about which dev branch gets to merge next. Instead, in a shared codebase with tens of thousands of engineers, we keep trunk working pretty reliably, there’s only a single version that matters. This makes more work for infrastructure teams, since they don’t have the option to put out a breaking change. Instead, they have to make changes incrementally and often go update their users directly to prepare for a change. It’s a different model, but it’s a lot less chaotic, and we think it’s more honest about the accounting when it comes to the cost of changing things. InfoQ: What are the challenges of dependency management, what makes it so hard? Winters: In the book we define “dependency management” as “the management of networks of libraries, packages, and dependencies that we don’t control.” Most of the time this comes up when dealing with external libraries, open source software, that sort of thing. But some organizations are so loosely coupled that their in house development is split into many disconnected repositories, and they’ll inherently run into the same (very hard) dependency management issues. The biggest guidance in dependency management is to prefer version control problems over dependency management ones (which is part of why we like the monorepo concept). Most of the difficulty comes from the fact that you don’t control external dependencies. You can’t assume the provider of those dependencies has your same resources or priorities. Many things that happen to work today can stop in the future because of incompatible versions, diamond dependencies, and unsatisfiable dependencies. The industry seems to have settled on semantic versioning (semver) as a way of identifying compatibilities, but semver is (at best) a highly-compressed human estimate of how compatible any given change is likely to be. It works OK in simple cases, but larger networks of dependencies are both bug-prone and overconstrained, which is a drag. In practice, we’re hoping that the industry moves to more use of Continuous Integration and evidence of correctness (that is: run the tests) rather than these half-measures. But most developers don’t have the resources that Google does, or the longevity to make these issues so critical. It’s just a really hard problem, trying to coordinate smooth evolution of software when you can’t see across the dependency gaps. I think that has a lot to do with why Google tends to build its own things so often. InfoQ: How do you manage dependencies at Google? Winters: We try to minimize the amount of external dependence: the (vast) majority of our code is written in-house. For the bits that we need to pull in from the open source world or from licensed partners, all of that code lives in our monorepo in the same fashion as our own code, but in a clear sub-directory (third_party) to identify code where there may be ownership or licensing issues. Most of our difficulty in managing that code is finding the necessary incentives to keep the number of versions bounded and the checked-in version up-to-date. That’s particularly challenging when there are popular open source projects that don’t promise any sort of compatibility between versions. Trying to update such a thing at scale when it is used by 10,000 projects is not fun. About the Book Authors Titus Winters is a Senior Staff Software Engineer at Google, where he has worked since 2010. At Google, he is the library lead for Google’s C++ codebase: 250 million lines of code that will be edited by 12K distinct engineers in a month. He served several years as the chair of the subcommittee for the design of the C++ standard library. For the last 10 years, Titus and his teams have been organizing, maintaining, and evolving the foundational components of Google’s C++ codebase using modern automation and tooling. Along the way he has started several Google projects that are believed to be in the top 10 largest refactorings in human history. That unique scale and perspective has informed all of his thinking on the care and feeding of software systems. Tom Manshreck is a Staff Technical Writer within Software Engineering at Google since 2005, responsible for developing and maintaining many of Google's core programming guides in infrastructure and language. Since 2011, he has been a member of Google's C++ Library Team, developing Google's C++ documentation set, launching (with Titus Winters) Google's C++ training classes, and documenting Abseil, Google's open source C++ code. Tom holds a BS in Political Science and a BS in History from the Massachusetts Institute of Technology. Before Google, Tom worked as a Managing Editor at Pearson/Prentice Hall and various startups. Hyrum Wright is a Staff Software Engineer at Google, where he leads the Google-wide Code Health team, and has also worked on C++ libraries and large-scale change tooling. He has submitted more changes to Google’s codebase than any other engineer. Hyrum has a BS from Brigham Young University and an MS and PhD from the University of Texas at Austin.