Tour of our 250k line Clojure codebase

At Red Planet Labs we’ve been quietly developing a new kind of developer tool for many years. Our tool reduces the cost of building large-scale end-to-end applications by multiple orders of magnitude, and Clojure is a big reason why we’ve been able to tackle such an ambitious project with a small team.

Our codebase consists of 250k lines of Clojure split evenly between source and test code. It’s one of the largest Clojure codebases in the world. In this post I’ll give a tour of how we organize our code so a project of this size can be understood amongst a team, the development and testing techniques we use that leverage the unique qualities of Clojure, and an overview of the key libraries we use.

Custom language within a language

One of the coolest parts of our codebase is the new general purpose language at its foundation. Though the semantics of the language are substantially different than Clojure, it’s defined entirely within Clojure using macros to express the differing behavior. It compiles directly to bytecode using the ASM library. The rest of our system is built using both this language and vanilla Clojure, interoperating seamlessly.

One of the striking capabilities our language has that vanilla Clojure does not is first-class continuations. The way in which our language expresses continuations makes it extremely good at async, parallel, and reactive programming. All of these are foundational to the large-scale distributed infrastructure we’re building.

That you can build an entirely new language with radically different semantics within Clojure demonstrates how powerful Clojure is. There’s a lot you get "for free" when building a language this way: lexing, parsing, datatypes, namespaces, immutable data structures, and the entire library ecosystem of Clojure and the JVM. Ultimately our new language is Clojure since it’s defined within Clojure, so it benefits from seamless interoperability with both Clojure and the JVM.

The vast majority of applications are not going to need to develop a full language like we have. But there are plenty of use cases where a focused DSL is appropriate, and we have examples of that too. The ability when using Clojure to customize how code itself is interpreted, via macros and meta-programming, is an incredibly powerful capability.

Type/schema checking

Central to any codebase is the data that is created, managed, and manipulated. We find it’s imperative to carefully and clearly document the data flying around the system. At the same time, type or schema annotations add overhead so it’s important to be thoughtful and not overdo it.

We use the Schema library for defining datatypes within our codebase. It’s easy to use and we like the flexibility to define schema constraints beyond just types: e.g. arbitrary predicates, enums, and unions. Our codebase contains about 600 type definitions, most of which are annotated using Schema.

Around Schema we have a helper called "defrecord+" which defines constructor functions which also perform validation (e.g. for type Foo it generates "->valid-Foo" and "map->valid-Foo"). These functions throw a descriptive exception if the schema check fails.

There’s no static type checking in Clojure, and static type checks wouldn’t be able to check all the kinds of constraints we define using Schema anyway (e.g. the value of a number being within a certain range). We’ve found we only need to insert schema checking on either:

  • Construction of types, for which our auto-generated "valid" constructor functions remove all the ceremony. Detecting an error when creating a record is much better than when using it later on, as during creation you have the context needed to debug the problem.
  • A few strategic spots throughout the codebase where lots of different types flow.

We only occasionally annotate the types of function args and return values. We find instead that being consistent about how we name things is good enough for understanding the code. We do have about 500 assertions throughout our codebase, though these are generally about higher-level properties rather than simple type checks.

The approach we’ve taken for schema definition and enforcement is lightweight, comprehensive, and doesn’t get in our way. The lack of static typing in Clojure scares a lot of programmers who have never used Clojure, and all we can say is that with a little bit of thought in how you organize your code it’s not an issue at all. And doing things dynamically means we can enforce stronger constraints than possible with static type systems.

Multi-module repository setup

Our codebase exists in a single git repo with four modules to split up the implementation:

  • "core", which contains the definition of our compiler and the corresponding abstractions for parallel programming
  • "distributed", which implements those parallel programming abstractions as a distributed cluster
  • "rpl-specter", an internal fork of Specter which adds a ton of functionality
  • "webui", which implements the front end of our product

We use Leiningen and deps.edn for our build. The ability to specify local targets as dependencies in deps.edn files is key to our multi-module setup, and the basic organization of our source tree looks like:

1
2
3
4
5
6
7
8
9
10
project.clj
deps.edn
rpl-specter/project.clj
rpl-specter/deps.edn
core/project.clj
core/deps.edn
distributed/project.clj
distributed/deps.edn
webui/project.clj
webui/deps.edn

Here’s an excerpt from our deps.edn file for "distributed":

1
2
3
4
5
6
{:deps {rpl/core {:local/root "../core"
                  :deps/manifest :deps}
        ...
        }
  ...
  }

This setup lets us develop within any one of the modules and automatically see any source changes in the other modules without having to make explicit Maven dependencies.

Loading the entire codebase for running tests or loading a REPL is pretty slow (largely from compilation of code using our custom language), so we use AOT compilation heavily to speed up development. Since we spend most of our time developing in “distributed”, we AOT compile “core” to speed things up.

Polymorphic data with Specter

Specter is a library we developed for supercharging our ability to work with data structures, especially nested and recursive data. Specter is based around the concept of “paths” into data structures, where a path can “navigate” to any number of values starting from the root of a data structure. The path can include traversals, views, and filters, and they’re deeply composable.

Our compiler compiles code into an abstract representation with a distinct record type for each kind of operation possible in our language. There are a variety of attributes every operation type must expose in a uniform way. For example, one of these attributes is “needed fields”, the fields in the closure of that operation that it requires to do its work. A typical way to express this polymorphic behavior would be to use an interface or protocol, like so:

1
2
(defprotocol NeededFields
  (needed-fields [this]))

The problem with this approach is it only covers querying. Some phases of our compiler must rewrite the fields throughout the abstract representation (e.g. uniquing vars to remove shadowing) and this protocol doesn’t support that. A (set-needed-fields [this fields] ) method could be added to this protocol, but that doesn’t cleanly fit data types which have a fixed number of input fields. It also doesn’t compose well for nested manipulation.

Instead, we use Specter’s "protocol paths" feature to organize the common attributes of our varying compiler types. Here’s an excerpt from our compiler:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
(defprotocolpath NeededFields [])

(defrecord+ OperationInput
  [fields :- [(s/pred opvar?)]
   apply? :- Boolean
   ])

(defrecord+ Invoke
  [op    :- (s/cond-pre (s/pred opvar?) IFn RFn)
   input :- OperationInput])

(extend-protocolpath NeededFields Invoke
  (multi-path [:op opvar?] [:input :fields ALL]))

(defrecord+ VarAnnotation
  [var :- (s/pred opvar?)
   options :- {s/Keyword Object}])

(extend-protocolpath NeededFields VarAnnotation
  :var)

(defrecord+ Producer
  [producer :- (s/cond-pre (s/pred opvar?) PFn)])

(extend-protocolpath NeededFields Producer
  [:producer opvar?])

"Invoke", for instance, is the type that represents calling another function. The :op field could be a static function or a var reference to a function in the closure. The other path navigates to all the fields used as arguments to the function invocation.

This structure is extremely flexible and allows for modifications to be expressed just as easily as queries by integrating directly with Specter. For instance, we can append a "-foo" suffix to all the needed fields in a sequence of operations like so:

1
(setval [ALL NeededFields NAME END] "-foo" ops)

If we want the unique set of fields used in a sequence of ops, the code is:

1
(set (select [ALL NeededFields] ops))

Protocol paths are a way to make the data itself polymorphic and able to integrate with the supercharged abilities of Specter. They greatly reduce the number of manipulation helper functions that would be required otherwise and make the codebase far more comprehensible.

Organizing complex subsystems with Component

The daemons comprising the distributed system we’re building are comprised of dozens of subsystems that build on top of one another and depend on each other. The subsystems need to be started in a particular order, and in tests they must be torn down in a particular order. Additionally, within tests we need the ability to inject mocks for some subsystems or disable some subsystems altogether.

We use the Component library to organize our subsystems in a way that manages lifecycle and gives us the flexibility to inject alternate dependencies or disable subsystems. Internally, we built a "defrcomponent" helper to unify field and dependency declarations. For example, from our codebase:

1
2
3
4
5
6
7
8
9
10
(defrcomponent AdminUiWebserver
  {:init      [port]
   :deps      [metastore
               service-handler
               cluster-retriever]
   :generated [^org.eclipse.jetty.server.Server jetty-instance]}

  component/Lifecycle
  ...
  )

This automatically retrieves fields "metastore", "service-handler", and "cluster-retriever" from the system map it’s started in and makes them available in the closure of the component’s implementation. It expects one field "port" in the constructor of the component, and it generates another field "jetty-instance" on startup into its internal closure.

We also extended the component lifecycle paradigm with "start-async" and "stop-async" protocol methods. Some components do part of their initialization/teardown on other threads, and it was important for the rest of our system (especially deterministic simulation, described below) for those to be doable in a non-blocking way.

Our test infrastructure builds upon Component for doing dependency injection. For instance, from our test code:

1
2
3
4
5
6
7
8
(sc/with-simulated-cluster
  [{:ticker (rcomponent/noop-component)}
   {:keys [cluster-manager
           executor-service-factory
           metastore]
    :as   full-system}]
  ...
  )

That first map is a dependency injection map, and this code disables the “ticker” component. The “ticker” causes simulation tests to advance time occasionally, and since this test wants to control time explicitly it disables it. That dependency injection map can be used to override or disable any component in the system, providing the flexibility necessary for writing tests.

Using with-redefs for testing

Clojure provides the macro "with-redefs" that can redefine any function executed within the scope of that form, including on other threads. We have found this to be an invaluable feature for writing tests.

Sometimes we use with-redefs to mock specific behavior in the dependencies of what we’re testing so we can test that functionality in isolation. Other times we use it to inject failures to test fault-tolerance.

The most interesting usage of with-redefs in our codebase, and one of our most common, is using it alongside no-op functions we insert into our source code. These functions effectively provide a structured event log that can be dynamically tapped in an à la carte way depending on what a test is interested in.

Here’s one example (out of hundreds in our codebase) of how we use this pattern. One part of our system executes user-specified work in a distributed way and needs to: 1) retry the work if it fails, and 2) checkpoint its progress to a durable, replicated store after a threshold amount of work has succeeded. One of the tests for this injects a failure the first time work is attempted and then verifies the system retries the work.

The source function that executes the work is called "process-data!", and here is an excerpt from that function:

1
2
3
(when (and success? retry?)
  (retry-succeeded)
  (inform-of-progress! manager))

"retry-succeeded" is a no-op function defined as (defn retry-succeeded [] ).

In a totally separate function called "checkpoint-state!", the no-op function "durable-state-checkpointed" is called after it finishes replicating and writing to disk the progress information. In our test code, we have:

1
2
3
4
5
6
7
8
9
10
(deftest retry-user-work-simulated-integration-test
  (let [checkpoints     (volatile! 0)
        retry-successes (volatile! 0)]
    (with-redefs [manager/durable-state-checkpointed
                  (fn [] (vswap! checkpoints inc))

                  manager/retry-succeeded
                  (fn [] (vswap! retry-successes inc))]
      ...
      )))

Then in the body of the test, we check the correct internal events happen at the correct moments.

Best of all, since this à la carte event log approach is based on no-op functions, it adds basically no overhead when the code runs in production. We have found this approach to be an incredibly powerful testing technique that utilizes Clojure’s design in a unique way.

Macro usage

We have about 400 macros defined through our codebase, 70% of which are part of source code and 30% of which are for test code only. We have found the common advice for macros, like don’t use a macro when you can use a function, to be wise guidance. That we have 400 macros doing things you can’t do with regular functions demonstrates the extent to which we make abstractions that go far beyond what you can do with a typical language that doesn’t have a powerful macro system.

About 100 of our macros are simple "with-" style macros which open a resource at the start and ensure the resource is cleaned up when the form exits. We use these macros for things like managing file lifecycles, managing log levels, scoping configurations, and managing complex system lifecycles.

About 60 of our macros define abstractions of our custom language. In all of these the interpretation of the forms within is different than vanilla Clojure.

Many of our macros are utility macros, like "letlocals" which lets us more easily mix variable binding with side effects. We use it heavily in test code like so:

1
2
3
4
5
(letlocals
  (bind a (mk-a-thing))
  (do-something! a)
  (bind b (mk-another-thing))
  (is (= (foo b) (bar a))))

This code expands to:

1
2
3
4
(let [a (mk-a-thing)
      _ (do-something! a)
      b (mk-another-thing)]
  (is (= (foo b) (bar a))))

The rest of the macros are a mix of internal abstractions, like a state machine DSL we built, and various idiosyncratic implementation details where the macro removes code duplication that can’t be removed otherwise.

Macros are a language feature that can be abused to produce terribly confusing code, or they can be leveraged to produce fantastically elegant code. Like anything else in software development, the result you end up with is determined by the skill of those using it. At Red Planet Labs we can’t imagine building software systems without macros in our toolbox.

Deterministic simulation

As we wrote about previously, we have the ability to write 100% reproducible distributed systems tests by running our whole system on a single thread and randomizing the order in which entities execute events starting from a random seed. Simulation is a major, codebase-spanning capability that heavily utilizes the aforementioned techniques of dependency injection and redefs. For example:

  • Any part of the system that in production would be a unique thread is coded in terms of executor services. To get an executor service for that particular part of the system, it requests one from an "executor service factory". In production, this returns new threads. In simulation, however, we override that component to provide executor services from our single-threaded, globally managed source.
  • Much of our system relies on time (e.g. timeouts), so time is abstracted away from our implementation. Any part of the system that is interested in time consults a "time source" dependency. In production this is the system clock, but in simulation the component is overridden with a "simulated time source" that can be explicitly controlled within our simulation tests.
  • Promises are used quite a bit throughout the codebase to manage asynchronous, non-blocking behavior. Simulation uses with-redefs to layer in additionally functionality into promises useful for stepping through simulation.

Front end

Our product provides a UI to let users see what they have running on a cluster, the current status of operations like scaling, and telemetry showing what’s going on in their applications.

The front end is a web-based single page app coded in ClojureScript. The ClojureScript ecosystem has many mature, well-designed libraries that make development efficient and fun.

Reviewing the libraries and their advantages could be a blog post in itself, but briefly: we use re-frame because its data-oriented state management and event handling models are easy to reason about and inspect. We use reitit for frontend routing; we like how its data-oriented design allows us to associate arbitrary data with each route, which in turn lets us do neat things like dispatch re-frame events on route changes. We use shadow-cljs to compile the project, in part because it dramatically simplifies the process of using JavaScript libraries and dealing with externs.

We use uPlot for displaying time-series data. Our API backend is served using a Jetty server, and we use Compojure to define backend routes.

Defining our front end in the same language as the rest of our codebase is a huge win, especially the ease of shuttling data back and forth between Clojure and ClojureScript. The immutable style emphasized by Clojure is just as beneficial in front-end code as back-end code, so being able to leverage that consistently benefits our productivity and the robustness of our product greatly.

Libraries

Here are many of the external libraries we use in our codebase, a mixture of Clojure, ClojureScript, Java, and Javascript libraries:

  • ASM: used for bytecode generation
  • Compojure: used for defining routes in web server
  • Component: used for defining subsystems with well-defined lifecycles
  • Jetty: used to serve data to our front end
  • Loom: used for representing graph data structures, especially within our compiler.
  • Netty: used for asynchronous network communication
  • Nippy: used for serialization
  • Potemkin: used for a few utilities, especially "import-namespace", "import-vars", and "def-map-type"
  • reitit: used for front-end routing
  • re-frame: used to build our web code
  • RocksDB: used for some durable indexing tasks
  • Schema: used for defining types with rich schemas
  • shadow-cljs: used for compiling front-end code
  • SnakeYAML: used for parsing YAML
  • Thrift: used to help power some of the CLI of our product
  • uPlot: used to display time series graphs in our front end

Conclusion

Clojure has been fantastic for developing our product. It’s enabled us to build powerful abstractions not possible in other languages, remove all ceremony whatsoever, and utilize powerful testing techniques. Plus we’ve had multiple members on our team start with no Clojure or functional programming experience and they were able to get up to speed quickly.

If you’re interested in working with us to help define the future of software development, we’re hiring! We work on hard problems pushing what’s possible with compilers, databases, and distributed systems. Our team is fully distributed and we’re open to hiring anywhere in the world.

8 thoughts on “Tour of our 250k line Clojure codebase

  1. Thanks for the post, interesting read.

    “Polymorphic data with Specter” – this looks very neat, it reminds me of lens (https://hackage.haskell.org/package/lens) in its purpose. I’m curious if you started out with the specter library, or you hit a breaking point and decided to refactor the codebase to using this? How did the library come about?

    “And doing things dynamically means we can enforce stronger constraints than possible with static type systems.” – are there particular static type systems you have in mind? The example given of a number being in a certain range is certainly possible in a number of type systems (anything with a degree of dependent types) – idris, agda, and haskell come to mind.

    “Multi-module repository setup” – could you elaborate a bit on the monorepo setup you have? In particular, what do your build process and CI look like? I know this can be a serious pain point for larger projects, and I haven’t seen a standard clojure solution for monorepo builds.

    “We use shadow-cljs to compile the project, in part because it dramatically simplifies the process of using JavaScript libraries and dealing with externs.” – I take it this is in comparison to figwheel/figwheel-main?

  2. Early on in development I realized Specter was necessary if the project was going to be feasible. I was hitting many of the same complexity issues around manipulating data I’d experienced in previous projects and saw clearly that my data manipulation needs were going to be far greater.

    I’m thinking mostly of type systems on the JVM, but there’s fundamentally a limit to what can be statically checked without adding overhead/complexity somewhere else.

    We use Jenkins for CI and divide up the tests into many build steps so they can be run in parallel. Each build step is run with “lein test :selector”, where “:selector” selects a subset of the tests to run. We also have one additional build step which runs an end-to-end integration test which launches via a custom shell script (it has to set up many daemons, etc.).

    I don’t think our setup is ideal. deps.edn doesn’t understand anything but Clojure files, and since we have a little bit of Java source all of our project.clj files have to point to the Java source directories in their dependent projects. e.g. “distributed” include “../core/src/java” in its “:java-source-paths”. It works well enough at our current stage, but I wouldn’t be surprised if we end up having a need to rework our build in the future.

    I think figwheel is great also. I’m not strongly opinionated on figwheel vs. shadow-cljs.

  3. So the defrcomponent is just a documentation of what parameters are for init, deps e.t.c? Or are there runtime checks that prevents a component creation in case init params are not provided?

    1. “:init” turns into the arguments for the constructor of the record. In the example in the post the component would be constructed as “(->AdminUIWebServer port)”. The :deps are assoc’d into the record on startup from the system map, and an exception is thrown if any of those keys aren’t present when starting the system map. The “:generated” fields are a convenience – they’re assoc’d into the component during “start”, and anywhere else in the definition of the component they can be referred to by that symbol instead of having to do (:jetty-instance this).

      1. Is the defrcomponent macro open-sourced/gist available? I think it’s solving the problem we are facing too. I’ll give the implementation a shot. Our devs often conflate deps and init, this could really streamline the way component is used throughout our codebase

        I think you all have been using component at scale so, whenever possible could you look at and give insights on what’s worked for you at scale https://www.reddit.com/r/Clojure/comments/nggl9p/how_to_pass_components_across_functions/

  4. If you find yourselves wanting more fine-grained mocking behavior than `with-redefs` can provide, my organization has been relying a lot on https://github.com/nubank/mockfn, which we wrote to port some of the nice mocking capabilities of midje to clojure.test.

    It can do things like only mock behavior when the arguments match a certain structure or assert that a function was called N times.

    Regardless, really nice article! It is always interesting to get a feel for the choices made in larger Clojure code-bases

Leave a Reply to hello123 Cancel reply