A tale of an event-driven platform

OpenMarket – October 1, 2020

by Julio Cesar Monroy

Not so long ago, my team was tasked with creating new software that was capable of allowing our customers to model end-user interactions (known as conversations) in a very simple and flexible way.

In a nutshell, the requirements of the system can be summed up as follows:

  • Interactions are triggered by external actions, a user sending a text message, a broadcast to announce a new product, etc.
  • A conversation likely requires more information that should be retrieved from an external source, like variables for personalization.
  • The conversation lifespan could range from a few minutes to several days.

 

A fairly complex system that requires a high degree of integration with other internal and external tools – how do we solve the problem?

Enter event-driven architecture (EDA)

Per requirements, it was very clear that the system’s nature was asynchronous: reacting to incoming external actions, processing them, and, potentially, waiting for another external action to happen.

Our main goal was to architect the system in a way that embraced that asynchronous nature while maintaining a high degree of resiliency, and be flexible enough to accommodate future enhancements without requiring any major re-work. We found the event-driven paradigm to really fit our needs.

An event-driven architecture is modeled after events: an event is an action that has occurred, like a change of state or an incoming external action, and is significant enough that your system should process it.

The main ideas of an event-driven architecture include: there are event producers and event consumers, a common communication pattern is the fire-and-forget style, producers do not know anything about consumers and similarly, consumers do not know anything about producers. In other words, a highly decoupled system.

A great companion to an EDA is stream processing. To keep it simple, a stream can be seen as an unbounded list of events that can be consumed and processed in near realtime.

Kafka, Apache Pulsar, and NATS Streaming are projects worth checking out as the foundation for event streaming.

Some of the nice benefits we got by following an event-driven architecture:

  • Highly decoupled services
  • High scalability
  • Events are immutable, with the proper handling, you can reconstruct the state of your system, at any point in time, just by replaying the events.
  • A single event can have multiple consumers, and more consumers can be added in the future without any change in the underlying system.

Events as first-class citizens

One of the first tasks we had to do was to come up with a list of the core events.  For that, we used a technique called event storming. The basic idea is to decompose the business process into a set of domain events. Once we had this list of the initial events, we decided to base our events implementation on the CloudEvents specification.

Our events are represented as a JSON object and serialized to text format. We considered the idea of using binary formats, such as Apache Avro or Google’s proto buffers. However, at the end, we settled with a plain text format for simplicity.

A simple event definition

{
   "specversion" : "1.0",
   "type" : "my.type",
    "source" : "my-awesome-producer",
    "subject" : "123",
    "id" : "A234-1234-1234",
    "time" : "2018-04-05T17:31:00Z",
    "data" : { //custom attributes }
}

Events are organized into channels (aka topics). Each channel represents a certain domain and all events related to that domain land in the same channel.  Consumers show “interest” in certain events by subscribing to the proper channel and defining a filter by event type (to only consume the relevant events).

As of this writing, there are around 9 channels and 40 events distributed among those channels.  ~15 microservices act as producers/consumers. At some point all events are dumped into a long term, searchable storage where we can perform queries for troubleshooting or for analytical purposes. If required, we can also go back in time and replay the events.

Testing in a distributed and eventual consistency world

Our amazing QA team uses several tools for different purposes, from KarateDSL to test each microservice in isolation to Gatling for performance testing. However, our system operates under the eventual consistency model, which poses a challenge for integration testing. How do we solve it?

In a simple and easy way, we treat the system as a black box: our testing scripts interact with the platform by injecting events.  We know beforehand which are the side effects (in this case, which other events should be produced as a result of the processing). After the relevant events are injected, the scripts query the long-term event storage and compare the available events against a predefined list. If something is wrong then we know exactly which event(s) are missing/have wrong data and we can take a deep dive into our other tooling to discover the problem.

We also run the same testing in our production environment constantly – an effort to spot potential issues before our customers do. From time to time, we take samples of our event stream and run analytics over them. The results give us a better idea of our performance baseline and we can adjust our monitoring as needed.

Conclusion

Working with an asynchronous distributed system that operates under the eventual consistency model is not a walk in the park. Many of the rules that apply in the synchronous communication world are no longer valid (or are more difficult to apply), and, there is a very good chance that bad choices will hit you back harder. However, if you take the time to really understand what event-driven architecture is all about, follow best practices (there is a lot of literature out there), and, most important, an event-driven solution is a good fit for your problem, then, the pros far outweigh the cons.

Overall, we have been very happy with our choice to architect the system using an event-driven architecture.  It has allowed us to iterate the product quickly: the events first approach combined with domain-driven design techniques gave us some powerful tooling to better understand the business processes and to model the interactions. On the technical side, adding new features is (almost always) just a matter of adding events and new producers/consumers. Sometimes, the required events are already there, meaning just adding another consumer will do the trick.

See all tech blog posts

Related Content