When designing software systems, one of the important things to do is to be able to access the performance of the system to determine if it meets the non-functional requirements. One of the ways people usually do this is to build a simple version and then apply some load tests to it. So you don’t know how your system will perform until you spend development time making some simple version for performance testing. Whilst this works, the feedback loop is slow and hence development time is extended. What you want to do is to be able to analyze the performance of your system at the design level without having to build anything yet.
In this post, I present a simple model you can use to analyze the performance of your system designs on paper many times before actually building anything. I call it the writer/reader (WR) model for analyzing software architectures. With this approach, you can evaluate your designs and get almost immediate feedback on how it meets the non-functional requirements of the system. Non-functional requirements usually relate to the performance of the system under certain circumstances. For example, if you were to design an image upload system, the non-functional requirements would specify the maximum and minimum size of image allowed, and the number of uploads expected in a given period (day, hour, etc …).
WR Modeling
The writer/reader model is a very simple way to model software designs. It does remove lots of details but maintains enough information to give a good way of gauging the performance of architectures. You will be able to reach lots of conclusions all on paper without having to build a prototype. Usually, a software architecture is made up of components that interact with each other to satisfy the requirements of the system. The component calling or making a request to another component is the writer and the component responding to the calling component is the reader. For example, in a blogging application, we could identify three components;
- The website that normal users interact with.
- The API server that the website interacts with.
- The database that the API server interacts with.
Roughly, we can have a diagram like this:
In the WR model, the direction of the arrow shows which component is the writer and reader. Hence, the website component writes to the API server and the API server writes to the database. The write here is in the sense of writing requests. In this model I assume that the API server never writes requests to the website, it only responds to requests it reads from the website. A similar relationship exists between the database and the API server. If say the API server was also able to write requests to the website, we would have an arrow moving from the API server to the website too.
In the image above, component A can write to B and vice versa. However, in the blogging software example, we assume that at no point in the lifetime of the application will the API server ever make an application request to the website.
Let’s look at a software problem that we’re going to analyze using the WR approach.
Problem Statement
Let’s Build a system to constantly track the location of fleets of vehicles and show a real-time movement feed on a dashboard. It should satisfy the following functional requirements;
- Constantly ingest location data sent from GPS tracking devices mounted on the vehicles.
- Provide near real-time movement feed on a dashboard.
- Provide statistics of the data collected on a dashboard.
Aside from these functional requirements, it should also satisfy the following non-functional requirements.
- It should be able to handle a minimum of 10,000 and a maximum of 100,000 GPS devices concurrently.
- The GPS devices will be sending location data at an interval of 2 seconds.
- The near real-time movement dashboard should be refreshed every 5 seconds with the updated location of the vehicles.
- The statistics dashboard should be updated at least once every 15 minutes.
- Both dashboards should be able to handle a minimum of 100 and a maximum of 1000 users concurrently.
First Solution
Below is the WR Model diagram of the proposed solution. It is depicted by the arrows which strictly depict which component is writing to which components. Another specific feature is the numbers depicting the request write rate. /s means request writes per second and /m for minutes.
From the diagram, it should be clear which components are the Request Writers and Readers.
GPS Devices →Tracker API
First, let’s focus on the GPS devices and the Tracker API. We would not concern ourselves with the protocols by which the devices send the location data and rather assume that the data will reach our APIs reliably. As stated in the non-functional requirements it is expected that 10,000 — 100,000 devices will send location data every 2s. This is depicted in the diagram as 5000/s — 50,000/s, where /s means request writes per second. To handle this load, the tracker API is designed to scale horizontally with an auto-scaling load balancer in front of several small-sized API servers.
Let’s assume that each API server has been benchmarked to handle 100/s, then to be able to meet the minimum requirement of 5000/s, the service will already need to be initialized with 50 API servers. With autoscaling, as the number of /s increases, the number of servers will be increased to meet the load automatically. This is crucial to ensure that the API is up most of the time without manual intervention.
Tracker API → Main DB
The Tracker API is supposed to do two things, 1) validate the data it receives from the devices, and 2) write valid data to the Main DB. From this point, it is safe to assume that the Tracker API will have to make 5000/s — 50,000/s to the Main DB as it receives from the devices. Hence, it means that the Main DB should be able to handle that amount of load. If the Main DB can’t keep up with the tracker service, the tracker service won’t be able to keep up with the GPS devices.
Auto-scaling will keep adding more API servers to handle the load, but since the Main DB can’t keep up, too many servers would be added just to handle the backlog whilst waiting for the Main DB to respond. This means that the application would be spending more money than it should, backlog requests will pile up, and eventually, the system would be hours or even days behind. The main lesson here is that the API here is as slow as the database, regardless of how fast the API is designed to be.
Hence the question is if the currently chosen DB can handle that load write load. To gauge this, one might need to refer to the system limits of the database in question.
In general, you need to be aware of the limits of the third party solutions used as part of your system.
In our current design, we chose to use an SQL database. The simple answer to the question is no, it won’t be able to handle that amount of load. It’s simply too much and you will be pushing the limits of the database in a manner that slows down your whole design. If the database can’t handle that load, it means that the replica database also can’t handle the load. Let’s evaluate our choice and look at a second solution.
Before looking at the second solution, let’s quickly walk through the other components. There is an ETL (extract, transform, load) process that loads data from the Replica DB every 15 minutes, computes the needed statistics, and stores those in the statistics database. The pressure on the statistics database is minimal, nonetheless, the Replica DB isn’t able to handle the pressure from the Main DB, data retrieval will be slow and the data retrieved would be old. This affects the Statistic API since it would start showing statistics for very old data. Similarly, the Location API won’t be able to return near the real-time locations of vehicles. It is clear from our analysis that the critical part of the system centers around the Tracker API → Main DB.
Second Solution
Below is the WR Model diagram of the proposed solution. It is depicted by the arrows which strictly depict which component is writing to which components. Another specific feature is the numbers depicting the request write rate. /s means request writes per second and /m for minutes.
Compared to the first solution, this solution uses Redis DB in place of the Main DB and the Replica DB. The ETL process now does two things;
- Compute statistics every 5 minutes. It loads the data directly from the Redis DB.
- Flush 5-minute old data into the Archive DB, which uses a NoSQL database.
The use of redis solves our initial concern that the SQL database would not be able to keep up with the Tracker API. The limits of our system are well within the grasp of redis. However, we had to adjust the ETL process time from 15 minutes to 5 minutes to be able to compute the statistic and also flush old data to the archive DB. We need to flush the data because redis is a memory-based database and we don’t want to have all that data stored in memory. Suppose that we can fit each tracking data sent into say 64 bytes, then every second we can expect 5000 x 65–50000 x 64, that is 0.0032 GB — 0.032GB worth of data per second. That amounts to about 0.96GB — 9.6GB every 5 minutes. Therefore it becomes necessary to flush 5-minute old data somewhere else. A disk-based NoSQL database is a good choice because clearly, the data size is huge, with terabytes of location data per year. It might be prudent to define a retention period.
This is not a perfect design but it's way better than the first solution. I’m excited to hear what you come up with, just email me or comment and let’s have a chat about it.
Conclusion
You have seen that by employing our simple WR modeling technique, we were able to immediately identify flaws in our initial design, and propose and validate an upgraded second design.