How to Choose the Right Data Movement: Real-time or Batch?
We all want a “zero wait”infrastructure. This has spurred many organizations to push all data through a real-time infrastructure. It’s important to recognize that “zero wait” means that the information is in ready form when a user needs it, so if the user needs information that includes averages, sums, and/or comparisons, there is a natural need to have a data set that has been fully processed (e.g., cleaned, combined, augmented, etc.). Building the data infrastructure with this in mind is very important.
The popular point of view is that real-time processing is the “modern” solution and that batch processing is the “archaic” way. However, real-time processing has also been around for a long time, and each mode of processing exist for different purposes.
One trade-off between real-time and batch processing is high throughput versus low latency. Choosing one process over the other can be somewhat counterintuitive for the broader team, so it is important to determine what the throughput and latency requirements are, independently of each other. A great example of throughput versus latency is the question, “What is the best way to get from Boston to San Francisco?” You might answer, “By plane.” That would be true for transporting a small group of people at a time as that would result in the lowest latency, but would by plane be the best way to move a million people at once? How would you get the highest throughput?
Real-time processing is very good for collecting input continuously and responding immediately to a user, but it is not the solution for all data movement. It’s not even necessarily the fastest mode of processing. When deciding whether data should be moved in real time or in batch, it is important to define the nature of the business need and the method of data acquisition.