What happened at the Open Source Data Infrastructure Chicago meetup?
August 9 the folks at Discover hosted the Chicago Open Source Data Infrastructure meetup - the first in Chicago from the meetups weâre organizing globally - and co-located with the Devopsdays Chicago conference which I and a few colleagues were in town for.
It was a fun evening, with talks by my colleague Dewan Ahmed, and Ehfaj Khan, Expert Application Engineer at Discover Financial Services (DFS).
Ehfaj compared batch processing, a technique for processing large amounts of data offline, with event (or: real-time) processing.
Key challenges with batch processing are that while the system can identify data sets that are not fit for processing (yet) or even corrupted, sending them back to the source system for correction causes delays in business processes. Unstructured data (and data inconsistency and data integrity issues) and batch processing are no friends. And: debugging can be painful and time-consuming when processing large amounts of data.
Data is becoming the backbone and nervous system of any organization. Organizations are pushing to modernize their data infrastructure at an increasing rate. Speed as a key element of business success, thatâs where real-time data comes in!
Event processing provides you with a live feedback loop, so that you can act on your data as it comes in. Real-time visibility in your business processing and better response to your target customers makes for better customer experience. Event processing also significantly enhances the organizationâs visibility of its operations and enables the organization to make faster and better decisions.
But whether event processing is the answer for your business, that (of course) depends. Like the successful adoption of any new technology, tools or framework depends on your use case and the business problem youâre trying to solve.
Dewan talked about Infrastructure-as-Code (IaC) for data. Or, like he said: âwe build apps to move dataâ - of course IaC is for data as well.
Your streaming platform, relational database, NoSQL database, networking, monitoring and security need the same â-ilitiesâ that the rest of your system needs:
- Reproducibility
- Repeatability
- Disposability (pets vs cattle)
- Consistency
- Ability to incorporate design changes
Here too itâs important to investigate first internally whether IaC is right for you, if youâre set up right to create a pilot project, to ultimately (maybe) roll it out wherever it makes sense.
But to showcase how orchestration with Terraform would work, Dewan shared a demo in which he uses Apache Kafka Mirrormaker to replicate data over different data centers, from (Kafka) source to target cluster.
Data center 1 | AKMM | Data center 2 |
---|---|---|
topic-a | Replication flow | topic-b; dc1.topic-a |
Apache Kafka Source | Apache Kafka Target |
You can check out his demo here.
If youâre in the Chicago area, make sure to join the Chicago OSDI Meetup Group