In today's data-driven era, it's critical to design data platforms to help business to foster innovation and compete in the market. Selecting a robust, future-proof set of tools and architecture requires an act of balance between the purely technological concerns and the wider constraints of the project. These constraints include challenges regarding regulations, existing workforce skills, talent acquisition, agreed timelines, and your company’s established processes.
A modern data platform is the set of technologies, configuration and implementation details that allows data to be stored and moved across the company systems to satisfy business needs. The SOFT methodology introduces an approach, based on four pillars, to define future-proof and effective data platforms that are scalable, observable, fast and trustworthy. Its aim is to enable data architects and decision makers to evaluate both current implementations and future architectures across a different set of criteria.
- Scalable: the ability of a solution to handle variability in: load volumes, use-cases, development platforms and cost.
- Observable: are there options around service monitoring and data assets discovery.
- Fast: analyzing the time to deliver data, develop pipelines and recover from problems.
- Trustworthy: the ability to provide secure, auditable and consistent results.
The rest of the blog examines each of the pillars in detail, providing the set of questions to be addressed during an evaluation for each pillar.
Data volumes and use cases are growing every day at unprecedented pace, therefore we need to find technological solutions that support the current set of requirements and have the ability to scale across different directions.
Design a solution with space for growth. Defining the current need and forecasting the future growth can help us understand if and when we'll hit the limits of a certain architecture. The modern cloud enables very deep vertical scaling, by creating bigger servers, but high availability and quick failovers are also important considerations. Technologies with support for horizontal scaling, splitting the load across nodes, usually offer a bigger variety of options for up/down sizing depending on needs, but these may incur consistency tradeoffs.
In addition to raw scaling, we also need to consider the automation options and the time to scale. We should be able to up/down size a data platform both manually if a peak in traffic is forecasted, and automatically when certain monitoring thresholds are exceeded. The time to scale is also crucial: keeping it at a minimum enables both better elasticity and less waste of resources.
When considering technology scaling, we need to have a broad evaluation of the tooling: not only primary instance, but also high availability, the possibility of read only replicas, and the robustness and speed of backup and restore processes.
Business cases scaling
Every business has a constantly growing collection of data, therefore a data platform solution needs to have both a low barrier of entry for new use cases, and enough capacity to support their solution.
Having great integrations available is crucial for the solution to support more use cases and new technologies. Proper, scalable security management and use-case isolation is needed to comply with regulation around data protection. We also need the ability to have separation of our data islands.
For even more data success, providing interfaces to explore already available datasets promotes data re-usage in different use cases and accelerates innovation.
No matter how much automation is in place, data platforms need humans to build, test, and monitor new pipelines. When selecting a data solution we need to evaluate the skills and experience of the current team, and the possible cost of growth based on geographies where the company is operating.
Selecting well-adopted, open source data solutions is associated with bigger talent pools and more enthusiastic developers. Using managed solutions offering pre-defined integrations and extended functionality can help by taking some of the burden away from the team allowing humans to focus on building rather than maintaining.
The last part of scalability is related to money. A technically perfectly valid data solution can't be adopted if it is not financially sustainable. Understanding the dollar-to-scale ratio is very important, as well as mapping future changes in the architecture that could raise costs significantly.
We need to calculate the derivative of the cost, aiming at solutions that can scale linearly (or hopefully less) with the amount of data.
Questions to ask:
- What options are there to scale this technology, vertically and horizontally?
- How easy is it to add new technologies?
- How can we manage data security?
- What is the current experience of the team?
- How big is the talent pool?
- How complex is the management vs the development?
- How easy is it to add/extract new data, or integrate with other technologies?
- What will scaling cost? Does it grow linearly with the data volume?
The risk of not evaluating the financial scalability, is to build a perfect system that works now, but can't cope with the future success of our business or use-case.
The old days of checking the batch status at 8AM are gone: modern data platforms and new streaming data pipelines are "live systems" where monitoring, understanding errors and applying fixes promptly is a requirement to provide successful outcomes.
Checking the data pipeline end status on a dashboard is not enough, we need to have methods to easily define:
- metrics and log integration
- aggregations and relevant KPIs (Key Performance Indicators)
- alert thresholds
- automatic notifications
Relying on a human watching a screen is not a great use of resources and doesn't scale. Automating platform observability and selecting tools that enable accurate external monitoring allows companies to centralize management efforts.
Recreate the bird's eye view
With new pipelines and new technologies being added all the time, it's hard to keep an inventory of all your data assets and how they integrate between each other. A future-proof data solution needs to provide an automatic harvesting, consolidation and exposure of data assets and their inter-linkage.
Being able to trace where a certain data point came from is critical. We should be able to establish what transformations were performed, where the data exists, and who or what can interact with it at any point in the pipeline. This information is going to help us to comply with security and privacy requirements, especially data lineage, security, impact assessments and GDPR queries.
Obtaining a queryable global map of the data assets provides additional benefits regarding data-reusability: by exposing the assets already present in a company, we can avoid repeated data efforts, and promote data re-usage across departments for faster innovation.
History replay and data versioning
With continuously evolving systems, having the ability to replay parts of the history provides ways to create baselines and compare results. These can help detecting regression or errors in new developments and evaluate impact of changes in the architecture.
Easily spinning off new "prod-like" development environments enables faster dev iteration, safer hypothesis validation, and more accurate testing.
Having a "data versioning" capability, allows to compare the results of data manipulation across development stages; adding the "metrics versioning" containing execution statistics enables a better (and possibly automatic) handling of performance regressions.
Questions to ask:
- What's happening now in my data platform?
- Is everything working ok? Are there any delays?
- What data assets do I have across my company?
- How is my data transformed across the company?
- Can I replay part of the pipelines in the event of processing errors?
- How is a change performing against a baseline?
From micro-batching to streaming, the time to data delivery is trending towards real-time. The old times of reporting on yesterday's data are gone, we need to be able to report in near real time on what's currently happening in our business, based on fresh data.
Time to develop
Delivering data in near-real time is useless if developing and testing pipelines takes weeks of work. The toolchain should be self-service and user-friendly. To achieve this, it is crucial to identify the target developers and their skills, and to select a set of technologies that will allow this group to work quickly, effectively, and happily. Once the use case is defined, it is also worth checking for existing solutions that can help by removing part of the problem's complexity.
Investing time on automation diminishes the friction and delay in the deployment of the data pipelines. People shouldn't lose time in clicking around to move a prototype to production. Automated checks and promotion pipelines should take care of this part of the artifact journey.
Time to deliver
Selecting a data architecture that enables data streaming is key to building future-proof pipelines. Having the ability to transition from batch to streaming also allows to improve existing pipelines in an iterative fashion, rather than requiring a big bang approach.
An integral part of the Time To Deliver(y) is also the Time To Execute: the performance of your chosen platform needs to be evaluated against target latency figures.
Time to recover
Finally, it's crucially important to define acceptable Time To Recover thresholds when data pipelines run into problems. To achieve this, take the time to understand, test and verify what the selected toolchain has to offer in this space. Especially when dealing with stateful transformations, it is crucial to navigate the options regarding checkpoints, exactly-once delivery, and replay of events. The events replay in particular can be handy to verify new developments, run A/B tests, and quickly recover from problematic situations.
Questions to ask:
- How much delay is there between the source event and the related insight?
- How good is the technology performance for the task?
- How fast can we create new pipelines?
- How fast can we promote a change to production?
- How fast can I recover from a crash?
Data is the most valuable asset for companies, and building trusted data platforms is key to ensure quality input is available to all stakeholders.
The first aspect of trust is related to security. As briefly mentioned in the Observable section, the toolchain should allow us to define and continuously monitor which actors can access a certain data point and what privileges they have. Moreover, platforms should provide enough logging and monitoring capabilities to detect and report, in real time, any inappropriate access to data. Providing a way to evaluate implications of security changes (impact assessment) would guarantee an extra level of checking before making a change.
Regulations define what is the correct usage of data, and which attributes need to be masked or removed. To build future-proof data platforms we need the ability to apply different obfuscation processes depending on the roles/privileges of the data receiver.
In the article we covered a lot about automation. For security, whilst a lot of checks can be performed with code, we might still want to retain manual gates that need a human evaluation and approval allowing companies to comply with the needed security regulations.
Vendor's ecosystem evaluation
To build trustworthy future-proof data platforms, we need trust in the vendors or projects providing the tools, and their ability to continue to evolve with new features during the lifetime of the tool.
Therefore a wide assessment of the company or open source project is required: consider the tool's adoption, any existing community and related growing patterns, and available support methods. Taken together, these topics can help understanding if a current tech solution will still be a good choice in future years.
Data locality and cloud options
Companies might need to define where in the world the data is stored and manipulated. Being forced to use consistent locations across the entire data pipeline might reduce the list of technologies or vendors available. The choice can further be refined by internal policies regarding the adoption of cloud, managed services or multi-cloud strategies.
From a pure data point of view, "trustable" means that the generated results can be trusted.
The results needs to be fresh, correct, repeatable and consistent:
- Fresh: the results are a representation of the most recent and accurate data.
- Correct: the KPIs and transformations should follow the definitions. A data transformation/KPI should be defined once across the company, providing a unique source of truth.
- Repeatable: the workflow could be run again with the same input and provide the same output.
- Consistent: performance is resilient to errors and consistent across time, giving stakeholders the confidence to receive the data in a timely manner.
Questions to ask:
- How can I secure my data? Can I mask it, and apply filters to rows/columns?
- Do I trust the vendor or project providing the tool?
- Can I use the tool in a specific datacenter/region?
- Can I precisely locate my data at any stage of my flow? Both from a technical and also geographical point of view?
- Can I trust the data? Is it correct, are results repeatable, on time, and consistent?
- How many times do we have a particular KPI defined across the company?
Deploy SOFT in your own organization
Whether you are looking to define your next data platform, or evaluating the existing ones, the SOFT framework provides a comprehensive set of guidelines to help you assess your options. By using the list of questions as a baseline for the evaluation, you can properly compare different solutions and make a better informed decision about the perfect fit for your data needs.
Take the SOFT framework into usage, and let us know your opinion!
To get the latest news about Aiven and our services, plus a bit of extra around all things open source, subscribe to our monthly newsletter! Daily news about Aiven is available on our LinkedIn and Twitter feeds.
If you just want to find out about our service updates, follow our changelog.
How to send and receive application data from Apache Kafka®
Learn how to create a highly available Apache Kafka service, go over the common tasks of producing and consuming messages, and finally use the popular Apache Avro™ specification to communicate with your Kafka service.Check the tutorial
Nov 11, 2022
An API is an important feature to look for when it comes to making cloud purchasing decisions. Here's why.
Apr 14, 2022
Join David Esposito as he gives you the low-down on building data architectures that stand up even under the largest loads.
Sep 6, 2023
Aiven for OpenSearch® now includes OpenSearch Security, with SSO authentication, role-based access control, audit logging, and multi-tenant Dashboards
Subscribe to the Aiven newsletter
All things open source, plus our product updates and news in a monthly newsletter.