Pitfalls to avoid in Modern Data Platforms

Prajwal Kumar
7 min readJan 25, 2021
Bumpy road ahead in the Data Cloud.

Businesses all round the world in every single industry continue to generate more data than ever before. In order to become data-driven and leverage Artificial Intelligence (AI) for successful business outcomes, enterprises invest a lot of time and money in building data platforms to process, store and derive insights from ever growing volume, velocity and variety of data.

Unfortunately, organizations find the results middling. CXOs particularly in large complex organizations have struggled to capitalize on the potential of modern data platforms to derive new value for their business and to meet data and AI democratization goals. Organizations that have been through the journey of building a Hadoop based Data Lake in the last decade are seeing a lot more success in adopting modern Cloud based Data Platforms than those that are taking the plunge from on-premise Data Warehouses to the Cloud. This is simply because the public cloud based Data Platforms are a natural evolution of the Hadoop philosophy and organizations with prior experience dealing with on-premise Hadoop based Data Lakes can reuse that experience in the data estate modernization journey. Irrespective of where you are in the journey, here are the most important aspects to focus on to avoid failure.

Polygot Persistence

Data platforms should be developed using multiple physical data storage technologies. Using the right data store for the right kind of data is critical to meet non-functional requirements for data consumption. An object/file store (S3, ADLS, HDFS etc.) only data platform is unlikely to meet the needs of streaming/fast data for real-time analytics or hot data that needs to be served via RESTful APIs to customer facing latency sensitive web/mobile applications.

A Modern Data Platform must therefore be able to support varied data storage and consumption use-cases such as streaming, batch, graph, search, cache, key/value etc. all being carefully orchestrated via a common metadata plane so as to avoid proliferation of data puddles.

Spoiler alert: Advances in database technology may in future result in convergence and creation of multi-model databases that support multiple data storage patterns all in one system. As of today, such systems (like the recently announced Oracle-Database-21c) are unlikely to match the performance characteristics of special purpose databases that excel at solving specific use-cases.

Architecture Simplification

A very important component in a Polygot Data Platform is either a Data Lake or a Data Warehouse or both. These systems are typically used for large scale batch Data Science (AI) and Business Intelligence (BI) queries. Technology advances in the last few years has led to the creation of a new paradigm called the Lakehouse that is underpinned by technologies such as Delta Lake, Apache Iceberg, Apache Hudi etc. Lakehouses combine the best elements of Data Lakes and Data Warehouses and are enabled by open storage formats such as Apache Parquet, Apache ORC etc., offer first class support for machine learning and data science, provide state-of-the-art SQL performance for BI workloads while supporting ACID transactions, data versioning and schema management capabilities.

Organizations must leverage the Lakehouse paradigm to simplify their data estate, reduce complexity and lower the total cost of ownership rather than maintain separate systems for Data Lakes and Data Warehouses. Commercial offerings such as Snowflake and Databricks are great choices for companies to kickstart their Lakehouse journey.

Metadata Management

Inadequate investment in a unified Metadata strategy is probably the single biggest reason for failure of many Data Platform projects. A Data Platform must have all of its metadata in one place using scalable solutions as metadata in this day and age is big-data and no longer something that could be managed in spreadsheets or in a relational database under someone's desk. Solutions supporting effective metadata management aligned with FAIR principles must span data catalogs, business glossaries, data models, data lineage, AI models, BI dashboards, Identity & Access management systems, business rules, data usage metrics, data pipeline observability data (metrics, logs, traces), semantic connections and more.

A key problem with traditional metadata management approaches is that they are not easily connectable, and result in more metadata silos. These silos don’t help in making data more valuable, usable, searchable, discoverable thereby negatively impacting data democratization goals. Modern Data Platforms must leverage metadata as a form of data that can be connected, searched, analyzed and maintained in the same way as all other types of data. This can be achieved by leveraging semantics and knowledge graphs. Most large tech companies have resorted to building their own in-house (now open sourced) solutions to tackle the metadata problem. Examples include Metacat from Netflix, DataHub from LinkedIn and Amundsen from Lyft. Commercial offerings in this space include data.world and Atlan .

DataOps Foundation

DataOps is an agile, process-oriented methodology used by data teams to reduce the time to insight. Successful implementation of DataOps requires foundational tools, capabilities and infrastructure to support automation, testing, and orchestration , as well as collaboration and communication among all stakeholders. Key areas organizations should focus on are below:

  • Data Pipeline Orchestration tools to author, schedule and monitor data pipelines with support for complex dependency management, event-driven execution, horizontal scaling, idempotency, and audit, balance, control (ABC) capabilities. Examples include Prefect and Astronomer.
  • Observability tools to track data health so as to ensure freshness and proactively identify data downtime, assess its impact, and notify those who need to know. Examples include Monte Carlo Data and Databand.
  • Performance Management (complementing Observability) tools to profile, monitor, analyze and optimize data pipelines by providing deep insights and intelligence to resolve coding inefficiencies, remove resource contention issues, and reduce performance bottlenecks. Examples include Unravel Data and Pepper Data.
  • Cost Control & Compliance tools to help rein in Cloud costs and implement a policy-as-code approach for Cloud governance by providing automated right sizing, scaling, policy compliance, forecasting, and charge back capabilities. Examples include Cloudability and CloudBolt.

Apart from specific DataOps tooling, organizations must invest in creating processes and product portals to support seamless SDLC (with CI/CD automation), self-service, and automated infrastructure provisioning and management capabilities.

Data Governance & Security

Organizations must adopt the ‘do-it-right-the-first-time’ approach with Data Governance. Often, data teams start ingesting data into the Data Platform without setting up the right governance processes, structure or standards. This leads to tech debt that becomes expensive to correct once the platform is in production with several active data consumers. Before initiating the build out of data pipelines, it is important to setup data councils to drive domain driven data ownership & stewardship, digital data contracts between data publishers and consumers to describe the data to be exchanged, documentation of critical data elements that are vital to the successful operation of the organization, data quality standards coupled with a robust remediation mechanism to enable trust in data, and business glossaries for Data Discovery purposes. The business ultimately needs to own data governance and it is therefore the responsibility of IT to establish the right set of processes and tools to drive self-service data governance and data lifecycle management.

Data is the oxygen of the modern age. It ought to be protected.

Piecemeal coarse grained data protection is not effective for a modern data Platform. A single system for scalable data policy management across multiple data stores is a must have to enable fine grained rapid access to data. Careful thought must also be given for setting up data encryption, tokenization and masking capabilities to avoid data breaches. The system must also be designed to quickly enable compliance with data privacy regulations like GDPR, HIPAA etc. while businesses are provided detailed and continuous visibility into their data assets with SIEM tooling integration for further insights.

Operating Model

Data platforms must be built using a decentralized team (vertical or domain wise alignment) of data engineers storing, processing and serving their datasets for their domain in easily accessible open formats. This ensures greater agility, improved time to market, focus on domain ownership, and reduced friction between teams.

Organizing teams by having groups of hyper-specialized data engineers broken down by zones or layers (horizontal alignment) of the data platform architecture is a key failure mode in the data platform operating model. While severely impacting time to market and costs, this also results in reduced agility, lack of domain knowledge (increase in the number of software coding errors) and increases friction between teams. Decomposing the platform architecture by physical zones also increases data duplication and is best avoided in favor of domain driven decomposition and logical tagging of datasets.

Conclusion

The Data Warehousing era (pre-2010) saw the evolution of database systems to support BI workloads and the adoption of multi-dimensional OLAP systems to meet growing data and performance requirements. Those systems were expensive and provided limited support for scale, velocity and variety of data in the SMAC era. The last decade saw rapid adoption of Hadoop in organizations to support data storage and processing at scale at low cost per byte. Though few organizations succeeded, several failed. The complexity of deploying and operating Hadoop based Data Platforms, co-location of storage and compute making multi-tenancy hard and software upgrades complex, and the severe shortage of good data engineering talent were some of the key reasons for the downfall of Hadoop.

Public Cloud has well and truly arrived. Hyperscalers (Amazon, Microsoft, Google) with their disaggregated storage and compute architectures, on-demand capacity, aggressive pricing, PaaS/SaaS experiences and marketing prowess have ensured widespread adoption across industries. While the public cloud has enabled commodification of hardware, very little has changed with the software experience. This is where niche players like Snowflake, Confluent, Databricks etc. (and the surrounding ecosystem) have excelled and helped fill the gaps. Whatever be the choice of PaaS/SaaS offering, points covered in this blog are worth considering to be successful in the Data Platform journey.

--

--