Yelp created a solution to sanitize data from the corrupted Apache Cassandra cluster utilizing its data streaming architecture. The team explored many potential options to address the data corruption issue, however ultimately had to move the data into a new cluster to remove corrupted records in the process.
Yelp uses Apache Cassandra as the data store for many parts of its platform. The company tends to run many smaller Cassandra clusters for specific use cases based on data, traffic, and business requirements. Initially, Cassandra clusters were hosted directly on EC2, but more recently, they transitioned most of them to Kubernetes using a dedicated operator.
The team has discovered that one of the Cassandra clusters running on EC2 was affected by data corruption that regular data maintenance tools could not address. Over time the situation was getting worse, impacting cluster health even further.
Muhammad Junaid Muzammil, a software engineer at Yelp, explains the reasons for opting to rebuild the corrupted Cassandra cluster:
Since the corruption was widespread, removing SSTables and running repairs wasn’t an option, as it would have led to data loss. Also, based on corruption size estimates and recent data value, we opted not to restore the cluster to the last corruption free backed up state.
The team opted to use a design inspired by sortation systems used in the manufacturing industry to remove defective products from reaching the end of the production line. They created a data pipeline using their PaaStorm streaming processor and the Cassandra Source connector that relies on Change Data Capture (CDC) feature, available in Cassandra from version 3.8.
High-Level View of Data Corruption Mitigation Pipeline (Source: Rebuilding a Cassandra cluster using Yelp’s Data Pipeline)
The Data Infrastructure team created a new Cassandra cluster on Kubernetes, benefiting from many hardware and software upgrades. The data pipeline used a Stream SQL processor to define data sanitation criteria, splitting the data between valid and malformed streams. Using the Cassandra Sink Connector, the pipeline fed the sanitized data stream into the new Cassandra cluster. The malformed data stream was used to analyze the data corruption’s severity further.
The team used a statistical sampling technique to validate the overall data migration process, inspecting a small subset of the data by comparing the data imported into the new cluster against the old one.
Before switching the traffic to the new cluster, the team created a setup where read requests were sent to both clusters, and the returned data was compared. They analyzed the logged results and estimated that 0.009% of the data were corrupted in the old cluster. Finally, the traffic was seamlessly switched to the new cluster, and the corrupted one was torn down.
Data Validation Approach For Read Requests (Source: Rebuilding a Cassandra cluster using Yelp’s Data Pipeline)