API Doc
Search Docs...
⌘ K
ABSAVE

Silent error inspection

Data block checksum and physical disk inspection

Physical media on a physical disk might experience silent corruption during operation, such as sector damage in an HDD or chip failure in an SSD, resulting in data being inaccessible or bit-flips (particularly common in HDDs). To proactively discover and address this issue, at the cost of some storage space and performance overhead, ABS records checksums when data is written to the persistent tier. Whenever data is read, the system simultaneously reads and verifies the match between the data and its checksum. When ABS detects that data does not match its checksum, it immediately marks the data block as abnormal and excludes the relevant data shard to prevent reading incorrect data. At the same time, it initiates a data recovery mechanism to repair the abnormal shard.

For data that are not frequently accessed, the Chunk service also periodically reads the data and checksums from the storage media to perform automatic checks. If an anomaly is detected, the system immediately notifies Meta for resolution, ensuring continuous data reliability.

Shard consistency inspection

As a distributed storage system, ABS distributes requests across different shards through the network when updating a data block. Due to the network environment and distributed characteristics, the update of shards may not be completely synchronized. In extreme cases like a sudden power outage of the cluster, this may lead to inconsistent shard data. Although ABS can promptly detect and fix these issues during the I/O process through data verification, if this data is not accessed after the cluster recovery, this inconsistent state between shards may persist for a long time, leading to reduced data security.

To prevent such risks, ABS incorporates a proactive inspection mechanism that periodically checks and repairs the consistency of all cold data shards, ensuring that data remains complete and secure during storage.