Ahrefs 30TB and 15TB SSDs Failure Rate Statistics 2023 Q3&Q4

Efim Mirochnik
Ahrefs
Published in
3 min readJan 11, 2024

--

It’s time to share the new AFR (Annualized Failure Rate) statistics about the big SSD drives we use in production. In the second half of the last year, we received and started using a significant amount of 30.72TB SSDs in addition to our current fleet of 15.36TB drives. The new 30.72TB models are Samsung PM1733a and Kioxia CM6. These 30TB drives comprise about 14% of all our large drives and provide almost a quarter of storage space. We’ll keep the statistics for both capacity sizes together.

AFR for 2023 Q3
AFR for 2023 Q4
AFR since 2021 Q4 till 2023 Q4

The new 30.72TB drives have been in production for about four months. We had already experienced failures for both Samsung PM1733a and Kioxia CM6 models. One of Kioxia CM6 drives failed to respond in a server once inserted the first time. So, this was a rare DOA (dead on arrival) case among our SSDs. A few other failures of 30TB drives happened during usual work. Unfortunately, these premium 30TB drives are no exception and fail like their smaller 15TB counterparts.

Samsung PM1733a 30.72TB SSD

2023 Q4 is the first quarter without any failures for the Western Digital SN840 model since two years in production. These drives have the second worst AFR over a long time after Micron 7450, which we discuss below. So, no failures are a pleasant surprise as we didn’t change our usage pattern for these drives

We use two batches of Kioxia CD6 drives with significantly different average ages. Older two-year drives in the Singapore batch showed zero failures during the last two quarters. While US batch drives are still failing, it may be a sign that AFR for the US batch will come closer to 0 eventually. Overall, both batches of Kioxia CD6 drives show a downward trend.

By the way, the AFR graph is now interactive. Once you play with it, it may help to isolate and visualize the data better (like more clearly see the trend for two batches of CD6 drives only).

Combined AFR from age in production per drives series since 2021 Q4 till 2023 Q4

Micron 7450 high AFR resolution

The last AFR review revealed that Micron 7450 SSDs saw a higher-than-expected Annualized Failure Rate. Micron conducted research and analysis using tens of failed drives. The vendor identified a combination of our specific load pattern and the drive firmware as the cause for the higher failure rate. Micron quickly provided an updated firmware to address this issue.

According to documentation, the new firmware required a server reboot to activate. However, we were reluctant to reboot the cluster with Micron 7450 drives. Then Micron confirmed that we could switch to the new firmware version without rebooting servers using an undocumented option. It worked well. So, kudos to Micron that their 7450 drives support the AWOR (activate without reset) feature for firmware upgrades, even though the documentation doesn’t mention such a possibility.

About 60% of analyzed Micron 7450 drives failed due to the above load pattern-firmware issue. Thus, all other failure reasons constituted a 40% share. So, if this issue had not appeared, the drives would have about 0.38% AFR over 19 months in production. Such a failure rate would be unremarkable and in line with the AFR of other vendors’ data center-grade SSDs. Once we switched to the new firmware, we saw just a single Micron 7450 failure in the following two months compared to about a drive per week before. We expect to see a more positive trend on the statistics graph for this model in the future.

--

--