NHacker Next
  • new
  • past
  • show
  • ask
  • show
  • jobs
  • submit
Durability: NVMe Disks (evanjones.ca)
macdice 1301 days ago [-]
To use reported AWUPF as a way to turn off MySQL double writes safely, I guess you need to be using direct IO (otherwise Linux's 4096 hardcoded buffer size kicks in) and you need to know that nothing in the IO stack can ever split up your writes in a way that could overwrite InnoDB pages non-atomically. How can we know that that is true of the whole IO stack? Is it even possible today, without something like the O_ATOMIC proposal?
helltone 1301 days ago [-]
Can I have a reference for that hardcoded buffer size please?
macdice 1301 days ago [-]
See man pages for mkfs.xfs, mkfs.ext4. On other Unix systems block size is/was more variable, but on Linux you can't exceed the 4kB memory page size. So it is not possible to match database block size (typically 16kB for InnoDB, 8kB for PostgreSQL). (Exception being ZFS which has a completely different architecture and can do atomic writes.) Note that I am not saying buffered IO would necessarily be atomic at that level on (say) ancestral SGI XFS or FreeBSD UFS with block size matching the database's (I don't know if it might it some circumstances split writes on sector boundaries for some reason)! Just that it's already a non starter when the block size is too small.
p_l 1301 days ago [-]
direct i/o forces 4k buffers too :/
ncmncm 1301 days ago [-]
I had an Intel 6000-series 1/2 TB NVMe. After just 2 years of light, fitful use in my laptop, my BIOS warned me it was about to fail. I spent the night doing backups, and it totally failed the next day. I count myself lucky, but I did not replace it with an Intel product.
wtallis 1301 days ago [-]
That was kind of a trashy product. Intel doesn't take their client/consumer SSD business very seriously these days, and that was only halfway an Intel SSD. The controller was Silicon Motion's first NVMe SSD controller, the drive had numerous firmware bugs, and it was probably also a bit of a dumping grounds for low-grade flash from Intel's first generation of 3D NAND. The successor to that drive switched from three bit per cell TLC NAND to four bit per cell QLC NAND and still managed to be a superior product in almost every way because the updated controller and firmware were so much more mature.
ncmncm 1301 days ago [-]
Thank you for the clue. All the reviews I had read claimed Intel and Samsung SSEs were bulletproof, and everybody else's were trash.

For $300, it should have been a better product. Only 6 months later the competition was getting only $100 for them.

bserge 1301 days ago [-]
Intel used to have good SSDs, but they seem to have given up on the consumer market. Samsung and Crucial are the only ones I'd trust today, based on my experience.

Although looking at Samsung's smartphones and laptops, they might be skimping on quality, as well :/

syshum 1301 days ago [-]
Samsung Flash storage I have always had great success with

Intel I would only use Enterprise Class gear, and even then would look to other vendors for storage

Intel trades off their name, and puts out alot of crap at the lower end which they get away with because "No one ever got fired for buying intel"

Alot of these big names do this....

willis936 1301 days ago [-]
It's funny to see Intel tracing IBM's steps. It's tempting to think they sick of being successful. I think the real answer is ignorance of IBM/Compaq, lack of wisdom, and, the key ingredient: hubris.
Shared404 1301 days ago [-]
Seconding Samsung. I've only had good experiences with them in the past few years.
srtjstjsj 1301 days ago [-]
Sad. I have a 9yr old Intel ssd that still benchmarks like new.
swalsh 1301 days ago [-]
Was it NVMe though? The standard is only 9 years old, so that's an impressive lifespan for one of the first drives.
srtjstjsj 1300 days ago [-]
No, it was SATA SSD.
eptcyka 1301 days ago [-]
Sadly, I don't think it's even possible to buy consumer SLC drives these days.
bleepblorp 1301 days ago [-]
'Consumer product' is a synonym for 'cheaply built crap sold at premium prices' these days.

Much as how the introduction of SMR allowed for the prices of real hard drives to increase by 25% or more, the development of new data-destroying technologies for consumer SSDs has pushed the effective price for non-garbage SSD storage up to the $0.75-$1 USD per GB level of enterprise SSDs.

wtallis 1301 days ago [-]
That is ridiculously inaccurate. Consumer SSDs that are absolute overkill on performance and endurance are only half the price of the $0.75 per GB you claim as a price floor for good drives. Realistically, there's no reason for a consumer to spend even 20¢/GB, and there are tons of good drives well below that price which will not eat your data and will outlast the useful lifetime of several other components in your machine. There are reputable, well-behaved SATA SSDs at 10¢/GB.
bleepblorp 1301 days ago [-]
The entire point of this post is that consumer SSDs provide no data loss protection in the event of an unexpected hard reset or power loss. Data security is only available in far more costly enterprise SSDs.
wtallis 1301 days ago [-]
You've missed out on the substance of the discussion too, then.

Expensive enterprise SSDs make crash-proof data protection automatic. Consumer SSDs require the host system's software to explicitly flush the write cache when necessary. This tradeoff works extremely well in practice, and consumer SSDs don't have serious data loss problems for consumer workloads. Enterprise workloads that are much more paranoid about syncing every single transaction cannot safely use consumer SSDs without unacceptable performance loss, but that certainly doesn't mean that consumer SSDs are playing fast and loose with your data safety.

wazoox 1301 days ago [-]
At some point in my company we've bought quite a lot of Intel SSD, and many failed after a relatively light use. We're using Samsung ones, and they're obviously much more reliable.
CreRecombinase 1301 days ago [-]
That's a bummer. I have an Intel 700 series I got several years ago that I've been torturing on and off for years. Still no problems. I think it was their first NVMe/PCIe drive too. Hell, the haswell era i7 and the god-knows-when era quadro feel much more "long in the tooth".
vbezhenar 1301 days ago [-]
Are there any consumer M.2 drives with guaranteed data protection (with capacitor or battery to flush all volatile data at power loss)? I did not find that kind of information in specs.
PragmaticPulp 1301 days ago [-]
I doubt any consumer M.2 drives will have power loss protection. The additional cost and PCB space required for the capacitors is a non-starter in consumer drives where every cent counts.

Server grace storage will specify if power loss protection is included on the board. You can usually identify the capacitors as well. For example, they are the yellow rectangles near the edge of this Intel SSD: https://ark.intel.com/content/www/us/en/ark/products/96932/i...

Note that the Intel drive with PLP is 110mm long, which is 30mm longer than most consumer M.2 SSDs due to the additional capacitors. This won’t fit on certain consumer motherboard.

kbenson 1301 days ago [-]
> This won’t fit on certain consumer motherboard.

From diving into this recently in speccing out a budget system for myself, it looks to be mostly the lower end budget chipsets and motherboards that don't support it. At least that's what I've seen of the regular sized ATX boards, but it may be a different story for Micro-ATX or Mini-ITX.

The relevant info is always in the specs, and will look something like:

  M.2 Slots
    2242/2260/2280/22110 M-key
    2242/2260/2280 M-key
For each of those, the first two numbers are the width, the rest denote the length. So 2280 is 22mm x 80mm, and 22110 is 22mm x 110mm.
wtallis 1301 days ago [-]
Consumer SSDs are very nearly defined as those that do not include the extra hardware necessary for such guarantees; the only other defining feature of a consumer SSD is the presence of low-power idle states. I'm not aware of any current or recent flash-based SSDs marketed at consumers that feature the same kind of power loss protection as enterprise SSDs.

Intel's Optane SSDs don't need power loss protection capacitors because they don't cache writes—their 3D XPoint memory is more or less fast enough to not require it. These drives correctly advertise that they do not have a volatile write cache. I have not encountered a consumer NVMe drive that lies about having a volatile write cache, but I have not attempted to test how they behave when the host requests that the write cache be disabled.

loosescrews 1301 days ago [-]
Intel's "Intel® Optane™ Memory H10 with Solid State Storage" is an Optane/QLC flash hybrid drive with an NVMe interface and capacities up to 1TB. It claims to offer enhanced power loss data protection:

https://www.intel.com/content/www/us/en/products/memory-stor...

I found this review which makes it sound pretty reasonable:

https://www.tomshardware.com/reviews/intel-h10-qlc-flash-opt...

wtallis 1301 days ago [-]
The Optane half of the H10 doesn't need power loss protection capacitors, and the QLC half of the drive doesn't have room for them. Intel has stated that their caching software for that drive will sometimes send writes directly to the QLC half without buffering them on the Optane half, so the package as a whole does not offer enterprise-grade power loss protection and shouldn't be listed as such.
marcan_42 1301 days ago [-]
Consumer Crucial/Micron drives used to have enough capacitors on board to, at least in my understanding, never corrupt data on power loss (not sure if they flushed the cache, but I think they at least tried and guaranteed sectors wouldn't become unreadable/corrupted due to being partially written). They dropped this in later models, though, but I think they still claim they can guarantee non corruption/partial writes?

I stopped trusting Crucial when they released the P1 though. That QLC drive is terrible on every metric. Performance falls off a steep cliff once you do any writes at all, and often never quite recovers. Tail latencies are in the 1 second range, which is insane for an SSD.

ec109685 1301 days ago [-]
These folks combine Intel 3D XPoint + QLC to deliver a two-tiered cheep storage solution: https://vastdata.com/
fomine3 1298 days ago [-]
SLC cache looks like another consumer feature that not supported on enterprise drives maybe due to inconsistent performance.
nh2 1301 days ago [-]
Not "consumer", but the only M.2 SSD with 80 mm length (M.2 2280) I found that has power loss protection is this one:

https://www.kingston.com/en/ssd/dc1000b-data-center-boot-ssd

wtallis 1301 days ago [-]
Rather interestingly, despite obviously having a bank of large capacitors and conspicuous advertising of power loss protection, the firmware on that drive identifies it to the host system as having a volatile write cache.
nh2 1301 days ago [-]
Last weekend my ThinkPad running ZFS on Linux on a Samsung non-power-loss-protected consumer SSD lost various files when it ran out of power (including the entire Thunderbird profile because its SQLite DB was corrupted).

I'm planning to replace it by the SSD linked above.

I was already aware of the power loss issue on consumer SSDs (from the 2013 piece http://lkcl.net/reports/ssd_analysis.html), but I wanted to see it happen for real. I guess I did now.

wtallis 1301 days ago [-]
You'll probably experience a lot more unplanned power loss in your laptop by giving it a SSD that idles at 1.2W instead of one that idles in the mW range. And there's a real chance that it won't even solve your problem.
throwaway8941 1301 days ago [-]
This does not answer your question, but wouldn't it be easier to get a cheap UPS with enough battery life to get your machine through additional 5 minutes of uptime?

Although I've had problems with hardware hanging so completely that it does not respond to the reset button, and the only option is to cut power, so UPS does not provide full protection.

smolder 1301 days ago [-]
UPSs are never really cheap, since when converting AC to DC to AC you lose a significant amount of power efficiency for your equipment. The funny part is the server PSU then converts back to DC power yet again. My understanding is that well optimized data centers distribute conditioned, battery backed DC power directly to devices to avoid those double conversion losses.
zrm 1301 days ago [-]
A small UPS will often have the equipment running directly on AC input power and only switch to the battery if the input power fails.
smolder 1301 days ago [-]
Oh, thanks for the correction. It seems I forgot that type existed or just never knew. Now I'm curious about the relative merits. I'd guess the transition is not as smooth and presents some kind of a risk that's unacceptable for critical infra.
agapon 1301 days ago [-]
The terms that describe those types are Line-interactive and On-line UPS.

E.g., https://blog.tripplite.com/line-interactive-vs-on-line-ups-s...

tuatoru 1301 days ago [-]
It'd be good if consumer UPSes had DC output for this reason. If only laptops could be standardized in terms of voltage requirements.

A lot of them are pretty similar; 19V - 21V seems quite common.

Until recently phones were standardized on 5V. Now, it's a mess again. :-(

smolder 1300 days ago [-]
I vaguely recall seeing someone modify their UPS for this in their "homelab", such that they were running a 12v cable modem, router and WAP off of the UPS battery pack directly somehow. They had dramatically longer run time from a full charge. (Well more than double, I think.)
yencabulator 1300 days ago [-]
Ooh, imagine a NUC or such small enough machine running off of USB-C Power Delivery. 65 Watt power budget easily. I guess a Raspberry Pi would count as that?

Then a battery pack could provide that, as DC.

fulafel 1301 days ago [-]
The PSU will let you know in advance that power is going out. I think the interesting question is which SSDs have FTLs that can safely "park" in this time to a consistent state.

Is there a standard for PSU's about how long they provide power, that SSDs could design to?

wtallis 1301 days ago [-]
> Is there a standard for PSU's about how long they provide power, that SSDs could design to?

Yes. The ATX power supply spec has timing requirements for the PWR_OK signal. As described on Wikipedia:

> The ATX specification requires that the power-good signal ("PWR_OK") [...] remain high for 16 ms after loss of AC power, and fall (to less than 0.4 V) at least 1 ms before the power rails fall out of specification (to 95% of their nominal value).

So power supplies are expected to continue operating through roughly a single missing cycle of AC power, but they only have to give the system 1ms of warning when power is going out. This signal would have to be delivered to the SSD by software in the form of a shutdown notification (a write to a particular NVMe register).

And now that I think about it, that shutdown notification mechanism probably warrants inclusion in the article.

fulafel 1301 days ago [-]
I see, didn't even know that the original ATX spec made any promises, I guess 1 ms is "something", strictly speaking :)

I was hoping for a post-1995, more usable spec since we know that current PSUs often do provide power longer than 1 ms.

A single AC cycle would be 16-20 milliseconds which could already be doable if the SSD got the signal right away without OS involvement.

wtallis 1301 days ago [-]
I'm not sure that bit actually comes from the original ATX spec; Wikipedia seems to indicate it showed up in ATX 2.31 in February 2008. It does also appear to be consistent with the very recent Intel ATX12VO spec.
dannyw 1301 days ago [-]
Actually the Crucial MX500 SSD series had extra capacitors for power loss. Unfortunately they removed it for M.2.
wtallis 1301 days ago [-]
That wasn't the same thing. Crucial SSDs always had volatile write caches. What their MX series and at least some of the NVMe drives offer is the guarantee that data already on the flash will not be corrupted by a write that's in progress when the power fails. But that doesn't affect the semantics of writes that may still be pending in the volatile write cache.

This potential failure mode is possible when storing more than one bit per flash memory cell, and using a multi-step process to program the cell voltage, and mapping the low-order bits of a cell's value to different LBAs than the high order bit. Drives need to have the capability to either complete or safely abort an in-progress cell program process so that the value in the high-order bit(s) isn't corrupted by an incomplete programming of the low-order bit(s). And this power failure problem isn't the only reason why SSDs need to be careful about leaving cells in a partially-programmed state.

srtjstjsj 1301 days ago [-]
I had a old (2015 era) presumably cheap M2 drive in a "refurbished" Dell Inspiron laptop that one day after under year just refused to boot or be recognized has having block storage at all (couldn't reformat, internally not externally with a USB adapter). (I bought a cheap replacement that has worked fine since.)

Could that have been caused by a sudden power loss that corrupted a key section or the drive?

sedatk 1301 days ago [-]
Assuming NVMe’s are mostly used with journaling file systems, why is this important?
throwaway8941 1301 days ago [-]
Journaling cannot guarantee data or filesystem integrity if your hardware is lying to you. If you send flush to an SSD and it reports "ok, your data is on the persistent storage", while actually keeping it in DRAM buffers (to get higher numbers on benchmarks), and your power goes down, shit ensues. This is surprisingly common behavior.
agar 1301 days ago [-]
Wow, this jogged a memory from when Brad Fitzpatrick (bradfitz on HN) had to write a utility to ensure the hard drives running LiveJournal didn't lie about successfully completing fsync().[1] IIRC, the behavior caused fairly serious database corruption after a power outage.

Went back and found the link. To my surprise, it was 15 years ago. To my greater surprise, the original post, the Slashdot article, and the utility all remain available.

And hard drives (or their NVMe successors) still lie.

[1] https://brad.livejournal.com/2116715.html

wtallis 1301 days ago [-]
> This is surprisingly common behavior.

Anecdotally, consumer NVMe SSDs actually tend to not lie about it. Every time I've benchmarked a consumer NVMe SSD under Windows both with and without the "Write Cache Buffer Flushing" option, it has a profound impact on the measured performance of the SSD. I have not observed a comparable performance impact for SATA SSDs, so I suspect Microsoft's description of what that option does is inaccurate for at least one type of drive, though it is at least possible that ignoring flushes is extremely common for consumer SATA SSDs but uncommon for consumer NVMe SSDs.

throwaway8941 1301 days ago [-]
Sure, although the proper way to test it would be to write a lot of data to the drive, issue an fsync, and cut power in the middle of the operation. Rinse and repeat a (few) hundred times for each drive.

There's a guy on btrfs' LKML (also the author of [0]) who is diligent enough to do these tests on much of the hardware he gets, and his experience does not sound good for consumer drives.

[0]: https://github.com/Zygo/bees/

wtallis 1301 days ago [-]
> although the proper way to test it would be to write a lot of data to the drive, issue an fsync, and cut power in the middle of the operation. Rinse and repeat a (few) hundred times for each drive.

This isn't quite right. You have to ensure that the drive returned completion of a flush command to the OS before the plug was pulled, or else the NVMe spec does allow the drive to return old data after power is restored. Without confirming receipt of a completion queue entry for a flush command (or equivalent), this test as described is mainly checking whether the drive has a volatile write cache—and there are much easier ways to check that.

tobias3 1301 days ago [-]
Here is a post from him: https://lore.kernel.org/linux-btrfs/20190623204523.GC11831@h...

TLDR: Very few drives don't implement flush correctly. Notice that he mainly uses hard disks, not SSDs/NVMe. Failure often occurs when two (usually rare) things occur at once. E.g. remapping an unreadable sector while power-cycling.

1301 days ago [-]
RealStickman_ 1301 days ago [-]
Does he share the results of his tests anywhere?
sedatk 1301 days ago [-]
But as long as you write the journal entry first and the device guarantees flushes for writes in the order they are queued, there should be no inconsistent state at all?
wtallis 1301 days ago [-]
> and the device guarantees flushes for writes in the order they are queued

NVMe does not require such a guarantee, nor does it provide a way for drives to signal such a guarantee.

(Part of the reason is that NVMe devices have multiple queues, and the standard tries to avoid imposing unnecessary timing or synchronization requirements between commands that aren't submitted to the same queue.)

sedatk 1301 days ago [-]
I see. Then it makes sense. Thanks.
jabl 1301 days ago [-]
Assuming NVME queing works like SATA or SCSI queuing (which I believe it does), then basically queue entries are unordered [1]; the device is free to process them in any order. If you (as in, person who is implementing a block layer or file system in an OS kernel, or some fancy kernel-bypass stuff) want requests A and B to be ordered before request C, then you must do something like

1. Issue A and B.

2. Wait for A and B to complete.

3. Issue a FLUSH operation (to ensure that A and B are written from the drive cache to persistent storage), and wait for it to complete.

4. Issue C with FUA (force unit access) bit set.

5. Wait for C to complete.

Alternatively, if the device doesn't support FUA, for writing C you must instead do

4b. Issue C.

5b. Wait for C to complete.

6b. Issue FLUSH, and wait for the FLUSH to complete.

Now, like wtallis already said, NVME additionally has multiple queues per device, but these are independent from each other. If you somehow want ordering between different queues, you must implement that in higher level software.

[1] The SCSI spec has an optional feature to enable ordered tags. But apparently almost no devices ever implemented it, and AFAIK Linux and Windows never use that feature either.

arielweisberg 1301 days ago [-]
Out of the box configuration for most journaling filesystems only journal metadata not data.

Journaling data cuts write throughput in half and it’s not necessary most of the time.

throwaway8941 1301 days ago [-]
It's worthy of note that log-structured and copy-on-write filesystems (which I've seen described as two types of journaling) like btrfs and F2FS log data as part of their normal operation without any performance loss, so you always get a consistent view of the filesystem (barring bugs in the FS code or fsync-is-not-really-an-fsync treachery from your hardware).
sedatk 1301 days ago [-]
As long as writes are flushed sequentially, I see no problem with metadata only journaling, but apparently, NVMe's don't provide such guarantees.
arielweisberg 1300 days ago [-]
You don’t get a guarantee of write ordering from not just the disk (pretty much any kind) but the OS IO scheduler.

Journaling filesystems can still implement atomic appends with only metadata journaling.

Updating in place is generally not atomic because of the way writeback works for buffered IO.

If you use unbuffered IO you bypass the OS scheduler but still have the disk reordering things if you don’t use write barriers and they still don’t guarantee atomicity for regular writes.

questionfor 1301 days ago [-]
What’s the most reliable non-NVME SSD for consumers?
Shared404 1301 days ago [-]
Good question.

I've found Samsung to be quite a good brand, as well as HP surprisingly. YMMV with HP though.

stjohnswarts 1301 days ago [-]
I assume the people who make NVME drives are much more knowledgeable than me on the topic. I just use drives as they come and don't obsess over how to get 5-10% more life out of them with tweaks. Your best bet is to buy a drive from a reputable brand and not one you haven't heard of before or at least haven't researched. As far as data, back up early, back up often.
wtallis 1301 days ago [-]
This article is not about a drive's total write endurance or lifespan, but about short-term durability of writes across events like unexpected power failures or system crashes.
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
Rendered at 14:27:31 GMT+0000 (Coordinated Universal Time) with Vercel.