So a few milliseconds in total. Big whoop. OpenBSD's ahci(4) driver stalls 5-6 seconds for each SATA device on the bus, just to make sure the device really is there and that the link speed is correct...or something. My two OpenBSD machines, which incidentally have 3 SATA drives each, spend almost 20 seconds on that one segment alone during kernel startup.
Colin (and others?) have been working to reduce the boot time of FreeBSD for a while now. At this point shaving off a few milliseconds is probably a nice win for them.
I mean, I get it. But shaving 4.5ms does seem to fall into the realm of not most peoples problems?
Note that I want to stress that that is no reason for some folks to stop focusing on it. And kudos on the improvement. Criticism, if any, here is that statistics once again rears its head to make something feel bigger by using a percent.
Sorta. They have to rely on this on a repeated basis such that it matters. And there, if startup time is still 10x this, roughly, why push for solutions that require that many kernel startups?
That is silly. I already said I do not intend this as a criticism of the folks working on it. Kudos on the improvement.
That said... for most people, you are better off learning to use threads or processes. It is almost certainly less effort than spinning up a VM. Not to mention everything else you will, by necessity, have to do if you are using VMs for everything. Unless you find ways to do shared network connections across VMs and such, but... I'm curious how we can make things closer to threads, without eventually bringing all of the dangers of threads along with it?
Why make "for most people" points when we're already talking about FreeBSD?
There's a sort of uni-/light-kernel race quietly happening right now since the industry doesn't really fully trust process isolation within an OS for running completely untrusted code. At least not as much as it trusts VT-X/AMD-V. In that race, kernel startup time is just like JVM startup time, or Julia startup time, or Python startup time, all of which are things that people work on shaving milliseconds from.
Ouch. That feels even harsher to the idea than what I'm saying. :D
To your point, my "for most people" is aimed at this forum. Not the folks doing the work. Is incredibly cool that the folks doing this are also on the forum, but I think it is safe to say they are not most of the folks on here. That not the case?
Say the post had been "How Netflix uses FreeBSD to achieve network latencies less than 10 microseconds" or something. How helpful would it be to comment about how "for most people" this doesn't matter?
Did anybody who read the tweet think it mattered to them when it didn't? It mentions Firecracker explicitly. How many people on HN do you think upvote this because they themselves are running FreeBSD on Firecracker, versus the number who upvoted because it's just interesting in and of itself?
Maybe? Having worked in way too many teams that were convinced we had to aim at some crazy tech because someone else saw "50% increase in throughput" on some tech stack; I am happy to be the one saying to keep perspective.
Though, I will note that you picked a framing with absolute timing here. That is the root criticism here. If the headline would be "team saved 2ms off 28ms startup routine," it would still be neat and worth talking about. Would probably not have impressed as many folks, though. After all, if they save 1% off training time on a standard ML workload, that is way way bigger. Heck, .07% is probably more time.
I'm reminded of a fun discussion from a gamedev, I think it was Jonathan Blow, on how they thought they were smarter than the ID folks because they found that they were not using hashes on some asset storage. He mused that whether or not he was smarter was not at all relevant to that, as it just didn't matter. And you could even make some argument that not building the extra stuff to make sure the faster way worked was the wiser choice.
If we didn't have you here we might have done the terrible mistake of thinking this was interesting. I'm so glad you were here to show us how useless this effort is, protecting us from wasting our time, protecting us from our lack of judgment. Thank you so much for your service and foresight.
I mean... sorta? Yes, it is the point. But, the goal is to make it so that you can do this scale up/down at the VM level. We have been able to do this at a thread or even process level for a long long time, at this point.
4.5ms on what hardware, in what scenario? Would I like to save 5ms off the startup time of my VMs? You betcha. Does that 5ms turn into 200ms on a low power device? Probably
How would you even discern this delay from all the other delays happening before you're done booting, when there are so many other natural variances of a couple of milliseconds here and there? Every "on metal" boot is unique, no two are exactly as fast, and certainly never within 1.98 milliseconds of eachother even in the case of a VM with lots of spare resources on the host. You're painting a too pretty picture of this.
Right, but again pointing this at most people on this forum, that answer is probably the same. Very few of us are in a situation where this can save seconds a year, much less seconds a month.
For the folks this could potentially save time, I'm curious if it is better than taking an alternative route. Would be delighted to see an analysis showing impact.
And again, kudos to the folks for possibly speeding this up. I'm assuming qsort or some such will be faster, but that isn't a given. For that matter, if it is being sorted so that searches are faster later, than the number of searches should be considered, as you could probably switch out to sentinel scanning to save the sort time, and then you are down to how many scans you are doing times the number of elements you are looking against.
Agreed, but my math also says that if this is your pain point, you'd save even more time by not firing off more VMs? That is, skip the startup time entirely by just not doing it that often.
I don't startup VMs multiple times per hour, much less per minute, so I don't assume to know what tradeoffs the people using firecracker are making when deciding how often to startup VMs.
Fair. I still feel fine pushing back on this. The amount of other resources getting wasted to support this sort of workload is kind of easy to imagine.
I will put back the perspective that they were not hunting for 2ms increases. They were originally chopping off at seconds on startup. The progress they made is laudable. And how they found it and acted on it is great.
Ha, I thought I saw 4.5ms from another post. Not sure where I got that number. :(
Realizing this is why someone said the number was more like 2ms. I don't think that changes much of my view here. Again, kudos on making it faster, if they do.
Colin's focus has been on speeding up EC2 boot time. You pay per second from time on EC2. A few milliseconds at scale probably ads up to a decent amount of savings - easily imaginable it's enough to cover the work it took to find the time savings.
Yes, you pay per second. But, per other notes, this is slated to save at most 2ms. Realistically, it won't save that much, as they are still doing a sort, not completely skipping it, but lets assume qsort does get this down to 0 and they somehow saved it all. That is trivially 500 boots before you see a second. Which... is a lot of reboots. If you have processes that are spawning off 500 instances, you would almost certainly see better savings by condensing those down to fewer instances.
So, realistically, assuming that Colin is a decently paid software engineer, I find it doubtable the savings from this particular change will ever add up to mean anything even close to what it costs to have just one person assigned to it.
Now, every other change they found up to this point may help sway that needle, but at this point, though 7% is a lot of percent, they are well into the area where savings they are finding are very unlikely to ever be worth finding.
Edit: I saw that qsort does indeed get it into basically zero at this range of discussion. I'd love to see more of the concrete numbers.
Missing the point, which is to shave off cold start times for things like Lambdas, where shorter start times means you can more aggressively shut them down when idle, which means you can pack more onto the same machine.
See, the binpacking that this implies for lambda instances is insane to me. Certainly sounds impressive and great, and they may even pull it off. And, certainly, for the Lambda team, this at least sounds like it makes sense. I'll even root for them to hope it works out.
It just reminds me of several failed attempts to move to JVM based technologies to replace perl, because "just think of the benefits if we get JVM startup so that each request is its own JVM?" They even had papers showing resource advantages that you would gain. Turned out, many times, that competing against how cheap perl processes were was not something you wanted to do. At least, not if you wanted to succeed.
A realization here is that while reaching cold start times that allow for individual requests would be awesome, you don't need to reach that to benefit from attacking the cold start time:
Every millisecond you shave off means you can afford to run closer to max capacity before scaling up while maintaining whatever you see as the acceptable level of risk that users will sometimes wait.
Of course if the customer stack is slow to start, the lower bound you can get to might still be awful, but you can only address the part of the stack you control.
I think my argument is easily seen as this, but that is not my intent of "argumemt." As you say, and if I'm not mistaken, in this case they only "cared" about this because it was easy and "in the way," as it were. I don't think it was a waste of time to pick up the penny in the road as you are already there and walking.
What I am talking towards, almost certainly needlessly, is that it would be a waste of most people's time to profile anything that is already down to ms timing in the hopes of finding a speedup. In this case, it sounds like they were basically doing final touches on some speedups they already did and sanity testing/questioning the results. In doing so, they saw a few last changes they can make.
So, to summarize, I do not at all intend this as a criticism of this particular change. Kudos to the team for all the work they did to get so that this was 7% of the remaining time, and might as well pick up the extra gains while there.
I've spent plenty of time optimizing for milliseconds (for very high hourly rates, not on staff, so my work has a very clear cost and ROI) and I operate probably at least a dozen layers above the OS.
Modern CPUs execute a few million instructions per millisecond.
I think people who operate at a level where microsecond and nanoseconds matter may see it as dismissive to question gains of milliseconds.
I'm sure that happens. And I'm sure they do. I'm also equally sure in thinking that the vast majority of the interactions I have, are with people that are not doing this.
You can think of my "argument" here, as reminding people that shaving 200grams off your bicycle is almost certainly not worth it for anyone that is casually reading this forum. Yes, there are people for whom that is not the case. At large, that number of people are countable.
And I can't remember if it was this thread or another, but I didn't really intend an argument. Just a conversation. I thought I lead off with a kudos on the improvement. Criticism would be to the headline, if there is any real criticism to care about.
I can't really agree. I think it's a sad indictment of this field that we so easily will throw away significant efficiencies - even if it's only a few milliseconds here or there.
Software today is shit. It's overly expensive, wastes tons of time and energy, and it's buggy as hell. The reason that it's cheaper to do things this way is because, unlike other engineering fields, software has no liability - we externalize 99.99% of our fuckups as a field.
Software's slow? The bar is so low users won't notice.
Software crashes a lot? Again, the bar is so low users will put up with almost anything.
Software's unsafe garbage and there's a breach? Well it's the users whose data was stolen, the company isn't liable for a thing.
That's to say that if we're going to look at this situation in the abstract, which I think you're doing (since my initial interpretation was in the concrete), then I'm going to say that, abstractly, this field has a depressingly low bar for itself when we throw away the kind of quality that we do.
But... this is precisely not a significant efficiency. At best, you can contrive an architecture where it is one. But, those are almost certainly aspirational at best, and come with their own host of problems that are now harder to reason about, as we are throwing out all of the gains we had to get here.
I agree with you, largely, in the abstract. But I'm failing to see how these fall into that? By definition, small ms optimizations of system startup are... small ms optimizations of system startup. Laudable if you can do them when and where you can. But this is like trying to save your way to a larger bank account. At large, you do that by making more, not saving more.
A 7% improvement from a trivial change is an insane thing to question, honestly. It is absolutely significant. Whether it is valuable is a judgment, but I believe that software is of higher value (and that as a field we should strive to produce higher quality software) when it is designed to be efficient.
> At best, you can contrive an architecture
FaaS is hardly contrived and people have been shaving milliseconds off of AWS Lambda cold starts since it was released.
> But I'm failing to see how these fall into that?
> The improvement in speed from Example 2 to Example 2a is only about 12%, and many people would pronounce that insignificant. The conventional wisdom shared by many of today's software engineers calls for ignoring efficiency in the small; but I believe this is simply an overreaction to the abuses they see being practiced by pennywise-and-pound-foolish programmers, who can't debug or maintain their "optimized" programs.
In established engineering disciplines a 12 % improvement, easily obtained, is never considered marginal; and I believe the same viewpoint should prevail in software engineering
You have two framings here that I realize are not mine.
First, I am not arguing to not make the change. It was identified, make it. As you say, it is a trivial one to do. Do it.
Second, thinking of percentage improvements has to be done on the total time. Otherwise, why is it not written in much more tightly tuned assembly? After all, I am willing to wager that they still don't have the code running within 7% of the absolute optimum speed it can be going in as long. Heck, this is a program that is a post processing of a list. They already have a way that could get the 40us completely gone. Why stop there?
If I am to argue that anything here would have been a "waste" and shouldn't have been done, it would be the search for a 2ms improvement in startup. But, again, that isn't what they were doing. They were shaving off seconds from startup and happened to find this 2ms improvement. It made headlines because people love pointing out poor algorithmic choice. And it is fun to muse on.
This would be like framing your wealth as just the money you have in your wallet as you go to the store. On the walk, you happen to see a $5 bill on the ground. You only have $20 with you, so picking it up is a huge % increase in your wealth. Of course you should pick it up. I'd argue that, absolutely considering it, there is really no reason to ever not pick it up. My "framing" is that going around looking for $5 bills to pick up from the ground is a waste of most peoples time. (If you'd rather, you can use gold mining. Was a great story on the radio recently about people that still pan for gold. It isn't a completely barren endeavor, but it is pretty close.)
Look, from a utilitarian perspective, let's say we want to optimize the total amount of time globally.
Let's say FreeBSD dev spends 2 hours fixing a problem that gives a 2ms improvement to the boot time.
Let's assume conservatively that FreeBSD gets booted 1000 times every day globally. That's 2 seconds per day. Scale that to 10 years and you break even.
Nobody cares about "your bicycle" or your internal CRUD app that has maybe 30 users, but FreeBSD is widely used and any minor improvements on it can be presumed to be worth the time fixing.
Maybe you have an axe to grind on the topic of premature optimization, but this FreeBSD fix doesn't seem to be the showcase of developers wasting their time doing meaningless optimizations.
This is silly, though. For many reasons. First, I have no axe to grind. They found a trivial change that can be made and made it. Probably even simplified the code, as they used a better search that was introduced later in the codebase. Kudos on that!
Second, it also clearly took them more than 2 hours. They've been working at making it faster for many days at this point. So, it will take them a long long time to realize this particular gain.
Finally, my only two "criticisms" this entire time were on calling this a 7% improvement and claiming that searching for this would be a waste of most teams time. Consider, from that headline most folks should assume that they can get their EC2 instance that is running FreeBSD 7% faster. But, they clearly won't be able to. They won't even get a lambda invocation 7% faster. Nothing anyone anywhere will ever see will ever be 7%. They may see something 2ms faster, and that is, itself awesome. And may claim that this would be a waste of time to search for is irrelevant for this team. They weren't looking for this particular change to make, of course, so that this "criticism" isn't relevant to them. At all.
Just to be clear, I mean 'enough saved across all pay-per-second ec2 users', not a given specific account or user, which, yeah, it's probably minimal. The scale of lambda and ec2 is...enough to make any small change a very large number in aggregate.
Right, but this is akin to summing up all of the time saved by every typist learning to type an extra 5 words a minute. Certainly you can frame this in such a way that it is impressive, but will anyone notice?
I think the focus here was for firecracker VMs used by lambda? If you're paying 2ms on every invocation of every function that'll add up. OTOH it seems like SnapStart is a more general fix.
If you boot OpenBSD thousands of times per day it adds up. I can imagine this being the case for people running big server farms or doing virtualization stuff.
I had it setup like this: some devices plugged into PC directly, KVM plugged into the PC, hub plugged into PC, hub on monitor plugged into hub and a bunch of things plugged into hubs.
Half of the devices were just powered by USB and not using any data, but only needed to be ON when the PC is ON.
The way it worked:
- FreeBSD would scan all USB devices
- get held up on some for no reason
- find a hub
- scan everything on a hub
- find another hub
- scan everything on that hub as well
Anytime during the scan it could run into a device that takes longer (no clue which or why, only happened on FreeBSD and never on Linux or Windows)
I'm sure windows and linux did the same thing, but both shapely continued booting while FreeBSD waited and waited and waited and waited.
Bias light on a monitor would sometimes completely stop the process until unplugged.
USB stack on FreeBSD is forever cursed, I still remember kernel panics when you unplug a keyboard or a mouse.
> I'm sure windows and linux did the same thing, but both shapely continued booting while FreeBSD waited and waited and waited and waited.
My understanding is FreeBSD will by default wait until it's found all the fixed disks before it figures out which one to mount as root. This could be a USB drive, so it needs to enumerate all the USB first. Adding hw.usb.no_boot_wait=1 to /boot/loader.conf.local or thereabouts if you don't need that will skip that wait and speed up the boot process a lot (especially on hardware where USB is slow to finish enumeration).
It adds 2ms to the boot time on modern hardware, and the list it is sorting has grown by about 2 orders of magnitude since it was introduced. Seems reasonable to me that it was unnoticed.
I am currently at work fighting the same class of problem. Bad pattern was used a few times, seemed to work. So it was used twenty times, still no problems. Now it’s used 100 times and it was 20% of response time. Got that down to half in a few months, now it’s a long slog to halve it again. It should never have been allowed to get past 3% of request startup time.
Likely because none have the required proficiency. I certainly don't. "So? It's open source. Stop complaining and fix it." is never a good response to a bug report. I know this quirky behavior has been brought up a few times by other users than me on the openbsd-bugs@ mailing list during the 10 or so years since I first observed it.
I don't know, but I don't think there's an actual case that Linux/FreeBSD/etc. happens to have a workaround for. I think it's just OpenBSD charging towards a windmill.
It's not impossible that it's some archaic and since long invalid legacy thing along the lines of "let's wait 5 seconds here just in case the drive hasn't managed to spin up, despite the computer having been powered on for 15 seconds already, because that happened once in 2003 on someone's 1989 HPPA system so we need to support that case". I'm not joking, this is really what I imagine it being about.
There were (or still are?) SCSI drives that would not spin up until told to over the bus. I think the idea is to not do them all at once since the motor draws a lot of power.
I'm fairly sure I've run into drives that would not respond on the bus while spinning up.
And if it happens with SCSI drives, it may happen with other types.
>"There were (or still are?) SCSI drives that would not spin up until told to over the bus"
Surely they'd spin up as soon as the BIOS started probing the bus to init the hardware, long before the kernel was running, or they'd be infamous for being "the HDDs you cannot boot an OS from"...
In my 25 years of using PCs I have not once come across a drive that did not spin up as soon as the computer was powered on. But whatever the case is, Linux and FreeBSD never had this behavior. Waiting some arbitrary amount of time isn't an appropriate solution (to what I insist is just an imagined problem), it's just a poor bandaid.
Spinning up only upon request is common behavior for SCSI drives. Per the Ultrastar DC HC550 manual I have handy,
After power on, the drive enters the Active Wait state. The Drive will not spin up its spindle motor after power on until it receives a NOTIFY (Enable Spinup) primitive on either port to enter the Active state. If a NOTIFY (Enable Spinup) primitive is received prior to receiving a StartStop Unit command with the Start bit set to one, spin up will begin immediately. For SAS, this is analogous to auto-spinup function in legacy SCSI. This provision allows the system to control the power spikes typically incurred with multiple drives powering on (and spinning up) simultaneously.
If a StartStop command with the Start bit set to one is received prior to receiving a NOTIFY (Enable Spinup), the drive will not start its spindle motor until Notify (Enable Spinup) is received on either port. Successful receipt of a NOTIFY (Enable Spinup) is a prerequisite to spin up.
Code in the SCSI controller boot ROM, if enabled, does typically handle this before the OS starts, often with an option to stagger spin-up for the reason mentioned above.
For what it's worth, to speed up boot, I typically disable the OPROM on any storage controller not hosting a boot device.
Moreover, as BIOS, UEFI, Power Mac, …, all require different, incompatible firmware bits, enabling controller firmware isn't an option for many otherwise compatible controllers.
Regardless, possible spin up delays in no way justify introducing explicit delays in device enumeration; if the OS needs to ensure a drive is spun up, multiple mechanisms exist to allow it to do so without introducing artificial delays (TEST UNIT READY, non-immediate START STOP UNIT, simply paying attention to additional sense information provided by commands known to fail before spin up and retrying as appropriate, etc.).
IIRC, explicit multi-second delays between controller initialization and device enumeration were traditionally introduced because some (broken) devices ignored commands entirely — and were therefore effectively invisible — for a significant period of time after a bus reset.
With that said, introducing a multi-second delay to handle rare broken devices is sufficiently strange default behavior that I assume something else I'm not aware of is going on.
> Surely they'd spin up as soon as the BIOS started probing the bus to init the hardware, long before the kernel was running, or they'd be infamous for being "the HDDs you cannot boot an OS from"...
Depends on the market they sell into, might not be a big deal for enterprise drives that you don't expect to boot from; especially if you're attaching them to a controller that you also don't expect to boot from. I had some giant quad processor pentium II beast after it was retired, and the BIOS could not boot from the fancy drive array at all; and even if it could, the drives were on a staggered startup system, so it would have had to wait forever for that.
Contrived case. Such a storage solution would still not be something that ahci(4) on OpenBSD - a basic SATA controller driver - could wrangle, no matter how many seconds it patiently waited.
Could be anything, but wouldn't be surprised if it were some other bug leading the driver to conclude there's 65355 devices attached, and it needs to probe each of them.
Well, for one I'd imagine you can do all ports at once. Or start doing it and allow rest of drivers to initialize their stuff while it is waiting
At least looking in my dmesg on linux all SATA drives are getting ready rougly at same time 0.6 s into boot AHCI driver gets initialized and 3.2 seconds into boot they are all up. Looks parallel too as ordering does not match log order
OS should save all serial numbers of devices and after they are all found continue.
we should make it declarative or how do you want to call it.
also why cant freaking boot process be optimized after first boot?
bsd is essentialy sorting SAME thing every boot, THAT is ridiculous.
sort once, save order ( list, sysinit000000000000 ), boot fast next time.
hardware or other change or failed boot, can trigger start sorting bull for safety.
you know what youre booting into, so sort it once then run it from saved order next time. how many times you change hardware on computer ? and if you do, you can just restart with grub flag, toggle switch in control panel before restart, etc
The code to manage "do I sort or do I cache" is probably worse than the code to just sort it unconditionally.
And you really want to do this automatically, with no manual anything, because (among other reasons) if you need to swap a part on a broken machine, you do not want that to needlessly break the software. So you're sorting regardless, and so you might as well just be unconditional about it.
on my machine TPM is checking state of machine for security reasons, so if this runs ANYWAY then why not use that for one more useful thing ... ( it runs before even machine thinks about booting anything )
you can define shutdown flag for hardware change: shutdown -rF XD,
some things are hot swappable and for that there has to be logic (in kernel) to know you have new hardware...
if you have total hardware failure that can be detected too, on next boot...
That's nice way to add fuckloads of useless code while wasting memory and making now-read-only process into read-write process (kernel have no place to write cache before userspace initializes and sets up mounts and filesystems).
Windows XP had a tool called BootVis[1] that would try to optimize the startup sequence. Supposedly newer versions of windows do this sort of thing automatically.
I suspect much of the delay comes from the nature of plug-and-play. The OS has to poll every bus and every port to find every device, then query those devices to determine what they are, then load a driver and actually configure them. It simply can't rely on the hardware being exactly the same from one boot to the next.
im saying you can have two pathways, ( decision which route to choose is almost zero cost )
first - when you boot correctly, you can save information that you booted correctly.
second - if you do not find information about safe boot, you go old route, quicksort route.
( YES. you can do that, in a way you do not have loops of boot process because you are reading boot is ok even if it is not. just dont be lazy like linux kernel developers were, when i was telling them this exact thing few years back. )