> O(n^2) is the sweet spot of badly scaling algorithms: fast enough to make it into production, but slow enough to make things fall down once it gets there.
Lesson here is to profile your software at some point.
I'm playing a game (Fenyx Rising) and after launching it I always wonder why "Checking for additional content" screen takes 20-30 seconds. I'm pretty sure it should be just a single request.
I doubt that GTA Online released the game with 60K+ items in their online store.
The runtime for the release day store inventory count may have been okay, but I don't think that Rockstar kept profiling their game every time that they modified their store inventory.
The amount of pain Rockstar inflicted on their 90K+ users shows that Rockstar didn't care that their game took more than 3 mins to startup for the majority of their users.
As someone who worked on a smaller, but still millions-of-users live-service game, profiling of average resource usage and the loading screens was done periodically even if there were no code changes to the area of the game in question.
Given that our team was over an order of magnitude smaller than Rockstar, I would be very surprised if they did not have anyone even casually browsing a profiler every 3 months or something, though I think at their scale (LinkedIn claims 5k employees) they can probably have a team or two where everyone's entire job description be performance maintenance.
We definitely profile "dev startup" a lot, it directly impacts team speed and ability to work on the codebase. Dev startup usually includes a majority of "user startup" too, so it helps -- but it's also really easy to overlook stuff like "massive mods/mod dirs" and the like.
Speaking from experience... Once, I was so annoyed at a "startup UI"-related dialog taking way (wayyyy) too long to load. I dropped an expletive into Slack and, not much later, I was guided toward fixing my own ~2 year old mistake. We were scanning some "user-generated files" in a way that worked fine but scaled horribly -- the operation went from multiple seconds down to milliseconds. Ugh.
About the regexp one: I once wrote a regexp for CSS. It wasn't complete, but you would be able to pinpoint a syntactic error. I hooked it up to a text field, and started entering CSS. All went fine, until I hit a limit, and adding a single character froze Chrome for more than a minute (at which point I killed it).
I don't think it was accidentally quadratic. More likely, it was exponential.
Probably because it's MUCH easier to code bubblesort without making mistakes that cause it to not terminate or some such. Especially if they are writing the bootloader in assembly.
For something mission critical like a bootloader that's more valuable than turning O(n^2) into O(n log n). People running systems like BSD largely don't care how long the system takes to boot, once it's booted the system runs for years.
The funny thing is that, in my experience, bubble sort is actually pretty hard to write, because the finicky details of the swap step don't make sense (in a "wait, this can't possibly be right, there's no way you'd want to write it like that" kind of way). Better than even odds that if you have someone write a "bubble sort" without reference, you get an accidentally insertion sort variant.
Sure. In this case, the smarter people wrote the sort in ~1995 and it was good enough for nearly 30 years, but now someone has to step up and be smart again.
You can't always rely on smarter people to be there to do things for you. And you also can't rely on the standard library to have been written by smarter people anyway, and even if so, to have been written by smarter people in order to handle the situation you find yourself in. There's lots of ways to sort, and lots of good reasons to choose ways that are less than optimal for the use cases they end up being used in.
You’re defending the claim that implementing qsort is too hard, definitely the people who wrote the standard library are smarter than the people putting in bubble sort because quicksort is too hard.
This is just a moronic defense of the venerated FreeBSD developers, it’s on a level equal to organized religion. The FreeBSD developers are fine developers and this was dumb, that’s why they replaced it.
And in this day and age there really is no argument for any user space c environment to exist where the qsort standard library function is not available. And even if there was, it would still be smarter to just copy and paste the battletested code from the c library than write another implementation. Because that’s how you end up with bubblesort because doing it right is too hard.
Copying a convenient-looking one out of a CS book is how you end up with a bubble sort. Approximately nobody comes up with a bubble sort from scratch; it's obviously, gratuitously bad in a way that there's no practical reason to think of on your own. The sorts that people come up with without reference are usually insertion and selection sorts—those two were discovered before bubble sort.
I mean yeah, bubble sort is basically insertion sort but weirdly pessimized in a way that makes negative sense. Giving it a catchy name has probably been a net harm.
Plenty of children come up with bubble sort as a sorting algorithm. It’s intuitive to go down the list swapping any pairs that happen to be in the wrong order.
It's also very intuitive to pick out a misplaced item and put it in the right position. In the context of physical objects (e.g. playing cards), it's even more intuitive to come up with a (two-way) insertion sort than a bubble sort.
Nah, it’s not. Correctly implementing algorithms is hard; it’s easy to create incorrect behavior on edge cases and performance pitfalls. I’m sure you knew about the Java issue from a while back, for example.
People who think this way haven’t written boot code.. I suppose you’re gonna link the c runtime too and it’s assumption of being a “process” under some “operating system”.. oh wait.
Compared to the rest of the task, writing a sort is pretty darn trivial.
This is true if we're talking about the first stage bios boot that needs to fit in 512 bytes, but there aren't any particular restraints on kernel size at the point in question. Link in anything you want, including qsort.
O(n^2) algorithms often cause performance issues. The main cases I have seen in business logic are: (A) offset pagination (select ... offset n) and then paginate over all entries, and (B) read a text value, append something, store, repeat.
Ummm.. Sure, but what's happening here is that FreeBSD stores a list in a file not in the desired order, then sorts it before use.
It seems to be any prolonged discussion about which sorting algorithm should be used is sort of skipping the elephant. Why is the list not sorted to begin with?
Without that basic knowledge it isn't very productive to focus on implementation details, no matter how fun the sorting exercise is. Deleted code is fast code.
I was talking about "exponential algorithms" in general, and about business logic. I know FreeBSD isn't business logic, but low-level kernel code. I don't know the details of the FreeBSD problem.
For each page, you have a select statement. The first without offset, then with offset 10 limit 10, then offset 20 limit 10, offset 30 limit 10 and so on. The database engine will, for each query, read all entries up to the offset + limit. This is a staircase pattern: first it reads 10, then 20, then 30, and so on. So the sum of read entries is n * n / 2, which is O(n^2).
One could argue that the database engine should be "smarter", but it's not. Note that data could be added or removed in the meantime, so the database engine can't really cache the result. See also https://stackoverflow.com/questions/70519518/is-there-any-be...
That's pagination, not indexed page scanning. Both have their place but they're not the same. Pagination is way better to handle updates between page loads and generally more complicated to implement. As you're now doing head tail index cursor tracking. Flat boring offset/limit is amazingly simple for the happy lazy path which is probably fine for most apps.
If you need to iterate over all records, just do it? Why do you need offset.
Otherwise using offset usually is OK idea. Because users very rarely will inspect page #2153. They're interested with page 1, sometimes page 2. limit/offset works fine for those cases and it'll work for page 2153 for those who visit it once in a decade. Using ids makes logic to trac prev/next/page number incredibly complex and generally you don't need it.
> If you need to iterate over all records, just do it?
Who is "you" here?
Usually what happens is that party A builds a REST API (or other connectionless protocol) for fetching a list of some entity-type; and they limit the number of items that can be fetched in one request (because they have a billion such entities, and they don't want to even try to imagine what kind of system would be necessary to generate and stream back a 5TB JSON response); which implies pagination, to get at "the rest of" the entities.
Then party B, a user of this API, decides that they want to suck party A's whole billion-entity database out through the straw of that REST API, by scraping their way through each page.
> it'll work for page 2153 for those who visit it once in a decade
To be looking at page 2153, the user probably first visited every page before 2153. Which means they didn't do one O(N) query (which would by itself be fine); but rather, they made O(N) requests that each did an O(N) query.
That is usually a non-issue. The cost in DB operations is usually much more relevant than it.
When people do actually care about fully enumeration and unicity of the items they they are querying, "pagination" itself tends to be a too messy concept.
What a "page" means when enough things appear between them that they push stuff a page away? Are those things reordered? Do people expect to see the new things before they get to the old?
A "page" is a very ill-defined thing that can only exist if your stuff is mostly static. Queries are very different on dynamic content.
You’re overthinking (overquestioning) it. When a user hits “next”, they want to see next N posts from where they at, in order they chosen before, that’s it.
Since there’s no evidence of a mess still, I believe you’re projecting it from an overthought side that isn’t real.
Off Topic: "Laying out icons on a grid should be an inherently linear operation"
it doesn't seem mentioned in the HN thread the cause here is probably the same thing O(n^2): sorting. laying out icons is only linear if the icons are read in the order that they're placed. It's been a long time since I used windows regularly but my memory is the default placement is by creation time. So if they're read off disk by filename (or some sort of hash) they'd need to be sorted by timestamp.
Even if the disk sorts directory entries by file name and you want to show them sorted by file name, chances are you have to sort.
Reasons? Firstly, you’d have to know the entries are sorted. For that, you need an API that tells you that or hard-code information about file systems in your code. They may exist, but I’m not aware of any file system API that provides that information.
Secondly, the file system may not return the names sorted in the locale you want to sort them in.
Thirdly, the sorting code used in the file system may contain a bug. Once file systems are out there, you can’t fix them (happened in one of Apple’s file systems. HFS, IIRC)
Lastly, modern GUIs tend to sort file names containing numbers non-alphabetically, so that, for example “file 2.jpg” gets sorted before “file 12.jpg”.
So, I think it’s easier to always sort. I would pick an algorithm that works well when items are mostly sorted at the start, though.
This made me recall modern AI and the issue with quadratic complexity in its transformers. Ooof! A breakthrough here would be a true Breakthrough™ with remarkably larger context sizes. Like it would barely even be a limit anymore and be transformative (har har) to what they can be used for.
True but I think the real cause of this is surely that C makes it too hard to use a sorting library that someone competent has written. I would not be surprised if the author was fully aware of the N^2 complexity but opted for a simpler implementation anyway.
(More realistically, below people are discussing that in the kernel environment the set of standard or third party library available may be unavoidably limited)
There is a quirk (almost a bug?) in FreeBSD's qsort where it will switch to insertsort for the whole array under certain conditions, which we hit in production given how our data was arranged.
I like how someone felt the need to write out insertion sort in some 4 line code golf challenge in the midst of qsort. This right here is why no one wants to deal with C anymore.
Did a bit of digging and found that there used to be a comment for why it was done, but it got removed [0] when they switched to the implementation from Bentley & McIlroy's "engineering a sort function" [1] around 1992.
qsort() is pretty slow if you're sorting something like long[]. In that case, radix sort goes 5x faster and vectorized quicksort goes 10x faster. If you're sorting int[] then vectorized quicksort goes 25x faster than qsort(). Nothing goes faster. The issue is it's a big ugly c++ monstrosity that's a huge pain to compile.
That's fair if the constant factor is relevant, but if bubble sort is terminating in any reasonable timescale then the difference between qsort, C++ std::sort, and a custom implementation is really not a factor.
A lot of languages now have common tooling (cargo for rust, pip for python, etc) which makes it easier to find and incorporate the libraries you want. Apparently there are tools like https://conan.io/ but they're not as widely-adopted.
C's build system is similarly non-uniform. Many packages use Makefiles, others use different build mechanisms.
C has no universally-agreed error handling mechanism or convention. Whether exceptions or golang's error interface, you can generally assume that a random package in most languages will handle errors the way you expect. In C it's a lot more varied.
Similarly memory allocation - sometimes in a larger application you want to customize how malloc and friends work. How and whether you can do that for a C package is non-uniform.
Mind you the C standard library has a sort() function which will have sensible big-O behavior on pretty much any platform. I suspect this specific problem is more to do with this being kernel-mode code which has a lot of special conditions.
I am always amazed by arguments that say that not having a language package manager like cargo or pip makes it hard.
Really? Is it really considered hard to link a library without them? Am I so old that somehow I grew up in a world where linking a library was not considered black magic?
The first issue is actually downloading the dependencies, doing this manually quickly becomes infeasible for any non-trivial project.
The second issue is keeping everything updated, and making sure that all packages are compatible with all other packages. Doing this manually is also not easy.
With C specifically, you need to wrangle different build systems, and once you have them built and "installed", you need to figure out which linker and compiler flags are needed to consume the libraries.
If you are working on a small project with maybe a few dependencies you can do this by hand, but when you get to say, 15 dependencies, it quickly becomes very difficult.
You can use the system package manager on some systems to install libraries I guess (assuming it has the packages and versions that you need), in this case manually managing things could be a lot easier, but you still should be using pkg-config for portability purposes.
But none of that supports the assertion that C makes it hard to use good libraries. You can even use libraries not written in C if you want.
If the argument is really "it's impossible to make a good library in C", that's different. I'd very much disagree with that, but it would be to the point.
I'm saying "it is harder to consume good libraries in C, because it is harder to find them & harder to build them; and once you have done both, you find that good library A and good library B work in very different ways, so you have to do more work to adapt".
And I haven't mentioned the lack of a strong universal string type, the way many libraries typedef their own family of different kinds of integer, the way one library will require you to free() a returned structure and another will require you to call my_library_free()...
It all adds up to additional friction.
You don't have to agree! Maybe I am out of date, I haven't really dealt with this since the mid 2000's. I'd be thrilled to hear this isn't an issue any more.
It's not really a matter of whether or not I agree. I was just trying to understand what the assertion was!
I was baffled by the notion because I couldn't think of anything inherent in the language that made it hard to use good libraries. Now I understand that's not really what the assertion was.
The original assertion was about difficulty of using C libraries in the kernel or bootloader. In the bootloader you're the OS. There's no file system, no dynamic linker, and no processes. There's no guarantee some third party library will work in that environment. It might but it's work to make sure it does or adapt it if it doesn't.
Let's say you want to develop a CLI tool in C for crawling a website's sitemap.xml as advertised by the website's robots.txt.
How would you approach this development in C?
With e.g. Java, Javascript, PHP, and Python it's clear to me.
There's no standard build system. Think about how you add a dependency in Rust, Go, JavaScript, even Python.
Now do it in C/C++. Absolute nightmare.
Look at how many header-only C++ libraries there are out there. They're making compilation time drastically worse purely to avoid the need to faff with a shitty build system. It's a good trade-off too. I use header-only libraries where possible. They're convenient and popular.
Actually vcpkg seems to be going some way to fixing that but I have yet to convince my co-workers to use it.
> to avoid the need to faff with a shitty build system.
Then maybe don't use a shitty build system?
It's true, C is not trying to be a programming environment or tech stack. It's a language, that's it. Whether or not that's desirable depends on what you're trying to do, it's not something that is good or bad in some absolute sense.
You have your choice of build systems, so pick one that meets your needs.
Vcpkg isn't for me, either, because it doesn't solve any problem I have. If it does for you, awesome!
Then build them. I'm not seeing the issue here, to be honest, so I'm not sure what I should be addressing.
If the issue is that you don't like how the dependency has arranged to build (I'm not sure why you'd actually care, but just in case...), then port the makefile (or whatever) to your preferred system.
Or, another guess, is the issue that you want to build all your dependencies as if they were an integral part of your own project? If that's the case, I would argue that you're doing it wrong and another tech stack would make you happier.
With what? The whole point of this line of discussion is that building stuff in C is suboptimal, often requiring the end user to juggle n different build systems by hand to get all the dependencies installed.
Since we're talking FreeBSD here, thirty years ago we had Modula-3 whose build system is lightyears ahead of anything make based.
And now we're back at: this is why C is more difficult, for no good reason, than the alternatives. For small projects that you're only building locally, your approach may work. Once you expect any sort of cross-platform compatibility and/or start distributing your project to others, this approach falls apart and fast. Instead of focusing on development you'll be focusing on playing whack-a-mole with incompatibilities between the different shells (or versions of make, etc.) offered on different operating systems.
Again the problem is that no package/dependency management for C/C++ means that C is balkanized and more difficult to deal with compared to other languages. Using third party libraries in C is far more difficult and error prone than it ought to be.
Pretty much every language that provides more comprehensive dependency management also provides easy enough access to allow you to NIH/DIY it if you need/want to. For instance contrast this with something like rust where cargo is standard across all supported platforms and provides proper dependency management. You can still call the rust compiler (rustc) and whatever linker yourself if you so desire.
Honestly if you lack the vision to see why that sucks I really suggest you try a language with good package management - e.g. Go or Rust. Maybe it will make it easier to see all the problems with the traditional C approach if you can see how it should work.
C and C++ dependency installation is "sub-optimal" but it is certainly well understood. The fact that there are many cross platform, open source projects written in C and C++ proves this. If you can't deal with makefiles and scripts, maybe stick to Go and Rust?
C and C++ dependency installation is "sub-optimal" but it is certainly well
understood. The fact that there are many cross platform, open source projects
written in C and C++ proves this.
Nope. It proves that people have found work arounds up to and including cargo culting things. Any large enough project is going to rely on something else to generate the makefiles, and if you depend on anything large enough you're gonna get stuck having to build and/or install whatever other makefile generators are required. Simply put it's an archaic mess.
If you can't deal with makefiles and scripts, maybe stick to Go and Rust?
To be clear it's not that I can't deal with makefiles it's that I don't find it a good use of my time*. Take, for instance, the FreeBSD ports tree. make(1) is its achilles heal and the main reason it's obscenely slow to deal with (even compared to something like portage). Besides, doing anything by hand is inherently error prone be it bounds checking array access or cobbling together dependency management.
* And, sure, in years past I wrote a drop-in replacement for automake/autoconf in perl whose big (speed) advantage was that it didn't spawn a new shell for each check. It was a neat hack, but that's all anything papering over the make(1) interface is.
C can’t be used to implement a good library for many problems due to being inexpressive. For example, you can’t write an efficient, generic vector data structure, but sort functions are also only fast due to the compiler’s smartness — passing a function pointer would be strictly slower.
Though this has not much relevance here as it is about assembly.
C arrays suck on the best of days and qsort requires that you understand function pointers and types which are some of the hairiest C syntax in common use. The C Clockwise Spiral rule is truly special.
It's easy to lose sight of the climb once you're at the top.
This actually made me laugh out loud. Yes, if your problem with C is that it doesn't need a package management mechanism like some other languages, then C is clearly not for you. But C is very far from the only language like this.
It's a bit like criticizing a fish for having no legs.
I sense that we have some sort of real miscommunication going on here, because the only response I can think of to
> I have no idea why you would think that it doesn't need one.
Is that I have no idea why anyone would think that it does need one.
Perhaps the disconnect is that you are wishing C addresses different use cases than it addresses? That you wish it were a different language? If so, that's fine. Use a more appropriate language for your task. I just find it odd if the criticism of C is that it isn't a different kind of language.
That's suboptimal for many reasons. Package names are not consistent. Installing multiple versions is usually impossible. Huge pain for library authors. Doesn't integrate with build systems usually (even basic pkg-config support is iffy). The command is OS-specific. You can't usually choose the linking method. Difficult to bundle dependencies. Usually out of date. Etc. etc.
A lot of these comments are like "C is hard if you don't know C and it scares you".
Not to mention kernel mode doesn't want a gigantic library package manager to pull in leftpad() from the internet. As mentioned, the kernel libraries on FreeBSD have a qsort, but they didn't in the original commit from 3 decades ago or whatever.
Annoyance #1: External symbols are expected to be unique within 31 characters according to the standard, so you're limited to a few namespace "levels", at most.
curl_easy_option_by_name() is already up to 24 characters and there's only two "levels" in curl_easy_*.
Annoyance #2: There's no formal registrar for LIBNAME. This isn't a big deal for popular libraries, but it's a pain having to keep a locally-modified copy of a less popular dependency just because it shares its name with another less popular dependency.
Annoyance #3: LIBNAME_actual_function_name() is a pain to read and using either the preprocessor or static inlines to locally alias function names for the sake of readability is silly.
@1: The limits are considered obsolete since C89 and implementations are encouraged to avoid them whenever possible. I think the same way we disregard non ASCII character sets, non two's complement encodings and similar, we are safe in assuming sane implementation being able to handle longer names. And being honest, if given implementation isn't capable of more than 31 significant characters, then it would have problems with namespaces too.
@2: Agree, although I don't recall this ever happening to me.
@3: Is it? How is libname::actual_function_name() much better?
I actually like to use libname__actual_function_name(), as it further separates "namespace" from function name (unless we need compatibility with C++, as IIRC it reserves all double underscores, not only at the beginning).
> @1: The limits are considered obsolete since C89 and implementations are encouraged to avoid them whenever possible.
This is still the case in C11, Section 5.2.4.1. Did this change in the most recent standard?
> @2: Agree, although I don't recall this ever happening to me.
It happened to me once. I ran across a library from 1994 and another from the 2010s which shared a simple name like "libamc". I'll comb through my records later to figure out the actual name.
> @3: Is it? How is libname::actual_function_name() much better?
It's not, but I wasn't thinking of C++ specifically. (I don't know C++. I've somehow managed to avoid it in many years of writing C.)
I was thinking more like the file-local namespace clobbering offered by Python e.g., from LIBNAME import actual_function_name.
> This is still the case in C11, Section 5.2.4.1. Did this change in the most recent standard?
Well, no, it's still marked "just" obsolete. For it to be deprecated or removed there would need to be anybody caring about it enough to put some work. But since it doesn't affect vendors at all (they can just ignore) and users don't complain, it's just a forgotten "law" - still law, but a dead one.
> I was thinking more like the file-local namespace clobbering offered by Python e.g., from LIBNAME import actual_function_name.
Oh, that's... way more than just namespaces. Way more. That would require more fundamental changes and additions. C++ just added modules in C++20 (not sure how well those will catch on), but I don't think something like that is to be expected in C for feasible future.
They used "_Generic" as a keyword but it doesn't really do that.
Suppose I need to define a copy assignment operator for the library's sort function to use. Is there a good way to overload it? Can the library know what the size of each element is based on its type without having to pass it in as a parameter?
You can pass function pointers to the library, but that quickly becomes awful.
Or, you get one function that takes all of the arguments and have to define and pass in a bunch of function pointers and type size parameters that are each an opportunity for bugs or UB in order to sort a simple array of integers.
If my type needs a custom assignment operator, I need each library I use to take that as an argument. One expects the function pointer to take the arguments in the order (src, dst), another as (dst, src), a third specifies the return value as int instead of void, a fourth takes the source argument as "void *" instead of "const void *" in case you want to implement move semantics and a fifth doesn't support custom assignment operators at all.
It's no surprise that people prefer to avoid this.
having said that this specific sort is somewhere deep in kernel boot land. and kernel code can't really use the standard library. I am not sure if there is a standard kernel sort.
> C makes it too hard to use a sorting library that someone competent has written.
rust makes it easy to create generic data structures via generics. It would be trivial to swap between different underlying sorting algos as long as it was given a type compatible function for how to compare the contained items.
So a few milliseconds in total. Big whoop. OpenBSD's ahci(4) driver stalls 5-6 seconds for each SATA device on the bus, just to make sure the device really is there and that the link speed is correct...or something. My two OpenBSD machines, which incidentally have 3 SATA drives each, spend almost 20 seconds on that one segment alone during kernel startup.
Colin (and others?) have been working to reduce the boot time of FreeBSD for a while now. At this point shaving off a few milliseconds is probably a nice win for them.
I mean, I get it. But shaving 4.5ms does seem to fall into the realm of not most peoples problems?
Note that I want to stress that that is no reason for some folks to stop focusing on it. And kudos on the improvement. Criticism, if any, here is that statistics once again rears its head to make something feel bigger by using a percent.
Sorta. They have to rely on this on a repeated basis such that it matters. And there, if startup time is still 10x this, roughly, why push for solutions that require that many kernel startups?
That is silly. I already said I do not intend this as a criticism of the folks working on it. Kudos on the improvement.
That said... for most people, you are better off learning to use threads or processes. It is almost certainly less effort than spinning up a VM. Not to mention everything else you will, by necessity, have to do if you are using VMs for everything. Unless you find ways to do shared network connections across VMs and such, but... I'm curious how we can make things closer to threads, without eventually bringing all of the dangers of threads along with it?
Why make "for most people" points when we're already talking about FreeBSD?
There's a sort of uni-/light-kernel race quietly happening right now since the industry doesn't really fully trust process isolation within an OS for running completely untrusted code. At least not as much as it trusts VT-X/AMD-V. In that race, kernel startup time is just like JVM startup time, or Julia startup time, or Python startup time, all of which are things that people work on shaving milliseconds from.
Ouch. That feels even harsher to the idea than what I'm saying. :D
To your point, my "for most people" is aimed at this forum. Not the folks doing the work. Is incredibly cool that the folks doing this are also on the forum, but I think it is safe to say they are not most of the folks on here. That not the case?
Say the post had been "How Netflix uses FreeBSD to achieve network latencies less than 10 microseconds" or something. How helpful would it be to comment about how "for most people" this doesn't matter?
Did anybody who read the tweet think it mattered to them when it didn't? It mentions Firecracker explicitly. How many people on HN do you think upvote this because they themselves are running FreeBSD on Firecracker, versus the number who upvoted because it's just interesting in and of itself?
Maybe? Having worked in way too many teams that were convinced we had to aim at some crazy tech because someone else saw "50% increase in throughput" on some tech stack; I am happy to be the one saying to keep perspective.
Though, I will note that you picked a framing with absolute timing here. That is the root criticism here. If the headline would be "team saved 2ms off 28ms startup routine," it would still be neat and worth talking about. Would probably not have impressed as many folks, though. After all, if they save 1% off training time on a standard ML workload, that is way way bigger. Heck, .07% is probably more time.
I'm reminded of a fun discussion from a gamedev, I think it was Jonathan Blow, on how they thought they were smarter than the ID folks because they found that they were not using hashes on some asset storage. He mused that whether or not he was smarter was not at all relevant to that, as it just didn't matter. And you could even make some argument that not building the extra stuff to make sure the faster way worked was the wiser choice.
If we didn't have you here we might have done the terrible mistake of thinking this was interesting. I'm so glad you were here to show us how useless this effort is, protecting us from wasting our time, protecting us from our lack of judgment. Thank you so much for your service and foresight.
I mean... sorta? Yes, it is the point. But, the goal is to make it so that you can do this scale up/down at the VM level. We have been able to do this at a thread or even process level for a long long time, at this point.
4.5ms on what hardware, in what scenario? Would I like to save 5ms off the startup time of my VMs? You betcha. Does that 5ms turn into 200ms on a low power device? Probably
How would you even discern this delay from all the other delays happening before you're done booting, when there are so many other natural variances of a couple of milliseconds here and there? Every "on metal" boot is unique, no two are exactly as fast, and certainly never within 1.98 milliseconds of eachother even in the case of a VM with lots of spare resources on the host. You're painting a too pretty picture of this.
Right, but again pointing this at most people on this forum, that answer is probably the same. Very few of us are in a situation where this can save seconds a year, much less seconds a month.
For the folks this could potentially save time, I'm curious if it is better than taking an alternative route. Would be delighted to see an analysis showing impact.
And again, kudos to the folks for possibly speeding this up. I'm assuming qsort or some such will be faster, but that isn't a given. For that matter, if it is being sorted so that searches are faster later, than the number of searches should be considered, as you could probably switch out to sentinel scanning to save the sort time, and then you are down to how many scans you are doing times the number of elements you are looking against.
Agreed, but my math also says that if this is your pain point, you'd save even more time by not firing off more VMs? That is, skip the startup time entirely by just not doing it that often.
I don't startup VMs multiple times per hour, much less per minute, so I don't assume to know what tradeoffs the people using firecracker are making when deciding how often to startup VMs.
Fair. I still feel fine pushing back on this. The amount of other resources getting wasted to support this sort of workload is kind of easy to imagine.
I will put back the perspective that they were not hunting for 2ms increases. They were originally chopping off at seconds on startup. The progress they made is laudable. And how they found it and acted on it is great.
Ha, I thought I saw 4.5ms from another post. Not sure where I got that number. :(
Realizing this is why someone said the number was more like 2ms. I don't think that changes much of my view here. Again, kudos on making it faster, if they do.
Colin's focus has been on speeding up EC2 boot time. You pay per second from time on EC2. A few milliseconds at scale probably ads up to a decent amount of savings - easily imaginable it's enough to cover the work it took to find the time savings.
Yes, you pay per second. But, per other notes, this is slated to save at most 2ms. Realistically, it won't save that much, as they are still doing a sort, not completely skipping it, but lets assume qsort does get this down to 0 and they somehow saved it all. That is trivially 500 boots before you see a second. Which... is a lot of reboots. If you have processes that are spawning off 500 instances, you would almost certainly see better savings by condensing those down to fewer instances.
So, realistically, assuming that Colin is a decently paid software engineer, I find it doubtable the savings from this particular change will ever add up to mean anything even close to what it costs to have just one person assigned to it.
Now, every other change they found up to this point may help sway that needle, but at this point, though 7% is a lot of percent, they are well into the area where savings they are finding are very unlikely to ever be worth finding.
Edit: I saw that qsort does indeed get it into basically zero at this range of discussion. I'd love to see more of the concrete numbers.
Missing the point, which is to shave off cold start times for things like Lambdas, where shorter start times means you can more aggressively shut them down when idle, which means you can pack more onto the same machine.
See, the binpacking that this implies for lambda instances is insane to me. Certainly sounds impressive and great, and they may even pull it off. And, certainly, for the Lambda team, this at least sounds like it makes sense. I'll even root for them to hope it works out.
It just reminds me of several failed attempts to move to JVM based technologies to replace perl, because "just think of the benefits if we get JVM startup so that each request is its own JVM?" They even had papers showing resource advantages that you would gain. Turned out, many times, that competing against how cheap perl processes were was not something you wanted to do. At least, not if you wanted to succeed.
A realization here is that while reaching cold start times that allow for individual requests would be awesome, you don't need to reach that to benefit from attacking the cold start time:
Every millisecond you shave off means you can afford to run closer to max capacity before scaling up while maintaining whatever you see as the acceptable level of risk that users will sometimes wait.
Of course if the customer stack is slow to start, the lower bound you can get to might still be awful, but you can only address the part of the stack you control.
I think my argument is easily seen as this, but that is not my intent of "argumemt." As you say, and if I'm not mistaken, in this case they only "cared" about this because it was easy and "in the way," as it were. I don't think it was a waste of time to pick up the penny in the road as you are already there and walking.
What I am talking towards, almost certainly needlessly, is that it would be a waste of most people's time to profile anything that is already down to ms timing in the hopes of finding a speedup. In this case, it sounds like they were basically doing final touches on some speedups they already did and sanity testing/questioning the results. In doing so, they saw a few last changes they can make.
So, to summarize, I do not at all intend this as a criticism of this particular change. Kudos to the team for all the work they did to get so that this was 7% of the remaining time, and might as well pick up the extra gains while there.
I've spent plenty of time optimizing for milliseconds (for very high hourly rates, not on staff, so my work has a very clear cost and ROI) and I operate probably at least a dozen layers above the OS.
Modern CPUs execute a few million instructions per millisecond.
I think people who operate at a level where microsecond and nanoseconds matter may see it as dismissive to question gains of milliseconds.
I'm sure that happens. And I'm sure they do. I'm also equally sure in thinking that the vast majority of the interactions I have, are with people that are not doing this.
You can think of my "argument" here, as reminding people that shaving 200grams off your bicycle is almost certainly not worth it for anyone that is casually reading this forum. Yes, there are people for whom that is not the case. At large, that number of people are countable.
And I can't remember if it was this thread or another, but I didn't really intend an argument. Just a conversation. I thought I lead off with a kudos on the improvement. Criticism would be to the headline, if there is any real criticism to care about.
I can't really agree. I think it's a sad indictment of this field that we so easily will throw away significant efficiencies - even if it's only a few milliseconds here or there.
Software today is shit. It's overly expensive, wastes tons of time and energy, and it's buggy as hell. The reason that it's cheaper to do things this way is because, unlike other engineering fields, software has no liability - we externalize 99.99% of our fuckups as a field.
Software's slow? The bar is so low users won't notice.
Software crashes a lot? Again, the bar is so low users will put up with almost anything.
Software's unsafe garbage and there's a breach? Well it's the users whose data was stolen, the company isn't liable for a thing.
That's to say that if we're going to look at this situation in the abstract, which I think you're doing (since my initial interpretation was in the concrete), then I'm going to say that, abstractly, this field has a depressingly low bar for itself when we throw away the kind of quality that we do.
But... this is precisely not a significant efficiency. At best, you can contrive an architecture where it is one. But, those are almost certainly aspirational at best, and come with their own host of problems that are now harder to reason about, as we are throwing out all of the gains we had to get here.
I agree with you, largely, in the abstract. But I'm failing to see how these fall into that? By definition, small ms optimizations of system startup are... small ms optimizations of system startup. Laudable if you can do them when and where you can. But this is like trying to save your way to a larger bank account. At large, you do that by making more, not saving more.
A 7% improvement from a trivial change is an insane thing to question, honestly. It is absolutely significant. Whether it is valuable is a judgment, but I believe that software is of higher value (and that as a field we should strive to produce higher quality software) when it is designed to be efficient.
> At best, you can contrive an architecture
FaaS is hardly contrived and people have been shaving milliseconds off of AWS Lambda cold starts since it was released.
> But I'm failing to see how these fall into that?
> The improvement in speed from Example 2 to Example 2a is only about 12%, and many people would pronounce that insignificant. The conventional wisdom shared by many of today's software engineers calls for ignoring efficiency in the small; but I believe this is simply an overreaction to the abuses they see being practiced by pennywise-and-pound-foolish programmers, who can't debug or maintain their "optimized" programs.
In established engineering disciplines a 12 % improvement, easily obtained, is never considered marginal; and I believe the same viewpoint should prevail in software engineering
You have two framings here that I realize are not mine.
First, I am not arguing to not make the change. It was identified, make it. As you say, it is a trivial one to do. Do it.
Second, thinking of percentage improvements has to be done on the total time. Otherwise, why is it not written in much more tightly tuned assembly? After all, I am willing to wager that they still don't have the code running within 7% of the absolute optimum speed it can be going in as long. Heck, this is a program that is a post processing of a list. They already have a way that could get the 40us completely gone. Why stop there?
If I am to argue that anything here would have been a "waste" and shouldn't have been done, it would be the search for a 2ms improvement in startup. But, again, that isn't what they were doing. They were shaving off seconds from startup and happened to find this 2ms improvement. It made headlines because people love pointing out poor algorithmic choice. And it is fun to muse on.
This would be like framing your wealth as just the money you have in your wallet as you go to the store. On the walk, you happen to see a $5 bill on the ground. You only have $20 with you, so picking it up is a huge % increase in your wealth. Of course you should pick it up. I'd argue that, absolutely considering it, there is really no reason to ever not pick it up. My "framing" is that going around looking for $5 bills to pick up from the ground is a waste of most peoples time. (If you'd rather, you can use gold mining. Was a great story on the radio recently about people that still pan for gold. It isn't a completely barren endeavor, but it is pretty close.)
Look, from a utilitarian perspective, let's say we want to optimize the total amount of time globally.
Let's say FreeBSD dev spends 2 hours fixing a problem that gives a 2ms improvement to the boot time.
Let's assume conservatively that FreeBSD gets booted 1000 times every day globally. That's 2 seconds per day. Scale that to 10 years and you break even.
Nobody cares about "your bicycle" or your internal CRUD app that has maybe 30 users, but FreeBSD is widely used and any minor improvements on it can be presumed to be worth the time fixing.
Maybe you have an axe to grind on the topic of premature optimization, but this FreeBSD fix doesn't seem to be the showcase of developers wasting their time doing meaningless optimizations.
This is silly, though. For many reasons. First, I have no axe to grind. They found a trivial change that can be made and made it. Probably even simplified the code, as they used a better search that was introduced later in the codebase. Kudos on that!
Second, it also clearly took them more than 2 hours. They've been working at making it faster for many days at this point. So, it will take them a long long time to realize this particular gain.
Finally, my only two "criticisms" this entire time were on calling this a 7% improvement and claiming that searching for this would be a waste of most teams time. Consider, from that headline most folks should assume that they can get their EC2 instance that is running FreeBSD 7% faster. But, they clearly won't be able to. They won't even get a lambda invocation 7% faster. Nothing anyone anywhere will ever see will ever be 7%. They may see something 2ms faster, and that is, itself awesome. And may claim that this would be a waste of time to search for is irrelevant for this team. They weren't looking for this particular change to make, of course, so that this "criticism" isn't relevant to them. At all.
Just to be clear, I mean 'enough saved across all pay-per-second ec2 users', not a given specific account or user, which, yeah, it's probably minimal. The scale of lambda and ec2 is...enough to make any small change a very large number in aggregate.
Right, but this is akin to summing up all of the time saved by every typist learning to type an extra 5 words a minute. Certainly you can frame this in such a way that it is impressive, but will anyone notice?
I think the focus here was for firecracker VMs used by lambda? If you're paying 2ms on every invocation of every function that'll add up. OTOH it seems like SnapStart is a more general fix.
If you boot OpenBSD thousands of times per day it adds up. I can imagine this being the case for people running big server farms or doing virtualization stuff.
I had it setup like this: some devices plugged into PC directly, KVM plugged into the PC, hub plugged into PC, hub on monitor plugged into hub and a bunch of things plugged into hubs.
Half of the devices were just powered by USB and not using any data, but only needed to be ON when the PC is ON.
The way it worked:
- FreeBSD would scan all USB devices
- get held up on some for no reason
- find a hub
- scan everything on a hub
- find another hub
- scan everything on that hub as well
Anytime during the scan it could run into a device that takes longer (no clue which or why, only happened on FreeBSD and never on Linux or Windows)
I'm sure windows and linux did the same thing, but both shapely continued booting while FreeBSD waited and waited and waited and waited.
Bias light on a monitor would sometimes completely stop the process until unplugged.
USB stack on FreeBSD is forever cursed, I still remember kernel panics when you unplug a keyboard or a mouse.
> I'm sure windows and linux did the same thing, but both shapely continued booting while FreeBSD waited and waited and waited and waited.
My understanding is FreeBSD will by default wait until it's found all the fixed disks before it figures out which one to mount as root. This could be a USB drive, so it needs to enumerate all the USB first. Adding hw.usb.no_boot_wait=1 to /boot/loader.conf.local or thereabouts if you don't need that will skip that wait and speed up the boot process a lot (especially on hardware where USB is slow to finish enumeration).
It adds 2ms to the boot time on modern hardware, and the list it is sorting has grown by about 2 orders of magnitude since it was introduced. Seems reasonable to me that it was unnoticed.
I am currently at work fighting the same class of problem. Bad pattern was used a few times, seemed to work. So it was used twenty times, still no problems. Now it’s used 100 times and it was 20% of response time. Got that down to half in a few months, now it’s a long slog to halve it again. It should never have been allowed to get past 3% of request startup time.
Likely because none have the required proficiency. I certainly don't. "So? It's open source. Stop complaining and fix it." is never a good response to a bug report. I know this quirky behavior has been brought up a few times by other users than me on the openbsd-bugs@ mailing list during the 10 or so years since I first observed it.
I don't know, but I don't think there's an actual case that Linux/FreeBSD/etc. happens to have a workaround for. I think it's just OpenBSD charging towards a windmill.
It's not impossible that it's some archaic and since long invalid legacy thing along the lines of "let's wait 5 seconds here just in case the drive hasn't managed to spin up, despite the computer having been powered on for 15 seconds already, because that happened once in 2003 on someone's 1989 HPPA system so we need to support that case". I'm not joking, this is really what I imagine it being about.
There were (or still are?) SCSI drives that would not spin up until told to over the bus. I think the idea is to not do them all at once since the motor draws a lot of power.
I'm fairly sure I've run into drives that would not respond on the bus while spinning up.
And if it happens with SCSI drives, it may happen with other types.
>"There were (or still are?) SCSI drives that would not spin up until told to over the bus"
Surely they'd spin up as soon as the BIOS started probing the bus to init the hardware, long before the kernel was running, or they'd be infamous for being "the HDDs you cannot boot an OS from"...
In my 25 years of using PCs I have not once come across a drive that did not spin up as soon as the computer was powered on. But whatever the case is, Linux and FreeBSD never had this behavior. Waiting some arbitrary amount of time isn't an appropriate solution (to what I insist is just an imagined problem), it's just a poor bandaid.
Spinning up only upon request is common behavior for SCSI drives. Per the Ultrastar DC HC550 manual I have handy,
After power on, the drive enters the Active Wait state. The Drive will not spin up its spindle motor after power on until it receives a NOTIFY (Enable Spinup) primitive on either port to enter the Active state. If a NOTIFY (Enable Spinup) primitive is received prior to receiving a StartStop Unit command with the Start bit set to one, spin up will begin immediately. For SAS, this is analogous to auto-spinup function in legacy SCSI. This provision allows the system to control the power spikes typically incurred with multiple drives powering on (and spinning up) simultaneously.
If a StartStop command with the Start bit set to one is received prior to receiving a NOTIFY (Enable Spinup), the drive will not start its spindle motor until Notify (Enable Spinup) is received on either port. Successful receipt of a NOTIFY (Enable Spinup) is a prerequisite to spin up.
Code in the SCSI controller boot ROM, if enabled, does typically handle this before the OS starts, often with an option to stagger spin-up for the reason mentioned above.
For what it's worth, to speed up boot, I typically disable the OPROM on any storage controller not hosting a boot device.
Moreover, as BIOS, UEFI, Power Mac, …, all require different, incompatible firmware bits, enabling controller firmware isn't an option for many otherwise compatible controllers.
Regardless, possible spin up delays in no way justify introducing explicit delays in device enumeration; if the OS needs to ensure a drive is spun up, multiple mechanisms exist to allow it to do so without introducing artificial delays (TEST UNIT READY, non-immediate START STOP UNIT, simply paying attention to additional sense information provided by commands known to fail before spin up and retrying as appropriate, etc.).
IIRC, explicit multi-second delays between controller initialization and device enumeration were traditionally introduced because some (broken) devices ignored commands entirely — and were therefore effectively invisible — for a significant period of time after a bus reset.
With that said, introducing a multi-second delay to handle rare broken devices is sufficiently strange default behavior that I assume something else I'm not aware of is going on.
> Surely they'd spin up as soon as the BIOS started probing the bus to init the hardware, long before the kernel was running, or they'd be infamous for being "the HDDs you cannot boot an OS from"...
Depends on the market they sell into, might not be a big deal for enterprise drives that you don't expect to boot from; especially if you're attaching them to a controller that you also don't expect to boot from. I had some giant quad processor pentium II beast after it was retired, and the BIOS could not boot from the fancy drive array at all; and even if it could, the drives were on a staggered startup system, so it would have had to wait forever for that.
Contrived case. Such a storage solution would still not be something that ahci(4) on OpenBSD - a basic SATA controller driver - could wrangle, no matter how many seconds it patiently waited.
Could be anything, but wouldn't be surprised if it were some other bug leading the driver to conclude there's 65355 devices attached, and it needs to probe each of them.
Well, for one I'd imagine you can do all ports at once. Or start doing it and allow rest of drivers to initialize their stuff while it is waiting
At least looking in my dmesg on linux all SATA drives are getting ready rougly at same time 0.6 s into boot AHCI driver gets initialized and 3.2 seconds into boot they are all up. Looks parallel too as ordering does not match log order
OS should save all serial numbers of devices and after they are all found continue.
we should make it declarative or how do you want to call it.
also why cant freaking boot process be optimized after first boot?
bsd is essentialy sorting SAME thing every boot, THAT is ridiculous.
sort once, save order ( list, sysinit000000000000 ), boot fast next time.
hardware or other change or failed boot, can trigger start sorting bull for safety.
you know what youre booting into, so sort it once then run it from saved order next time. how many times you change hardware on computer ? and if you do, you can just restart with grub flag, toggle switch in control panel before restart, etc
The code to manage "do I sort or do I cache" is probably worse than the code to just sort it unconditionally.
And you really want to do this automatically, with no manual anything, because (among other reasons) if you need to swap a part on a broken machine, you do not want that to needlessly break the software. So you're sorting regardless, and so you might as well just be unconditional about it.
on my machine TPM is checking state of machine for security reasons, so if this runs ANYWAY then why not use that for one more useful thing ... ( it runs before even machine thinks about booting anything )
you can define shutdown flag for hardware change: shutdown -rF XD,
some things are hot swappable and for that there has to be logic (in kernel) to know you have new hardware...
if you have total hardware failure that can be detected too, on next boot...
That's nice way to add fuckloads of useless code while wasting memory and making now-read-only process into read-write process (kernel have no place to write cache before userspace initializes and sets up mounts and filesystems).
Windows XP had a tool called BootVis[1] that would try to optimize the startup sequence. Supposedly newer versions of windows do this sort of thing automatically.
I suspect much of the delay comes from the nature of plug-and-play. The OS has to poll every bus and every port to find every device, then query those devices to determine what they are, then load a driver and actually configure them. It simply can't rely on the hardware being exactly the same from one boot to the next.
im saying you can have two pathways, ( decision which route to choose is almost zero cost )
first - when you boot correctly, you can save information that you booted correctly.
second - if you do not find information about safe boot, you go old route, quicksort route.
( YES. you can do that, in a way you do not have loops of boot process because you are reading boot is ok even if it is not. just dont be lazy like linux kernel developers were, when i was telling them this exact thing few years back. )
I think the “Firecracker” referenced in this Tweet is Amazon's “microVM” environment[0]. VMs running in this environment are sometimes booted to service scheduled/batch jobs, but are, I think, more often launched in direct response to (and with the intent of serving) an incoming HTTP request (and then stick around for a few minutes afterwards in case there are any subsequent requests). Network latency from the continental US to the nearest AWS datacentre is typically <30ms, so even after you add the response time of the application running in the VM, an extra 2ms boot time still probably makes up >2% of the total cold response time.
So wouldn't this fall under the category of "booting the kernel too much"? Why not find a way to keep already-booted kernels laying around and recycle them, rather than rebooting them afresh on every request? This pattern has already played out in other situations and been solved this way, such as with object pools or thread pools.
All that sounds like it adds much more complexity than swapping out bubble sort for qsort, and the other various optimizations that have been going in to reduce boot time. Plus optimizing common code improves boot time for everyone running FreeBSD - routers, firewalls, etc, whereas pooling just helps that one use case. If you can make it startup up fast enough that you don't need any of that extra machinery, why wouldn't you?
This is the "AWS Lambda cold start" use case: you're not doing this every time, but you are doing it on scale-up, initial deploy, etc. Depending on complexity, those costs start to hit your p99 as well as your p100.
Yes, Firecracker supports snapshotting for that reason. But you still have the famous "cold start" lag. Not to mention that, for some use cases, you'll potentially want to add restarts for security reasons (Firecracker is designed to be able to execute malicious code).
Having a microVM handle a rare endpoint and then exit is often cheaper than leaving an EC2 instance running constantly. An EC2 instance also has extra administration needs. At some scales the savings can be significant.
Shaving ms off the boot of that VM is a performance boost of that lambda (or whatever the cloud vendor calls it).
Maybe you're calling the serverless function from another serverless function in the same datacenter (or more simply, maybe you're instantiating a local firecracker VM to run some user-defined function within the context of your currently executing program), in which case the latency would be more on the order of 5-10ms.
but question is,
is this solution for people running software on other peoples computers ?
or solution for running my own code on my own computer only ?
Does it matter? Running kernel pieces in userspace is useful in both directions (rather like docker), and in both cases we'd like it to run as quickly as possible.
For my particular case it's about something similar in purpose to sandboxing, but with providing the compartment (ie a process subtree) with an alternative kernel to talk to, to minimise the attack surface between that container and the host kernel.
Yeah, and Ahmdal's law comes directly into play here. At 7% of the total boot time, this bubble sort may be the slowest part of the kernel boot. But, even if you optimized away the sort entirely somehow, you could still only save 1.97 ms. And once you do that, the next slowest thing probably doesn't take more than ~1ms to run. Eventually, you get to a state that resists these kinds of targeted optimizations, where there are no bottlenecks, and the only improvements can come from new algorithms and data structures, i.e. "uniformly slow code."
Of course, calling anything that takes place in less than a blink of an eye "slow" is overstating things... lol :-)
If only I could get my desktop environment to come up in less than 28ms :-)
exactly,
it is not problem of making it slower / faster or which is slower/ faster.
BUT
that we use solutions across multiple problems instead providing solution specifically for that problem. if you have KNOWN vm, on KNOWN hardware then why you need all kinds of hardware and other bull.. initializations, checks.... ? just provide list of things it has to have then after list is run over without problem, continue. also why would amazon/gcp/azure care ? they are making money from you running inefficient solutions....
main point of running unixy stuff is exactly that i can mold my solution to specific problem EASILY. thats why it is so common in embedded market ( routers, kiosks,drones, car infotainment, TV... )
Sure but Firecracker is used for AWS lambdas, so you're going to be booting those extremely frequently. (Now I wonder if Amazon is using FreeBSD for AWS lambda ..?)
Do you think those devices boot every time you send them an input, or just when they're plugged in? Unless you're turning the TV on and off thousands of times per day this would be unnoticeable.
Certain systems on a vehicle make sense to power off. Do you really need the unit controlling your backup camera to be booted into standby while the car is turned off?
No, but I'm not taking thousands of six-inch family vacations per day, I'd walk if I wanted to travel a distance for which two milliseconds was noticeable.
I think you're mixing (bad) analogies to make some type of point but I'm not sure what it is.
The original comment you replied to mentioned cars in the context of booting devices often, so I simply pointed out that yes, actually it makes sense to boot up some devices when they go to receive input and not to leave them in standby as you suggested. In a vehicle that may sit off/idle for several weeks, it's best not to drain the battery unnecessarily.
Do we? And... do we know that we can make it significantly faster? To wit, will the code be bigger to do a smarter sort such that that itself causes problems? Or, even with a faster sort, is it still 1.6ms?
You lack the imagination to come up with a use for some improvement off the cuff, that doesn't mean everyone else (some of whom are more motivated) does as well.
Why do we even need to waste any consciousness-seconds on watching electric circuits initialize? Computers should just start working
Probably true. My question was "how often is this run?"
I assume it's not just once when I first turn on my laptop from fully off. Is it once every time I restart a Docker container? (Assuming FreeBSD images.)
Is it something that runs a couple times a day, a dozen times a day, a thousand times a day? Fifty times every time I open an Electron app?
I suspect this doesn't matter and is more a curiosity, but I wouldn't be shocked to learn otherwise.
Years (to be a little hyperbolic). FreeBSD still doesn't have a good pid 1 story. It still has a very simple /sbin/init that just spawns whatever shell script happens to live in /etc/rc. That shell script essentially does a topology sort of service dependencies and then starts each one individually and serially.
So you might be shocked that 7% is so high, but the reason it is so high is that cperciva has been driving down the overall boot time to the tune of ~99%+ faster than two major releases ago. Bubblesort used to be <1% of total boot time because the rest of boot was so slow. Now that many of the gratuitous sleeps are gone, the next thing to look at is stupid algorithms like this.
I've seen this referred to as the "BBBB problem", as in: "nobody types BBBB into the search box, so it makes no sense to optimize the algorithm so it doesn't have worst-case complexity with repeated characters".
That is, until someone searches for e.g. git conflict markers.
At a previous job, I debugged a close relative of that exact problem: sorting list views was very slow when you wanted to sort on a column with very few distinct values in a large table. The culprit was of course a quick sort degenerating to n squared performance.
Had this problem in xBase (Clipper with NTX indexes): sorting by date but there were many null (empty) dates in that particular column. Added another date column (that was never null) to the index just to provide entropy.
At a glance looks like 13.1-RELEASE booted 3x faster than 11.1-RELEASE, and 14.0-CURRENT on Firecracker boots 45x faster than 11.1-RELEASE on a c5.xlarge
When I was doing my PhD in hydrology modeling, I found a bubble sort in one of our pre-processing programs, which calculated the topographic flow network. This worked okay when the watersheds being modeled were small or the grid resolution was coarse. I however, was modeling at much finer resolution (3m^2) resulting in runtimes of more than an hour. I replaced the bubble sort with merge sort from libc and my runtimes fell to less than 5 minutes. Fun times.
Tangential, but in my Java class recently I was teaching teaching the students some basic Big O stuff. I wrote a few versions of a program that found the "biggest delta" of two numbers in a list. The first version was O(n^2) time, the next version was just O(n).
I showed them how long it took to run the O(n^2) on 100000 elements, about two minutes on my beefy server. Then I showed them the O(n) version running on my comparatively weaker laptop, and it took about ten seconds to run a million elements. The students were genuinely kind of impressed, and I stressed the point to say that "these are the kind of numbers that make it important to take theory seriously."
I would love to see a detailed analysis of various OS boot sequences and times, it is kind of insane that modern computers and phones can take minutes to boot; what ever are they doing?
Mostly just questionable hardware design and spoiled programmers who want to do too much in early boot. There is no technical reason why you couldn't have a modern desktop booting in a couple of milliseconds. It's a tradeoff between being fast and being compatible with the rest of the world.
At the end of the day, boot times are determined not by technical reasons but by how much the users are willing to tolerate. If the market demanded low boot times, the engineers would deliver.
But seriously, there is a router that I shall not name here that spends 10 seconds sleeping to avoid a race condition after initializing some irrelevant service. And of course the entire boot process depends on this service.
> not by technical reasons but by how much the users are willing to tolerate.
Realizing this one clause is what helped me understand so much, including economics and traffic (induced demand). I call it the "misery constant."
Number of lanes doesn't matter: People will adjust their lifestyles until they arrive at the highest level of misery they can tolerate. Add a lane and wait for equilibrium to reestablish at exactly the same level of congestion as before.
Add a 10,000% (more??) increase in CPU and disk speed from the 128K Macintosh to the MacBook Pro, and oh look, it takes basically the same amount of time to boot or to launch an application as it did then, as the developers adjusted their wastefulness until they achieved the maximum level of delay the customers were willing to bear.
> Mostly just questionable hardware design and spoiled programmers who want to do too much in early boot. There is no technical reason why you couldn't have a modern desktop booting in a couple of milliseconds. It's a tradeoff between being fast and being compatible with the rest of the world.
You wouldn't even be able to initialize CPU & DRAM in "couple of milliseconds"
Loading BIOS from flash alone would eat that budget few times over
> there is a router that I shall not name here that spends 10 seconds sleeping to avoid a race condition after initializing some irrelevant service
Reliability is an issue here. There's all sorts of bits and pieces where it's miserably difficult to get a signal or event that something has come up, or there is such a mechanism and it's unreliable. Especially when there's third-party closed drivers or firmware involved. So somebody ends up giving up and putting the delay in, testing it a few hundred times, and declaring it done.
But yes, lots of systems could come up faster if enough people cared.
PCI enumeration turns out to be surprisingly slow, especially when virtualized. Access to PCI config space is done by writing an address to one register and reading the data at that address from another register, repeat for every word in the PCI config space for every device. Each of those register reads and writes takes a VMEXIT, so goes through the host kernel and into qemu and back. Oh yeah, and it used to be done twice, once by SeaBIOS and once by the kernel (the SeaBIOS stuff has been optimized since.)
If you're dealing with modern systems (which would hopefully include virtual machines), the PCI config space can be memory mapped, and then you don't need to read it through the PCI config registers.
I'm basically a 5yo when it comes to hardware - so my apologies for the dumb question - but I wonder naïvely, could this enumeration be completely cached by default? Most computers have most of the important stuff either soldered down (laptops) or at least not swapped regularly. I think I'd sign up immediately for a system where I had to hold a key at startup the first time I installed a new GPU or something, but otherwise it could boot into Windows in 3 seconds. Since most devices that are likely to change are hot-pluggable, I would imagine that those don't need to be checked on at boot time. I'd love to know where I'm wrong on this, though.
Perhaps this is part of what Windows does -- I know that it basically keeps a hibernated image of the non-logged-in RAM state, for use when you shut down (and it throws it out if you do an explicit Restart, for troubleshooting reasons)
> On the subset of Linux distros that use systemd, rather.
Worth noting that Fedora, RHEL/CentOS/et al, Debian, Ubuntu, Arch and derivatives use systemd, so odds are pretty high if you use "Linux" you also get systemd
That's true; I think the only major user-facing Linux distros without systemd (or with non-systemd versions) these days are Gentoo, Void, Alpine, and Android (which may or may not count). Embedded stuff is probably more of a mixture (openwrt doesn't use systemd, yocto and buildroot can use it but optionally), and firecracker feels closer to embedded than desktop but both ends are interesting. I dunno, I mostly just object to conflating "Linux" with systemd even if it is the most popular option (but bear in mind that I have strong opinions about "Linux" vs "GNU/Linux", so YYMV;]).
I'd just like to interject for a moment. What you're referring to as Linux, is in fact, systemd/Linux, or as I've recently taken to calling it, systemd plus Linux.
I mean, I would unironically be fine with people saying systemd/Linux when that's what they're talking about; it's a subset of GNU/Linux and points at what we're talking about pretty easily and specifically...
Android seems to be the most popular Linux distro on the planet, though? Does it use systemd? I was under the impression that it doesn't, but as it often is, I might be wrong.
Technically Android runs the Linux kernel, but I wouldn't consider it a "Linux distro" in anywhere near the same sense as I would an Ubuntu or Fedora, etc. The vast majority of Android users don't even know they're running the Linux kernel, which is not true of "distros."
I would imagine the vast majority of OpenWRT users don't know they're running the Linux kernel. It was installed by the manufacturer or the operator. Assuming that's true, does it make OpenWRT not a distro?
I understand your usage of the word "distro". I think it's useful, but also quite a bit misleading (for example, Wikipedia[1] lists Android as a distro).
Perhaps there ought to be another word for distributions of Linux that require user to take active part in either replacing a manufacturers' choice or setting it up from scratch on an otherwise "empty" system. But I don't know what it would be.
> I would imagine the vast majority of OpenWRT users don't know they're running the Linux kernel. It was installed by the manufacturer or the operator. Assuming that's true, does it make OpenWRT not a distro?
My gut is to not consider OpenWRT a distro since it's usually just the software part of a network appliance and many/most users don't know they're running it, although I concede that it does qualify in most senses as a distro: downloadable image, a package manager, is a complete product that uses the linux kernel, etc.
I fully concede that my opinion of "what makes a distro" is subjective and arbitrary. It's very much the Linux distro version of Stewart's test for obscenity[1] and is a pretty unhelpful standard which value is quite low :-)
Could be worse. Some network routers take up to 30 minutes to cold boot from what I hear, but they essentially never cold boot because they are on UPS.
Makes me realize how spoiled I am by standard libraries nowadays.
Outside of toy projects, interviews, and teaching them, I've never actually written a sort function. I never saw the point; the sort function built into the standard library of whatever language I'm using is probably faster and/or more space efficient than whatever thing I whip up in a few seconds.
Of course, that's a privileged opinion that I am afforded by the fact that my entire career has been "Java and higher" languages.
In my first computer science class in college, after finishing the unit on Red-Black Trees, the professor said something like "Now you should never implement a Red-Black Tree again. If you need a balanced binary tree, use std::map or whatever your standard library provides. Same goes for all the other sorting and data structures we've done in this class."
Relatedly:
Q: When is Python faster than C?
A: When the C programmer used a linked list or some O(n^2) algorithm because they didn't have a standard library full of easy to use well implemented data structures and algorithms.
That's actually something I've been saying for quite awhile when people bring up microbenchmarks complaining about how a language like Haskell is slow.
Like, yes, no question, if you try and implement a tightloop in Haskell vs C, C is going to come out a lot faster.
However, no one writes Haskell like that! People rely pretty heavily on standard libraries (or even third party libraries), and on the optimizations afforded to them by the "rich" language of Haskell.
C might be capable of being faster, but that doesn't imply that a C program will inherently be faster than a similar Haskell program, especially if the programs are relatively large.
Even though I kind of mostly agree that you almost never should implement your own rb tree, this advice is also kind of disempowering and I worry that it is part of a mindset that accepts the status quo when it is so often the case that we could have better things if we just got our hands a little dirty.
> A: When the C programmer used a linked list or some O(n^2) algorithm because they didn't have a standard library full of easy to use well implemented data structures and algorithms.
I don't think so, regarding the linked list. In almost every case you'll end up with at least as much pointer indirection in Python. But yeah, if you have enough data that complexity starts mattering a worse algorithm will cause more problems.
I was mostly joking, but worst-cases include sorted and reverse-sorted lists. I've seen those happen accidentally, in practice, resulting in accidentally-quadratic runtime. This is why, when I taught that class, I made students compare sorting algorithms on real world data sets.
Sure. In this case though the list doesn't change from boot to boot, or for that matter from system to system; if we hit a worst-case we would notice before the release shipped.
Also, the worst case would be "systems take 2 ms extra time to boot" so it's not exactly a disaster.
> though the list doesn't change from boot to boot, or for that matter from system to system; if we hit a worst-case we would notice before the release shipped.
Wait what - why don't you just sort it at build time then? I'm probably misunderstanding what you are saying or maybe just a premature optimization.
e: Nevermind, read further in the thread and saw you answered [0]
We might end up doing that. A quicksort takes it from 2 ms down to 40 us though, and doing the sorting at build time would take a lot of linker magic.
(The SYSINITs are records in separate object files which are marked as going into a particular ELF section, so we don't actually have a complete list of them until we finish linking the kernel.)
Isn't the point of quicksort that the randomization of the pivot selection means that you're extremely unlikely to run into a pathological case for the specific pivot chosen?
And perhaps even more unlikely if you permute the input sequence before sorting it.
I'm basing this off of my understanding of Skiena's discussion of quicksort in The Algorithm Design Manual.
Not to mention that QuickSort isn't a single algorithm but rather a class of algorithms (as the choice of pivot is left unspecified), and there are methods to choose the pivot that give O(n log n) worst case or O(n log n) expected running time (e.g. median of medians or soft heaps for the former, random choice for the latter).
Most of them? If the instructor doesn't call out the difference and point out that all terms of a O() expression have invisible but important constants tacked on, it's easy to accidentally conflate scaling complexity with actual total speed.
They're not intentionally avoiding it, they just get caught up in the mathematical angle and only mention its (dis)connection to real execution time in a passing comment if at all. Or they might even mention that there are constants in play... and then say that of course since they're constants we can ignore them for the purposes of O() notation. It's the difference between learning from PHDs and engineers, I suppose.
This is talking about Firecracker[0] which is for microvm deployments so I would think it is the aggregate effect of losing that 7% across a large number of spin-ups that would make a difference. Occasional reboot on your personal machines obviously doesn't make a dent.
This is for FreeBSD Firecracker. This means this is used for AWS Lambda and Fargate. A whole bunch of 2ms wins is how one would improve cold start time.
BSD is sometimes used in some iot-like devices, of which there are 3 billion. If I randomly assume 0.5% of those are running a fork of FreeBSD, that's 15 million devices. I'm also going to randomly assume that devices reboot once a month on average.
Many years ago I worked on a piece of software that simulated networks. As networks got bigger it would run out of memory. Turned out it did something like
And almost the entire matrix was empty because most things in a network aren't connected to each other. I spent a a few minutes replacing that with a hash table on the (i, j) node pair.
This is from a year old tweet so it may not be correct anymore, but:
“FreeBSD/Firecracker now boots to a login prompt in 782 ms. 170 ms of this is spent generating a self-signed TLS certificate for sendmail, because of course it is.”
The notion that the system mails things to certain local accounts is still quite strong in FreeBSD. In fairness, it's a notion that has a long history in Unices in general; so it isn't FreeBSD Think.
On the gripping hand, it's mad that it's still Sendmail, although that is going to change. It's 2023, and the fact that a FreeBSD system still has the traditional ~2000 lines of indistinguishable-from-line-noise "DO NOT EDIT THIS" stuff in /etc/mail/sendmail.cf, complete with whole bunches of rules for UUCP, is rather amazing.
I wonder how many small FreeBSD systems out there are spending their time double-bouncing cron-job mail that cannot be delivered. (-:
I suspect Firecracker is set up to be as quick to initialize as possible, but still that a vanilla (ie not an optimized fork) can do that ... My phone takes a two orders of magnitude more!
This is one of the cases where C’s low-level nature bites.
Both C++ and Rust have pretty nice sorting algorithms in the standard library. C does not (qsort doesn’t really count as a nice sorting algorithm which is why so many C programmers took their own).
I could imagine that qsort is not an option in FreeBSD for example because it is recursive and there is no stack at that level, or there is a risk of stack overflow. If that's the case, shell sort should be fine I guess.
Can't use the standard library in the kernel, though. That's why there's special implementations of a lot of things you take for granted in kernel source, things like "print".
Rust's `sort_unstable` is available in `core` and does not require the standard library or the `alloc` crate, so, if my understanding is correct, even `no_std` programs like the kernel could use it?
Most relevant element of what's going on here is that Rust's standard library is deliberately split three ways: Core has stuff that should "just work" on your target platform, Alloc adds stuff that further requires a Memory Allocator, which is something target hardware doesn't come with, but any non-trivial system is likely to build, then Std adds stuff which requires an Operating System, such as Files, and Threading
C is technically also split, so called "freestanding" C implementations don't provide features like stdio but they do have e.g. math functions. C++ attempts something similar but it's a bit hit-and-miss as a truly "free standing" C++ doesn't entirely make sense, as systems that try to use C++ for very low level work often discover.
The next less relevant observation is that Rust chooses to call these sort and sort_unstable - the one that's guaranteed to be available doesn't get the easier name, that's reserved for the one that definitely doesn't have unexpected behaviour. If you don't know what an "unstable" sort is, then sort_unstable's behaviour might in some cases be astonishing. Some languages make the opposite choice, which means programmers may use sort, not realising it's actually an unstable sort because the entire concept was never explained to them.
This is one of many examples of Rust's safety achieved not via some sophisticated technology but just by thinking about why people screw up. Every Rust programmer who without thinking writes sort gets a stable sort, if that was too slow, they can choose sort_unstable, but doing so prompts them to consider, wait, is an unstable sort correct here?
Finally the least relevant but hopefully still interesting observation is that it is quite possible to have a stable sort which doesn't require allocation, but the performance is not so hot. It seems pretty unlikely that you must have a stable sort and you can't afford an allocator and yet you don't mind the performance hit. As I understand it though some C++ libraries do offer this.
Looks like the code to bubble-sort the SYSINIT objects was committed a few months BEFORE this kernel qsort code was committed (late Aug 1995 vs early Nov 1995).
So that's why the sort-SYSINIT objects code didn't use qsort.
I'd also guess that there were fewer SYSINIT objects to sort back in 1995 than today.
I see a lot of discussion here about quicksort as an alternative.
If I'm understanding correctly, we're talking about sorting integers. Would something like radix sort not be more appropriate? Or is there a reason to still prefer a more generic sorting algorithm?
At the risk of tempting the systemd flame wars back to life, is there data on how systemd performs? I know faster boot time was a major goal of the project[1].
My personal/corporate, n=~500 experience is that it boots faster, shuts down slower (technically because it is doing it "properly"), and if you have any problem during shutdown it is absolute PITA to debug why.
We did managed to throw away few thousand lines of code of fixed init scripts after moving to systemd and get rid of few other stuff so it has been positive but definitely some of the bad with a lot of good. Journald logging format is also kinda fucked that causes some nasty edge cases
Except of systemd-resolved, this utterly radioactive piece of shit can go fuck itself.
Long story short: there is no useful indexing in journald DB format which means looking for say "last message app emitted" means looking thru every file.
As long as it is mmaped in or on tmpfs it is reasonably fast, but if you say have last 1GB of logs stored on spinning rust on NAS.... you either have 3-4 second wait till it finds those few log lines, or buffer cache littered with that GB of logs, pushing out actually useful stuff.
It literally have worse performance than grepping entire /var/log last time I checked.
And it seems in its entirety it was "Lennart wanted to make a binary log format" vs something actually useful... just having SQLite DB instead would on top of being faster far more useful (ability to SQL query logs on local node with no extra fuss would be awesome)
Naïve question here: Sorting algorithms aside, is there a reason why the ordering of these even needs to be (re)determined at runtime every boot as opposed to the order being recomputed when the kernel is updated, and written to a file?
Gah! I was just looking at this code last night because RTEMS borrows parts of the FreeBSD codebase for drivers and networking. The sorting happens in mi_startup, in case anyone wants to look.
Fascinating. It’s just plain old bubble sort. Why did they choose to do that? I don’t get the benefits of it. If they are going for simplicity, why not selection or insertion sort?
> O(n^2) is the sweet spot of badly scaling algorithms: fast enough to make it into production, but slow enough to make things fall down once it gets there.
https://randomascii.wordpress.com/2021/02/16/arranging-invis...