Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
FreeBSD spends 7% of its boot time running a bubblesort on its SYSINITs (twitter.com/cperciva)
381 points by gslin on May 19, 2023 | hide | past | favorite | 358 comments


Dawson's law strikes again!

> O(n^2) is the sweet spot of badly scaling algorithms: fast enough to make it into production, but slow enough to make things fall down once it gets there.

https://randomascii.wordpress.com/2021/02/16/arranging-invis...


GTA online was struck too

How I cut GTA Online loading times by 70% https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times...


Lesson here is to profile your software at some point.

I'm playing a game (Fenyx Rising) and after launching it I always wonder why "Checking for additional content" screen takes 20-30 seconds. I'm pretty sure it should be just a single request.


I doubt that GTA Online released the game with 60K+ items in their online store.

The runtime for the release day store inventory count may have been okay, but I don't think that Rockstar kept profiling their game every time that they modified their store inventory.

The amount of pain Rockstar inflicted on their 90K+ users shows that Rockstar didn't care that their game took more than 3 mins to startup for the majority of their users.


As someone who worked on a smaller, but still millions-of-users live-service game, profiling of average resource usage and the loading screens was done periodically even if there were no code changes to the area of the game in question.

Given that our team was over an order of magnitude smaller than Rockstar, I would be very surprised if they did not have anyone even casually browsing a profiler every 3 months or something, though I think at their scale (LinkedIn claims 5k employees) they can probably have a team or two where everyone's entire job description be performance maintenance.


Video games are probably some of the most profiled code there is (from experience).

But the profiling is all done on the game loop; I've never heard of any teams profiling the startup...


We definitely profile "dev startup" a lot, it directly impacts team speed and ability to work on the codebase. Dev startup usually includes a majority of "user startup" too, so it helps -- but it's also really easy to overlook stuff like "massive mods/mod dirs" and the like.

Speaking from experience... Once, I was so annoyed at a "startup UI"-related dialog taking way (wayyyy) too long to load. I dropped an expletive into Slack and, not much later, I was guided toward fixing my own ~2 year old mistake. We were scanning some "user-generated files" in a way that worked fine but scaled horribly -- the operation went from multiple seconds down to milliseconds. Ugh.


Every few months or so I come back to that article always in awe


I was happy to see R* patched the bug and gave him a $10k bounty for it since last time I saw it.


For a collection of similar stories:

https://accidentallyquadratic.tumblr.com/


the funniest comment ever posted on HN was something like:

"everytime this blog is linked, I end up reading the whole thing"


At least that's probably only O(n) time;)


Not if every single one of its posts gets linked.


About the regexp one: I once wrote a regexp for CSS. It wasn't complete, but you would be able to pinpoint a syntactic error. I hooked it up to a text field, and started entering CSS. All went fine, until I hit a limit, and adding a single character froze Chrome for more than a minute (at which point I killed it).

I don't think it was accidentally quadratic. More likely, it was exponential.


I submitted some posts to this but it doesn't seem like the author has updated it. I probably run into one of these once a month!


Probably because it's MUCH easier to code bubblesort without making mistakes that cause it to not terminate or some such. Especially if they are writing the bootloader in assembly.

For something mission critical like a bootloader that's more valuable than turning O(n^2) into O(n log n). People running systems like BSD largely don't care how long the system takes to boot, once it's booted the system runs for years.


The funny thing is that, in my experience, bubble sort is actually pretty hard to write, because the finicky details of the swap step don't make sense (in a "wait, this can't possibly be right, there's no way you'd want to write it like that" kind of way). Better than even odds that if you have someone write a "bubble sort" without reference, you get an accidentally insertion sort variant.


The easiest sort to implement in assembly is probably selection sort. Bubble sort is actually messier :)

Source: written sorts in assembly more times than I would care to count.


You shouldn’t write a sort algorithm because much smarter people have already done so and their work is included in the standard library.


Sure. In this case, the smarter people wrote the sort in ~1995 and it was good enough for nearly 30 years, but now someone has to step up and be smart again.

You can't always rely on smarter people to be there to do things for you. And you also can't rely on the standard library to have been written by smarter people anyway, and even if so, to have been written by smarter people in order to handle the situation you find yourself in. There's lots of ways to sort, and lots of good reasons to choose ways that are less than optimal for the use cases they end up being used in.


You’re defending the claim that implementing qsort is too hard, definitely the people who wrote the standard library are smarter than the people putting in bubble sort because quicksort is too hard.

This is just a moronic defense of the venerated FreeBSD developers, it’s on a level equal to organized religion. The FreeBSD developers are fine developers and this was dumb, that’s why they replaced it.

And in this day and age there really is no argument for any user space c environment to exist where the qsort standard library function is not available. And even if there was, it would still be smarter to just copy and paste the battletested code from the c library than write another implementation. Because that’s how you end up with bubblesort because doing it right is too hard.


The “standard library” often isn’t available in early boot scenarios.


True, but there are plenty of CS books about algoriths and data structures.


Copying a convenient-looking one out of a CS book is how you end up with a bubble sort. Approximately nobody comes up with a bubble sort from scratch; it's obviously, gratuitously bad in a way that there's no practical reason to think of on your own. The sorts that people come up with without reference are usually insertion and selection sorts—those two were discovered before bubble sort.


Bubble sort occupies a weird space where people assume it's dumb and simple so it'll be the least code, but even insertion sort tends to beat it.


I mean yeah, bubble sort is basically insertion sort but weirdly pessimized in a way that makes negative sense. Giving it a catchy name has probably been a net harm.


Plenty of children come up with bubble sort as a sorting algorithm. It’s intuitive to go down the list swapping any pairs that happen to be in the wrong order.


It's also very intuitive to pick out a misplaced item and put it in the right position. In the context of physical objects (e.g. playing cards), it's even more intuitive to come up with a (two-way) insertion sort than a bubble sort.


Only if they never managed beyond first chapter.


Lifting sample code out of an academic text, now there's a winning strategy for rock solid stability, correctness, and performance!


It is at least on the world where being an engineer actually requires a degree, and not just because someone likes the word.

The real ones, the ones that write USENIX papers.


Converting those into code is not always trivial.


Depends on the quality of the degree.

Besides I would assume anyone getting ready for their leetcode assignments goes through them anyway.


Nah, it’s not. Correctly implementing algorithms is hard; it’s easy to create incorrect behavior on edge cases and performance pitfalls. I’m sure you knew about the Java issue from a while back, for example.


The standard library can be statically linked...


People who think this way haven’t written boot code.. I suppose you’re gonna link the c runtime too and it’s assumption of being a “process” under some “operating system”.. oh wait.

Compared to the rest of the task, writing a sort is pretty darn trivial.


This is true if we're talking about the first stage bios boot that needs to fit in 512 bytes, but there aren't any particular restraints on kernel size at the point in question. Link in anything you want, including qsort.


O(n^2) algorithms often cause performance issues. The main cases I have seen in business logic are: (A) offset pagination (select ... offset n) and then paginate over all entries, and (B) read a text value, append something, store, repeat.


Ummm.. Sure, but what's happening here is that FreeBSD stores a list in a file not in the desired order, then sorts it before use.

It seems to be any prolonged discussion about which sorting algorithm should be used is sort of skipping the elephant. Why is the list not sorted to begin with?

Without that basic knowledge it isn't very productive to focus on implementation details, no matter how fun the sorting exercise is. Deleted code is fast code.


I was talking about "exponential algorithms" in general, and about business logic. I know FreeBSD isn't business logic, but low-level kernel code. I don't know the details of the FreeBSD problem.


How select with offset leads to O(n²)?


For each page, you have a select statement. The first without offset, then with offset 10 limit 10, then offset 20 limit 10, offset 30 limit 10 and so on. The database engine will, for each query, read all entries up to the offset + limit. This is a staircase pattern: first it reads 10, then 20, then 30, and so on. So the sum of read entries is n * n / 2, which is O(n^2).

One could argue that the database engine should be "smarter", but it's not. Note that data could be added or removed in the meantime, so the database engine can't really cache the result. See also https://stackoverflow.com/questions/70519518/is-there-any-be...


Sorry, I've missed the "and then paginate over all entries" part. Anyway thank you for the detailed explanation.


So, how to do offset w/o using OFFSET?


Include the last id (which should be indexed) of the previous page in the next where filter.

https://use-the-index-luke.com/sql/partial-results/fetch-nex...


That's pagination, not indexed page scanning. Both have their place but they're not the same. Pagination is way better to handle updates between page loads and generally more complicated to implement. As you're now doing head tail index cursor tracking. Flat boring offset/limit is amazingly simple for the happy lazy path which is probably fine for most apps.


Make a query whose parameters exclude the previous page results altogether. I learned about this from here: https://www.citusdata.com/blog/2016/03/30/five-ways-to-pagin...


If you need to iterate over all records, just do it? Why do you need offset.

Otherwise using offset usually is OK idea. Because users very rarely will inspect page #2153. They're interested with page 1, sometimes page 2. limit/offset works fine for those cases and it'll work for page 2153 for those who visit it once in a decade. Using ids makes logic to trac prev/next/page number incredibly complex and generally you don't need it.


> If you need to iterate over all records, just do it?

Who is "you" here?

Usually what happens is that party A builds a REST API (or other connectionless protocol) for fetching a list of some entity-type; and they limit the number of items that can be fetched in one request (because they have a billion such entities, and they don't want to even try to imagine what kind of system would be necessary to generate and stream back a 5TB JSON response); which implies pagination, to get at "the rest of" the entities.

Then party B, a user of this API, decides that they want to suck party A's whole billion-entity database out through the straw of that REST API, by scraping their way through each page.

> it'll work for page 2153 for those who visit it once in a decade

To be looking at page 2153, the user probably first visited every page before 2153. Which means they didn't do one O(N) query (which would by itself be fine); but rather, they made O(N) requests that each did an O(N) query.


I regularly change 3 to 2 in /new/345689 links when bored with today’s content.

Using ids makes logic to trac prev/next/page number incredibly complex and generally you don't need it.

When it’s a public site, users may post so fast that this “next” can show some previous page. Paging via ids is a must there.


> “next” can show some previous page

That is usually a non-issue. The cost in DB operations is usually much more relevant than it.

When people do actually care about fully enumeration and unicity of the items they they are querying, "pagination" itself tends to be a too messy concept.


The cost in DB operations is usually much more relevant than it.

As a result, a user (all of them) hits “next” again until a page looks like containing new posts. It’s multiple requests wasted.

Anyway, what exactly becomes messy?


What a "page" means when enough things appear between them that they push stuff a page away? Are those things reordered? Do people expect to see the new things before they get to the old?

A "page" is a very ill-defined thing that can only exist if your stuff is mostly static. Queries are very different on dynamic content.


You’re overthinking (overquestioning) it. When a user hits “next”, they want to see next N posts from where they at, in order they chosen before, that’s it.

Since there’s no evidence of a mess still, I believe you’re projecting it from an overthought side that isn’t real.


Use a sorted index (e.g. Btree), and use those values to quickly find the start of the next page of results.

For good performance this also requires that your search criteria (if any) be compatible with this index.


Off Topic: "Laying out icons on a grid should be an inherently linear operation"

it doesn't seem mentioned in the HN thread the cause here is probably the same thing O(n^2): sorting. laying out icons is only linear if the icons are read in the order that they're placed. It's been a long time since I used windows regularly but my memory is the default placement is by creation time. So if they're read off disk by filename (or some sort of hash) they'd need to be sorted by timestamp.


Even if the disk sorts directory entries by file name and you want to show them sorted by file name, chances are you have to sort.

Reasons? Firstly, you’d have to know the entries are sorted. For that, you need an API that tells you that or hard-code information about file systems in your code. They may exist, but I’m not aware of any file system API that provides that information.

Secondly, the file system may not return the names sorted in the locale you want to sort them in.

Thirdly, the sorting code used in the file system may contain a bug. Once file systems are out there, you can’t fix them (happened in one of Apple’s file systems. HFS, IIRC)

Lastly, modern GUIs tend to sort file names containing numbers non-alphabetically, so that, for example “file 2.jpg” gets sorted before “file 12.jpg”.

So, I think it’s easier to always sort. I would pick an algorithm that works well when items are mostly sorted at the start, though.


you have indexes............


This made me recall modern AI and the issue with quadratic complexity in its transformers. Ooof! A breakthrough here would be a true Breakthrough™ with remarkably larger context sizes. Like it would barely even be a limit anymore and be transformative (har har) to what they can be used for.


Thanks, that's a fun read although I don't understand much of it. I do understand the gist.


Discoveries and analysis like this blog post and the parent show the difference between programmers and engineers.


True but I think the real cause of this is surely that C makes it too hard to use a sorting library that someone competent has written. I would not be surprised if the author was fully aware of the N^2 complexity but opted for a simpler implementation anyway.


qsort() not good enough for you?

(More realistically, below people are discussing that in the kernel environment the set of standard or third party library available may be unavoidably limited)


There is a quirk (almost a bug?) in FreeBSD's qsort where it will switch to insertsort for the whole array under certain conditions, which we hit in production given how our data was arranged.

(I think this was the check) https://github.com/freebsd/freebsd-src/blob/main/lib/libc/st...


I like how someone felt the need to write out insertion sort in some 4 line code golf challenge in the midst of qsort. This right here is why no one wants to deal with C anymore.


Did a bit of digging and found that there used to be a comment for why it was done, but it got removed [0] when they switched to the implementation from Bentley & McIlroy's "engineering a sort function" [1] around 1992.

[0]: https://github.com/weiss/original-bsd/commit/d3fcf71e0db57cb...

[1]: https://cs.fit.edu/~pkc/classes/writing/papers/bentley93engi...


qsort() is pretty slow if you're sorting something like long[]. In that case, radix sort goes 5x faster and vectorized quicksort goes 10x faster. If you're sorting int[] then vectorized quicksort goes 25x faster than qsort(). Nothing goes faster. The issue is it's a big ugly c++ monstrosity that's a huge pain to compile.


That's fair if the constant factor is relevant, but if bubble sort is terminating in any reasonable timescale then the difference between qsort, C++ std::sort, and a custom implementation is really not a factor.


People who don't compute much data don't need computer science.


But if you're comparing to bubblesort.....


No. In most languages sorting a container is `foo.sort()` or something similar. `qsort()` is much more faff.

I mean, clearly it wasn't good enough otherwise they would have used it, no? Perhaps integrating it was a huge faff.


Nonetheless, qsort() is available in libkern.


Not at the time that the bubble-sort code was used though. see https://news.ycombinator.com/item?id=36005209


You mean written. It's been available for many of the years that it has been used. (-:


> that C makes it too hard to use a sorting library that someone competent has written.

This makes no sense to me. What about C makes it hard to use a good library?


I think it comes down to ecosystem fragmentation.

A lot of languages now have common tooling (cargo for rust, pip for python, etc) which makes it easier to find and incorporate the libraries you want. Apparently there are tools like https://conan.io/ but they're not as widely-adopted.

C's build system is similarly non-uniform. Many packages use Makefiles, others use different build mechanisms.

C has no universally-agreed error handling mechanism or convention. Whether exceptions or golang's error interface, you can generally assume that a random package in most languages will handle errors the way you expect. In C it's a lot more varied.

Similarly memory allocation - sometimes in a larger application you want to customize how malloc and friends work. How and whether you can do that for a C package is non-uniform.

Mind you the C standard library has a sort() function which will have sensible big-O behavior on pretty much any platform. I suspect this specific problem is more to do with this being kernel-mode code which has a lot of special conditions.


I am always amazed by arguments that say that not having a language package manager like cargo or pip makes it hard.

Really? Is it really considered hard to link a library without them? Am I so old that somehow I grew up in a world where linking a library was not considered black magic?


There is a lot more going on here.

The first issue is actually downloading the dependencies, doing this manually quickly becomes infeasible for any non-trivial project.

The second issue is keeping everything updated, and making sure that all packages are compatible with all other packages. Doing this manually is also not easy.

With C specifically, you need to wrangle different build systems, and once you have them built and "installed", you need to figure out which linker and compiler flags are needed to consume the libraries.

If you are working on a small project with maybe a few dependencies you can do this by hand, but when you get to say, 15 dependencies, it quickly becomes very difficult.

You can use the system package manager on some systems to install libraries I guess (assuming it has the packages and versions that you need), in this case manually managing things could be a lot easier, but you still should be using pkg-config for portability purposes.


Okay, but this is FreeBSD. All they have to do is import whatever code into src and use it.


But none of that supports the assertion that C makes it hard to use good libraries. You can even use libraries not written in C if you want.

If the argument is really "it's impossible to make a good library in C", that's different. I'd very much disagree with that, but it would be to the point.


I'm saying "it is harder to consume good libraries in C, because it is harder to find them & harder to build them; and once you have done both, you find that good library A and good library B work in very different ways, so you have to do more work to adapt".

And I haven't mentioned the lack of a strong universal string type, the way many libraries typedef their own family of different kinds of integer, the way one library will require you to free() a returned structure and another will require you to call my_library_free()...

It all adds up to additional friction.

You don't have to agree! Maybe I am out of date, I haven't really dealt with this since the mid 2000's. I'd be thrilled to hear this isn't an issue any more.


> You don't have to agree!

It's not really a matter of whether or not I agree. I was just trying to understand what the assertion was!

I was baffled by the notion because I couldn't think of anything inherent in the language that made it hard to use good libraries. Now I understand that's not really what the assertion was.


Although I usually rant about C, using libraries is surely not a problem specifically in the world of UNIX package managers.


The original assertion was about difficulty of using C libraries in the kernel or bootloader. In the bootloader you're the OS. There's no file system, no dynamic linker, and no processes. There's no guarantee some third party library will work in that environment. It might but it's work to make sure it does or adapt it if it doesn't.


Let's say you want to develop a CLI tool in C for crawling a website's sitemap.xml as advertised by the website's robots.txt. How would you approach this development in C?

With e.g. Java, Javascript, PHP, and Python it's clear to me.

With C, I don't know.


libcurl and expat (or one of a zillion other XML libraries)?


> cargo for rust, pip for python, etc

GOOD! I love that C lacks this pollution. Means that code is written for-purpose and tuned for-purpose.


... except in this case (and many others), where it was written for-purpose and then never tuned.


Is the purpose opening up new vulnerabilities?


Custom memory allocation is pretty optional and a lot of the time could be handled with a single buffer.

Outside of that you don't need to deal with exceptions in a sorting library, and you can happily make it a single .c and .h


There's no standard build system. Think about how you add a dependency in Rust, Go, JavaScript, even Python.

Now do it in C/C++. Absolute nightmare.

Look at how many header-only C++ libraries there are out there. They're making compilation time drastically worse purely to avoid the need to faff with a shitty build system. It's a good trade-off too. I use header-only libraries where possible. They're convenient and popular.

Actually vcpkg seems to be going some way to fixing that but I have yet to convince my co-workers to use it.


> to avoid the need to faff with a shitty build system.

Then maybe don't use a shitty build system?

It's true, C is not trying to be a programming environment or tech stack. It's a language, that's it. Whether or not that's desirable depends on what you're trying to do, it's not something that is good or bad in some absolute sense.

You have your choice of build systems, so pick one that meets your needs.

Vcpkg isn't for me, either, because it doesn't solve any problem I have. If it does for you, awesome!


> You have your choice of build systems, so pick one that meets your needs.

And can I pick the one that my dependencies use too? Didn't think so.


Your dependencies have already been built. You're just linking to them. So yes, you can.


And what if the dependencies haven't been built yet?


Then build them. I'm not seeing the issue here, to be honest, so I'm not sure what I should be addressing.

If the issue is that you don't like how the dependency has arranged to build (I'm not sure why you'd actually care, but just in case...), then port the makefile (or whatever) to your preferred system.

Or, another guess, is the issue that you want to build all your dependencies as if they were an integral part of your own project? If that's the case, I would argue that you're doing it wrong and another tech stack would make you happier.


Build them and install them in /usr/local, just like we did 30 years ago.


With what? The whole point of this line of discussion is that building stuff in C is suboptimal, often requiring the end user to juggle n different build systems by hand to get all the dependencies installed.

Since we're talking FreeBSD here, thirty years ago we had Modula-3 whose build system is lightyears ahead of anything make based.


You already answered your own question: with whatever tools that dependency requires, often ./configure && make && make install

In the event you find yourself doing this constantly, write a shell script.


And now we're back at: this is why C is more difficult, for no good reason, than the alternatives. For small projects that you're only building locally, your approach may work. Once you expect any sort of cross-platform compatibility and/or start distributing your project to others, this approach falls apart and fast. Instead of focusing on development you'll be focusing on playing whack-a-mole with incompatibilities between the different shells (or versions of make, etc.) offered on different operating systems.

Again the problem is that no package/dependency management for C/C++ means that C is balkanized and more difficult to deal with compared to other languages. Using third party libraries in C is far more difficult and error prone than it ought to be.

Pretty much every language that provides more comprehensive dependency management also provides easy enough access to allow you to NIH/DIY it if you need/want to. For instance contrast this with something like rust where cargo is standard across all supported platforms and provides proper dependency management. You can still call the rust compiler (rustc) and whatever linker yourself if you so desire.


Honestly if you lack the vision to see why that sucks I really suggest you try a language with good package management - e.g. Go or Rust. Maybe it will make it easier to see all the problems with the traditional C approach if you can see how it should work.


C and C++ dependency installation is "sub-optimal" but it is certainly well understood. The fact that there are many cross platform, open source projects written in C and C++ proves this. If you can't deal with makefiles and scripts, maybe stick to Go and Rust?


  C and C++ dependency installation is "sub-optimal" but it is certainly well
  understood. The fact that there are many cross platform, open source projects
  written in C and C++ proves this.
Nope. It proves that people have found work arounds up to and including cargo culting things. Any large enough project is going to rely on something else to generate the makefiles, and if you depend on anything large enough you're gonna get stuck having to build and/or install whatever other makefile generators are required. Simply put it's an archaic mess.

  If you can't deal with makefiles and scripts, maybe stick to Go and Rust?
To be clear it's not that I can't deal with makefiles it's that I don't find it a good use of my time*. Take, for instance, the FreeBSD ports tree. make(1) is its achilles heal and the main reason it's obscenely slow to deal with (even compared to something like portage). Besides, doing anything by hand is inherently error prone be it bounds checking array access or cobbling together dependency management.

* And, sure, in years past I wrote a drop-in replacement for automake/autoconf in perl whose big (speed) advantage was that it didn't spawn a new shell for each check. It was a neat hack, but that's all anything papering over the make(1) interface is.


> If you can't deal with makefiles and scripts, maybe stick to Go and Rust?

Many people will in fact do just this.


C can’t be used to implement a good library for many problems due to being inexpressive. For example, you can’t write an efficient, generic vector data structure, but sort functions are also only fast due to the compiler’s smartness — passing a function pointer would be strictly slower.

Though this has not much relevance here as it is about assembly.


C arrays suck on the best of days and qsort requires that you understand function pointers and types which are some of the hairiest C syntax in common use. The C Clockwise Spiral rule is truly special.

It's easy to lose sight of the climb once you're at the top.


I'd suspect kernel devs know about function pointers.


I'd expect kernel devs to carry lots of bad habits and poorly calibrated intuition from their noob days. Example: "for loops are fine."


What’s wrong with for loops?


They encourage accidentally quadratic behavior if they are easier than calling sort().


Well, we could debate this, but it's all irrelevant to the assertion that C makes it hard to use good libraries.


Oh, right, I forgot that C's library/package management situation sucks so hard that makes the awful syntax look like a comparatively small problem.


This actually made me laugh out loud. Yes, if your problem with C is that it doesn't need a package management mechanism like some other languages, then C is clearly not for you. But C is very far from the only language like this.

It's a bit like criticizing a fish for having no legs.


> if your problem with C is that it doesn't need a package management mechanism like some other languages

The problem isn't that it doesn't need one, it's that it doesn't have one. I have no idea why you would think that it doesn't need one.

Well there is vcpkg now anyway so it finally does have one.


I sense that we have some sort of real miscommunication going on here, because the only response I can think of to

> I have no idea why you would think that it doesn't need one.

Is that I have no idea why anyone would think that it does need one.

Perhaps the disconnect is that you are wishing C addresses different use cases than it addresses? That you wish it were a different language? If so, that's fine. Use a more appropriate language for your task. I just find it odd if the criticism of C is that it isn't a different kind of language.


> Is that I have no idea why anyone would think that it does need one.

For the same reason any language needs one. What is it about C that you think excludes it from the basic requirement of "using third party libraries"?


C package management works great on my system.

  # dnf install foo-devel


That's suboptimal for many reasons. Package names are not consistent. Installing multiple versions is usually impossible. Huge pain for library authors. Doesn't integrate with build systems usually (even basic pkg-config support is iffy). The command is OS-specific. You can't usually choose the linking method. Difficult to bundle dependencies. Usually out of date. Etc. etc.


it works quite well in practice, and unlike pip things don't break every 2 weeks.


A lot of these comments are like "C is hard if you don't know C and it scares you".

Not to mention kernel mode doesn't want a gigantic library package manager to pull in leftpad() from the internet. As mentioned, the kernel libraries on FreeBSD have a qsort, but they didn't in the original commit from 3 decades ago or whatever.


> This makes no sense to me. What about C makes it hard to use a good library?

It doesn't have templates/generics.


The lack of namespaces is of far greater consequence.


LIBNAME_actual_function_name()


Annoyance #1: External symbols are expected to be unique within 31 characters according to the standard, so you're limited to a few namespace "levels", at most.

curl_easy_option_by_name() is already up to 24 characters and there's only two "levels" in curl_easy_*.

Annoyance #2: There's no formal registrar for LIBNAME. This isn't a big deal for popular libraries, but it's a pain having to keep a locally-modified copy of a less popular dependency just because it shares its name with another less popular dependency.

Annoyance #3: LIBNAME_actual_function_name() is a pain to read and using either the preprocessor or static inlines to locally alias function names for the sake of readability is silly.


@1: The limits are considered obsolete since C89 and implementations are encouraged to avoid them whenever possible. I think the same way we disregard non ASCII character sets, non two's complement encodings and similar, we are safe in assuming sane implementation being able to handle longer names. And being honest, if given implementation isn't capable of more than 31 significant characters, then it would have problems with namespaces too.

@2: Agree, although I don't recall this ever happening to me.

@3: Is it? How is libname::actual_function_name() much better?

I actually like to use libname__actual_function_name(), as it further separates "namespace" from function name (unless we need compatibility with C++, as IIRC it reserves all double underscores, not only at the beginning).


> @1: The limits are considered obsolete since C89 and implementations are encouraged to avoid them whenever possible.

This is still the case in C11, Section 5.2.4.1. Did this change in the most recent standard?

> @2: Agree, although I don't recall this ever happening to me.

It happened to me once. I ran across a library from 1994 and another from the 2010s which shared a simple name like "libamc". I'll comb through my records later to figure out the actual name.

> @3: Is it? How is libname::actual_function_name() much better?

It's not, but I wasn't thinking of C++ specifically. (I don't know C++. I've somehow managed to avoid it in many years of writing C.)

I was thinking more like the file-local namespace clobbering offered by Python e.g., from LIBNAME import actual_function_name.


> This is still the case in C11, Section 5.2.4.1. Did this change in the most recent standard?

Well, no, it's still marked "just" obsolete. For it to be deprecated or removed there would need to be anybody caring about it enough to put some work. But since it doesn't affect vendors at all (they can just ignore) and users don't complain, it's just a forgotten "law" - still law, but a dead one.

> I was thinking more like the file-local namespace clobbering offered by Python e.g., from LIBNAME import actual_function_name.

Oh, that's... way more than just namespaces. Way more. That would require more fundamental changes and additions. C++ just added modules in C++20 (not sure how well those will catch on), but I don't think something like that is to be expected in C for feasible future.


C11 added (limited) generic support


They used "_Generic" as a keyword but it doesn't really do that.

Suppose I need to define a copy assignment operator for the library's sort function to use. Is there a good way to overload it? Can the library know what the size of each element is based on its type without having to pass it in as a parameter?

You can pass function pointers to the library, but that quickly becomes awful.

  /* sort for basic integer types */
  void lib_sort_u8(uint8_t *a, size_t count);
  void lib_sort_i8(int8_t *a, size_t count);
  void lib_sort_u16(uint16_t *a, size_t count);
  void lib_sort_i16(int16_t *a, size_t count);
  void lib_sort_u32(uint32_t *a, size_t count);
  void lib_sort_i32(int32_t *a, size_t count);
  void lib_sort_u64(uint64_t *a, size_t count);
  void lib_sort_i64(int64_t *a, size_t count);
  /* specify a comparator */
  void lib_sort_comp(void *start, size_t nmemb, size_t size, int (*compar), const void *, const void *));
  /* specify a comparator and an assignment operator */
  void lib_sort_assign_comp(void *start, size_t nmemb, size_t size, void (*assign)(void *, const void *), int (*compar)(const void *, const void *));
  /* specify a comparator, an assignment operator, and ... */
Or, you get one function that takes all of the arguments and have to define and pass in a bunch of function pointers and type size parameters that are each an opportunity for bugs or UB in order to sort a simple array of integers.

If my type needs a custom assignment operator, I need each library I use to take that as an argument. One expects the function pointer to take the arguments in the order (src, dst), another as (dst, src), a third specifies the return value as int instead of void, a fourth takes the source argument as "void *" instead of "const void *" in case you want to implement move semantics and a fifth doesn't support custom assignment operators at all.

It's no surprise that people prefer to avoid this.


You can't specialize a generic algorithm for specific operations and data structures; in C++ terms, it gives you overloading, not templates.


It gives generic programming. C++ templates aren't the only possible implementation.


So where is a generic vector data structure written in plain C that is efficient (that is, doesn’t store pointers to elements).


Slight scarcasm. yes too hard.

http://man.openbsd.org/qsort

having said that this specific sort is somewhere deep in kernel boot land. and kernel code can't really use the standard library. I am not sure if there is a standard kernel sort.


Of course it can. Just extract the qsort implementation to a static library pulled in by the kernel and by libc.


downvote me if you want but this is probably the best argument for rust I've ever seen


[flagged]


in what way did I misunderstand?

> C makes it too hard to use a sorting library that someone competent has written.

rust makes it easy to create generic data structures via generics. It would be trivial to swap between different underlying sorting algos as long as it was given a type compatible function for how to compare the contained items.

what exactly am I not understanding here?


The fact that this was a developers mistake and has nothing to do with the language.


So a few milliseconds in total. Big whoop. OpenBSD's ahci(4) driver stalls 5-6 seconds for each SATA device on the bus, just to make sure the device really is there and that the link speed is correct...or something. My two OpenBSD machines, which incidentally have 3 SATA drives each, spend almost 20 seconds on that one segment alone during kernel startup.


Colin (and others?) have been working to reduce the boot time of FreeBSD for a while now. At this point shaving off a few milliseconds is probably a nice win for them.

https://twitter.com/cperciva/status/1659391725427720195?t=0y...


I mean, I get it. But shaving 4.5ms does seem to fall into the realm of not most peoples problems?

Note that I want to stress that that is no reason for some folks to stop focusing on it. And kudos on the improvement. Criticism, if any, here is that statistics once again rears its head to make something feel bigger by using a percent.


> I mean, I get it. But shaving 4.5ms does seem to fall into the realm of not most peoples problems?

If whoever is fixing this depend on quickly launching instances in Firecracker, it's their problem. And that's how open source is usually done


Sorta. They have to rely on this on a repeated basis such that it matters. And there, if startup time is still 10x this, roughly, why push for solutions that require that many kernel startups?


Why do anything, we're all going to die!!

Super fast boots of containers is a good effort, would be cool to be as fast as spawning a process or a thread!


That is silly. I already said I do not intend this as a criticism of the folks working on it. Kudos on the improvement.

That said... for most people, you are better off learning to use threads or processes. It is almost certainly less effort than spinning up a VM. Not to mention everything else you will, by necessity, have to do if you are using VMs for everything. Unless you find ways to do shared network connections across VMs and such, but... I'm curious how we can make things closer to threads, without eventually bringing all of the dangers of threads along with it?


Why make "for most people" points when we're already talking about FreeBSD?

There's a sort of uni-/light-kernel race quietly happening right now since the industry doesn't really fully trust process isolation within an OS for running completely untrusted code. At least not as much as it trusts VT-X/AMD-V. In that race, kernel startup time is just like JVM startup time, or Julia startup time, or Python startup time, all of which are things that people work on shaving milliseconds from.


Ouch. That feels even harsher to the idea than what I'm saying. :D

To your point, my "for most people" is aimed at this forum. Not the folks doing the work. Is incredibly cool that the folks doing this are also on the forum, but I think it is safe to say they are not most of the folks on here. That not the case?


Say the post had been "How Netflix uses FreeBSD to achieve network latencies less than 10 microseconds" or something. How helpful would it be to comment about how "for most people" this doesn't matter?

Did anybody who read the tweet think it mattered to them when it didn't? It mentions Firecracker explicitly. How many people on HN do you think upvote this because they themselves are running FreeBSD on Firecracker, versus the number who upvoted because it's just interesting in and of itself?


Maybe? Having worked in way too many teams that were convinced we had to aim at some crazy tech because someone else saw "50% increase in throughput" on some tech stack; I am happy to be the one saying to keep perspective.

Though, I will note that you picked a framing with absolute timing here. That is the root criticism here. If the headline would be "team saved 2ms off 28ms startup routine," it would still be neat and worth talking about. Would probably not have impressed as many folks, though. After all, if they save 1% off training time on a standard ML workload, that is way way bigger. Heck, .07% is probably more time.

I'm reminded of a fun discussion from a gamedev, I think it was Jonathan Blow, on how they thought they were smarter than the ID folks because they found that they were not using hashes on some asset storage. He mused that whether or not he was smarter was not at all relevant to that, as it just didn't matter. And you could even make some argument that not building the extra stuff to make sure the faster way worked was the wiser choice.


If we didn't have you here we might have done the terrible mistake of thinking this was interesting. I'm so glad you were here to show us how useless this effort is, protecting us from wasting our time, protecting us from our lack of judgment. Thank you so much for your service and foresight.


Isn't fast startup the point of firecracker? It's supposed to scale down to zero when there are no users, and quickly ramp up when needed.


I mean... sorta? Yes, it is the point. But, the goal is to make it so that you can do this scale up/down at the VM level. We have been able to do this at a thread or even process level for a long long time, at this point.


why not boot a good os in firecracker instead?


4.5ms on what hardware, in what scenario? Would I like to save 5ms off the startup time of my VMs? You betcha. Does that 5ms turn into 200ms on a low power device? Probably


How would you even discern this delay from all the other delays happening before you're done booting, when there are so many other natural variances of a couple of milliseconds here and there? Every "on metal" boot is unique, no two are exactly as fast, and certainly never within 1.98 milliseconds of eachother even in the case of a VM with lots of spare resources on the host. You're painting a too pretty picture of this.


I'm not even clear that I care on my VMs. I don't start up enough for that to save me more than... a second a month or so? Maybe. Probably not even.


Great! Then this doesn't affect you!


Right, but again pointing this at most people on this forum, that answer is probably the same. Very few of us are in a situation where this can save seconds a year, much less seconds a month.

For the folks this could potentially save time, I'm curious if it is better than taking an alternative route. Would be delighted to see an analysis showing impact.

And again, kudos to the folks for possibly speeding this up. I'm assuming qsort or some such will be faster, but that isn't a given. For that matter, if it is being sorted so that searches are faster later, than the number of searches should be considered, as you could probably switch out to sentinel scanning to save the sort time, and then you are down to how many scans you are doing times the number of elements you are looking against.


My math says ~2ms (28ms*0.07 = 1.96ms)? Still, if it mattered to get it down to 28ms, it might matter to get it down to 26ms...


Agreed, but my math also says that if this is your pain point, you'd save even more time by not firing off more VMs? That is, skip the startup time entirely by just not doing it that often.


I don't startup VMs multiple times per hour, much less per minute, so I don't assume to know what tradeoffs the people using firecracker are making when deciding how often to startup VMs.


Fair. I still feel fine pushing back on this. The amount of other resources getting wasted to support this sort of workload is kind of easy to imagine.

I will put back the perspective that they were not hunting for 2ms increases. They were originally chopping off at seconds on startup. The progress they made is laudable. And how they found it and acted on it is great.


if 4.5ms is 7%, the entire boot time is 64ms?


Ha, I thought I saw 4.5ms from another post. Not sure where I got that number. :(

Realizing this is why someone said the number was more like 2ms. I don't think that changes much of my view here. Again, kudos on making it faster, if they do.


Colin's focus has been on speeding up EC2 boot time. You pay per second from time on EC2. A few milliseconds at scale probably ads up to a decent amount of savings - easily imaginable it's enough to cover the work it took to find the time savings.


Yes, you pay per second. But, per other notes, this is slated to save at most 2ms. Realistically, it won't save that much, as they are still doing a sort, not completely skipping it, but lets assume qsort does get this down to 0 and they somehow saved it all. That is trivially 500 boots before you see a second. Which... is a lot of reboots. If you have processes that are spawning off 500 instances, you would almost certainly see better savings by condensing those down to fewer instances.

So, realistically, assuming that Colin is a decently paid software engineer, I find it doubtable the savings from this particular change will ever add up to mean anything even close to what it costs to have just one person assigned to it.

Now, every other change they found up to this point may help sway that needle, but at this point, though 7% is a lot of percent, they are well into the area where savings they are finding are very unlikely to ever be worth finding.

Edit: I saw that qsort does indeed get it into basically zero at this range of discussion. I'd love to see more of the concrete numbers.


Missing the point, which is to shave off cold start times for things like Lambdas, where shorter start times means you can more aggressively shut them down when idle, which means you can pack more onto the same machine.


See, the binpacking that this implies for lambda instances is insane to me. Certainly sounds impressive and great, and they may even pull it off. And, certainly, for the Lambda team, this at least sounds like it makes sense. I'll even root for them to hope it works out.

It just reminds me of several failed attempts to move to JVM based technologies to replace perl, because "just think of the benefits if we get JVM startup so that each request is its own JVM?" They even had papers showing resource advantages that you would gain. Turned out, many times, that competing against how cheap perl processes were was not something you wanted to do. At least, not if you wanted to succeed.


A realization here is that while reaching cold start times that allow for individual requests would be awesome, you don't need to reach that to benefit from attacking the cold start time:

Every millisecond you shave off means you can afford to run closer to max capacity before scaling up while maintaining whatever you see as the acceptable level of risk that users will sometimes wait.

Of course if the customer stack is slow to start, the lower bound you can get to might still be awful, but you can only address the part of the stack you control.


I think your argument is effectively "Colin's time could be better spent". That might be true, but consider that:

1. He was already profiling the system, which is almost certainly where the bulk of his eng costs are

2. This is one piece of a larger optimization story, he has made many many changes to significantly improve performance

3. This is a trivial change

4. 500 Firecracker executions is nothing - FaaS targets 10s of thousands of executions per second

Given (1) and (3) in particular the only sensible thing Colin could do other than fix this is ignore it, which seems insane.


I think my argument is easily seen as this, but that is not my intent of "argumemt." As you say, and if I'm not mistaken, in this case they only "cared" about this because it was easy and "in the way," as it were. I don't think it was a waste of time to pick up the penny in the road as you are already there and walking.

What I am talking towards, almost certainly needlessly, is that it would be a waste of most people's time to profile anything that is already down to ms timing in the hopes of finding a speedup. In this case, it sounds like they were basically doing final touches on some speedups they already did and sanity testing/questioning the results. In doing so, they saw a few last changes they can make.

So, to summarize, I do not at all intend this as a criticism of this particular change. Kudos to the team for all the work they did to get so that this was 7% of the remaining time, and might as well pick up the extra gains while there.


I've spent plenty of time optimizing for milliseconds (for very high hourly rates, not on staff, so my work has a very clear cost and ROI) and I operate probably at least a dozen layers above the OS.

Modern CPUs execute a few million instructions per millisecond.

I think people who operate at a level where microsecond and nanoseconds matter may see it as dismissive to question gains of milliseconds.


I'm sure that happens. And I'm sure they do. I'm also equally sure in thinking that the vast majority of the interactions I have, are with people that are not doing this.

You can think of my "argument" here, as reminding people that shaving 200grams off your bicycle is almost certainly not worth it for anyone that is casually reading this forum. Yes, there are people for whom that is not the case. At large, that number of people are countable.

And I can't remember if it was this thread or another, but I didn't really intend an argument. Just a conversation. I thought I lead off with a kudos on the improvement. Criticism would be to the headline, if there is any real criticism to care about.


I can't really agree. I think it's a sad indictment of this field that we so easily will throw away significant efficiencies - even if it's only a few milliseconds here or there.

Software today is shit. It's overly expensive, wastes tons of time and energy, and it's buggy as hell. The reason that it's cheaper to do things this way is because, unlike other engineering fields, software has no liability - we externalize 99.99% of our fuckups as a field.

Software's slow? The bar is so low users won't notice.

Software crashes a lot? Again, the bar is so low users will put up with almost anything.

Software's unsafe garbage and there's a breach? Well it's the users whose data was stolen, the company isn't liable for a thing.

That's to say that if we're going to look at this situation in the abstract, which I think you're doing (since my initial interpretation was in the concrete), then I'm going to say that, abstractly, this field has a depressingly low bar for itself when we throw away the kind of quality that we do.


But... this is precisely not a significant efficiency. At best, you can contrive an architecture where it is one. But, those are almost certainly aspirational at best, and come with their own host of problems that are now harder to reason about, as we are throwing out all of the gains we had to get here.

I agree with you, largely, in the abstract. But I'm failing to see how these fall into that? By definition, small ms optimizations of system startup are... small ms optimizations of system startup. Laudable if you can do them when and where you can. But this is like trying to save your way to a larger bank account. At large, you do that by making more, not saving more.


> this is precisely not a significant efficiency.

A 7% improvement from a trivial change is an insane thing to question, honestly. It is absolutely significant. Whether it is valuable is a judgment, but I believe that software is of higher value (and that as a field we should strive to produce higher quality software) when it is designed to be efficient.

> At best, you can contrive an architecture

FaaS is hardly contrived and people have been shaving milliseconds off of AWS Lambda cold starts since it was released.

> But I'm failing to see how these fall into that?

Like I said, 7% is a significant win.

I like to quote Knuth on this,

https://dl.acm.org/doi/pdf/10.1145/356635.356640

> The improvement in speed from Example 2 to Example 2a is only about 12%, and many people would pronounce that insignificant. The conventional wisdom shared by many of today's software engineers calls for ignoring efficiency in the small; but I believe this is simply an overreaction to the abuses they see being practiced by pennywise-and-pound-foolish programmers, who can't debug or maintain their "optimized" programs.

In established engineering disciplines a 12 % improvement, easily obtained, is never considered marginal; and I believe the same viewpoint should prevail in software engineering


You have two framings here that I realize are not mine.

First, I am not arguing to not make the change. It was identified, make it. As you say, it is a trivial one to do. Do it.

Second, thinking of percentage improvements has to be done on the total time. Otherwise, why is it not written in much more tightly tuned assembly? After all, I am willing to wager that they still don't have the code running within 7% of the absolute optimum speed it can be going in as long. Heck, this is a program that is a post processing of a list. They already have a way that could get the 40us completely gone. Why stop there?

If I am to argue that anything here would have been a "waste" and shouldn't have been done, it would be the search for a 2ms improvement in startup. But, again, that isn't what they were doing. They were shaving off seconds from startup and happened to find this 2ms improvement. It made headlines because people love pointing out poor algorithmic choice. And it is fun to muse on.

This would be like framing your wealth as just the money you have in your wallet as you go to the store. On the walk, you happen to see a $5 bill on the ground. You only have $20 with you, so picking it up is a huge % increase in your wealth. Of course you should pick it up. I'd argue that, absolutely considering it, there is really no reason to ever not pick it up. My "framing" is that going around looking for $5 bills to pick up from the ground is a waste of most peoples time. (If you'd rather, you can use gold mining. Was a great story on the radio recently about people that still pan for gold. It isn't a completely barren endeavor, but it is pretty close.)


Look, from a utilitarian perspective, let's say we want to optimize the total amount of time globally.

Let's say FreeBSD dev spends 2 hours fixing a problem that gives a 2ms improvement to the boot time.

Let's assume conservatively that FreeBSD gets booted 1000 times every day globally. That's 2 seconds per day. Scale that to 10 years and you break even.

Nobody cares about "your bicycle" or your internal CRUD app that has maybe 30 users, but FreeBSD is widely used and any minor improvements on it can be presumed to be worth the time fixing.

Maybe you have an axe to grind on the topic of premature optimization, but this FreeBSD fix doesn't seem to be the showcase of developers wasting their time doing meaningless optimizations.


This is silly, though. For many reasons. First, I have no axe to grind. They found a trivial change that can be made and made it. Probably even simplified the code, as they used a better search that was introduced later in the codebase. Kudos on that!

Second, it also clearly took them more than 2 hours. They've been working at making it faster for many days at this point. So, it will take them a long long time to realize this particular gain.

Finally, my only two "criticisms" this entire time were on calling this a 7% improvement and claiming that searching for this would be a waste of most teams time. Consider, from that headline most folks should assume that they can get their EC2 instance that is running FreeBSD 7% faster. But, they clearly won't be able to. They won't even get a lambda invocation 7% faster. Nothing anyone anywhere will ever see will ever be 7%. They may see something 2ms faster, and that is, itself awesome. And may claim that this would be a waste of time to search for is irrelevant for this team. They weren't looking for this particular change to make, of course, so that this "criticism" isn't relevant to them. At all.


Just to be clear, I mean 'enough saved across all pay-per-second ec2 users', not a given specific account or user, which, yeah, it's probably minimal. The scale of lambda and ec2 is...enough to make any small change a very large number in aggregate.


Right, but this is akin to summing up all of the time saved by every typist learning to type an extra 5 words a minute. Certainly you can frame this in such a way that it is impressive, but will anyone notice?


Haha, there's a thread here on hacker news with tons of comments, so, yeah, somebody noticed! :)


I think the focus here was for firecracker VMs used by lambda? If you're paying 2ms on every invocation of every function that'll add up. OTOH it seems like SnapStart is a more general fix.


If you boot OpenBSD thousands of times per day it adds up. I can imagine this being the case for people running big server farms or doing virtualization stuff.


These are servers and I reboot them only when necessary. It's a pet peeve, but a big one.


In my experience time to boot everything else on server far exceeds any OS shenaningans.

I had servers that took up to 5 minutes from start/reboot to GRUB prompt. And they were not some 60 drive monsters but typical dual CPU 1U servers...


At 7% of 28ms, if you boot it a thousand times per day you get a whopping 2 seconds.


no, if your SATA/SAS stalls for 20s you get 20s


Well, FreeBSD has other issues too. I had re-cable all of my USB devices because it was taking forever (linux and windows booted a lot faster).


Huh ? Were they resetting under FreeBSD or something ?


I had it setup like this: some devices plugged into PC directly, KVM plugged into the PC, hub plugged into PC, hub on monitor plugged into hub and a bunch of things plugged into hubs.

Half of the devices were just powered by USB and not using any data, but only needed to be ON when the PC is ON.

The way it worked:

- FreeBSD would scan all USB devices

- get held up on some for no reason

- find a hub

- scan everything on a hub

- find another hub

- scan everything on that hub as well

Anytime during the scan it could run into a device that takes longer (no clue which or why, only happened on FreeBSD and never on Linux or Windows)

I'm sure windows and linux did the same thing, but both shapely continued booting while FreeBSD waited and waited and waited and waited.

Bias light on a monitor would sometimes completely stop the process until unplugged.

USB stack on FreeBSD is forever cursed, I still remember kernel panics when you unplug a keyboard or a mouse.


> I'm sure windows and linux did the same thing, but both shapely continued booting while FreeBSD waited and waited and waited and waited.

My understanding is FreeBSD will by default wait until it's found all the fixed disks before it figures out which one to mount as root. This could be a USB drive, so it needs to enumerate all the USB first. Adding hw.usb.no_boot_wait=1 to /boot/loader.conf.local or thereabouts if you don't need that will skip that wait and speed up the boot process a lot (especially on hardware where USB is slow to finish enumeration).


Huh, didn't know about that option. Yeah, by default, FreeBSD fully initializes before surrendering to the user.


IIRC most linux distros just get what they need to boot from initrd/fstab and only complain if something from there is missing


...wat? How something that bad could linger for so long?


It adds 2ms to the boot time on modern hardware, and the list it is sorting has grown by about 2 orders of magnitude since it was introduced. Seems reasonable to me that it was unnoticed.


I think the comment you are replying to was referring to the openbsd issue that apparently adds 10-20secs per boot in the GP comment.


I am currently at work fighting the same class of problem. Bad pattern was used a few times, seemed to work. So it was used twenty times, still no problems. Now it’s used 100 times and it was 20% of response time. Got that down to half in a few months, now it’s a long slog to halve it again. It should never have been allowed to get past 3% of request startup time.


Because none of the people afflicted have done anything about it.


Likely because none have the required proficiency. I certainly don't. "So? It's open source. Stop complaining and fix it." is never a good response to a bug report. I know this quirky behavior has been brought up a few times by other users than me on the openbsd-bugs@ mailing list during the 10 or so years since I first observed it.


How do the other OS's get around this?


I don't know, but I don't think there's an actual case that Linux/FreeBSD/etc. happens to have a workaround for. I think it's just OpenBSD charging towards a windmill.

It's not impossible that it's some archaic and since long invalid legacy thing along the lines of "let's wait 5 seconds here just in case the drive hasn't managed to spin up, despite the computer having been powered on for 15 seconds already, because that happened once in 2003 on someone's 1989 HPPA system so we need to support that case". I'm not joking, this is really what I imagine it being about.


That's believable and maybe even reasonable.

There were (or still are?) SCSI drives that would not spin up until told to over the bus. I think the idea is to not do them all at once since the motor draws a lot of power.

I'm fairly sure I've run into drives that would not respond on the bus while spinning up.

And if it happens with SCSI drives, it may happen with other types.


>"There were (or still are?) SCSI drives that would not spin up until told to over the bus"

Surely they'd spin up as soon as the BIOS started probing the bus to init the hardware, long before the kernel was running, or they'd be infamous for being "the HDDs you cannot boot an OS from"...

In my 25 years of using PCs I have not once come across a drive that did not spin up as soon as the computer was powered on. But whatever the case is, Linux and FreeBSD never had this behavior. Waiting some arbitrary amount of time isn't an appropriate solution (to what I insist is just an imagined problem), it's just a poor bandaid.


Spinning up only upon request is common behavior for SCSI drives. Per the Ultrastar DC HC550 manual I have handy,

After power on, the drive enters the Active Wait state. The Drive will not spin up its spindle motor after power on until it receives a NOTIFY (Enable Spinup) primitive on either port to enter the Active state. If a NOTIFY (Enable Spinup) primitive is received prior to receiving a StartStop Unit command with the Start bit set to one, spin up will begin immediately. For SAS, this is analogous to auto-spinup function in legacy SCSI. This provision allows the system to control the power spikes typically incurred with multiple drives powering on (and spinning up) simultaneously.

If a StartStop command with the Start bit set to one is received prior to receiving a NOTIFY (Enable Spinup), the drive will not start its spindle motor until Notify (Enable Spinup) is received on either port. Successful receipt of a NOTIFY (Enable Spinup) is a prerequisite to spin up.

Code in the SCSI controller boot ROM, if enabled, does typically handle this before the OS starts, often with an option to stagger spin-up for the reason mentioned above.

For what it's worth, to speed up boot, I typically disable the OPROM on any storage controller not hosting a boot device.

Moreover, as BIOS, UEFI, Power Mac, …, all require different, incompatible firmware bits, enabling controller firmware isn't an option for many otherwise compatible controllers.

Regardless, possible spin up delays in no way justify introducing explicit delays in device enumeration; if the OS needs to ensure a drive is spun up, multiple mechanisms exist to allow it to do so without introducing artificial delays (TEST UNIT READY, non-immediate START STOP UNIT, simply paying attention to additional sense information provided by commands known to fail before spin up and retrying as appropriate, etc.).

IIRC, explicit multi-second delays between controller initialization and device enumeration were traditionally introduced because some (broken) devices ignored commands entirely — and were therefore effectively invisible — for a significant period of time after a bus reset.

With that said, introducing a multi-second delay to handle rare broken devices is sufficiently strange default behavior that I assume something else I'm not aware of is going on.


> Surely they'd spin up as soon as the BIOS started probing the bus to init the hardware, long before the kernel was running, or they'd be infamous for being "the HDDs you cannot boot an OS from"...

Depends on the market they sell into, might not be a big deal for enterprise drives that you don't expect to boot from; especially if you're attaching them to a controller that you also don't expect to boot from. I had some giant quad processor pentium II beast after it was retired, and the BIOS could not boot from the fancy drive array at all; and even if it could, the drives were on a staggered startup system, so it would have had to wait forever for that.


Contrived case. Such a storage solution would still not be something that ahci(4) on OpenBSD - a basic SATA controller driver - could wrangle, no matter how many seconds it patiently waited.


Could be anything, but wouldn't be surprised if it were some other bug leading the driver to conclude there's 65355 devices attached, and it needs to probe each of them.


Well, for one I'd imagine you can do all ports at once. Or start doing it and allow rest of drivers to initialize their stuff while it is waiting

At least looking in my dmesg on linux all SATA drives are getting ready rougly at same time 0.6 s into boot AHCI driver gets initialized and 3.2 seconds into boot they are all up. Looks parallel too as ordering does not match log order

    [    3.192274] sd 6:0:0:0: [sdc] 500118192 512-byte logical blocks: (256 GB/238 GiB)
    [    3.192277] sd 7:0:0:0: [sdb] 3907029168 512-byte logical blocks: (2.00 TB/1.82 TiB)
    [    3.192277] sd 8:0:0:0: [sda] 5860533168 512-byte logical blocks: (3.00 TB/2.73 TiB)


OS should save all serial numbers of devices and after they are all found continue. we should make it declarative or how do you want to call it.

also why cant freaking boot process be optimized after first boot? bsd is essentialy sorting SAME thing every boot, THAT is ridiculous. sort once, save order ( list, sysinit000000000000 ), boot fast next time. hardware or other change or failed boot, can trigger start sorting bull for safety.

you know what youre booting into, so sort it once then run it from saved order next time. how many times you change hardware on computer ? and if you do, you can just restart with grub flag, toggle switch in control panel before restart, etc


The code to manage "do I sort or do I cache" is probably worse than the code to just sort it unconditionally.

And you really want to do this automatically, with no manual anything, because (among other reasons) if you need to swap a part on a broken machine, you do not want that to needlessly break the software. So you're sorting regardless, and so you might as well just be unconditional about it.


on my machine TPM is checking state of machine for security reasons, so if this runs ANYWAY then why not use that for one more useful thing ... ( it runs before even machine thinks about booting anything )

you can define shutdown flag for hardware change: shutdown -rF XD,

some things are hot swappable and for that there has to be logic (in kernel) to know you have new hardware...

if you have total hardware failure that can be detected too, on next boot...


That's nice way to add fuckloads of useless code while wasting memory and making now-read-only process into read-write process (kernel have no place to write cache before userspace initializes and sets up mounts and filesystems).

...vs just putting quicksort in


Windows XP had a tool called BootVis[1] that would try to optimize the startup sequence. Supposedly newer versions of windows do this sort of thing automatically.

I suspect much of the delay comes from the nature of plug-and-play. The OS has to poll every bus and every port to find every device, then query those devices to determine what they are, then load a driver and actually configure them. It simply can't rely on the hardware being exactly the same from one boot to the next.

[1] https://en.wikipedia.org/wiki/BootVis


Phones take a long time to boot (12 seconds on an iPhone SE) even though their hardware stays the same.


> toggle switch in control panel before restart

Ah yes, let's brick the OS when hardware fails and has to be replaced without advance notice.


im not saying remove all existing code,

im saying you can have two pathways, ( decision which route to choose is almost zero cost )

first - when you boot correctly, you can save information that you booted correctly.

second - if you do not find information about safe boot, you go old route, quicksort route.

( YES. you can do that, in a way you do not have loops of boot process because you are reading boot is ok even if it is not. just dont be lazy like linux kernel developers were, when i was telling them this exact thing few years back. )


But why? You have a very manageable number of devices, O(nlogn) over that on a modern processor is absolutely zero time.


Based on the kernel boot time (28 ms) given by the author, this process takes 1.97 milliseconds.

I'd consider this less "haha dum code is slow" and more "you are booting the kernel way too much, even a 1.97 millisecond delay is noticable".


I think the “Firecracker” referenced in this Tweet is Amazon's “microVM” environment[0]. VMs running in this environment are sometimes booted to service scheduled/batch jobs, but are, I think, more often launched in direct response to (and with the intent of serving) an incoming HTTP request (and then stick around for a few minutes afterwards in case there are any subsequent requests). Network latency from the continental US to the nearest AWS datacentre is typically <30ms, so even after you add the response time of the application running in the VM, an extra 2ms boot time still probably makes up >2% of the total cold response time.

[0]: https://firecracker-microvm.github.io/


So wouldn't this fall under the category of "booting the kernel too much"? Why not find a way to keep already-booted kernels laying around and recycle them, rather than rebooting them afresh on every request? This pattern has already played out in other situations and been solved this way, such as with object pools or thread pools.


All that sounds like it adds much more complexity than swapping out bubble sort for qsort, and the other various optimizations that have been going in to reduce boot time. Plus optimizing common code improves boot time for everyone running FreeBSD - routers, firewalls, etc, whereas pooling just helps that one use case. If you can make it startup up fast enough that you don't need any of that extra machinery, why wouldn't you?


This is the "AWS Lambda cold start" use case: you're not doing this every time, but you are doing it on scale-up, initial deploy, etc. Depending on complexity, those costs start to hit your p99 as well as your p100.


Yes, Firecracker supports snapshotting for that reason. But you still have the famous "cold start" lag. Not to mention that, for some use cases, you'll potentially want to add restarts for security reasons (Firecracker is designed to be able to execute malicious code).


The problem here isn't the boot time, but the booting at all!!


Having a microVM handle a rare endpoint and then exit is often cheaper than leaving an EC2 instance running constantly. An EC2 instance also has extra administration needs. At some scales the savings can be significant.

Shaving ms off the boot of that VM is a performance boost of that lambda (or whatever the cloud vendor calls it).


Why does anything have to boot? What does booting do?


Maybe you're calling the serverless function from another serverless function in the same datacenter (or more simply, maybe you're instantiating a local firecracker VM to run some user-defined function within the context of your currently executing program), in which case the latency would be more on the order of 5-10ms.


This will also make a lot of difference once we can run lightweight, userspace kernels ("rump kernels").


> once we can run lightweight, userspace kernels ("rump kernels").

Er, didn't NetBSD accomplish that... like a decade ago, now? Or is FreeBSD looking into supporting rump kernels?


Yes, NetBSD did accomplish that, but I expect it to become more commonplace.


but question is, is this solution for people running software on other peoples computers ? or solution for running my own code on my own computer only ?


Does it matter? Running kernel pieces in userspace is useful in both directions (rather like docker), and in both cases we'd like it to run as quickly as possible.


For my particular case it's about something similar in purpose to sandboxing, but with providing the compartment (ie a process subtree) with an alternative kernel to talk to, to minimise the attack surface between that container and the host kernel.


Yeah, and Ahmdal's law comes directly into play here. At 7% of the total boot time, this bubble sort may be the slowest part of the kernel boot. But, even if you optimized away the sort entirely somehow, you could still only save 1.97 ms. And once you do that, the next slowest thing probably doesn't take more than ~1ms to run. Eventually, you get to a state that resists these kinds of targeted optimizations, where there are no bottlenecks, and the only improvements can come from new algorithms and data structures, i.e. "uniformly slow code."

Of course, calling anything that takes place in less than a blink of an eye "slow" is overstating things... lol :-)

If only I could get my desktop environment to come up in less than 28ms :-)

https://en.wikipedia.org/wiki/Amdahl%27s_law

https://wiki.c2.com/?UniformlySlowCode


exactly, it is not problem of making it slower / faster or which is slower/ faster.

BUT

that we use solutions across multiple problems instead providing solution specifically for that problem. if you have KNOWN vm, on KNOWN hardware then why you need all kinds of hardware and other bull.. initializations, checks.... ? just provide list of things it has to have then after list is run over without problem, continue. also why would amazon/gcp/azure care ? they are making money from you running inefficient solutions....


main point of running unixy stuff is exactly that i can mold my solution to specific problem EASILY. thats why it is so common in embedded market ( routers, kiosks,drones, car infotainment, TV... )


Sure but Firecracker is used for AWS lambdas, so you're going to be booting those extremely frequently. (Now I wonder if Amazon is using FreeBSD for AWS lambda ..?)


AWS Lambda does not support FreeBSD.

It would be great if that changed though...


We now boot so often kernels this is important.

My tv boots a kernel, Alexa's, laptops, cars etc.

And this might be not specifically critical but it's a sign of people just doing something less optimal


Do you think those devices boot every time you send them an input, or just when they're plugged in? Unless you're turning the TV on and off thousands of times per day this would be unnoticeable.


Certain systems on a vehicle make sense to power off. Do you really need the unit controlling your backup camera to be booted into standby while the car is turned off?


No, but I'm not taking thousands of six-inch family vacations per day, I'd walk if I wanted to travel a distance for which two milliseconds was noticeable.


I think you're mixing (bad) analogies to make some type of point but I'm not sure what it is.

The original comment you replied to mentioned cars in the context of booting devices often, so I simply pointed out that yes, actually it makes sense to boot up some devices when they go to receive input and not to leave them in standby as you suggested. In a vehicle that may sit off/idle for several weeks, it's best not to drain the battery unnecessarily.


I probably boot a few embedded devices per day.

If someone did not optimize this this might indicate other optimizations


Well, a microcontroller in a microwave is a bit different than fully fledged POSIX system and both are "embedded devices".

But yes, I also hate when home router takes forever to reboot and reconnect


Do we? And... do we know that we can make it significantly faster? To wit, will the code be bigger to do a smarter sort such that that itself causes problems? Or, even with a faster sort, is it still 1.6ms?


2.031 ms, I rounded down.


You lack the imagination to come up with a use for some improvement off the cuff, that doesn't mean everyone else (some of whom are more motivated) does as well.

Why do we even need to waste any consciousness-seconds on watching electric circuits initialize? Computers should just start working


Probably true. My question was "how often is this run?"

I assume it's not just once when I first turn on my laptop from fully off. Is it once every time I restart a Docker container? (Assuming FreeBSD images.)

Is it something that runs a couple times a day, a dozen times a day, a thousand times a day? Fifty times every time I open an Electron app?

I suspect this doesn't matter and is more a curiosity, but I wouldn't be shocked to learn otherwise.


So that's a very impressive number, but how long does it take to bring up userland?


Years (to be a little hyperbolic). FreeBSD still doesn't have a good pid 1 story. It still has a very simple /sbin/init that just spawns whatever shell script happens to live in /etc/rc. That shell script essentially does a topology sort of service dependencies and then starts each one individually and serially.


So you might be shocked that 7% is so high, but the reason it is so high is that cperciva has been driving down the overall boot time to the tune of ~99%+ faster than two major releases ago. Bubblesort used to be <1% of total boot time because the rest of boot was so slow. Now that many of the gratuitous sleeps are gone, the next thing to look at is stupid algorithms like this.


I've seen this referred to as the "BBBB problem", as in: "nobody types BBBB into the search box, so it makes no sense to optimize the algorithm so it doesn't have worst-case complexity with repeated characters".

That is, until someone searches for e.g. git conflict markers.


At a previous job, I debugged a close relative of that exact problem: sorting list views was very slow when you wanted to sort on a column with very few distinct values in a large table. The culprit was of course a quick sort degenerating to n squared performance.


Had this problem in xBase (Clipper with NTX indexes): sorting by date but there were many null (empty) dates in that particular column. Added another date column (that was never null) to the index just to provide entropy.


And pattern-defeating quicksort is one example of a solution to that


Funny that this is what trends, and not the flame chart of the different boot times of different FBSD versions: https://twitter.com/cperciva/status/1659059445685420032

At a glance looks like 13.1-RELEASE booted 3x faster than 11.1-RELEASE, and 14.0-CURRENT on Firecracker boots 45x faster than 11.1-RELEASE on a c5.xlarge


When I was doing my PhD in hydrology modeling, I found a bubble sort in one of our pre-processing programs, which calculated the topographic flow network. This worked okay when the watersheds being modeled were small or the grid resolution was coarse. I however, was modeling at much finer resolution (3m^2) resulting in runtimes of more than an hour. I replaced the bubble sort with merge sort from libc and my runtimes fell to less than 5 minutes. Fun times.


Tangential, but in my Java class recently I was teaching teaching the students some basic Big O stuff. I wrote a few versions of a program that found the "biggest delta" of two numbers in a list. The first version was O(n^2) time, the next version was just O(n).

I showed them how long it took to run the O(n^2) on 100000 elements, about two minutes on my beefy server. Then I showed them the O(n) version running on my comparatively weaker laptop, and it took about ten seconds to run a million elements. The students were genuinely kind of impressed, and I stressed the point to say that "these are the kind of numbers that make it important to take theory seriously."


O(n) for 1 million elements taking ten seconds is pretty long. What’s the constant factor?


I would love to see a detailed analysis of various OS boot sequences and times, it is kind of insane that modern computers and phones can take minutes to boot; what ever are they doing?


Mostly just questionable hardware design and spoiled programmers who want to do too much in early boot. There is no technical reason why you couldn't have a modern desktop booting in a couple of milliseconds. It's a tradeoff between being fast and being compatible with the rest of the world.

At the end of the day, boot times are determined not by technical reasons but by how much the users are willing to tolerate. If the market demanded low boot times, the engineers would deliver.

But seriously, there is a router that I shall not name here that spends 10 seconds sleeping to avoid a race condition after initializing some irrelevant service. And of course the entire boot process depends on this service.


> not by technical reasons but by how much the users are willing to tolerate.

Realizing this one clause is what helped me understand so much, including economics and traffic (induced demand). I call it the "misery constant."

Number of lanes doesn't matter: People will adjust their lifestyles until they arrive at the highest level of misery they can tolerate. Add a lane and wait for equilibrium to reestablish at exactly the same level of congestion as before.

Add a 10,000% (more??) increase in CPU and disk speed from the 128K Macintosh to the MacBook Pro, and oh look, it takes basically the same amount of time to boot or to launch an application as it did then, as the developers adjusted their wastefulness until they achieved the maximum level of delay the customers were willing to bear.


> Mostly just questionable hardware design and spoiled programmers who want to do too much in early boot. There is no technical reason why you couldn't have a modern desktop booting in a couple of milliseconds. It's a tradeoff between being fast and being compatible with the rest of the world.

You wouldn't even be able to initialize CPU & DRAM in "couple of milliseconds"

Loading BIOS from flash alone would eat that budget few times over


> there is a router that I shall not name here that spends 10 seconds sleeping to avoid a race condition after initializing some irrelevant service

Reliability is an issue here. There's all sorts of bits and pieces where it's miserably difficult to get a signal or event that something has come up, or there is such a mechanism and it's unreliable. Especially when there's third-party closed drivers or firmware involved. So somebody ends up giving up and putting the delay in, testing it a few hundred times, and declaring it done.

But yes, lots of systems could come up faster if enough people cared.


PCI enumeration turns out to be surprisingly slow, especially when virtualized. Access to PCI config space is done by writing an address to one register and reading the data at that address from another register, repeat for every word in the PCI config space for every device. Each of those register reads and writes takes a VMEXIT, so goes through the host kernel and into qemu and back. Oh yeah, and it used to be done twice, once by SeaBIOS and once by the kernel (the SeaBIOS stuff has been optimized since.)


If you're dealing with modern systems (which would hopefully include virtual machines), the PCI config space can be memory mapped, and then you don't need to read it through the PCI config registers.


I'm basically a 5yo when it comes to hardware - so my apologies for the dumb question - but I wonder naïvely, could this enumeration be completely cached by default? Most computers have most of the important stuff either soldered down (laptops) or at least not swapped regularly. I think I'd sign up immediately for a system where I had to hold a key at startup the first time I installed a new GPU or something, but otherwise it could boot into Windows in 3 seconds. Since most devices that are likely to change are hot-pluggable, I would imagine that those don't need to be checked on at boot time. I'd love to know where I'm wrong on this, though.

Perhaps this is part of what Windows does -- I know that it basically keeps a hibernated image of the non-logged-in RAM state, for use when you shut down (and it throws it out if you do an explicit Restart, for troubleshooting reasons)


On linux you can use systemd-analyze and get a nice svg waterfall chart of the boot-time dependency chains.


On the subset of Linux distros that use systemd, rather. Thankfully, there are other options (https://elinux.org/Bootchart seems a decent overview)


> On the subset of Linux distros that use systemd, rather.

Worth noting that Fedora, RHEL/CentOS/et al, Debian, Ubuntu, Arch and derivatives use systemd, so odds are pretty high if you use "Linux" you also get systemd


That's true; I think the only major user-facing Linux distros without systemd (or with non-systemd versions) these days are Gentoo, Void, Alpine, and Android (which may or may not count). Embedded stuff is probably more of a mixture (openwrt doesn't use systemd, yocto and buildroot can use it but optionally), and firecracker feels closer to embedded than desktop but both ends are interesting. I dunno, I mostly just object to conflating "Linux" with systemd even if it is the most popular option (but bear in mind that I have strong opinions about "Linux" vs "GNU/Linux", so YYMV;]).


I'd just like to interject for a moment. What you're referring to as Linux, is in fact, systemd/Linux, or as I've recently taken to calling it, systemd plus Linux.


I mean, I would unironically be fine with people saying systemd/Linux when that's what they're talking about; it's a subset of GNU/Linux and points at what we're talking about pretty easily and specifically...


Android seems to be the most popular Linux distro on the planet, though? Does it use systemd? I was under the impression that it doesn't, but as it often is, I might be wrong.


Technically Android runs the Linux kernel, but I wouldn't consider it a "Linux distro" in anywhere near the same sense as I would an Ubuntu or Fedora, etc. The vast majority of Android users don't even know they're running the Linux kernel, which is not true of "distros."


I would imagine the vast majority of OpenWRT users don't know they're running the Linux kernel. It was installed by the manufacturer or the operator. Assuming that's true, does it make OpenWRT not a distro?

I understand your usage of the word "distro". I think it's useful, but also quite a bit misleading (for example, Wikipedia[1] lists Android as a distro).

Perhaps there ought to be another word for distributions of Linux that require user to take active part in either replacing a manufacturers' choice or setting it up from scratch on an otherwise "empty" system. But I don't know what it would be.

[1] https://en.wikipedia.org/wiki/List_of_Linux_distributions


> I would imagine the vast majority of OpenWRT users don't know they're running the Linux kernel. It was installed by the manufacturer or the operator. Assuming that's true, does it make OpenWRT not a distro?

My gut is to not consider OpenWRT a distro since it's usually just the software part of a network appliance and many/most users don't know they're running it, although I concede that it does qualify in most senses as a distro: downloadable image, a package manager, is a complete product that uses the linux kernel, etc.

I fully concede that my opinion of "what makes a distro" is subjective and arbitrary. It's very much the Linux distro version of Stewart's test for obscenity[1] and is a pretty unhelpful standard which value is quite low :-)

[1]: https://en.wikipedia.org/wiki/I_know_it_when_I_see_it


So 95% of the market.


So systemd comes with its own SVG, and presumably some sort of XML composer. TIL.


Yes, hugely complicated XML composer called printf :) https://github.com/systemd/systemd/blob/main/src/analyze/ana...


The author of the tweet has given talks and made written articles about his work on speed up the FreeBSD boot. Some examples:

https://www.daemonology.net/papers/bootprofiling.pdf

https://youtu.be/10XRCiBtyhA


Large systems need to initialize a lot of DRAM at power-on but a reboot of a running Linux desktop these days should take single-digit seconds.


Could be worse. Some network routers take up to 30 minutes to cold boot from what I hear, but they essentially never cold boot because they are on UPS.


initialising hardware mostly.


Makes me realize how spoiled I am by standard libraries nowadays.

Outside of toy projects, interviews, and teaching them, I've never actually written a sort function. I never saw the point; the sort function built into the standard library of whatever language I'm using is probably faster and/or more space efficient than whatever thing I whip up in a few seconds.

Of course, that's a privileged opinion that I am afforded by the fact that my entire career has been "Java and higher" languages.


In my first computer science class in college, after finishing the unit on Red-Black Trees, the professor said something like "Now you should never implement a Red-Black Tree again. If you need a balanced binary tree, use std::map or whatever your standard library provides. Same goes for all the other sorting and data structures we've done in this class."

Relatedly:

Q: When is Python faster than C?

A: When the C programmer used a linked list or some O(n^2) algorithm because they didn't have a standard library full of easy to use well implemented data structures and algorithms.


That's actually something I've been saying for quite awhile when people bring up microbenchmarks complaining about how a language like Haskell is slow.

Like, yes, no question, if you try and implement a tightloop in Haskell vs C, C is going to come out a lot faster.

However, no one writes Haskell like that! People rely pretty heavily on standard libraries (or even third party libraries), and on the optimizations afforded to them by the "rich" language of Haskell.

C might be capable of being faster, but that doesn't imply that a C program will inherently be faster than a similar Haskell program, especially if the programs are relatively large.


Even though I kind of mostly agree that you almost never should implement your own rb tree, this advice is also kind of disempowering and I worry that it is part of a mindset that accepts the status quo when it is so often the case that we could have better things if we just got our hands a little dirty.


If you want to reinvent a wheel, you had better be expecting and willing to spend significant time on your new wheel.

It's not just getting your hands dirty, you need to be strategic about where you want to spend your time.


I think better verbiage might be "99% of the time you probably don't want to write one of these from scratch...at least not for production".


> A: When the C programmer used a linked list or some O(n^2) algorithm because they didn't have a standard library full of easy to use well implemented data structures and algorithms.

I don't think so, regarding the linked list. In almost every case you'll end up with at least as much pointer indirection in Python. But yeah, if you have enough data that complexity starts mattering a worse algorithm will cause more problems.


> O(N^2) can bite hard when you're sorting over a thousand items. Time to replace the bubblesort with something faster.

... downthread ...

> I suspect quicksort will be enough though.

Somebody hasn't taken a freshman complexity theory class in a while...

(Note, this is tongue in cheek. Worst case is not average case and all that. I'm sure quicksort will do fine)


[Deleted, humour doesn't come through in text.]


I was mostly joking, but worst-cases include sorted and reverse-sorted lists. I've seen those happen accidentally, in practice, resulting in accidentally-quadratic runtime. This is why, when I taught that class, I made students compare sorting algorithms on real world data sets.


Sure. In this case though the list doesn't change from boot to boot, or for that matter from system to system; if we hit a worst-case we would notice before the release shipped.

Also, the worst case would be "systems take 2 ms extra time to boot" so it's not exactly a disaster.


> though the list doesn't change from boot to boot, or for that matter from system to system; if we hit a worst-case we would notice before the release shipped.

Wait what - why don't you just sort it at build time then? I'm probably misunderstanding what you are saying or maybe just a premature optimization.

e: Nevermind, read further in the thread and saw you answered [0]

[0]: https://twitter.com/cperciva/status/1659561673278226432


We might end up doing that. A quicksort takes it from 2 ms down to 40 us though, and doing the sorting at build time would take a lot of linker magic.

(The SYSINITs are records in separate object files which are marked as going into a particular ELF section, so we don't actually have a complete list of them until we finish linking the kernel.)


Any consideration to use Radix sort.

It’s typically the fastest. It’s also a stable sort algorithm.

https://youtu.be/BeoCbJPuvSE


Isn't the point of quicksort that the randomization of the pivot selection means that you're extremely unlikely to run into a pathological case for the specific pivot chosen?

And perhaps even more unlikely if you permute the input sequence before sorting it.

I'm basing this off of my understanding of Skiena's discussion of quicksort in The Algorithm Design Manual.


Not even noon and Colin is already roasting. Reminds me of [0], also by him.

[0] https://news.ycombinator.com/item?id=35079


I haven't recovered from jetlag yet. Sorry, my snark went overboard.


What freshman complexity class would teach you to conflate how fast an algorithm is in practice with the asymptotic complexity?


Not to mention that QuickSort isn't a single algorithm but rather a class of algorithms (as the choice of pivot is left unspecified), and there are methods to choose the pivot that give O(n log n) worst case or O(n log n) expected running time (e.g. median of medians or soft heaps for the former, random choice for the latter).


Most of them? If the instructor doesn't call out the difference and point out that all terms of a O() expression have invisible but important constants tacked on, it's easy to accidentally conflate scaling complexity with actual total speed.


Why would the instructor skip that part of the the definition? Giving out wrong definitions as a prank or something?


They're not intentionally avoiding it, they just get caught up in the mathematical angle and only mention its (dis)connection to real execution time in a passing comment if at all. Or they might even mention that there are constants in play... and then say that of course since they're constants we can ignore them for the purposes of O() notation. It's the difference between learning from PHDs and engineers, I suppose.


he means quicksort will be enough to make further boot time improvements due to precomputing and storing the sorted list insignificants


Back in the 1980s, my compiler (Datalight C) won file I/O benchmark after benchmark, year after year.

My trick was stupid simple. The usual file I/O buffer size was 512 bytes, I made it 16K.


P.S. I learned that trick from Shal Farley, who dramatically speeded up DECUS' Runoff program (an ancestor of Markdown) by doubling the buffer size.


Man that's crazy. Think of all that extra time you'll have if timsort was used! An extra 2ms to go on that walk while your system is booting.


This is talking about Firecracker[0] which is for microvm deployments so I would think it is the aggregate effect of losing that 7% across a large number of spin-ups that would make a difference. Occasional reboot on your personal machines obviously doesn't make a dent.

[0]: https://firecracker-microvm.github.io/


The lower the boot time, the more aggressively you can also afford to shut them down when not in use, freeing up capacity.


I'd imagine if you wanted absolute boot time you'd just use a container


Containers are inappropriate for many workloads, such as any workload that relies on multitenancy.


This is for FreeBSD Firecracker. This means this is used for AWS Lambda and Fargate. A whole bunch of 2ms wins is how one would improve cold start time.


I wanted to figure out the cost to all users.

BSD is sometimes used in some iot-like devices, of which there are 3 billion. If I randomly assume 0.5% of those are running a fork of FreeBSD, that's 15 million devices. I'm also going to randomly assume that devices reboot once a month on average.

That all comes out to 4.2 days/year saved! Wow!

https://www.wolframalpha.com/input/?i=2ms+*+3+billion+*+0.5%...


Many years ago I worked on a piece of software that simulated networks. As networks got bigger it would run out of memory. Turned out it did something like

   CONNECTION *c = malloc(num_nodes * num_nodes * sizeof(CONNECT_STRUCT));
And almost the entire matrix was empty because most things in a network aren't connected to each other. I spent a a few minutes replacing that with a hash table on the (i, j) node pair.


But how long on average does it take to boot in wall clock time?


This is from a year old tweet so it may not be correct anymore, but:

“FreeBSD/Firecracker now boots to a login prompt in 782 ms. 170 ms of this is spent generating a self-signed TLS certificate for sendmail, because of course it is.”


FreeBSD 14 is going to replace sendmail with DragonFlyBSD’s dma.


It's 2023, I don't think any system calling itself "minimal" should have an MTA.


The notion that the system mails things to certain local accounts is still quite strong in FreeBSD. In fairness, it's a notion that has a long history in Unices in general; so it isn't FreeBSD Think.

On the gripping hand, it's mad that it's still Sendmail, although that is going to change. It's 2023, and the fact that a FreeBSD system still has the traditional ~2000 lines of indistinguishable-from-line-noise "DO NOT EDIT THIS" stuff in /etc/mail/sendmail.cf, complete with whole bunches of rules for UUCP, is rather amazing.

I wonder how many small FreeBSD systems out there are spending their time double-bouncing cron-job mail that cannot be delivered. (-:


Recently, I tested FreeBSD and it seems one can disable sendmail. See https://docs.freebsd.org/en/books/handbook/bsdinstall/#bsdin....


In that particular case, 620 ms for the entire boot process. 28 ms for the kernel boot (of which 7% or 2 ms is the bubblesort).


Looking from the outside, this is astonishing.

I suspect Firecracker is set up to be as quick to initialize as possible, but still that a vanilla (ie not an optimized fork) can do that ... My phone takes a two orders of magnitude more!

I hope your BSDCan talk is getting recorded.


Yes, talks are being recorded, should be up on YouTube in a month or two depending on how busy the (volunteer) video editors are.


Your phone boots in 2.8 seconds? Mine is closer to 2.8 minutes.


No, closer to 62 seconds (which is 100x the total boot time, not just the kernel)


> The kernel boot (of which 7% is bubblesort) is 28 ms.

* https://twitter.com/cperciva/status/1659577625289928705


This is firecracker, so less than a second I guess?


This is one of the cases where C’s low-level nature bites.

Both C++ and Rust have pretty nice sorting algorithms in the standard library. C does not (qsort doesn’t really count as a nice sorting algorithm which is why so many C programmers took their own).


Why do you think qsort is not nice?

I could imagine that qsort is not an option in FreeBSD for example because it is recursive and there is no stack at that level, or there is a risk of stack overflow. If that's the case, shell sort should be fine I guess.


Can't use the standard library in the kernel, though. That's why there's special implementations of a lot of things you take for granted in kernel source, things like "print".


Rust's `sort_unstable` is available in `core` and does not require the standard library or the `alloc` crate, so, if my understanding is correct, even `no_std` programs like the kernel could use it?

https://doc.rust-lang.org/core/primitive.slice.html#method.s...

(It's interesting that the stable `sort`, by contrast, is only available in `std` because it uses an allocating timsort-variant.)


Most relevant element of what's going on here is that Rust's standard library is deliberately split three ways: Core has stuff that should "just work" on your target platform, Alloc adds stuff that further requires a Memory Allocator, which is something target hardware doesn't come with, but any non-trivial system is likely to build, then Std adds stuff which requires an Operating System, such as Files, and Threading

C is technically also split, so called "freestanding" C implementations don't provide features like stdio but they do have e.g. math functions. C++ attempts something similar but it's a bit hit-and-miss as a truly "free standing" C++ doesn't entirely make sense, as systems that try to use C++ for very low level work often discover.

The next less relevant observation is that Rust chooses to call these sort and sort_unstable - the one that's guaranteed to be available doesn't get the easier name, that's reserved for the one that definitely doesn't have unexpected behaviour. If you don't know what an "unstable" sort is, then sort_unstable's behaviour might in some cases be astonishing. Some languages make the opposite choice, which means programmers may use sort, not realising it's actually an unstable sort because the entire concept was never explained to them.

This is one of many examples of Rust's safety achieved not via some sophisticated technology but just by thinking about why people screw up. Every Rust programmer who without thinking writes sort gets a stable sort, if that was too slow, they can choose sort_unstable, but doing so prompts them to consider, wait, is an unstable sort correct here?

Finally the least relevant but hopefully still interesting observation is that it is quite possible to have a stable sort which doesn't require allocation, but the performance is not so hot. It seems pretty unlikely that you must have a stable sort and you can't afford an allocator and yet you don't mind the performance hit. As I understand it though some C++ libraries do offer this.


So you're not wrong, but also, the Linux kernel uses a modified variant of alloc, so not only can it use sort_unstable, it also has access to sort https://github.com/torvalds/linux/blob/2d1bcbc6cd703e64caf8d...


As PHK points out [0] qsort was imported [1] into the kernel in 1995, it was simply never used for the boot process, but it is availible.

[0]: https://twitter.com/bsdphk/status/1659568675899228163

[1]: https://github.com/freebsd/freebsd-src/commit/d89a1b600fc071...


Looks like the code to bubble-sort the SYSINIT objects was committed a few months BEFORE this kernel qsort code was committed (late Aug 1995 vs early Nov 1995).

So that's why the sort-SYSINIT objects code didn't use qsort.

I'd also guess that there were fewer SYSINIT objects to sort back in 1995 than today.

see ~line 150 of sys/kern/init_main.c in https://github.com/freebsd/freebsd-src/commit/2b14f991e64ebe...


Seems like we are due for the cyclic interest in microkernels. I wonder what's the latest research on that front.


FWIW, qsort (despite its name) does not specify a quicksort be used.

IIRC qsort will use a merge-sort on glibc when it can do so without growing the heap.


However FreeBSD qsort(), in both libc and libkern, is quicksort, falling back to insertion sort when there are fewer than 7 items.


I see a lot of discussion here about quicksort as an alternative.

If I'm understanding correctly, we're talking about sorting integers. Would something like radix sort not be more appropriate? Or is there a reason to still prefer a more generic sorting algorithm?


One of the replies:

https://twitter.com/cperciva/status/1659577625289928705

So this bubble sort is just under 2ms.

I think it's cool to find and optimize that, he says he'll put in a qsort and that's great. But 2ms? Probably not a huge deal.


At the risk of tempting the systemd flame wars back to life, is there data on how systemd performs? I know faster boot time was a major goal of the project[1].

[1]: http://0pointer.de/blog/projects/systemd.html


My personal/corporate, n=~500 experience is that it boots faster, shuts down slower (technically because it is doing it "properly"), and if you have any problem during shutdown it is absolute PITA to debug why.

We did managed to throw away few thousand lines of code of fixed init scripts after moving to systemd and get rid of few other stuff so it has been positive but definitely some of the bad with a lot of good. Journald logging format is also kinda fucked that causes some nasty edge cases

Except of systemd-resolved, this utterly radioactive piece of shit can go fuck itself.


Do you have more details on the journald issues?


Here is 9 year old bug that is still not fixed:

https://github.com/systemd/systemd/issues/2460

Long story short: there is no useful indexing in journald DB format which means looking for say "last message app emitted" means looking thru every file.

As long as it is mmaped in or on tmpfs it is reasonably fast, but if you say have last 1GB of logs stored on spinning rust on NAS.... you either have 3-4 second wait till it finds those few log lines, or buffer cache littered with that GB of logs, pushing out actually useful stuff.

It literally have worse performance than grepping entire /var/log last time I checked.

And it seems in its entirety it was "Lennart wanted to make a binary log format" vs something actually useful... just having SQLite DB instead would on top of being faster far more useful (ability to SQL query logs on local node with no extra fuss would be awesome)


Thanks, interesting. (This is something I had actually observed, too, but never had the patience to figure out what was happening.)

I like the SQLite idea.


Naïve question here: Sorting algorithms aside, is there a reason why the ordering of these even needs to be (re)determined at runtime every boot as opposed to the order being recomputed when the kernel is updated, and written to a file?


Gah! I was just looking at this code last night because RTEMS borrows parts of the FreeBSD codebase for drivers and networking. The sorting happens in mi_startup, in case anyone wants to look.


The story behind this: “how long do you think until they find this?”

-somebody who accepted a bet


The real story

"The kernel boot (of which 7% is bubblesort) is 28 ms."


Is there no link for the actual code?


It's FreeBSD. The source for base comes with the operating system, in this particular case in /usr/src/sys/kern/init_main.c .


Starting around line 255 in init_main.c (or search for bubble). see https://github.com/freebsd/freebsd-src/blob/40b287054521f0a9...


(btw If you click on the line numbers in Github, you get a three-dots popup, click that and you can "Copy Permalink" to get a link directly to the line which has #L255 on the end, e.g. https://github.com/freebsd/freebsd-src/blob/40b287054521f0a9... )


Fascinating. It’s just plain old bubble sort. Why did they choose to do that? I don’t get the benefits of it. If they are going for simplicity, why not selection or insertion sort?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: