Introducing Varnish Massive Storage Engine

damagednoob · on Dec 5, 2014

"Varnish allocate[s] some virtual memory, it tells the operating system to back this memory with space from a disk file. When it needs to send the object to a client, it simply refers to that piece of virtual memory and leaves the rest to the kernel.

If/when the kernel decides it needs to use RAM for something else, the page will get written to the backing file and the RAM page reused elsewhere.

When Varnish next time refers to the virtual memory, the operating system will find a RAM page, possibly freeing one, and read the contents in from the backing file.

And that's it. "

https://www.varnish-cache.org/trac/wiki/ArchitectNotes

I'll try and hold back the snark but I find it interesting that after attacking '1975 programming' and Squid's deficiencies, here we are 8 years later and maybe the kernel doesn't know best.

nkurz · on Dec 6, 2014

You have a strong point. I was more struck by the incongruity of the following statement: "We started out using a memory mapped file to store objects in. It had some problems associated with it and was replaced with a storage engine that relied on malloc to store content. While it usually performed better than the memory mapped files performance suffered as the content grew past the limitations imposed by physical memory."

So it looks like they gave up on the "kernel knows best" approach quite a while ago. But then they show a graph, where the older mmap() approach is more than twice as fast as the malloc() approach across the entire range of the graph. They explain this in a sentence below the graph saying "Malloc suffers quite a bit here as the swap performance on Linux is rather abysmally bad. " Well, yes. But presumably that is the case we're interested in, as a cache that has room for everything is a much easier problem.

So what are we to make of the earlier contention that malloc() trumps mmap()?

wmf · on Dec 6, 2014

Varnish was originally designed to run on FreeBSD (as PHK works on both projects). Have they changed to primarily target Linux, perhaps in response to market demand?

perbu · on Dec 6, 2014

~ 99% of all varnish instances run on Linux. FreeBSD is still important. Varnish Software doesn't currently have any customers who are planning on staying on FreeBSD. They are all migrating to Linux.

vorador · on Dec 6, 2014

I'm a linux user but I've always been interested in running my projects on FreeBSD. Could you give a couple reasons why those people are migrating?

perbu · on Dec 6, 2014

FreeBSD admins are hard to come by. Same with software; try getting something hip running on FreeBSD these days. Even better, try getting a support contract on it, it will not be trivial.

However, the quality of the sysadmins that prefer FreeBSD is pretty high, though. And FreeBSD is pretty neat.

vorador · on Dec 6, 2014

Thanks!

xxxyy · on Dec 6, 2014

> While it usually performed better than the memory mapped files performance suffered as the content grew past the limitations imposed by physical memory.

Physical memory is RAM memory. Their initial observation was:

1) If you need less memory than you have RAM installed - use malloc. Faster than file-backed mmap().

2) If you need more memory than you have RAM installed - use file-backed mmap(). Faster than malloc() + swap.

Edit: you can even see this on the graph in the article. Around time 0 it is malloc that is faster than file, then malloc sinks and file gets faster.

perbu · on Dec 6, 2014

You missed the point of the note. The point was to have the kernel balance what is on disk and what is in memory.

Varnish (w/MSE) still relies on the kernel for this. However, as you cannot atomically write a whole page through a mmap (thereby avoiding the page fault) we have to use write(). It still is read through mmap and the kernel still drop pages whenever it feels like it.

otterley · on Dec 7, 2014

But if you're reading via mmap, aren't you still relying on the kernel's naive LRU page eviction algorithm?

perbu · on Dec 7, 2014

The kernel switched to LFU in 2.6 if I'm not mistaken. But, yes, we are. There are quite a few caches with separate eviction algorithms if you follow the rabbit hole all the way down.

xxxyy · on Dec 5, 2014

I'm not sure whether it was using write() over mmap() that boosted the performance. This part however looks very interesting:

> Taking advantage of the fact that Varnish is a cache and letting it actually eliminate objects that are blocking new allocations simplifies the allocation process.

perbu · on Dec 6, 2014

mmap a file into memory. As you touch a page the CPU generates a page fault and invokes the kernel. The kernel merges the partially written page with the underlying data. As there is no way to avoid the merge you end up doing write() instead.

The page fault is a synchronous read. Horrible for performance.

xxxyy · on Dec 6, 2014

What do you mean by "The kernel merges the partially written page". A pagefault traps the execution just before any CPU write succeeds, then the kernel optionally loads the page, maps it into process' space, flushes TLBs and resumes process execution. mmap() style disk writes happen on page-level granularity, usually not before the kernel runs out of free memory frames.

I agree on the fact that using mmap() reads vs read() reads leaves the kernel with no possibility to reorder requests, as any thread trapped in a pagefault is obviously unable to generate more I/O requests. Reordering can lead to better performance. This is not the case with asynchronous read().

perbu · on Dec 6, 2014

I might have expressed myself somewhat unclear. Let me try again. varnishd tries to write to a page not in memory and gets a page fault while doing so. As the kernel has no idea of your intentions to overwrite the whole page it will read the page into memory and then return control to varnish who completes the memory write, merging the write request just made onto the preexisting page. varnish does not utilize any asynchronous reads. All IO is blocking.

xxxyy · on Dec 6, 2014

I understand now. Not enough that the kernel unnecessarily loads a page that is about to be fully rewritten, it also blocks the thread that is trapped in a pagefault, while it could potentially go and serve another connection. I can see how write() can be faster than mmap(), thanks.

cbsmith · on Dec 6, 2014

Which is exactly the thinking as to why squid cache used write()...

imanaccount247 · on Dec 6, 2014

This is exactly what I came to point out. Varnish was pushed in a very arrogant manner insulting other software that was just plain better. While they were trash talking squid and pretending the kernel could just magically know how to do everything perfectly for any arbitrary userland workload, we were consistently getting better performance from squid.

xxxyy · on Dec 6, 2014

Squid was faster? Varnish is an engineering masterpiece, do you have any data to prove this? I would be very glad to see it.

imanaccount247 · on Dec 6, 2014

>Varnish is an engineering masterpiece

No, it isn't. That's very much the point. They have spent all this time realizing that doing it wrong is bad and rewriting it to actually work.

>do you have any data to prove this?

No, that was 2 jobs ago. I'm sure you can get a hold of the source code of old varnish and test it out yourself.

perbu · on Dec 6, 2014

> No, it isn't. That's very much the point. They have spent all this time realizing that doing it wrong is bad and rewriting it to actually work.

Troll much? Nope. That is not accurate at all. Varnish, with it's current storage engines manages to power sites such as Wikipedia and NYT. This announcement was not about that, it was about enabling caching of really large datasets. Datasets that run in the hundreds of terabytes.

And we did tens of migrations from Squid -> Varnish. Usually it involved scaling down the number of servers to 1/6 whilst seeing a massive decrease in response time. So it kind of surprising that your findings are completely different and somewhat disappointing that you cannot back them up.

imanaccount247 · on Dec 6, 2014

I posted them back in 2008 or so when it was relevant. I got the standard "our propaganda benchmarks where we deliberately misconfigured squid show otherwise and that's all that matters" response followed by some irrelevant ranting about how everything that isn't freebsd or linux shouldn't exist.

perbu · on Dec 7, 2014

Where did you post them? Under what name?

xxxyy · on Dec 6, 2014

I am not talking about mmap() vs write() here. That is just ordinary propaganda. There is a lot of powerful ideas at work there. The two that I know about:

1) Compiling config file into an .so object.

2) Using innovative priority queue (heap) implementation: http://queue.acm.org/detail.cfm?id=1814327

Varnish userbase is the best proof for its quality though.

imanaccount247 · on Dec 6, 2014

>Compiling config file into an .so object.

That is neither new, nor interesting "engineering". Lots of software that needs significant customization uses the host programming language for configuration.

>Using innovative priority queue (heap) implementation

How is "we used an existing data structure and deliberately misrepresent this as some amazing discovery" an amazing feat of engineering? This supports the notion that varnish's biggest achievement is an amazing feat of marketing.

xxxyy · on Dec 6, 2014

Regarding the .so configuration: can you show other examples? I find this technique fascinating, as it combines speed, expressiveness and actually ease of use.

About the alternative heap implementation: can you point out who else has described this aggregated-heap data structure? I don't even know if it has any name, but sure as hell it is a good idea.

fiatmoney · on Dec 5, 2014

"Assumption 1. Using write() instead of implicitly writing to a memory map would lead to better performance."

I've seen this mentioned in the context of RocksDB; but contradicted by e.g. SQLite. The case for mmap has always been that one avoids the overhead of a system call & some double-copying, and in either case it just dirties the page cache and is only "really" written in periodic flushes (assuming it's not writing via direct IO). Can someone explain what the bottleneck is on the mmap side and why write() might be faster?

perbu · on Dec 6, 2014

We (Varnish Software) still use mmap, just not for writing. The problem with mmap is that you can't use it to wrote whole pages at a time as the CPU will trap and ask the OS to merge the page you're writing to.

mmaps are still great as the page cache still manages the split cache between disk and memory.

cbsmith · on Dec 6, 2014

Because most POSIX systems, including Linux, provide even almost nothing to help with hinting about page management, but write() (and the implicit copy) provides better ways to be explicit about that.

fiatmoney · on Dec 6, 2014

Could you go into more detail? mmap lets you specify some flags about access patterns; write() doesn't give you any such hints.

woadwarrior01 · on Dec 6, 2014

Correct me if I'm wrong. I think he's talking about posix_fadvise() [1], which lets you pre declare access patterns for an fd. And you're talking about madvise() [2].

[1]: http://linux.die.net/man/2/posix_fadvise [2]: http://linux.die.net/man/2/madvise

cbsmith · on Dec 7, 2014

...and madvise hardly does anything.

cbsmith · on Dec 7, 2014

You ever tried doing an append only mmap? ;-)

skrause · on Dec 5, 2014

Why does a site of a web site acceleration software take almost 10 seconds to load?

mortenlarsen · on Dec 6, 2014

Looks like it is hosted in Norway. Loaded instantly for me, but I am close by in Denmark. So it is probably the RTT that is affecting your page load time.

lsz9 · on Dec 5, 2014

Not so frequently used I guess. Reciding somewhere in the kill-zone.

cordite · on Dec 5, 2014

Took about 16 for me, felt like fast-dial up.

jallmann · on Dec 5, 2014

Regarding three tier caching between RAM/SDD/HDD: isn't this exactly what ZFS L2ARC is supposed to do? Relying on that seems closer to the Varnish philosophy of leaving as much to the underlying system as possible. Or would that encounter the same bottlenecks they are trying to solve right now? And if so -- how?

wmf · on Dec 5, 2014

Yes, the "1995 programming" solution would be to put each cached object in a file and the let the filesystem manage it. I'm sure there's a good reason why they don't do that, although it would be interesting to hear what that reason is.

Rapzid · on Dec 6, 2014

ZFS has a very advanced caching mechanism, I wonder how it would compare to standard file systems or the kernel's paging for something like this.

It's also pretty common for people to enable file system caching plugins in Wordpress, Drupal, etc. and forget to leave the system enough RAM to actually do its business. D'oh.

lifeisstillgood · on Dec 5, 2014

I am reminded of the slow programming article a while back. How rare is it for companies to say "take a year, see if it works"?

Really good performance needs doing things differently, not the same thing faster. Yet most organisations don't want to try something different or give the space to try it.

wmf · on Dec 5, 2014

You can promise that it will work ahead of time, work on it for a year, and then exercise your political skills if it doesn't.

ForFreedom · on Dec 6, 2014

What is better for caching: Varnish or Squid? [2012] http://www.quora.com/What-is-better-for-caching-Varnish-or-S...

Varnish was built for caching web apps. Squid is a forward proxy that can be configured to work as a web app caching program. So, when Varnish was designed we where able to disregard a lot of stuff that isn't needed when caching in reverse mode. On the other hand, squid has been around for ages and is a very, very mature product with a very well known set of strengths and weaknesses. Varnish is only 5 years old.

101914 · on Dec 6, 2014

So what happens if you put the "disk file" on tmpfs?

Personally, I have more than enough RAM now and I no longer need virtual memory. I do not need a "disk" or other secondary storage in order to retrieve and consume data.

I consider virtual memory a relic from an earlier era of limited computing resources, like "user accounts" designed for an era of time-limited use of prohibitively expensive, shared computers.

We now all have our own _personal_ computers and GB's of RAM, but we still have use kernels with builtin solutions designed to address issues of scarce, expensive, shared computers and scarce, expensive RAM.

perbu · on Dec 7, 2014

If you have enough memory you can just have Varnish store stuff in memory. However, there are quite a few petabyte datasets out there and there will be quite a few years before we can all stick those sets in memory.

dantiberian · on Dec 5, 2014

I'd be curious to know why they didn't just use one of the existing cache algorithms from the literature. They talk about it being close to them, but not why they chose to go their own way. I suspect it was because their algorithm had better mechanical sympathy. There's a number of good choices at http://en.wikipedia.org/wiki/Cache_algorithms#Examples, theirs sounds closest to ARC.

zzzcpan · on Dec 5, 2014

I presume that ARC cannot be used because of the patent for it.

amelius · on Dec 6, 2014

Speaking of memory mapped files: how easy is it to allocate and store various data structures into mmapped files in languages such as C, C++, Rust, etc.?

If it is difficult, then why is this the case? Are languages lacking in this respect?

xxxyy · on Dec 6, 2014

You just have to override the new operator (C++) or use a different malloc (C). All you need is to replace brk() with an appropriate mmap(). Don't know how it goes for Rust.

C++ is especially appealing, because you can embed file-backed mmap allocations in chosen classes by overriding their new operator. So you could for example create a float array class that automagically allocates itself in a file-backed mmap region by a simple new Array() call.

Edit: Python's numpy has a file-backed array: http://docs.scipy.org/doc/numpy/reference/generated/numpy.me...

amelius · on Dec 6, 2014

well, the problem is that the mmapped file is based at a different address every time you mmap it. (in realistic situations, and the general case)

thus, every pointer dereference needs a base address. how is the support for this?

xxxyy · on Dec 6, 2014

This is not an easy way to serialize objects, it is merely a way to help virtual memory manager recognize portions of memory that can be saved to disk in the first place.

.so files are loaded through mmap, and the pointers problem is solved there through the mechanism of relocations. But please don't write raw memory objects to disk, use Google Protobuf or ASN.1

wmf · on Dec 6, 2014

It mostly involves casting a pointer.

amelius · on Dec 10, 2014

It also involves offsetting the pointer with the beginning of the mmapped area. And this gets really cumbersome if you have to do it for every pointer dereference. Basically all standard libraries are immediately unusable if the language does not provide support for this operation.

For example, try storing a C++ std::map inside an mmapped file. Practically not possible AFAIK.

kolev · on Dec 5, 2014

Another payware from Varnish?

perbu · on Dec 6, 2014

I tried, for several years, to run the company by giving all the software we wrote and selling support agreements. Didn't fly.

wmf · on Dec 5, 2014

It's almost like Varnish is a company with paid developers.

kolev · on Dec 6, 2014

There are much more complex projects that add tremendously more value and they don't charge a penny. Varnish and Nginx are doing it wrongly and their flawed models will only create competitors. I can understand charging for support and scale, but charging for features is stupid.

reinhardt · on Dec 6, 2014

Also known as "product".

WordSkill · on Dec 6, 2014

A "product" with no prices listed, just the shady practice of forcing you to contact them and having them work out a "quote" for you.

I really hate it when companies feel they cannot be transparent about pricing, it is such an obvious strategy to work out how much they can squeeze out of each potential customer. How are you meant to trust a company like that?

swalberg · on Dec 6, 2014

I had contacted them for a quote a little over a month ago. Had to get on the phone with the sales guy to get any information. My favourite part is when you fill out the online form for a quote, the autoresponder says "Based on the answers you provided, we will decide whether you qualify and come back to you as soon as possible."

I don't remember the prices offhand but it was $57K US for a 3 node license at platinum support level and around $14k per node after that. He did provide that over the phone (but as I check my email, not in written form) on our first call.

kolev · on Dec 7, 2014

Oh, much worse than I thought. How can they justify this price tag? Anyway, Nginx + LuaJIT (custom-build, Tengine, or OpenResty) beats both plain Nginx caching and Varnish. I was considering using Varnish in front of Nginx with PageSpeed, but given the direction they are taking, I'm sticking with Nginx although it's taking a similar dangerous direction recently.

cthalupa · on Dec 6, 2014

There was an article here on HN just a few weeks ago how the "Contact Us" pricing model actually often works better for generating sales than having listed prices.

I can't find it now, so my comment doesn't particularly add much value, but hopefully someone else can provide the link.

perbu · on Dec 6, 2014

It is my experience that this is true. However, it is probably smart to list some minimal prices so people that are in different ballparks can adjust.

kolev · on Dec 7, 2014

It's true for some. I would never ever buy or recommend a product with no listed price. It's a psychological thing - you always think they charge you more than others. For example, with Akamai, at two very large of their customers, one was getting charged $300/hour and the other - $600/hour.