Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I haven't even suggested to roll your own "string" type. Not more than rolling any other type of array or slice. In my programs I normally do not define a "string" type. Not a central one at least. Zero-terminated strings work just fine for the quick printf() or fopen().

Instead, I might have many string-ish types. A type to hold strings in the UI (may include layout information!), a type of string slice that points into some binary buffer, a rope string type to use in my editor, a fixed-size string as part of some message payload, a string-builder string that tries to be fast without imposing a fixed length... Again, there is little point in an "optimized" generic string type for systems programming, because... generic and optimized is a contradiction.



Any length delineated string you're using, and you did say you were using length delineation, suffers from the problem of not being compatible with any other C code. There's a good reason operating system API calls tend to use 0 terminated strings.

If you want to do a quick debug printf() on it, well, you could use %.*s, but it's awkward and ugly (I speak from lots of experience). Otherwise, you gotta append the zero.

I'm not a C newbie. I've been programming C for 40 years now. I've written 2 professional C compilers, the most recent one I finished this year. When I started D, a major priority was doing strings a better way, as C ranks among the most inconvenient string processing languages :-)


Sure, I know who you are but I hold opinions too :-)

I don't care about having to provide zero-terminated strings to OS and POSIX APIs, because somehow I almost always have the zero already. Maybe I'm a magician.

Sometimes I have not, but >99% of what I give to printf is actually "text", and that pretty much always has the zero anyway. It's a C convention, you might not like it, but I don't sweat it.

If I want to "print", or rather "write", something other than a zero-terminated string, which is normally "binary data", I use... fwrite() or something analogous.

> C ranks among the most inconvenient string processing languages

I've written my share of parser and interpreters (including also a dysfunctional toy compiler with x64 assembler backend, but doesn't matter here), so I'm not entirely a stranger to this game either.

I find parsing strings in C is extremely _easy_, and I find it in fact easier than say in Python where going through a stream of characters one-by-one feels surprisingly unpythonic.

Writing a robust, human-friendly parser with good error reporting and some nice recovery attributes is on the harder side, but that has nothing to do with C strings. A string input for the average parser isn't even required, you just read char by char, frankly I don't understand what you're doing that is hard about it. It doesn't matter one bit if there's a zero at the end or not.


The inconvenience and inefficiency is apparent when building functions to do things like break up a path & filename & extension into components and reassemble them. You wind up, for each function, dealing with 0 termination or length, separately allocated or not, tracking who owns the memory, etc. There's just no satisfying set of choices. Maybe you've found an elegant solution that never does a defensive copy, never leaks memory, etc., but I never have, and I've never seen anyone else manage it, either.


I agree filepath related tasks are ugly. But there are a number of reasons for that that aren't related to zero termination. First, there is syntax & semantics of filepaths. Strings (whatever kind, just thinking about their monoidic structure) are a convenient user interface for specifying filepath constants, but they're annoying to construct from, and disassemble into, filepath components programmatically (relative to how easy I think it should be). Because of complicated syntax and especially semantics of components and paths, there are a lot of pitfalls. Filepath handling is most conveniently done in the shell, where also nobody has any illusion about it being fragile.

Second, you're talking about memory allocation, and this is arguably orthogonal to the string representations we're discussing here. Whether you make a copy or not for example totally depends on your specific situation. The same considerations arise for any array or slice type.

Third, again, you're free to make substrings using pointer + length or whatever, and this is in many cases the best solution. I could even agree that format strings should have better standardized support for explicit length, but it's really not a pain point for me. I'm only stating that zero-terminated is an acceptable default for string literals, and I want to stress this with another example: Last time you were looking at a binary using your editor or pager, how much better has your experience been thanks to NUL terminators? This argument can also extend to runtime debugging somewhat.


> memory allocation, and this is arguably orthogonal to the string representations

A substringz cannot be produced from a stringz without doing an allocation.

> you're free to make substrings using pointer + length or whatever, and this is in many cases the best solution

Right, I can. And it's an ongoing nuisance in C to do so, because it doesn't have proper abstractions to build new types with. Even worse, if I switch my stringz to length delimited, and then pass it to fopen() which wants a stringz, I have to convert my length delimited string to stringz even though it is already a stringz. Because my length delimited API has no mechanism to say it also is 0 terminated.

You wind up with two string representations in your code, and then what? Have each string function come in a pair?

Believe me, I've done this stuff, I've thought about it a lot, and there is no happy solution. It annoys me enough that C is just not a tool I want to reach for anymore. I'm just tired of ugly, buggy C string code.

The good news is there is a fix, and I've proposed it, but it gets zero traction:

https://www.digitalmars.com/articles/C-biggest-mistake.html


> You wind up with two string representations in your code, and then what? Have each string function come in a pair?

As said, I don't think this is the end of the world, and I'm likely to add a number of other string representations. While it happens rarely, I don't worry about formatting a temporary string for an API into a temporary before calling it. Because most "string" things are small and dispensable. Zero-terminated strings are the cheap plastic solution that just works for submitting string-literals to printf, and that just works to view directly in a binary. And they're compatible with length delineated in the sense that you can supply a (cheap plastic) zero-terminated string to a (more serious) length delineated API. Also the other way, many length delineated APIs are designed to work with both - supply -1 as length, and you can happily put a string literal as argument, don't even have to macro your way with sizeof then to supply the right length.

> The good news is there is a fix, and I've proposed it, but it gets zero traction

I'm aware of this and I like it ("fat pointers") but I wouldn't like it if the APIs would miss the explicit length argument because there's a size field glued to the slice.


> many length delineated APIs are designed to work with both - supply -1 as length, and you can happily put a string literal as argument, don't even have to macro your way with sizeof then to supply the right length.

I'm sorry, I just have to say "no thanks" to that. I don't really want each string function to test the length and run strlen if it isn't there.

By now, the D community has 20 years experience with length as part of the string type. Nobody wants to go back to the C way. It's probably the most unambiguously successful and undisputed feature of D. C code that gets converted to D gets scrubbed of the stringz code, and the result is cleaner and faster.

D still interfaces with C and C strings. The conversion is done as the last step before calling the C function. (There's a clever way to add a 0 that only rarely requires an allocation.) Any C strings returned get immediately converted with the slice idiom:

    string s = p[0 .. strlen(p)];
> I wouldn't like it if the APIs would miss the explicit length argument because there's a size field glued to the slice.

I bet you would like it! (Another problem with a separate length field is there's no obvious connection between it and the string - which is another source of bugs.)


> Last time you were looking at a binary using your editor or pager, how much better has your experience been thanks to NUL terminators?

Not perceptibly better. And yeah, I do look at binary dumps now and then, after all, I wrote the code that generates ELF, OMF, MachO, and MSCOFF object file formats, and librarians for them :-)


I wrote simple ELF and PE/COFF writers too, but independently of that, zero terminators are what lets you find strings in a binary. And what allows the "strings" program to function. It simply couldn't work with without those terminators.

Similarly, the text we're exchanging consists of words and sentences that are terminated using not zero bytes, but other terminators. I'm very happy that they're not length delineated.


> It simply couldn't work with without those terminators.

Yeah, it will. For a related example, I use `grep` all the time to find strings in source code. Source code is not 0 terminated. It works fine.


I use "grep -w foo" (or something like "grep '\<foo\>'"), because when I look for "foo" I don't want "bazfoobar". grep -w only works because the end of words is signaled in-band (surrounding / terminating words with whitespace).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: