Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

As a bit of historical context:

"This is horribly inefficient BSD crap. Using these function only leads to other errors. Correct string handling means that you always know how long your strings are and therefore you can you memcpy (instead of strcpy).

Beside, those who are using strcat or variants deserved to be punished."

- Ulrich Drepper, around 23 years ago: https://sourceware.org/legacy-ml/libc-alpha/2000-08/msg00053...



He is not wrong, is he? If you are using null terminated strings that's the thing you need to fix.

I still support this addition. If you are doing methamphetamine with needle sharing you should stop methamphetamine, but distributing clean needles is still an improvement.


He's not wrong. The main reason to have these functions is that other implementations have them, and programs are using them, and having to define those functions themselves when ported to glibc.

One benefit of defining strlcpy yourself is that you can define it as a macro that expands to an open-coded call to snprintf, and then that is diagnosed by GCC; you may get static warnings about possible truncation. (I suspect GCC might not yet be analyzing strlcpy/strlcat calls, but that could change.)

The functions silently discard data in order to achieve memory safety. Historically, that has been viewed as acceptable in C coding culture. There are situations in which that is okay, like truncating some unimportant log message to "only" 1024 characters.

Truncating can cause an exploitable security hole; like some syntax is truncated so that its closing brace is missing, and the attacker is able to somehow complete it maliciously.

Even when arbitrary limits are acceptable, silently enforcing them in a low-level copying function may not be the best place in the program. If the truncation is caused by some excessively long input, maybe that input should be validated close to where it comes into the program, and rejected. E.g. don't let the user input some 500 character field, pretend you're saving it and then have them find out the next day that only 255 of it got saved.

Even if in my program I find it useful to have a truncating copying function, I don't necessarily want it to be silent when truncation occurs. Maybe in that particular program, I want to abort the program with a diagnostic message. I can then pass large texts in the unit and integration tests, to find the places in the program that have inflexible text handling, but are being reached by unchecked large inputs.


Example:

  #include <stdio.h>
  #include <string.h>

  #define strlcpy(dst, src, size) ((size_t) snprintf(dst, size, "%s", src))

  size_t (strlcpy)(char *dst, const char *src, size_t size)
  {
    return strlcpy(dst, src, size);
  }

  int main(void)
  {
    char littlebuf[8];
    strlcpy(littlebuf, "Supercalifragilisticexpealidocious", sizeof littlebuf);
    return 0;
  }


  strlcpy.c: In function ‘main’:
  strlcpy.c:4:63: warning: ‘%s’ directive output truncated writing 34 bytes into a region of size 8 [-Wformat-truncation=]
   #define strlcpy(dst, src, size) ((size_t) snprintf(dst, size, "%s", src))
                                                               ^
  strlcpy.c:14:22:
     strlcpy(littlebuf, "Supercalifragilisticexpealidocious", sizeof littlebuf);
                      ~
  strlcpy.c:14:3: note: in expansion of macro ‘strlcpy’
     strlcpy(littlebuf, "Supercalifragilisticexpealidocious", sizeof littlebuf);
   ^~~~~~~
  strlcpy.c:4:34: note: ‘snprintf’ output 35 bytes into a destination of size 8
   #define strlcpy(dst, src, size) ((size_t) snprintf(dst, size, "%s", src))
                                   ~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  strlcpy.c:14:3: note: in expansion of macro ‘strlcpy’
     strlcpy(littlebuf, "Supercalifragilisticexpealidocious", sizeof littlebuf);
     ^~~~~~~
If glibc doesn't do something in the header file such that we get similar diagnostics for its strlcpy, we can make the argument that this is detrimental to the program.


There is a hierarchy of bugs involved here. Memory safety is a much more serious class of problem. Obstinately refusing to improve the status quo because it doesn't solve all problems is just plain bad engineering. Doubly so in this case where there exist the "n" variants of string functions that are massive foot guns.


Yes, he’s wrong. To apply your metaphor: improvements to the mess that is string handling in C are still an improvement, even if they don’t solve the underlying problem.


Well, the wider problem then is using C.


Pretty much all operating system APIs use C-style zero-terminated strings. So while C may be historically responsible for the problem, not using C doesn't help much if you need to talk to OS APIs.


not using C doesn't help much if you need to talk to OS APIs

This means cdecl, stdcall or whatever modern ABIs OSes use, not C. Many languages and runtimes can call APIs and DLLs, though you may rightfully argue that their FFI or wrappers were likely compiled from C using the same ABI flags. But ABI is no magic, just a well-defined set of conventions.

And then, no one prohibits to use length-aware strings and either have safety null at the end or only copy to null-terminated before a call. Most OS calls are usually io-bound and incomparably heavy anyway.


The problem is, a null-terminated string is a very simple concept for an ABI. A string with a length count seems simple, but there is a big step up in complexity, and you can't just wistfully imagine effortlessly passing String objects around to your ABI.

For a start, String objects are going to be different everywhere. Even in C++, one library's String object isn't going to be binary compatible with another. How is the data laid out, does it do small string optimisation, etc? Are there other internal fields?

So you won't be passing objects around. At the ABI, you'll have to pass a pointer and a length. Calling an ABI will involve unwrapping and wrapping objects to pretend you are dealing with 'your' strings. Simple C-style ABIs make memory management straightforward (n.b. but error-prone, and certainly not easy). If this new style ABI returns a 'string' (pointer and length) of some sort, you have to package it up in your own object format, and manage the memory. Will you need an extra object type to represent 'string I got from an ABI, whose memory is managed differently'?

None of these are insurmountable, but they are a complexity that is rarely thought of when people declare 'C style ABIs are terrible!'


> For a start, String objects are going to be different everywhere. Even in C++, one library's String object isn't going to be binary compatible with another. How is the data laid out, does it do small string optimisation, etc? Are there other internal fields?

I don't really think anyone expects a c abi to have multiple implementation defined string types. They want there to be a pointer + length string interface removing the use of null pointer style strings alltogether.

> If this new style ABI returns a 'string' (pointer and length) of some sort, you have to package it up in your own object format

A c function with proper error, (that is something you want to have for all your interface functions). Normally looks something like this.

int name(T1 param_1, T2 param_2, ..., TN param_n, R1* return_1, R2* return_2, ..., RN* return_n);

Where the return int is the error code. param_1-param_n the input parameters. result_1-result_n the results of the function.

When writing these kinds of functions having an extra parameter for the size of the strings either for input or output is not a huge complexity increase.

> Will you need an extra object type to represent 'string I got from an ABI, whose memory is managed differently'?

Which memory management system you use does not impact if you use null terminated strings or a pointer + length pair. Both support stack, manual, managed or gc memory. It's just about the string representation.

For example:

I use a gc language.

I call a c library which returns a string that I get ownership of.

Now I want to leverage the gc to automatically free the string at some point. What I do is tell the gc how to free it, I have to do this no matter how the string is represented.

Or take the inverse.

I send in a string to the c library, which takes ownership of it.

Now the library must know how to free the memory. Typically this is done by allocating it with a library allocator (which can be malloc) before sending it to the function. Importantly the allocator is not the same as the one we use for everything else.

What I am getting at is that if you are not using the same memory system in the caller and the calle you have to marshal between them always. No matter if you are using null terminated strings or a pointer + length pair.


> pointer + length string interface

If it's a 32 bit length, that will be limiting for some 64 bit programs.

If it's a 64 bit length, it means tiny strings take up more space.

Hey, do both! Have the length be a "size_t" and then have "compat_32" shim around single system call that takes at least one string argument.

Wee!

Imagine a parallel world in which mainstream OS kernel developers had seen the light 30 years ago and used len + data for system calls. You'd now have to be support ancient binary programs that are passing strings where the length is uint16. Oh right, I forgot! We can just screw programs that are more than five years old. All the cool users are on the latest version of everything.

> if you are not using the same memory system in the caller and the calle you have to marshal between them always. No matter if you are using null terminated strings or a pointer + length pair.

Null-terminated byte strings are always marshaled and ready to be sent literally anywhere. They have no byte order issues. No multi-byte length field whose size and endianness we have to know. If they are UTF-8, their character encoding is already marshaled also (that's the point of using UTF-8 everywhere).


>Null-terminated byte strings are always marshaled and ready to be sent literally anywhere. They have no byte order issues.

They have https://en.cppreference.com/w/c/string/wide


Why are you citing documentation about wide strings, in response to a comment about byte strings (that even mentions UTF-8)?


> don't really think anyone expects a c abi to have multiple implementation defined string types. They want there to be a pointer + length string interface removing the use of null pointer style strings alltogether

Not so simple.

32bit or 64bit length? Signed or unsigned? It doesn't make sense to have a signed length.

Zero length strings are easy, what about null strings? Are you going to design the pointer + length strict to be opaque so that callers can only ever use pointers to the struct? If you don't, you cannot represent a null string (IE a missing value) differently to an empty string.

How do callers free this string? You have to mandate that they use a special stringFree function, or rely on callers first freeing the pointer field and then freeing the struct.

Composite data types are a lot more work and are more error prone in C.


We're very much in agreement.

The whole 'null pointer style strings' makes no sense, I think they want to say 'nul terminated'. But fine.

Your examples are excellent, let me add a few more:

Big endian? Little endian? Do we count characters or bytes? Who owns the bloody thing? Can they be modified in place? Are they in ROM or RAM? Automatic? Static? Can they be transmitted over a network 'as is' or do they need to be sent via some serialization mechanism? What about storing them on disk? And can they then be retrieved on different architectures?

The problem really is that C more or less requires you to really know what you're doing with your data and that's impossible in a networked world because your toy library ends up integrated into something else and then that something else gets connected to the internet and suddenly all those negative test cases that you never thought of are potential security issues. So any simplistic view of string handling will end up with a broken implementation regardless of how well it worked in its initial target environment.

C's solution is simple: take the simplest possible representation and use that, pass responsibility back to the programmer for dealing with all of the edge cases. The problem is that nobody does and even those that try tend to get it subtly wrong several times across a codebase of any magnitude.

It's a nasty little problem and it will result in security issues for decades to come. There are plenty of managed languages, I had some hope (as a seasoned C programmer) that instead of this Cambrian explosion of programming languages that we'd have some kind of convergence so that it becomes easier, not harder to pick a winner and establish some best practices. But it seems as though cooperation is rare, much more common is the mode where a defect in one language or eco system results in a completely new language that solves that one problem in some way (sometimes quite convoluted) at the expense of introducing a whole raft of new problems. Besides the fractioning of mindshare.


It's not a hypothesis, the thing was already implemented many times in C, C++ and other languages and used for ages especially for networked code, because C "there's no length" approach is a guaranteed vulnerability.


It's not a guaranteed vulnerability, it's a potential vulnerability.

Guaranteed doesn't mean "this will probably happen", it means "this will definitely happen".

The "no length approach" can probably result in a vulnerability. It won't definitely result in a vulnerability.

I mean, come one, if it was a guaranteed vulnerability, almost nothing on the internet would work because they all have, somewhere down the line, a dependency on a nul-terminated string.

I mean, do you think that nginx (https://github.com/nginx/nginx/blob/master/src/core/ngx_stri...) is getting exploited millions of times per hour because they have a few uses for nul-terminated strings?


nginx whacks one mole at a time https://cve.circl.lu/cve/CVE-2013-2028


That CVE has absolutely nothing to do with length up front vs nul terminated strings. It's also two years old. The only thing it does is reference nginx but that's disingenuous, unless the point you're trying to make is that nginx has the occasional security issue, which I think we're all very much aware of. But it doesn't answer the GPs point in any relevant way.


The problem there is in opportunistic bound checking due to loose association of an array with length, string being an example of an array. This vulnerability is a direct consequence of C "there's no length" approach and shows why this approach in unsuitable for networked code.


In C a string is not an example of an array. If we can't agree on terminology for a discussion that requires extreme precision it becomes difficult to keep going.

Networked code does not as a rule use C style nul terminated strings though, in the case of fixed length buffers they will usually be accompanied either by a length field or by zeroing out the end of the string or even the whole buffer (the latter is much better and ensures you don't accidentally leak data from one session to another).

Networked code doesn't have to be written in C to begin with. Regardless of implementation there usually is a protocol spec and you adhere to that spec and if you don't then you'll find out the hard way why it matters.

This particular vulnerability has nothing at all to do with C strings but in fact has everything to do with a broken implementation of length based strings, which could result in the length being negative, which is at least one problem which C style strings do not have... (small comfort there, they have plenty of other problems, but that one they don't.).

This is the fix for that particular CVE:

https://github.com/nginx/nginx/commit/4997de8005630664ab35f2...

Which stems from integer overflow after doing arithmetic on the lengths.

It looks to me as though you just pulled the first nginx CVE that you found and posted it without looking at what the CVE was all about, without realizing that the ancestor comment was referring to the string implementation inside nginx which lives in the referenced file, whereas you are pointing to a CVE related to the parsing of HTTP chunked data requests, which resides in an entirely different file and has nothing to do with string handling to begin with.


And what do you propose? To let only 1.5 good C programmers in the world write code like in 70s?


> And what do you propose?

That you get your terminology right, back up your claims with links that actually make sense and try to understand that the software world is complex and that incremental approaches make more sense than demanding unrealistic / uneconomical changes because they are not going to happen.

> To let only 1.5 good C programmers in the world write code like in 70s?

No, I did not propose that, you just did and clearly that's nonsense aka a strawman even if you didn't bother throwing it down.

C is here. It will be here decades from now. Rewriting everything is not going to happen, at least, not in the short term. C will likely still be here (and new C code will likely still be written) in 2100, and possibly long after that. This isn't ideal and it's not going to help that we can not make a clean break with the past even though we are trying.

The solution will come in many small pieces rather than as one silver bullet to cure it all and TFA announces two such small pieces and as such is a small step in a very, very long game. The adoption of Rust and other safer (not inherently safe but safer, there are still plenty of footguns left) may well in the longer run give us a chance to do away with the last of the heritage from the C era. But there is a fair chance that it won't happen and that Rust's rate of adoption will be too low to solve this problem timely.

The same goes for every other managed language, they are partial solutions at best. This isn't good news and it isn't optimal, but it is the reality as far as I can determine. If you're going to do a new greenfield development I hope that you will find yourself on a platform where you won't have to use C and that you have skills and resources at your disposal that will allow you to side-step those problems entirely. But that won't do anything for the untold LOC already out there in production and that utterly dwarfs any concern I have about future development, it's the mess we made in the past that we have to deal with and we have to try hard to avoid making new messes.

Think of it as fixing a large toxic waste spill.


It's not a hypothesis, the change happened several times and is used in networking code: in putty and s2n in C and in grpc in C++ and I guess in all C++ code that uses string_view and span, it's easier to happen in C++ due to more language features.

>Rewriting everything is not going to happen, at least, not in the short term.

If you can't do a big task in one go, split it into smaller tasks and do them in sequence.


I'm sorry, I apparently lack the vocabulary or clarity of expression to get my points across to you so I'm bowing out here.


Which C compilers are those then?

Also, you keep writing 'null pointer' and 'null', there is a pretty big difference between 'null' and 'nul' and in the context of talking about language implementation details such little things matter a lot. You say a lot of stuff with great authority that simply doesn't match my experience (as a C programmer of many decades) and while I'm all open to being convinced otherwise you will have to show some references and examples.


What doesn't match your experience?


My experience as a programmer of some 40 years in C has yet to expose me to a C compiler that has length based rather than nul terminated strings as the base string type. Please point me to one in somewhat widespread use rather than an experimental implementation that uses this concept and make sure not to confuse libraries with the implementation of the language.


Since no C/C++ compiler supports it, for them implementation is in a library.


So that means they are not part of C/C++. Which was the point. You can write software in C/C++ but that's hardly news and you can use that to create new data types that are not in the language, which also is hardly news.


People suggesting it are concerned about security, they don't intend it to be a novel invention. Bound checking predates C.


Yes it does. But that doesn't mean that you get to state a lot of stuff with certainty that upon inspection turns out to simply not be true. C programmers are - in spite of what you appear to think - also concerned about security. And whether bounds checking predates C or not has nothing to do with how this is implemented, in a library or in the compiler itself (or even in the hardware).

If you reference C you are talking about the compiler, that, and only that is the language implementation. In C that specification is so tiny that a lot of the functionality that you might expect to be present in the language is actually library stuff. K&R does a poor job for novices to split out what is the language proper and what is the library, but a good hint is that anything that requires an include file isn't part of the language itself.

The original comment to which you responded talked about the ABI, the layer between the applications and the operating system, presumably the UNIX/POSIX ABI, which is more or less cast in concrete by now and unlikely to be replaced because if you do so you introduce a breaking change: all compiled applications using that ABI will no longer work. Some versions of UNIX will occasionally do this and this is widely regarded as a great way to limit your adoption. So the problem, in a nutshell is: how do we repair the security situation that has emerged as the result of many years of bad practices in such a way that our systems continue to work without having to re-invest the untold trillions of $ that have been spent on software that we use every day. This is a hard problem. TFA is a small, and incremental step in trying to solve that problem.

Others are more pessimistic, believe that we should just take our lumps and get on with that rewrite, usually in whatever is their favorite managed (or unmanaged, in some cases) language. Yet others pursue compiler based or hardware based solutions which all introduce different degrees of incompatibility.

I'm somewhat bearish on seeing this problem resolved in my lifetime. At the same time I applaud every little step in the right direction. And I personally do not believe that replacing C's 'string type' (which it really doesn't have other than nul terminated string literals) is the way to go due to the reasons outlined above. But an incremental approach allows for fixing some known issues and allows us to back away from historical mistakes in a way that we can afford the cost and to do so without incurring the penalty of a complete rewrite (which usually comes with a whole raft of new bugs as well). So small improvements that do not address each and every grievance should be welcomed. Even if they no doubt introduce new problems at least the scope is such that you can - hopefully - deal with those without introducing new security issues.


Putty and s2n are examples how this problem is solved, they work on POSIX, e.g. linux, just compile them with gcc and they work.


>32bit or 64bit length? Signed or unsigned? It doesn't make sense to have a signed length.

32 bit should be enough for everyone, it's easier to type as int, and you have less problems with variable sized integers on different targets. Signed length makes sense because length is a number, and numbers are signed, also in conjunction with array -1 sentinel value is often used.

>If you don't, you cannot represent a null string (IE a missing value) differently to an empty string.

C++ can't do it either with std::string and sky doesn't fall, because such distinction is rarely needed and for business logic empty string means absence of value, actually in languages with nullable strings null string and empty string are routinely synonymous and you often use a method like IsNullOrEmpty to check for absence of value. Anyway you need the concept of absence for other types too, like int, so string isn't special here.

>You have to mandate that they use a special stringFree function, or rely on callers first freeing the pointer field and then freeing the struct.

pointer+length struct is a value type, see https://en.cppreference.com/w/cpp/container/span


> C++ can't do it either with std::string and sky doesn't fall, because such distinction is rarely needed and for business logic empty string means absence of value,

Incorrect. I'm literally, today, working on a project where the business logic is different depending on whether an empty string is stored in the database, or no string.

"User didn't get to fill in a preference" is very different from "user didn't indicate a preference".

In more practical terms, a missing value could mean that we use the default while an empty value could mean that we don't use it at all.


For user empty text field means absence of value. Indeed, rarely a situation arises for optional values, but it's not only for strings, other types like int may need it too.


The end user representation of a programming construct versus the implementation details surrounding such constructs give rise to what is called a 'leaky abstraction', in this case that 'absence of value' is something entirely different than 'empty string'.

We have a way of representing absence of value for some data types but not for others, again because of implementation details. This sort of leaky abstraction often gives options for creativity but it can also lead to trouble and bugs. Some languages offer such 'optional' behavior to more datatypes and make it a part of function calling conventions, either by supplying a default or by leaving the optional parameters set to the equivalent of 'empty' or even 'undefined' if that is possible.


Pretty much all string implementations have the ability to give you a pointer and a length which you can then pass on to the foreign interface. Essentially, he API always takes a non-owning string view. C strings on the other hand require you to store that terminating NUL next to the string. This is only bearable because most string implementations are designed to deal with because C APIs are so popular.

For returning strings, ownership is a bigger problem than the exact representation. OS APIs typically make you provide a buffer an then fail if it was not big enough.


>Simple C-style ABIs make memory management straightforward (n.b. but error-prone, and certainly not easy).

The idea is to use C-style memory management: you provide a buffer, where the string is copied, for example of string return see getenv_r function: https://man.netbsd.org/getenv.3

In C++ it's more similar to std::span.


you can't just wistfully imagine effortlessly passing String objects around

To clarify, I didn’t mean it. No new style API/ABI. Only unboxing a string into (str, len) in/out-params and boxing it back from returns.


Lots of C programs define a more substantial string type for themselves (e.g. dynamic, reference-counted strings or what have you), used only internally. Time-honored tradition.


You do like in Windows and define safe strings for ABI, as done for COM API, nowadays the main kind of Windows APIs.


I suspect null terminated strings predate C, C is just just one of many languages that can use them.


The PDP-10 and PDP-11 assemblers had direct support for nul-terminated strings (ASCIZ directives, and OUTSTR in MACRO10) which Ritchie adopted as-is, not unlike Lisp’s CAR/CDR. It’s not entirely clear that other “high-level” languages at the time also used such a type.

Although later ISA added support for it for C compatibility, whereas older ISAs tended to only support fixed-length or length-prefixed, for instance the Z80 has LDIR, which is essentially a memcpy, copying a terminated string required a manual loop.


All non-dynamic string representations give rise to the situations where programmers need to combine strings that don't fit into the destination.

Whether null-terminated or not, dynamic strings that solve the problem of being able to add two strings together without worrying whether the destination buffer is large enough (trading that problem for DoS concerns when a malicious agent may feed a huge input to the program).


Nothing prevents those operating systems from offering custom string types.


In reality, a ton of stuff does. As an example: What do you do if someone calls your new string+length API with an embedded \0 character? Your internal functions are all still written in C and using char* so they will silently truncate the string. So you need to check and reject that. Except you forgot there are also APIs (like the extended attrs APIs) that do accept embedded \0. The exceptions are all over the place, in ioctl calls passed to weird device drivers etc.


Windows internally uses string+length struct, null terminated string API is just compatibility interface on top of it.


*new operating systems

You can't change the string type without breaking all apps and services.


Even on a new OS it's going to be a compatibility problem. Implementing even partial POSIX compatibility makes porting stuff easier, but changing how stings work is going to make it significantly harder.


As a user posting from a Linux machine, I disagree. Though it seems the "don't use C" crowd often delegate the important decisions to somewheres else.

I guess the answer is "some people's C is good enough, but not yours"


If the problem is "you're using nul-terminated strings" as the GP said, then "don't use C" a good step towards fixing that problem, no?


Perhaps, but also realistic to accept that you're using code where other people do/have and that the same logic would apply to them.


You only have to care about it at boundaries though, for the most part. Like, when calling a C API. That's easy to handle. Even C++'s std::string can do that, as the c_str method always returns a null-terminated string. That inherently kills the need for things like strcat.


The return from c_str cannot be used everywhere you would normally use a null terminated string, because the return is const.

For example, you couldn't pass it to strtok, or any other function that needs to even temporarily modify the string.


strtok is an abomination. The only reason it needs to modify the input string in the first place is to support zero-terminated output strings without having to make copies.


While this is true, passing a string to a C function that is manipulating the string would defeat the point of not using C string manipulation.


You may not know the function is doing C string manipulation, since const correctness in APIs is not a 100% thing.


If it's just incidental mutation that is a concern, rather than intentionally mutating C strings, no problem: it is common-place to defensively clone strings and other memory when passing them to untrusted interfaces. In fact, if this is your fear, you have literally no alternative but to do so, even when programming directly in C.

Then again, if there's no contract for who owns or mutates a given piece of memory, there's no safe way to use said API from any language or environment and you should probably stop using it. Failing that, you'd just have to check the source code and find out what it actually does and hope that it does not change later.

(Of course, this has no bearing on whether or not you should use C strings or C string manipulation: You shouldn't, even if you're touching unsafe APIs. It's extremely error prone at best, and also pretty inefficient in a lot of cases.)


@jchw I don't see anything you write as disagreeable. But clearly you have a strong handle on what needs taken care of.


Turtles all the way down isn't it? At some point, someone has to take responsibility.


Let me reframe this. What we're saying to do is stop using C string manipulation such as strcat, strcpy, etc. Particularly, I'm saying simply don't use C-style null terminated strings until you actually go to call a C ABI interface where it is necessary.

The argument against this is that you might call something that already internally does this, to your inputs directly, without making a copy. Yes, sure, that IS true, but what this betrays is the fact that you have to deal with that regardless of whether or not you add additional error-prone C string manipulation code on top of having to worry about memory ownership, mutation, etc. when passing blobs of memory to "untrusted" APIs.

It's not about passing the buck. Passing a blob of memory to an API that might do horrible things not defined by an API contract is not safe if you do strcat to construct the string or you clone it out of an std::string or you marshal it from Go or Rust. All this is about, is simply not creating a bigger mess than you already have.

Okay fine, but what if someone hates C++ and Rust and Go and Zig? No problem. There are a slew of options for C that can all handle safer, less error-prone string manipulation, including interoperability with null-terminated C strings. Like this one used in Redis:

https://github.com/antirez/sds

And on top of everything else, it's quite ergonomic, so it seems silly to not consider it.

This entire line of thinking deeply reminds me of Technology Connection's video The LED Traffic Light and the Danger of "But Sometimes!".

https://youtube.com/watch?v=GiYO1TObNz8

I think hypothetically you can construct some scenarios where not using C strings for string manipulation requires more care, but justifying error prone C string manipulation with "well, I might call something that might do something unreasonable" as if that isn't still your problem regardless of how you get there makes zero sense to me.

And besides, these hypothetical incorrect APIs would crash horrifically on the DS9K anyways.


This thread reminds me of the essay, "Some were meant for C"

https://www.humprog.org/~stephen/research/papers/kell17some-...


C the needle contaminated now often with deadly RCE virus. Historically it was used to inject life into the first bytes of the twisted self perpetuating bootstrapping chain of an eco system dominating today the planet and the space around it.


All processors are C VMs at the end of the day. They are designed for it, and it's a great language to access raw hardware and raw hardware performance.

I still fail to label C as evil.

P.S.: Don't start with all memory management and related stuff. We have solutions for these everywhere, incl., but not limited to GCs, Rust, etc. Their existence do not invalidate C, and we don't need to abandon it. Horses for courses.


> All processors are C VMs at the end of the day.

That would be a poor argument back in the 80s; and is increasingly wrong for modern processors. Compiler intrinsics can paper-over some of the conceptual gap, but dropping down to inline assembly can't be entirely eliminated (even if it's relegated to core libraries). Lots of C code relies on certain patterns compiling down to specific instructions, e.g. for vectorising; since C itself has no concept of such things. C is based around a 1D memory model which has no concept of cache hierarchies. C has no representation of branch prediction, out-of-order instructions, or pipelines; let alone hyperthreading or multi-core programming.

After all, if processors were "C VMs", then GCC/LLVM/etc. wouldn't be such herculean feats of engineering!


This is a subject I love to discuss.

Exactly. C is based around 1D memory, has no understanding of caches. All of your other arguments are true, too.

This is why most of the things; caches, memory hierarchies and other modern things are hidden from C (and other languages, or software in general) itself, to trick C, and make it think it's still running on a PDP-11.

All caches (L1, L2, L3, even disk and caches, and various caches built in RAM) are handled by hardware or OS kernels themselves. Unless they provide an API to talk with, they are invisible and untouchable, unmanageable, and this is by design (esp. the ones baked into hardware like Lx and other buffers).

All the compilers are the interface perpetuating this smoke and mirrors to not upset C about its assumptions about the machine underlying itself. Even then, a compiler can only command the processor upto a certain point. You can't say that I want these in caches, and evict these. These are automagic processes.

Exactly, because of these reason, CPUs are C VMs. They do work completely different than a PDP-11, but behave like one at the uppermost level, where compilers are the topmost layer in this toolchain.

Compilers are such a herculean feats of engineering, because we need to trick that the programs we're building, to make them think they're running on a much simpler hardware. In turn, hardware tries hard to keep this management ovherhead handled by compilers at a bare minimum while allowing higher and higher performance.

More ponderings, and foundation of my assertion is here: https://dl.acm.org/doi/10.1145/3212477.3212479

Paper is titled: C Is Not a Low-level Language: Your computer is not a fast PDP-11.


Caches, memory hierarchies, out-of-order execution, etc. are hidden from assembly as well as C. One reason for this that isn't mentioned in your comment (or the ACM article) isn't that everyone loves C but rather that most software has to run on a variety of hardware, with differing cache sizes, power consumption targets, etc. Pushing all of that optimization and fine tuning off to the hardware means that software isn't forced to only work on the exact computer model it was designed to run on.

The author also mentions that alternative computation models would make parallel programming easier, but this neglects the numerous problems that aren't parallelizable. There's a reason why we haven't switched all of our computation to GPUs.


I don't think I agree completely to your sentiment. Because while we want to make software run everywhere (at least in the X86 family regardless of the feature sets we have), we want to make sure that our software performs well, too. This is esp. important in areas where we (ab)use the hardware to the highest level (games, science, rendering, etc.)

To enable this performance optimizations, we taught our compilers tons of tricks, like -march & -mtune flags. Also, we allow our compilers to generate reckless code like -ffastmath, or add tons of assembly or vectorization hints into libraries like Eigen.

We write benchmarks like STREAM, or other tools which measure core to core latency, or measure execution code with different data lengths to detect cache sizes, associativity, and whatnot. Then use this information to optimize our code or compiler flags to maximize the software's speed at hand.

If caches and other parts of the system would be available to assembly, we would have asked the processor their properties, directly optimize according to their merits, even do some data allocation tricks or prefetching w/o guesswork (which some architectures support via programmable external prefetching engines), not doing tuning in the dark via half-informative data sheets, undisclosed AVX frequency behaviors, or other techniques like running perf and looking cache trash percent, IPC, and other numbers to make educated guesses about how a processor behaves.

Yes, not all stuff is can be run in parallel, and I don't want to move all computation to GPUs with FP16 Half Precision math, but we can at least agree that these systems are designed to look like PDP-11's from a distance, and our compilers are the topmost layer of this "emulation" while doing all kinds of tricks. Trying to push this performance in an opaque way why we have Spectre and Meltdown, for example, where these abstractions and mirrors break down.

If our hardware was more transparent to us, we would have arguably selectively optimize our code a bit easier, if it had the switches labeled "Auto/I know what I'm doing", for certain features.

Intel tried to take this to max (do all optimization with the compiler) with Itanium. The architecture was so dense, it failed to float, it seems.


This is backwards. C was conceived as a way to do the things programmers were already doing in assembler, but with high(er) level language conveniences. In turn , the things they were doing in assembler were done to efficiently use the "VM" their code was executed on.


I have linked a paper published in ACM Queue in another comment of mine, which discusses this in depth.

The gist is, hardware and compilers are hiding all the complexity from C and other programming languages while trying to increase performance, IOW, emulating a PDP-11 while not being a PDP-11.

This is why C and its descendants are so long lived and performs very well on these systems despite the traditional memory models and simple system models they employ.

IOW, modern hardware and development tooling creates and environment akin to PDP-11, not unlike VMs emulate other hardware to make other OSes happy.

So, at the end of the day, processors are C VMs, anyway.


What a crazy metaphor! You're equating using zero terminated strings in C to doing drugs.


What's up with people seeing an analogy and going "you can't equate those two things"? Analogies aren't equating things


Analogies are great since they talk about how things are the same, and just as terrible because they talk about things that are different.

But seriously it’s sometimes hard to slice out what level of similarity is implied. Obvious things are somewhat less obvious to others sometimes


I feel like the success rate of getting someone off of null terminated strings is probably lower than most rehabilitation programs.


We can't entirely because of the C ABI but apart from that it's as simple as not using C which is not too difficult. C is not a popular language these days.


“Apart from that” does a lot of work here: FFI layers generally talk nul-terminated string unless otherwise specified, so do syscalls.


Yes that's what I said. You can generally wrap those layers so you aren't actually manipulating null terminated strings; just converting to/from them which is not too bad.


I don't know what you're relying on for the idea that C is not a popular language, but it is extremely popular.


Well, you will need to give up SQLite if you really feel this way, and reimplement it in a safe language.

It will also be some time before Rust has substantial penetration into Linux; you might need to find a kernel that implements the POSIX interfaces safely.

These will not be easy problems to solve.


Yeah, no…


I mean it’s a wash, on the one hand zero terminated strings have done untold amounts of damage[0] and are impossible to extirpate once they’re in, on the other hand the nazis were methed (and coked later on) up their eyeballs.

[0] and not just in C itself, unexpected truncation through FFI is an issue which regularly pops up


odbc defines multilingual interface that can accept both null terminated and length bounded strings by using NTS sentinel value for null terminated string length.


-edit- I'm not a C programmer, nor do I have any opinion on whether api is garbage or less worse or whatever.

They seemed useful enough to get added to the other BSDs, Solaris, Mac OS X, Irix(!), QNX, and Cygwin as well as used in the Linux kernel.


Distributing clean needles is useful, yes, but you should still lament why it is necessary.


The Linux kernel has better options, notably strscpy.


Imho its pretty simple: Strings in C are 0-terminated char arrays. If the char array is not 0-terminated, its not a string.

strncpy() can make a string into a non-string (depending on size), which is clearly bad.


That’s because strncpy does not return a nul-terminated (“C”) strings, but a fixed-size nul-padded strings.

That, as it turns out, is the case of most (but not all) strn* functions.

Of course strncpy adds the injury that it’s specified to alllow nul-terminated inputs (it stops at the first nul byte, before filling the target buffer with nuls).


It also, in some situations, returns a *string" that doesn't have the null terminator, which means it is giving the caller something that literally isn't a string.


It always “returns” the same thing: a fixed size nul-padded buffer. Call it a char array if you want, that’s always been it’s role and contract.


> Strings in C are 0-terminated char arrays

To be pedantic, they're pointers to char. Nothing more. Calling them array confuses non-C coders. The length is just an unenforced contract and has to be passed.


It's a pointer to a chunk of memory which contains an array of characters. You pass around the pointer because copying an array is expensive and wasteful.

I think (or hope) the concepts are pretty clear if you understand what a pointer is.


strncpy was a bad mistake. If you know the length and there's no null termination, you use memcpy instead.


strncpy isn't good either. But using length delimited strings is the best way to generate fixed length char strings and NUL terminated strings.


I'm surprised they didn't go with strscpy() directly

https://archive.kernel.org/oldlinux/htmldocs/kernel-api/API-...


Because strlcpy exists in bsd since 1999: https://man.netbsd.org/strlcpy.3


HN discussion around this quote, around 12 years ago: https://news.ycombinator.com/item?id=2378013


> Correct string handling means that you always know how long your strings are

Well, I couldn't think of a stronger argument against NULL terminated strings than this. After all, NULL terminated strings make no guarantee about having a finite length. Nothing prevents you from building a memory mapped string that is being generated on demand that never ends.


Except that's a non-sequitur because you can totally keep separate string length fields.

The only NUL that C requires is the NUL following C string literals, and you can even easily define char-arrays without NUL.

    char buf[5] = "Hello";
or even

    #define DEFINE_NONZ_STRING(name, lit) char name[sizeof lit - 1] = lit "";
Can also easily build pointer + length representations, without even a runtime strlen() cost.

    struct String { const char *buf, int len; };
    #define STRING(lit) ((String) { (lit ""), sizeof lit - 1 })


What do you do when the strings might have more than MAX_INT characters?


What will you do on your 200th birthday?

In case you're more interested in theory than practice, I have a different answer: I use a different API.

However, I'm aware not even that could stop you, because you could still ask "what do you do when the strings might have more than SIZE_MAX characters?", which is entirely possible (as a combination of 2 or more strings).

And to answer that, we're coming back to my original answer: It doesn't happen. I'm not calling the API with such huge strings. (And no, I usually don't keep formal proofs that it couldn't happen -- there are also an infinite number of other properties that I don't verify).


INT_MAX is often far less than SIZE_MAX (the former is usually the max of a signed 32-bit integer, the latter of an unsigned 64-bit integer), so usually nothing special.


SIZE_MAX is the largest possible value of type size_t. size_t is defined as an unsigned type that is big enough to represent the size of the largest possible object (which basically means the size of the virtual address space i.e. 2^32 on a 32-bit system and usually 2^48 on a 64-bit system, which is being addressed with an uint64_t).

None of that is relevant since you're extremely unlikely to hit either limit by accident. If you really want, you can hit 32-bit limit if you're doing things that snprintf really shouldn't be used for, and likewise you can hit size_t limit if you're on a 32-bit system and joining multiple large strings.


Yes, my point is just that since all the "strn" C string-handling functions in the standard library use a size_t for the size if you've got more than INT_MAX characters there's not necessarily any problem. INT_MAX is pretty much always going to be lower than SIZE_MAX, even on 32-bit systems since the former is signed and the latter isn't. You just call snprintf or whatnot as usual. If you manage to have more than SIZE_MAX characters, then you have a problem. Libc probably can't solve it for you though, since SIZE_MAX has to be large enough to cover any allocation so you have some sort of segmented architecture that the C standard library isn't expecting.


If that is ever a possible issue, you switch the implementation to use two pointers.

    struct String { const char *buf, const char *buf_end; };


Actually I was answering this question wrong because I somehow understood it in the context of snprintf() return int, and I should have just replied "you can switch to size_t if you like". start + end pointer is certainly not necessary, not sure why one would ever do this. It's more inviting of bugs compared to start pointer + length.


It is how many languages implement strings without being bound by numeric limits.

Naturally for this to work out without bugs, it cannot be exposed directly, only manipulated via a string library.


size_t is large enough to hold the size of the largest possible object in memory. In practice, on most architectures, that means it is the same size as pointers. I'm not sure if there is a case where start + end pointer can describe a valid string in memory that start pointer + size couldn't? If that was the case, that string wouldn't be an "object" by definition.


Yep, and what if I want and to make an arbitrarily large array without much copying?


What's the point of the empty string literal "" ?


It's a poor man's assertion that the "lit" is indeed a string literal (such that we can get the string length using sizeof) and not a variable (of pointer type, where sizeof returns the size of the pointer not the string-buffer). If you pass a variable instead of a string literal, that will be a syntax error.


Or more likely strncpy plus forced last NUL. Return a flag on truncation unlike messing with return code or errno.

Call it safe_strncpy and be done with it. Otherwise asprintf and snprintf exist. strlcpy is a more garbage version of snprintf.


He was a jerk, but often he had a reason for his abusiveness. Was the reason in this case valid?


The question is: Is string truncation a good solution when the strings you have are unexpectedly long? Like, it's probably ok in a lot of cases, and once you start using these functions, it's very tempting to use them almost everywhere... but truncating "Attack at dawn on Friday" to "Attack at dawn" could be a disaster as well.

On the other hand, his recommendation to always know string lengths and use memcpy didn't really become common practice over the last 20+ years either, so I'm not sure it was worth all the arguing.

At this point, I'm kind of joining the camp of "C has proven to be too bug-prone for most organizations to use safely and therefore we should all go to Rust".


The second part "and therefore we should all go to Rust" does not follow necessarily from the first. Maybe the reason not everybody is gone to Rust is that it lacks something. Maybe we will all go somewhere else.


It lacks developer ergo omics, for me personally.

Source is for humans to read, it shouldn't look like alphabet soup for the idiomatic cases.


I suspect the eventual end result is major compilers start implementing a "fat pointer" string ABI for internal translation units (decaying to char * at the edge where necessary) and people start turning that on.


> On the other hand, his recommendation to always know string lengths and use memcpy didn't really become common practice over the last 20+ years either, so I'm not sure it was worth all the arguing.

It hasn't become common practice in C. But other languages (like JavaScript or Python) have become hugely popular, and don't use null-terminated strings.


Even languages in C's niche encode strings as pointer + length, like Rust.


> On the other hand, his recommendation to always know string lengths and use memcpy didn't really become common practice over the last 20+ years either

It was the way plenty of languages from the 70s stored their strings, including such popular ones as BASIC.


It has in the sense that people allocate strings much more than using fixed-size, stack-allocated arrays.

Modern C uses things like glib's GString, which (in addition to keeping the NUL terminator) track the length and can resize the underlying memory. And people also use a lot more asprintf instead of strcpy and strcat.


> but often he had a reason for his abusiveness

There is never, ever, under any circumstances, a reason to be abusive.


Not really; he was frequently a jerk right out of the starting gates for no particular reason. That quote is the initial reply to the proposed patch, and the only "reason" I see for the insults is to satisfy Drepper's own emotional needs. It's petty and pathetic.

This is very different from e.g. Torvalds who sometimes rants a bit after someone who he feels ought to know better screwed up. I'm not saying that's brilliant either, but I can be a lot more understanding when people are abrasive out of passion for their project after something went wrong.


Well, he does actually have a point. strlcpy is a faster (well, safer) horse than strncpy, but it's still a horse. We should not use horses as the main mode of transport anymore.

"Doctor, it hurts when I strcpy — so don't do that".

He's being a jerk about it, but I would not say that he doesn't have a point.


Merely "having a point" is not "a reason for his abusiveness". I think I "have a point" for almost any HN comment I post (or at least, I'd like to think so) and have just as much "reason" to be a jerk as Drepper had. This applies to most posts on HN.


Ah, true. I think I cross-read comments here. Sorry.


Mostly no. True, the C NUL-terminated string definition is bad, but it's baked into the API. You need some semi-sane way to work with it that isn't 'Everyone writes their own wrappers to memccpy' (some people will get that wrong - e.g. the Linux manpage stpecpy wrapper just begs for misuse, and it's what most initiate C programmers will see if they know enough to check manpages).

strlcpy may not be the best API, but it's sane enough and by now ubiquitous enough to deserve inclusion. Had glibc and others engaged we may have had a better API. Regardless, glibc should never have had such a long veto here.


No.


Yes.


Why?


Inefficiency probably doesn't need any comment (these functions traverse string twice instead of once). His argument that string length should be always known is correct in theory although not in practice.


Can you name a program that runs too slowly because it uses strlcpy?


You're looking at it wrong. strlcpy is defined to be slow in certain cases. The API requires it. Other interfaces may be slow today but can be improved in the future because they don't have a return value that is inconvenient. (Notably, memccpy today is typically a memchr followed by memcpy, since this is faster than a naive implementation. Obviously if it gets used more then it will get replaced with a single-pass, machine optimized implementation.)


As the top level comment was about knowing the length of a string: GTA Online's loading times were atrocious because of a null-terminated string.


Not really, more that the implementation of sscanf() is stupid and calls strlen() even though implementing sscanf() that doesn't require that is perfectly possible.


Instead of putting up with people constantly complaining how C is bad because of zero-terminated strings, we should better educate folks that there is absolutely zero reason why one has to rely on a NUL byte in-band signal. And APIs like sscanf() shouldn't be used beyond their historic purposes and there are easier ways to program.

C doesn't really "have" zero-terminated strings other than supporting them with string literals as well as having an atrocious "strings" library for historical reasons. C has storage and gives you the means to copy data around, that's it.

(Although I fully agree that the GTA issue can be seen as a bug in the implementation of sscanf()).


People typically do not realize that it has a return value that is expensive to compute.


It's not any slower in the typical case where destination buffer is large enough to fit the source thing. And if that's not the case then we are most likely in a error case (either caller notices the truncation and decides to abort, or ignores the truncation and things may soon go boom), and not many people care about optimizing error paths.

Furthermore, when coders dont't have strlcpy() the alternatives are often even worse than strlcpy(): 1) They use strcpy() and have buffer overflows. 2) They use strncpy() which is slower than strlcpy() in the common (non truncating) case, and in the truncating case leave the string unterminated (thus segfault potential) 3) They use snprintf(dst, len, "%s", src); which is strictly slower than strlcpy()


Since the error path is the largest one (the string doesn’t fit…) it makes sense to bound its execution. I would not recommend the others FWIW for exactly the reasons you mentioned.


Why would you optimize for the error case and not the common case? You've already done an unbounded amount of work copying the string in from the network or wherever. If anybody cared that much, they wouldn't let the string get that long in the first place.


It can be appropriate to bound the runtime of certain components of a system while allowing looser constraints elsewhere. For example I would perhaps not want to do an O(n) string operation on a collection of strings even though the user would be pretty upset if they can’t paste infinite input into my app.


It's only as expensive as what you pass in. Joke's on you.


qsort is also only as expensive as what I pass in. If it did a bubble sort internally I would be pretty upset though.


What? snprintf is nothing like doing an O(n^2) computation when O(n log n) was expected.


Right, it’s more like O(m) when you probably wanted O(n).


[flagged]


I don't think OP intended this quote to glorify Drepper. He is correctly regarded as a giant asshole. Very smart, but also an awful person to work with.


Back in the 00s when Ruby was hot, the Ruby community had a remarkably constructive and helpful attitude. Even when offering criticism. Many folks attributed it to its creator with the acronym, MINASWAN ("Matz is nice and so we are nice").

No community is perfect, but once you've seen how good it can be it's hard to have much patience for brilliant assholes.


I credit this more than anything for the success of Ruby. Just like I credit the 'holier than thou' attitude of the proponents of some other languages for their relative lack of success compared to where they could have been by now.

Dutch proverb, not sure if it translates or if there is a better English version: you catch more flies with sugar than with vinegar.


The English version is "you catch more flies with honey than with vinegar", which at least in English makes more sense, since in English "sugar" generally implies dry granulated sugar. You're not going to catch any flies with that. (Ironically, you'd probably catch more with the vinegar, since some would go to it for the moisture and a few would drown.)

/tangent


I'll substitute syrup then ;)


[flagged]


> Or just another do-nothing internet blowhard?

I don't know about the OP but you are crossing the line here.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: