I think hacking is interesting, so I spend a fair bit of my time following exploits that happen in the news. I run across my fair share of "rewrite it in Rust" comments, and while I generally disagree, I've been content to discard them as mindless parrots, or perhaps even a symptom of the new synthetic user initiatives that Facebook has been spearheading. (Side note: how funny is it that the push for AI users is coming from Facebook, which is so dead you haven't been able to find a real person on there in ten years). But recently, the Rust cult has attacked me in my home. There I was, sitting on my couch watching a reel about arrays in C (my algorithm is cooked, I know), and out of nowhere comes a big ole paragraph comment talking about how C is such a bad language and everyone who uses it is a "C-nile dinosaur". Having recently hit the ripe old age of 24 and a half, I was fairly certain the dinosaurs went extinct at some point before the towers fell, but nonetheless, did not appreciate being referred to in such a fashion. There is nothing in this world I hate more than people who try to win arguments by regurgitating catchphrases, and it's disappointing to me how much that I see in my feed. A couple salient examples are "Ubuntu is spyware" (Patently untrue as of 14.04, released ten years ago. Still see this to this day.), "snaps are proprietary" (patently untrue as of always), and last but not least, the subject of today's take, "C is unsafe". And seriously, this one has been regurgitated so much that it was picked up by Google's Gemini chatbot and used to censor output. So, without further ado.
C is not unsafe
C is a language specification, and a relatively simple one at that. Absolutely nothing in that
specification requires that the resulting executables be subject to memory corruption, use-after-free
bugs, or logic errors due to integer over/underflows. In fact, free()
is just library code, you're
totally free to manage your own memory in any way you see fit, provided you think you can do a better
job. What you mean to say is, "the current binary output of gcc
and clang
is unsafe". And there, I
agree with you.
That may seem like a pedantic correction, but it's not. See, in clarifying what the problem actually is, we've also revealed the very obvious solution: just change the binary output that results from C(++) code, and voila, no more FBI papers about Rust. Also way fewer remote code execution CVEs, and more importantly, no more ground for these preachy Rust wheel reinventors to stand on.
So, how do we patch all this up? From an ideological standpoint, at least, it's fairly easy.
Easy Memory Safety: Pointer Tags
Fair disclosure, I'm not the only one to ever have thought of this solution. It's honestly a pretty trivial step away from the problem statement itself. I know for a fact there's a C compiler out there using pointer tags in a more intelligent way than I'm going to outline here (I've seen it in passing), so if anyone has the link handy please shoot me an email so I can reference it here. Anyway...
The idea is to generate a random 64-bit unsigned integer to identify tags applied by the compiler. For
enhanced security you can probably do this per program execution, or per method call, or whatever, but
let's keep it simple for now. Next, whenever you emit a memory allocation (as in, in the code for stack
allocation and the code for malloc()
), you do:
void* new_malloc(uint64_t bytes) {
/* old_malloc() can either be the regular malloc function (for heap), or:
asm("subq %1, %%rsp\n\t"
"subq $0x10, %%rsp\n\t"
: "=r" (ptr)
: "r" (bytes));
* for stack. */
uint64_t* ptr = old_malloc(2*sizeof(uint64_t) + bytes);
ptr[0] = COMPILER_SIGNATURE;
ptr[1] = bytes;
return &ptr[2];
}
Prevent Out-of-Bounds Memory Access
Given the tags we just applied ahead of our pointers, you can probably see where I'm going with this. On pointer dereference, we simply do:
/* Dereference a byte pointer. */
uint8_t deref(const uint8_t* ptr, uint64_t offset) {
assert(((uint64_t*) ptr)[-2] == COMPILER_SIGNATURE);
assert(offset >= 0 && offset < ((uint64_t*) ptr)[-1]);
return ptr[offset];
}
These assertions should minimally affect program performance, and, in an ideal case, be optimized out during dead code removal where user-provided bounds checks already exist, thereby reducing the performance detriment to zero. Memory allocations footprint will be minimally impacted, but there will essentially be no other performance impacts.
Prevent Use-After-Free and Double-Free
With our new memory dereference sanity assertions, it's fairly trivial to prevent use-after-free and double-free as well:
void new_free(void* ptr) {
/* Could assert instead to reveal double-free loudly. */
if (((uint64_t*) ptr)[-2] != COMPILER_SIGNATURE) return;
((uint64_t*) ptr)[-2] = 0x00;
old_free(ptr);
}
Stack variables must also have their signatures nulled out when the function that allocated returns. This will just involve inserting a couple of:
mov [rbp + OFFSET_THE_COMPILER_KNOWS], 0x00
before the ret
instruction at the end of the function code.
Freed pointers can now no longer be accessed, nor re-freed, period.
Tracking Pointer Aliasing
So, what happens when we assign a pointer to another pointer? It's pretty common to see code like:
struct node* cur = head;
while (cur != NULL) {
/* Do something with `*cur`. */
cur = cur->next;
}
Luckily, for simple assignments like this one, we don't actually need to do anything. However the
pointer stored in cur->next
came into being, it should have tags of its own that we can just
reference like normal. Much more difficult to handle is pointer arithmetic, such as:
char* str = "abcdefg";
while (*str != '\0') {
/* Do something with `*str`. */
str++;
}
Once str
is incremented, we must be able to track its offset from the original pointer value, without
necessarily having access to the original symbol.
One solution is for the compiler, when it emits pointer arithmetic code, to also emit code storing the origin pointer, and use this temporary variable for dereference, turning the above into:
char* str= "abcdefg";
void* __orig_str = str;
while (deref(orig_str, (char*)str - (char*)orig_str) != '\0') {
str++;
}
This will also require the compiler keeping a map from symbols to their temporaries, which could be expensive. However, the real problem involves what happens when the temporary goes out of scope. Either we have to store them in some kind of global heap segment and start tracking variable lifetimes (basically reinventing Rust at that point), or we store them forever and introduce a memory leak. So, this approach kind of sucks.
The other approach is to take a small memory hit and just change what a pointer means. Currently we can think of a pointer like this:
struct pointer {
uint64_t memory_segment;
};
However, when we do arithmetic on that, we're changing what the original memory segment is, which destroys information that we (now) care about. Instead, we can represent a pointer like this:
struct pointer {
uint64_t memory_segment;
uint64_t offset;
};
Now when we do pointer arithmetic, we just mess with the offset
field and we can do our safety checks
based off the original memory segment. Doubling the size of all pointers might seem ridiculous, but
there's actually precedent here--it's exactly what we did when we switched from 32 to 64-bit machines,
and that alone should make it minimally disruptive. Any code that assumes the width of a pointer should
already not work on 64-bit (a.k.a any modern) machines, so doesn't need to be supported. There will
certainly be a performance penalty, but not a major one.
Pointer Assignment to Non-Null Literal Value
Technically we also have to support the following, even though no sane programmer would ever do this:
#include <stdio.h>
int main() {
char* string = 0xbeefface;
printf("%s\n", string);
}
So, what do we do? Just treat it as a null pointer with a huge offset. The pointer assignment becomes:
char* string = {.memory_segment = NULL, .offset = 0xbeefface};
And then the dereference just seg faults which is what it will do in most cases due to ASLR anyway. If you are doing shit like this you are not worried about memory safety in the first place, so I'm not sure why you'd even bother with a safe C compiler.
Reclaiming Some Performance: Introducing the capacityof()
Macro
Data structures in C frequently have a need to track the boundaries of their internal arrays, and currently they are required to do this using a separate variable. However, since we are already keeping track of this value for all pointers, exposing it will allow developers to eliminate this redundancy and reclaim some of the performance penalty from using safe C.
We would define the macro as such:
inline uint64_t capacityof(const void* ptr) {
const void* seg_ptr = {.memory_segment = ptr.memory_segment, offset = 0};
assert(((uint64_t*) seg_ptr)[-2] == COMPILER_SIGNATURE);
return ((uint64_t*) seg_ptr)[-1] - ptr.offset;
}
Optional: Integer Over/Underflow
While these aren't technically memory safety-related, there are a fair number of CVEs in C stemming from logic errors caused by unchecked integer overflows and underflows. For those who are unfamilar, these are:
#include <stdio.h>
int main() {
unsigned short a = 0xffff;
unsigned short b = 0;
// overflow; prints 0
printf("%hu\n", a + 1);
// underflow; prints 65535 (= 0xffff)
printf("%hu\n", b - 1);
}
This happens because types have specific maximum and minimum values related to their storage capacities (a short can store two bytes, for example), and wrap around the other side when you exceed these limits.
In my humble opinion, this should just not happen. Nobody relies on this behavior for correct program function (except in the same category as non-null literal pointer assignment users), and the CPU should just terminate the process when it detects overflow or underflow on an operation. However, if you must enforce it at a language level, the correct assertions to inject look something like this:
inline int64_t safe_add(int64_t x, int64_t y) {
if (y > 0)
assert(y <= INT64_MAX - x);
else
assert(y >= INT64_MIN - x);
return x + y;
}
inline uint64_t safe_add(uint64_t x, uint64_t y) {
assert(y <= UINT64_MAX - x);
return x + y;
}
inline int64_t safe_sub(int64_t x, int64_t y) {
if (y > 0)
assert(x >= INT64_MIN + y);
else
assert(x <= INT64_MAX + y);
return x - y;
}
inline uint64_t safe_sub(uint64_t x, uint64_t y) {
assert(x >= y);
return x - y;
}
This will incur a hefty performance penalty if injected globally, though, so in my opinion programmers should just inject their own asserts when dealing with integers derived from user input (or even copy these functions as needed). Or, back to the CPU doing it automatically thing...
Conclusion
Is C, the language, unsafe? No. Are current C compilers unsafe? Yes. However, that doesn't mean that we need to re-write every program. We can just tweak the compilers. Recompiling the C ecosystem is enough to neutralize all memory-corruption-related vulnerabilities ever, and it's less work than your average Gentoo user does to set up a new computer. But if we must get into it...
Epilogue: Why not rewrite in Rust?
Two reasons. Reason one, rewriting is bad. The software is already written, and it already works. Go build new software that actually accomplishes new things. Or, I don't know, go research how to cure cancer. There are literally an unlimited number of better things you could do with your time.
Reason two, Rust is a bad choice anywhere that C is a good choice. Granted, there are a lot of places where C is a bad choice and was chosen anyway, but Rust is meant to replace C++. It's a fairly high-level (as in, I can't predict what assembly it will spit out when I compile it) language with good features for userspace applications, easy ways to handle strings and HTTP requests, a good package ecosystem, and a completely opaque memory management system that's optimal for having zero control over how memory is allocated, and liking it! Its unsafe component sucks to work with, and though technically you can use it to get into the weeds, once you're sacrificing safety you'd rather just use a different language. Compared to C, it is an entirely different beast. C is very good at forcing you to feel apprehension when you're about to make your CPU suffer, and in places where C should be utilized, that is a feature. Rust, specifically because of its rich built-in feature set, is very bad at this.
You should use Rust where you would otherwise use Python or Java and enjoy your 10x speedup. Where you should not use Rust is in drivers, kernels, embedded devices, and other such places where you care very deeply about which byte goes into which memory address. In an ideal world, we would just interop these two languages. Business logic would take place in (safe) Rust, RAM plunging would happen in C, you'd link the two resulting object files together to produce a complete application, and unsafe Rust wouldn't have to exist at all. Currently, having to write Rust bindings to accomplish this is pretty annoying. I greatly enjoy that Zig goes out of its way to make this work with zero additional effort, and I hope that trend catches on. For now, though, even just spawning one process from another process to get the job done isn't the end of the world.