std::string— a tear down and comparison/discussion

Rico Mariani
12 min readApr 5, 2023

--

I did a series of investigations on the costs associated with some of the more popular Modern C++ features. This was actually the first one I did, but I’m publishing it second, because I’m random like that.

Here I’m going to compare old school char * strings with std::string. The idea is not to beat up on std::string, it's actually doing a lot more than char *. The real goal here is to help people understand what you get with std::string and what you pay for it and when you should use it. std::string is not like other basic string objects in many systems, it's a mutable string buffer. It's a value type. Each assignment gets a full copy, so it's more like managing strings with strdup. This means it's very important to use const std::string & as your normal way of flowing these. But even that has costs.

I’m going to explore these below. Skip to the end if you just want the conclusions.

Let’s start with the first test program. The helper functions stayed the same in all the tests and I disassembly them at the end, but they are pretty boring really. The body of main changes to illustrate different patterns.

Here we go:

#include <string>
#include <stdio.h>

__declspec(noinline)
void printString1(const std::string& str) {
printf("test: %s\n", str.c_str());
}

__declspec(noinline)
void printString2(const char *str) {
printf("test: %s\n", str);
}

int main()
{
std::string foo = "foofoofoofoo";
std::string bar = "barbarbarbarbarbarbarbarbarbar";
const char* baz = "barbarbarbarbarbarbarbarbarbar";
printString1(foo);
printString2(foo.c_str()); // use copied (small) constant
printString1(bar);
printString2(bar.c_str()); // use copied (big) constant
printString2(baz); // use constant directly
}

Now, as I said, unlike many other string implementations in other languages std::string is a mutable string value.

It is 32 bytes big:

  • the first 16 bytes are either the text of the string if it fits, or a pointer to the text if it’s >15 characters; there’s room for the null
  • then an int64 for the in-use amount; not counting null, like strlen
  • then an int64 for the maximum we can store; again, not counting the trailing null

To get the text pointer you check the length, then:

  • use the base of the object if length < 16, or,
  • use the pointer at the base if length >= 16
.------------------------------------------.
|0x00| text of the string and trailing null|
|----|-------------------------------------|
|0x10| strlen of the text |
|----|-------------------------------------|
|0x18| 15, size of internal space -1 |
`------------------------------------------/

OR

.------------------------------------------.
|0x00| pointer to stored C string |
|----|-------------------------------------|
|0x08| 8 zeros |
|----|-------------------------------------|
|0x10| strlen of the text |
|----|-------------------------------------|
|0x18| allocated size minus 1 for null |
`------------------------------------------/

Let’s have a look at the code:

int main()
{
// set up the frame, and our locals and home storage for callees
01220 push rbp
01222 mov rbp, rsp
01225 sub rsp, 70h
// the stack frame (in increasing order address order)
//
// 32 bytes [rbp-70h] home storage for our callees
// 32 bytes [rbp-50h] storage for bar
// 32 bytes [rbp-30h] storage for foo
// 8 bytes [rbp-10h] storage for stack guard
// 8 bytes [rbp-08h] padding? junk?
// --- locals above this line --- frame linkage below this line
// current rbp points to the saved rbp so you can follow the chain
// 8 bytes [rbp] saved rbp,
// 8 bytes [rbp+08h] return address is here
// no bytes stack args would be here if there were any
// make a stack guard cookie and store it
01229 mov rax, qword ptr[__security_cookie]
01230 xor rax, rsp
01233 mov qword ptr[rbp-10h], rax
// construct "foo" std::string foo = "foofoofoofoo";// clear xmm0 and smash 16 bytes of the internal storage to all nulls
// recall rpb-30 is foo
01237 xorps xmm0, xmm0
0123A movups xmmword ptr[rbp-30h], xmm0
// prepare the used size and the reserved size
// used sized 12 (stored at foo+16)
// available 15, leaves 1 for null (stored at foo+24)
0123E mov qword ptr[rbp-20h], 0Ch
01246 mov qword ptr[rbp-18h], 0Fh
// now copy the text, we we get 8 bytes and store them
// (this could have been rax rather than xmm0)
0124E movsd xmm0, mmword ptr[string "foofoofoofoo"]
01256 movsd mmword ptr[rbp-30h], xmm0
// get 4 bytes and store them
0125B mov eax, dword ptr[string "foofoofoofoo"+8h]
01261 mov dword ptr[rbp-28h], eax // foo+8
// add the null terminator
01264 mov byte ptr[rbp-24h], 0 // foo+12
// next we will construct "bar" std::string bar = "barbarbarbarbarbarbarbarbarbar";
// smash the storage that is past the pointer
// we're not using that, recall rbp-50 is bar
// this is bar+8, the character storage we're not using
01268 mov qword ptr[rbp-48h], 0
// allocate memory for the text, 32 bytes
01270 mov ecx, 20h
01275 call operator new
// store the pointer and the used and reserved sizes// store allocated address in "bar"
0127A mov qword ptr[rbp-50h], rax
// used 30 (stored at bar+16)
0127E mov qword ptr[rbp-40h], 1Eh
// reserved 31, +1 for null
01286 mov qword ptr[rbp-38h], 1Fh
// now copy the bytes of the text// get 16 bytes and store them in xmm0, a 16 byte register
0128E movups xmm0, xmmword ptr[string "bar...bar"]
// store the first 16 bytes in the string storage for bar
01295 movups xmmword ptr[rax], xmm0
// get 8 bytes and store them
01298 movsd xmm1, mmword ptr[string "bar...bar"+10h]
012A0 movsd mmword ptr[rax+10h], xmm1
// get 4 bytes and store them
012A5 mov ecx, dword ptr[string "bar...bar"+18h]
012AB mov dword ptr[rax+18h], ecx
// get 2 bytes and store them (16+8+4+2 = 30, that's all of it)
012AE movzx ecx, word ptr[string "bar...bar"+1Ch]
012B5 mov word ptr[rax+1Ch], cx
// add the null terminator
012B9 mov byte ptr[rax+1Eh], 0
// by fluke, since the string is 30 bytes plus 1 for the null we
// just saw all the possible copies, 16, 8, 4, 2, and 1.
const char* baz = "barbarbarbarbarbarbarbarbarbar";// no code is needed for this, the label 'baz' becomes an alias
// for the constant pointer to the text
printString1(foo);// direct call to printString1 passing the address of foo [rbp-30]
// as the arg
012BD lea rcx, [rbp-30h]
012C1 call printString1
printString2(foo.c_str());// extract the string pointer
012C6 lea rcx, [rbp-30h]
012CA cmp qword ptr[rbp-18h], 10h
// conditional move if size > 16 (it's forgotten the size)
012CF cmovae rcx, qword ptr[rbp-30h]
012D4 call printString2
printString1(bar);// gets the address of bar and prints it
012D9 lea rcx, [rbp-50h]
012DD call printString1
printString2(bar.c_str());/ get the address of bar which is buffer
012E2 lea rcx, [rbp-50h]
// the usual conditional move if the size is > 16
012E6 cmp qword ptr[rbp-38h], 10h
// get the pointer, note [rcx] not used (pipeline stall?)
012EB cmovae rcx, qword ptr[rbp-50h]
012F0 call printString2
printString2(baz);// note that there is no baz variable, it's using the address
// of the string constant directly 07FF6C3743360h again
012F5 lea rcx, [string "barbarbarbarbarbarbarbarbarbar"]
012FC call printString2
}
// destruction, first we check if there was an alloc,
// if the string was very small (len < 16) then no need to free
01301 mov rdx, qword ptr[rbp-38h]
01305 cmp rdx, 10h
01309 jb main+11Fh
// this will always run... because bar > 16 bytes, but the
// compiler forgot, well, really std::string is mutable so
// it can't assume
0130B inc rdx // we add 1 for the null to get the allocated size
0130E mov rcx, qword ptr[rbp-50h] // load the string pointer
01312 mov rax, rcx
// allocations >4k are handled differently, you have to find the
// block by walking backwards. So the delete has an internal check
01315 cmp rdx, 1000h
0131C jb main+11Ah
// increase allocated size, allocations of this size have padding
// which is to stay the true length is 39 bytes more...
0131E add rdx, 27h
// get the word before the allocation
01322 mov rcx, qword ptr[rcx - 8]
// subtract the new address from the base address
// after this rax = original - (original[-1])
01326 sub rax, rcx
// rax now has the "slop" between what we allocated
// subtract another 8 (why is't this a small sub?)
// after this rax = original - (original[-1]) - 8
// this presumably accounts for the mandatory pointer that had
// the true alloc address at address 01322 above
01329 add rax, 0FFFFFFFFFFFFFFF8h (-8)
// now we test (rax = original - (original[-1]) - 8) <= 31
// presumably if there is more than 31 bytes of true slop, then
// we wouldn't use an alloc of this size and the heap is corrupt.
0132D cmp rax, 1Fh
01331 jbe main+11Ah
// it seems we do this if we find that the difference of the
// true allocation pointer and the pointer we handed out is
// greater than the threshold amount
01333 call qword ptr[__imp__invalid_parameter_noinfo_noreturn]
// yes this is an internal breakpoint...
01339 int 3
// finally! the delete! rcx address, rdx length
0133A call operator delete
// reset the internal storage to empty, but actually,
// we don't have to do this since Elvis is about to leave...
0133F mov qword ptr[rbp-40h], 0
01347 mov qword ptr[rbp-38h], 0Fh
0134F mov byte ptr[rbp-50h], 0
// now destruct foo, foo is small, won't take the delete path
01353 mov rdx, qword ptr[rbp-18h]
01357 cmp rdx, 10h
// we always take this branch, it's a small string
0135B jb main+171h
// we never run this, same delete logic as before, same 4k check
0135D inc rdx
01360 mov rcx, qword ptr[rbp-30h]
01364 mov rax, rcx
01367 cmp rdx, 1000h
0136E jb main+16Ch
01370 add rdx, 27h
01374 mov rcx, qword ptr[rcx - 8]
01378 sub rax, rcx
0137B add rax, 0FFFFFFFFFFFFFFF8h
0137F cmp rax, 1Fh
01383 jbe main+16Ch
01385 call qword ptr[__imp__invalid_parameter_noinfo_noreturn]
0138B int 3
0138C call operator delete
// note that in this case we don't reset the contents of the
// string because it's already valid as is. I guess the library
// really doesn't want to leave a string in the state where it
// looks like it needs a free, even if it's known to be dead.
// we're setting up the return value now, it's returning zero.
01391 xor eax, eax
// check the stack guard cookie, if all is well, we'll exit
01393 mov rcx, qword ptr[rbp-10h]
01397 xor rcx, rsp
0139A call __security_check_cookie
// pop the locals off the stack and we're outta here
0139F add rsp, 70h
013A3 pop rbp
013A4 ret

That code was 389 bytes.

Now here’s a version that uses only normal string constants for comparison. I tried to do one constant, but then the constant inlined into printString2. I was surprised that the compiler could figure out that all the calls to printString2 had the same argument value so it flowed the constant into the body of the function. Sort of a reverse inline. Anyway, I had to have 2 constants, and 2 calls, to see the normal code.

int main()
{
const char* baz1 = "barbarbarbarbarbarbarbarbarba1";
const char* baz2 = "barbarbarbarbarbarbarbarbarba2";
printString2(baz1);
printString2(baz2);
}

And here’s the code:

int main()
{
01080 sub rsp, 28h
const char* baz1 = "barbarbarbarbarbarbarbarbarba1";
const char* baz2 = "barbarbarbarbarbarbarbarbarba2";
// as before, no local, baz1 & baz2 become an aliases for the string// two normal calls
printString2(baz1);
01084 lea rcx, [string "barbarbarbarbarbarbarbarbarba1"]
0108B call printString2
printString2(baz2);
01090 lea rcx, [string "barbarbarbarbarbarbarbarbarba2"]
01097 call printString2
}
0109C xor eax, eax
0109E add rsp, 28h
010A2 ret

Well that’s a lot simpler, it’s only 35 bytes. Still that’s not a totally fair comparison because the first sample did more than just print twice. So here is another sample using std::string that does the call with two implicit conversions from constant strings. That's exactly the same as normal construction and destruction, like if you used "foo"s for instance.

Here we go, two implicit conversions:

int main()
{
const char* baz1 = "barbarbarbarbarbarbarbarbarba1";
const char* baz2 = "barbarbarbarbarbarbarbarbarba2";
printString1(baz1);
printString1(baz2);
}

And the code:

int main()
{
// space for one string, one scratch, and the home storage
// there's 8 bytes there I don't understand, marked.
// frame diagram after "sub rsp, 48h"
// rsp : home storage 0
// rsp + 08h : home storage 1
// rsp + 10h : home storage 2
// rsp + 18h : home storage 3
// rsp + 20h : std::string cstr_ptr
// rsp + 28h : std::string 8 bytes zero
// rsp + 30h : std::string used (30)
// rsp + 38h : std::string available (31)
// rsp + 40h : I dunno why this is here? unused.
// rsp + 48h : return address
01210 sub rsp, 48h const char* baz1 = "barbarbarbarbarbarbarbarbarba1";
const char* baz2 = "barbarbarbarbarbarbarbarbarba2";
// no local, baz1 and baz2 become aliases for the strings
printString1(baz1);// normal creation of a string, including the heap alloc
01214 mov ecx, 20h
01219 mov qword ptr[rsp+28h], 0 // zero unused string space
01222 call operator new
01227 movups xmm0, xmmword ptr[string "...barba1"]
0122E mov qword ptr[rsp+20h], rax
01233 mov qword ptr[rsp+30h], 1Eh // length 30
0123C mov qword ptr[rsp+38h], 1Fh // available 31
01245 movups xmmword ptr[rax], xmm0
01248 movsd xmm1, mmword ptr[string "bar...ba1"+10h]
01250 movsd mmword ptr[rax+10h], xmm1
01255 mov ecx, dword ptr[string "bar...ba1"+18h]
0125B mov dword ptr[rax+18h], ecx
0125E movzx ecx, word ptr[string "bar...ba1"+1Ch]
01265 mov word ptr[rax+1Ch], cx
01269 lea rcx, [rsp+20h]
0126E mov byte ptr[rax+1Eh], 0
// rcx has the string pointer, we're ready to print
01272 call printString1
// and now we do destruction...
01277 mov rdx, qword ptr[rsp+38h]
0127C cmp rdx, 10h
01280 jb main+0A7h
01282 mov rcx, qword ptr[rsp+20h]
01287 inc rdx
0128A mov rax, rcx
0128D cmp rdx, 1000h
01294 jb main+0A2h
01296 mov rcx, qword ptr[rcx - 8]
0129A add rdx, 27h
0129E sub rax, rcx
012A1 add rax, 0FFFFFFFFFFFFFFF8h
012A5 cmp rax, 1Fh
012A9 jbe main+0A2h
012AB call qword ptr[__imp__invalid_parameter_noinfo_noreturn]
012B1 int 3
012B2 call operator delete
// exactly the same code only "...barba2" instead of "...barba1"
// both temporary strings go in the same place, [rsp+20h]
printString1(baz2);
012B7 mov ecx, 20h
012BC mov qword ptr[rsp+28h], 0 // clear the junks to zero
012C5 call operator new
012CA movups xmm0, xmmword ptr[string "bar...etc...barba2"]
012D1 mov qword ptr[rsp+20h], rax
012D6 mov qword ptr[rsp+30h], 1Eh
012DF mov qword ptr[rsp+38h], 1Fh
012E8 movups xmmword ptr[rax], xmm0
012EB movsd xmm1, mmword ptr[string "barbarbarbarbarbarbarbarbarba2"+10h]
012F3 movsd mmword ptr[rax+10h], xmm1
012F8 mov ecx, dword ptr[string "...barba2"+18h]
012FE mov dword ptr[rax+18h], ecx
01301 movzx ecx, word ptr[string "barbarbarbarbarbarbarbarbarba2"+1Ch]
01308 mov word ptr[rax+1Ch], cx
0130C lea rcx, [rsp+20h]
01311 mov byte ptr[rax+1Eh], 0
01315 call printString1
0131A mov rdx, qword ptr[rsp+38h]
0131F cmp rdx, 10h
01323 jb main+14Ah
01325 mov rcx, qword ptr[rsp+20h]
0132A inc rdx
0132D mov rax, rcx
01330 cmp rdx, 1000h
01337 jb main+145h
01339 mov rcx, qword ptr[rcx - 8]
0133D add rdx, 27h
01341 sub rax, rcx
01344 add rax, 0FFFFFFFFFFFFFFF8h
01348 cmp rax, 1Fh
0134C jbe main+145h
0134E call qword ptr[__imp__invalid_parameter_noinfo_noreturn]
01354 int 3
01355 call operator delete
}
0135A xor eax, eax
0135C add rsp, 48h
01360 ret

The above was basically the same code, only using std::string constants. The size was 337 bytes and so the std::string tax for two calls is 337-35 = 302 bytes, or if you like tax factor is roughly 9.6.

How do you feel about creating 9.6 times as much code? Well assuming a linear slowdown, which I think is generous, let’s look for a comparison in these single threaded Passmark numbers.

.-------------------------------------------------.
|2021 | Core i9-11900K | 3485 |
|2004 | Pentium 4 Extreme Edition 3.46 GHz | 379 |
`-------------------------------------------------'

So basically when you do something like this it’s kind of like self-downgrading to a processor from 2004. Note that 3485 / 9.6 is actually 364.6 so it’s actually worse than the above. And linear slowdown for space is probably generous… the extra branches to extract the string pointer for instance are also not free.

Let’s have a look at the helper functions for completeness.

void printString1(const std::string& str) {
printf("test: %s\n", str.c_str());
}
printString1:
printf("test: %s\n", str.c_str());
// rcx can't do conditional move so we do it the hard way
// we always have to do this length check when getting the c_str!
011F0 cmp qword ptr[rcx+18h], 10h
011F5 jb printString1+0Ah
// we either dereference rcx or we don't, we have to do this
// every time we want to get the char * out of a std::string
// it's fundamental to how it's stored, there are two
// representations!
011F7 mov rcx, qword ptr[rcx]
011FA mov rdx, rcx
011FD lea rcx, [string "test: %s\n"]
01204 jmp printf

And the simpler helper:

void printString2(const char* str) {
printf("test: %s\n", str);
}
printString2:
printf("test: %s\n", str);
// just move the registers for the call, no frame no nothing
01210 mov rdx, rcx
01213 lea rcx, [string "test: %s\n"]
0121A jmp printf

So, in conclusion, we get some maybe high level advice and observations:

  • std::string is a mutable string, use it if you need string mutation but do not use it for finished string objects
  • std::stringview does not have these problems, if pointers to substrings are needed, it has a simpler representation
  • use const char * where it makes sense, especially when you're done manipulating and lifetime is known
  • if nulls are a concern it’s easy to write a no-overhead “safe pointer” that can’t hold a null and costs nothing

The std:string overhead for finished strings as std::string compared to "old school" is about 9.6x.

--

--

Rico Mariani
Rico Mariani

Written by Rico Mariani

I’m an Architect at Microsoft; I specialize in software performance engineering and programming tools.