Why C++ compilers use name-mangling.

A concept which exists in C++ is, that the application programmer can simply define more than one function, which will seem to have the same names in his or her source code, but which will differ, either just because they have different parameter-types, or, because they are member functions of a class, i.e., ‘Methods’ of that class. This can be done again, for each declared class. In the first case, it’s a common technique called ‘function overloading’. And, if the methods of a derived class replace those of a base-class, then it’s called ‘function overriding‘.

What people might forget when programming with object-oriented semantics is, that all the function definitions still result in subroutines when compiled, which in turn reside in address-ranges of RAM, dedicated for various types of code, either in ‘the code segment of a process’, or in ‘the addresses which shared libraries will be loaded to’. This differs from the actual member variables of each class-object, also known as its properties, as well as for entries that the object might have, for virtual methods. Those will reside in ‘the data-segment of the process’, if the object was allocated with ‘new’. Each method would be incapable of performing its task if, in addition to the declared parameters, it did not receive an invisible parameter, that will be its ‘this’ pointer, which will allow it to access the properties of one object. And such a hidden ‘this’ pointer is also needed by any constructors.

Alternatively, properties of an object can reside on the stack, and therefore, in ‘the stack segment of the process’, if they were just declared to exist as local variables of a function-call. And, if an array of objects was declared, let’s say mistakenly, and not, of pointers to those objects, then each entry in the array will, again, need to have a size determined at compile-time, for which reason such objects will not be polymorphic. I.e., in these two cases, any ‘virtuality’ of the functions is discarded, and only the declared class of the object will be considered, for resolving function-calls. Such an object ends up ‘statically bound’, in an environment which really supports ‘dynamically bound’ method-invocation.

First of all, when programming in C, it is not allowed to overload functions by the same name like that. According to C, a function by one name can only be defined once, as receiving the types in one parameter-list. And the only real exception to this is in the existence of ‘variadic functions,’ which are beyond the scope of this one posting. (:1)

Further, C++ functions that have the same name, are not (typically) an example of variadic functions.

This limitation ‘makes sense’, because the compiler of either language still needs to generate one subroutine, which is the machine-language version of what the function in the source-code defined. It will have a fixed expectation, of what parameter list it was fed, even in the case of ‘variadic functions’. I think that what happens with variadic functions is, that the machine-language code will search its parameter list on the stack, for whatever it finds, at run-time. They tend to be declared with an ellipsis, in other words with ‘…’, for the additional parameters, after the entries for any fixed parameters.

So, the way in which C++ resolves this problem is, that it “mangles” the names of the functions in the source code, deterministically, but, with a system that takes into account, which parameter types they receive, and which class they may belong to, if any. The following is an example of C++ source code that demonstrates this. I have created 3 versions of the function ‘MyFunc()’, each of which only has as defined behaviour, to return the exact data which they received as input. Obviously, this would be useless in a real program.

But what I did next was to compile this code into a shared library, and then to use the (Linux) utility ‘nm’, to list the symbols which ended up being defined in the shared library…

Source Code:

 

/*  Sample_Source.cpp
 * 
 * This snippet is designed to illustrate a capability which C++ has,
 * but which requires name-mangling...
 * 
 */

#include <cmath>
#include <complex>

 /*  If this were a regular C program, then we'd include...
  *
#include <math.h>
#include <complex.h>
  *
  */

using std::complex;

typedef complex<double> CC;

class HasMethods {
public:
	HasMethods() { }
	~HasMethods() { }
	
	CC MyFunc(CC input);
};

//  According to the given headers, there are at least 3 functions
// that I could define below. First, two free functions, aka
// global functions...

double MyFunc(double input) {
	return input;
}

CC MyFunc(CC input) {
	return input;
}

//  Next, the member function of HasMethods can be defined, aka
// the supposed main 'Method' of a HasMethods object...

CC HasMethods::MyFunc(CC input) {
	return input;
}

 

(Updated 4/12/2021, 21h30… )

(As of 3/19/2021… )

Symbols in the library:

 

dirk@Phosphene:~/Programs/Dirk_Mangle_Demo$ which recode
/usr/bin/recode
dirk@Phosphene:~/Programs/Dirk_Mangle_Demo$ recode utf8..html <Sample_Source.cpp >SampleSource.cpp.html
dirk@Phosphene:~/Programs/Dirk_Mangle_Demo$ nm libmylib.so | grep MyFunc
00000000000006d0 T _Z6MyFuncd
00000000000006e0 T _Z6MyFuncSt7complexIdE
0000000000000718 T _ZN10HasMethods6MyFuncESt7complexIdE
dirk@Phosphene:~/Programs/Dirk_Mangle_Demo$ 

 

Basically, what the reader can see above is, that three symbols exist in the resulting library, that are all named differently, but in a way derived from the name of the functions according to the source-code.

Further, an earlier posting of mine showed, that I can declare one of these functions to exist, using the ‘extern “C”‘ compiler directive, which will ‘turn off’ name mangling for that one. The big caveat here is, that one can only use this directive with one of the (three) functions shown, because, if it were used for more than one, the result would be more than one compiled symbol, with the same name, which is actually illegal. (:3)

This last problem can be mitigated, by declaring either free functions or static objects with the linkage directive ‘static’. What this means is, that The object will only exist for the purpose of defining the functions in the present module, and will not be exported. However, the use of this directive must not be confused with the use of a directive by the same name, ‘for properties and methods of a class’. When ‘static’ is used there, it causes properties to exist once per class, instead of existing once per object. And, methods which have been declared ‘static’, have as built-in limitation, that they may only access static properties of the same class, for which reason they do not receive a hidden ‘this’ pointer.

Because of the historic inconsistencies that exist, in what the declaration ‘static’ meant, I try to avoid using it to declare ‘linkage’ in a C++ program, where I’ll mainly use it in the second sense named above. (:4)

Also, static linkage will not resolve any problems that will still result, if an attempt is made to incorporate two symbols by the same name into one shared library, where the final intent will be, to export both. For such reasons, objects declared with ‘static’ linkage should either have their name-mangling turned on, or, be named differently, according to the source code.

 


 

1:)

Further, there is a way in which I must second-guess The WiKiPedia article about variadic functions, when (mis)used in C++. They explicitly named a usage example like so:

 


        } else if (*fmt == 'c') {
            // note automatic conversion to integral type
            int c = va_arg(args, int);
            std::cout << static_cast<char>(c) << '\n';

 

My readers might wonder why I have an issue with that. The main problem with variadic functions is, that they will decide at run-time, how many bytes to read off the stack. Type-checking is turned off by default – because when these functions are called, the compiler cannot determine what data-types they are expecting from their declarations – and a real risk exists, that the function itself could read past the part of the stack which still holds its parameters. Where the code above reads:

int c = va_args(args, int);

My assumption would be, that the entry ‘int‘ not only defines what the data-type is, which ‘c‘ is to receive, but also defines by how many bytes the ‘args‘ pointer is supposed to be incremented. On every system I know, ‘sizeof(char) == 1‘, and ‘siezeof(int) = 4‘. Thus, the function above should increment its pointer by 4 bytes, expecting an object of type ‘int‘, while at run-time, an object of type ‘char‘ was supplied. So, unless my reader can find a reference document somewhere, according to which all variable arguments passed-in to a variadic C++ function are passed-in by reference, there will be an inconsistency, in where the next parameter gets read from the stack… (:2)  And, this can also lead to ‘other problems’. AFAICT, the function-call ‘va_end(args);‘ will only assure that the next function won’t be affected, if the current function-call read past its last intended parameter. AFAICT, the type-‘va_list‘ object was created on the stack, so that ‘va_end(args);‘ cannot be calling its destructor.

 

(Update: )

I can think of exactly one reason, why I might not have to worry about this issue. It could be that, both in C and C++, When arguments are passed in to a variadic function’s (un)declared parameter list, If they are smaller than of type (int), they may automatically be promoted to the size of (int). If that were true, then a question it would leave unanswered is, what’s to happen on 64-bit systems. There, an item of type (void *) is actually 8 bytes in size, not 4. On those systems, are all the arguments automatically promoted at least to (uint64_t)?

Well, if the compiler did this to an (int32_t) which was negative, then any (int64_t) which the function might extract, would fail to be negative. However, it could be the general behaviour of the compiler instead, to cast any integer it passes in to an (int64_t). Then, there should not be any problem with addresses – really of type (void *) – if they resided above ‘0x000000007fffffff‘ but below ‘0x0000000100000000‘. The compiler should recognize that they are already 8 bytes in size, and not attempt any conversion.

Yet, if it really was the behaviour of the C++ compiler, always to pass-in items at least 8 bytes long, on a 64-bit system, the compiler would still need to be a bit more complex, in that it would also need to cast anything of type (float) to type (double). AFAIK, When the function implementation casts that back to a (32-bit) float, this can be done ‘silently’ within the API function – ‘without generating any warnings’.

 

(Another observation: )

What ‘looks the same’ between C and C++, may often not be (implemented) the same (way). And, a good example might be the implementation of:

va_start(args, fmt);

While I’m sure that the usage, again taken from the WiKiPedia, is correct – because it’s also stated the same way on other reference sites – at first glance, it looks impossible in both these languages. But, looks can be deceiving.

It’s every bit possible that, even though the naming of this function-call is in lower-case letters, it could be a macro in C, while in C++, the parameter ‘fmtcould be passed-in by reference (into ‘va_start()‘). Either way, this function-call has the ability to determine the address of ‘fmt‘.


 

(Update 3/21/2021, 21h30: )

2:)

Actually, I can refer my readers to Another blogger, who has experience with variadic functions in C++, just in case those readers really want to know… I can also paraphrase what that person wrote. For certain integral types, there is a “default promotion”. Apparently, the default promotion for a ‘char‘, a ‘short‘, or a ‘uint‘ is an ‘int‘, while the default promotion for a ‘float‘ is a ‘double‘. And this seems to be the case, regardless of whether the platform is a 32-bit or a 64-bit (platform).

When the compiler sees code that calls a variadic function, it applies this default promotion to all the arguments that are positioned where the ellipsis is, in the declared parameter list. If there is none, the compiler applies none. The function implementation needs to see to it, always to request a non-promotable type (=parameter) from the stack, because if it fails to do so, the argument that was placed there, was nevertheless promoted by the compiler.


 

(Update 3/21/2021, 15h00: )

3:)

I am also aware, that the ‘extern "C"‘ declaration can be followed by a pair of curly braces, around one or more class-declaration. However, again, this use will only avoid producing errors, if the classes thus spanned, never declare more than one method with the same name.


 

(Update 3/22/2021, 4h25: )

4:)

I could have a reader with the morbid curiosity, of wanting to know, how to use the keyword ‘static‘ in both senses, for properties and methods of a class, as well as to declare linkage. The answer to that question lies in the fact that, if a method is declared static in the class declaration, it becomes a static method of that class, which only has access to properties, which have been declared the same way, and the potential to do so forbids, that the same keyword be used in the function definition, which is also ‘the function implementation’. However, I could modify the code above like so:

 

/*  Sample_Source.cpp
 * 
 * This snippet is designed to illustrate a capability which C++ has,
 * but which requires name-mangling...
 * 
 */

#include <cmath>
#include <complex>

 /*  If this were a regular C program, then we'd include...
  *
#include <math.h>
#include <complex.h>
  *
  */

using std::complex;

typedef complex<double> CC;

class HasMethods {
private:
	CC m_CC;

public:
	HasMethods() : m_CC(0.0, 0.0) { }
	~HasMethods() { }

	CC MyFunc(CC input);
	void SetComplexMember(CC in);
	CC GetComplexMember();
};

//  According to the given headers, there are at least 5 functions
// that I could define below. First, two free functions, aka
// global functions...

double MyFunc(double input) {
	return input;
}

CC MyFunc(CC input) {
	return input;
}

//  Next, the member functions of HasMethods can be defined...

CC HasMethods::MyFunc(CC input) {
	return input;
}

void HasMethods::SetComplexMember(CC in) {
	this->m_CC = in;
	return;
}

CC HasMethods::GetComplexMember() {
	return this->m_CC;
}


//  Invoke static linkage...

static HasMethods s_ptr;

void TestStaticFunc(CC in) {
	s_ptr.SetComplexMember(in);
	return;
}

 

Doing this does not require that ‘s_obj‘ be declared again, and calls its default constructor, to initialize an object that’s static in two senses:

  • It becomes ‘a global variable’ – aka a static variable, And
  • Each client-program that loads the resulting, shared library, gets its own version of it, which other programs don’t see.
  • This still wouldn’t prevent a client program from trying to load two objects with names based on ‘s_obj‘, as a result of loading two shared libraries. At that point, the only factor that prevents a link error from taking place, is any name-mangling that was applied when compiling each library.
  • The property ‘m_CC‘ does not create multiple definitions, precisely because it was not declared ‘static‘ in the object-oriented sense, and therefore resides once in the data-segment of each process, that loaded the library.

 


 

(Update 3/22/2021, 0h25: )

5:)

I have more trivia to share, about how non-trivial parameters can be specified in a function -prototype (=declaration) and -definition. First of all, even though passing by reference is legal (and sometimes, preferred) in C++, doing so really requires that the function declaration ask for one. Otherwise, the compiler has no warning, that an argument that appears as a class-object, is supposed to be passed-in thus, resulting in a parameter that’s a reference. One way to avoid this pitfall is, to pass-in class-objects via explicit pointers, which means that where the arguments are given, the expression is explicit, to ask for the address of the object. If a user commits the folly, of specifying a class-object in the function declaration (which states parameters), say, because he or she was tired one night, then this object will be passed-in by copy. And the real devil here is in the fact, that if the class (of this parameter) specifies a copy constructor, the compiler will use that, in order to copy the object onto the stack. Passing class-objects by copy – i.e., so that a copy is received as the parameter – is only advisable in certain specific cases. I happened to do this in my example, because the type of object – complex numbers based on the type ‘double‘ – has well-defined behaviour when copied. But I could probably make these function-calls more efficient, if I passed by reference instead. Another caveat when passing class-objects by copy, If the class happens to be polymorphic – which means, that it would have virtual functions – is, that the polymorphism will be broken. Methods will be called ‘statically’, aka, according to the declared class, even though the platform supports ‘dynamic binding’.


Specifically in the case of variadic functions, the other blogger who I linked to seems to increase the emphasis. Apparently, little has really changed in how a C++ program’s variadic function executes its ‘va_args()‘ call, in comparison to how it was done in C. What this consists of is, to cast the stack pointer to a pointer to the type being read-in – a situation where a pointer-to-a-pointer apparently causes no problems. Then, the resulting pointer is dereferenced to result in the return-value (which is returned using the defined assignment operator of the class, BTW), and, is incremented by the ‘sizeof()‘ the dereferenced type. Apparently, if this requested type happens to be ‘a reference to a class’, the stack-pointer could get incremented, by the size of the class being referenced, not, the size of a reference to the class… And, this would be enough to break the methodology, if the rightmost named parameter was a reference.

 

/*  Given how the 'va_list' constructs seem unaware of
 * references, I guess they must be macros after all, both
 * in C and in C++. Here's my best guess, at how to implement
 * two of them...
 * 
 */

#define va_start(struc, rm_arg) struc.a1 = ((void *) &rm_arg) + sizeof(rm_arg)

#define va_args(struc, intype) ({   struc.a2 = struc.a1; \
                                    struc.a1 += sizeof(intype); \
                                    *((const intype *) struc.a2); })

 

(Update 3/22/2021, 6h10: )

There’s an aspect to this question which can be confusing. On most CPU architectures, each time a value is ‘pushed’ onto the stack, the stack pointer is automatically decremented because, by default, the stack grows ‘downwards’. This is actually determined by the microprogramming of the CPU. Additionally, at the beginning of a subroutine’s execution, the first thing it normally does is, to push the base-pointer – a specific CPU register – onto the stack, after which a newly invoked subroutine sets the base pointer to equal the current stack pointer.

This URL already explains how that works by default.

A question that posting did not answer then, was, when a function is being compiled, ‘How does the compiler obtain the addresses of specific parameters?’ And the answer I would suggest is that, when functions are compiled, usually both the number of parameters and all their sizes are known. What this means is that, in that case, the compiler can determine an offset of address more positive than the base-pointer, to correspond to whichever argument was pushed first, by the calling function. Addresses which are ‘consecutively more negative’, and thus closer to the base pointer’s, will belong to ‘arguments pushed later’.

But then, what, exactly happens when a function is being compiled, without any awareness at compile-time, of how many arguments there will be?

In that (special) case, only the parameters corresponding to the first known arguments in the list, will have known address-offsets. To my mind, this implies that the order with which the arguments get pushed, when the function is called, needs to be the reverse of the intuitive order. An unknown number of parameters would then reside at addresses, progressively more positive than the base pointer’s. ‘Progressively more positive stack-addresses’ belong to ‘arguments pushed earlier’. But, when a variadic function is being called, the compiler recognizes how many arguments it will be fed, as well as what their sizes are. It should push them then, in the reverse order, from the order they occur in the list.


 

(Update 4/12/2021, 21h30: )

A fact which I recently learned was, that if source code both dereferences and post-increments a pointer in a simple expression, a compiler warning results. Because ‘va_args()‘ can be used in source code without generating any warnings, the next guess I’d make is that it’s still a macro, but one that uses a “statement expression”. The code above has been updated to reflect this.

 

Enjoy,

Dirk

Print Friendly, PDF & Email

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>