Expressions

Objects vs. Values

In order to talk about expressions in C, we need three pieces of information:

Is it an object, or just a value?
What is its type?
What is its value?

The C language divides expressions into two major categories, which I call ‘objects’ and ‘values’. (The ANSI/ISO C Standard uses the terms ‘lvalue’ and ‘value’ respectively. Many computer scientists use the name ‘rvalue’ for the latter. Technically, an lvaue ‘names’ an object, while an rvalue is what most of us think of as an ordinary value.) An object is, in effect, a place to store values. Ordinary variables are the simplest examples of objects, but not the only ones.

A C value is actually a pair of things, because each C value carries a type as well. Values are ephemeral: they only last for the duration of a single expression, and then they vanish. If a particular value is to be of any use, it must be saved somewhere (or printed, which simply amounts to saving it in the user's brain instead of in the computer's memory). Objects, on the other hand, last as long as their object lifetime (which may be anything from a few lines of code to the entire program run).

Object also have types, and in general, any given object can only hold values that have types compatible with that object's type. Automatic objects that have never been assigned a value will contain garbage, and C allows the garbage to be arbitrarily poisonous. The same holds for objects allocated with malloc(). For this reason, it is important to be sure that any particular object has been given a value before attempting to look at its value. Note that static objects always have an initial value, even if you did not assign one.

A side note on garbage
Some systems always zero out ‘fresh’ memory. On these systems, your uninitialized variables may all start out as zero or NULL. This is not a property of the language; it is merely your system being helpful and/or making sure your program does not have access to sensitive information (such as passwords) that was left in memory by a previous program.

The value of an object

In C, as in many other computer languages, when you need a value, you can name an object instead. In other words, the language distinguishes between an ‘object context’ and a ‘value context’, so that when you write:

x = y;

the machine finds the value of y, and sticks that in x. In general, the value of an object is pretty obvious. Its type is determined by the type of the object, and its actual value is whatever you last stored in it. This is not the case for arrays.

The Rule

Array objects have a special, fundamental rule in C. This rule is essentially arbitrary, and simply must be memorized. It falls out from a key fact: C does not have array values. (There is one exception to this, which I will save for later.) C does have array objects -- just the values are missing. For instance, int a[5]; declares an ordinary array containing five ints. Logically, the ‘value’ of this array ought to be the five int values stored in that array -- but it is not. Instead, the ‘value’ of the array is a pointer to the first element of that array.

One Rule to ring them all
One Rule to find them
One Rule to point them all
And in the language bind them
-- apologies to J. R. R. Tolkien

Of course, in order to talk about any value, we need to know its type -- a pointer to int is quite different from a pointer to double. Hence, The Rule, which determines both for an array object:

In any value context, an object of type ‘array of T’ is converted to a value of type ‘pointer to T’, pointing to the first element of that array, i.e., the one with subscript 0.

Here T is any valid array-element type. Valid types include array types themselves; it is possible to have an object of type ‘array 3 of array 4 of float’. The Rule turns this into a value of type ‘pointer to array 4 of float’ -- not float **, but rather float (*)[4]. The Rule only applies once, because after that, you no longer have an object. In order to get The Rule to fire again, you have to turn the new value into another object.

(Note that the size of the array, even if it is known, disappears once The Rule is finished. This is why it can often be eliminated or ignored.)

Structure and union objects have structure and union values, and pointers have pointer values. (Functions have return values, of course, but functions are not objects. In fact, attempting to find the ‘value’ of a function without calling it results in an effect similar to The Rule about arrays and pointers.)

Note that C always passes all parameters by value. Since the ‘value’ of an array is actually a pointer to the array's first element, it follows that no C parameter has array type. Thus, when you declare a function as, e.g.:

void f(char s[]);

you are actually declaring that f() takes a parameter of type ‘pointer to char’, rather than an array (of unspecified size of) char. This ‘type rewrite’ rule will be discussed more later.

The object of an object

I keep mentioning ‘object context’ and ‘value context’, and ‘objects’ vs ‘values’. How do you know which context you have?

As noted above, in a simple ordinary assignment like x = y, the left-hand side of the assignment operator is in ‘object context’. This is how, even if the value of x is 3 before the assignment, the compiler knows to set x instead of ‘setting 3’ (which obviously makes no sense). C has quite a few operators that demand an object context, including all the assignment operators like += and the increment and decrement operators ++ and --. Since all of these need to change some object's value, they have to find the object, not its value.

There are two other places where an expression is in object context. The unary & (address-of) operator finds the address of some object. To do so, it needs the object, rather than its value. That means that if arr is an array, &arr has arr in object context, and The Rule does not apply. The last special case is the sizeof operator. It too uses object context, so that it can find the size of the object.

Knowing The Rule, it is easy to see that sizeof has object context, and that arrays and pointers are very different, by running a simple example program:

#include <stdio.h>
int main(void) {
    int a[5];
    printf("sizeof a   = %lu\n", (unsigned long)sizeof a);
    printf("sizeof &a  = %lu\n", (unsigned long)sizeof &a);
    printf("sizeof a+0 = %lu\n", (unsigned long)sizeof (a+0));
    return 0;
}

This will almost always print two different numbers for the first two lines of output. (If it prints two identical numbers, you probably have a broken compiler. It is possible that five ints happen to be exactly the same size as one ‘pointer to array 5 of int’. In that case, changing the 5 to some other constant should produce two different numbers. In the past, some broken compilers have applied The Rule to array names that follow the sizeof operator.)

The first sizeof operator finds the size of the entire array object, without applying The Rule. The second sizeof operator finds the size of the result of applying the unary & operator, i.e., the size of a pointer to one ‘array 5 of int’. The final line, though, uses the addition operator + to add nothing to the value of the array. Since the ‘value’ of the array is a pointer to its first element, the expression (a+0) first has to find the value -- a pointer to int -- and the sizeof operator then prints the size of a pointer to one int.

It is quite likely that the last two numbers the above program printed are identical. That is, the size of a pointer to the entire array is probably the same as the size of a pointer to the first element of the array. In fact, most modern machines really only have one, or sometimes two, sizes of pointer. A number of older machines had many different ‘flavors’ of pointer, and C compilers for those machines would use them all; on those machines, it might be easier to tell that &a and (a+0) have different types. On modern machines, however, you have to resort to another method to see this.

The easiest is simply to observe the diagnostics, or lack thereof, produced for correct and incorrect programs. A correct program such as:

#include <stdio.h>
int main(void) {
    int a[5];
    int (*p1)[5];
    int *p2;
    p1 = &a;
    p2 = a; /* same as (a+0) -- is on the right hand side */
    return 0;
}

does not require any diagnostics. On the other hand, if you switch the unary & operator from the first assignment to the second, the program violates a constraint, and requires a diagnostic (and a good compiler should produce two diagnostics, one for each erronous assignment).

Another method is to observe the effect of pointer arithmetic. That, however, is tricky to do portably; we leave this for later.

Analyzing expressions

Now that you know about types and values, and objects vs values, you are ready to write down information about each part of an expression. Earlier, you wrote down values as <type, value> pairs. Now you get to write these down as triples: <object-or-value, type, name-or-value>.

Suppose you have some declarations:

int i;
int *ip;
int aone[5];
int atwo[3][5];
int (*ap)[5];

Here the variable i has type int, ip has type ‘pointer to int’, a1 has type ‘array 5 of int’, a2 has type ‘array 3 of array 5 of int’, and ‘ap’ has type ‘pointer to array 5 of int’.
None of them have values yet, i.e., they all probably contain ‘garbage’. So we had better set at least some of them:

i = 42;
ip = aone;
*ip = i;
ap = atwo;

The first line is pretty simple. On the left is the variable name i. This is an:

<object, int, i>

Likewise, on the right is the integer constant 42, which is:

<value, int, 42>

The = (assignment) operator demands an object on its left hand side, and a value on its right. This is exactly what it has. It then converts the value on the right (42) to the required type if needed. In this case, 42 is already the right type, so nothing interesting happens. Finally, it assigns the value to the object -- so i becomes 42.

The second line is not really difficult either. The left and right sides of the assignment operator are, respectively:

<object, pointer to int, ip>
<object, array 5 of int, aone>

Of course, the assignment operator needs a value on the right -- so now it is time to find the ‘value’ of aone, i.e., to apply The Rule.

The Rule drops the size of the array (5) and considers only the element type (int). The ‘array of T’ becomes a ‘pointer to T’, pointing to the first element of that array, i.e., the one with subscript 0. Thus, the <object, array 5 of int, aone> becomes a <value, pointer to int, &aone[0]>. Now the left and right sides have the right form, and once again, they also have the right types. The assignment proceeds to set ip to point to aone[0].

The third line is a little more complicated, and more interesting. The left hand side of the assignment operator is, itself, an expression. Before you can figure out what the assignment does or means, you have to work out this sub-expression.

The sub-expression consists of the prefix unary * (‘indirection’) operator, and the variable named ip. The indirection operator demands a value, and that value has to have some pointer type. But ip is not a value; it is an <object, pointer to int, ip>. Pointers are not magic after all -- they have values, just like any other ordinary variable, and the value of a pointer is just its value. We just set that value a moment ago: it points to aone[0]. Thus, this <object> becomes a <value, pointer to int,&aone[0]>.

Now the unary * operator has what it needs, a value of type ‘pointer to T’. The indirection operator simply follows this pointer -- which had better not be garbage or NULL -- to find the object to which it points. The result of the indirection is an <object> and has type T. The name of that object can be a bit problematic, but in this case, we know it is just aone[0], so the final result of this * operator is <object, int, aone[0]>. (We could also call this <object, int, *ip>.)

The right hand side of that line -- *ip = i; -- is of course just the triple <object, int, i>. The left hand side is an object; the value of the right hand side is the value of i, i.e., the pair <int, 42>; so the assignment sets *ip(i.e., aone[0]) to 42.

The last line is left as an exercise.

Values and Representations

Whenever a value is stored in an object, it takes on some sort of representation. Usually the representation is just the entire set of bits in that object. There may be unused bits, or bits that do not participate in the value yet must have some particular pattern in them -- for instance, the Data General Eclipse insists that certain bits of every pointer contain a ‘protection ring’ number that only the operating system is allowed to manipulate. In the absence of such bits (or simply ignoring them), the value represented by any particular bit pattern generally depends on the type. On many machines, the type must be supplied with the instruction that uses the bits. For instance, on ordinary 80x86 CPUs, a 32-bit field in memory can be treated as either an integer or a value of type ‘float’. The bit pattern that represents 0x3ff00000 when considered as a 32-bit integer represents instead 1.875 when considered as a 32-bit ‘float’. The C compiler will always use the correct instruction and thus interpret the value correctly -- unless you trick it. There are two defined ways to ‘trick’ a compiler: pointers and unions. The latter is not described here.

The C language makes a fairly strong promise about representations: if you take the address of any ordinary object obj, convert that pointer to one of type unsigned char *, and print out sizeof obj bytes, this will print all of the representation bits in that object, along with any ‘padding’ bits. The pointer conversion and subsequent indirection are the ‘trick’ that gets the compiler to interpret the representation bits as unsigned chars.

This trick can be quite useful, but it is also quite limited. The problem is that it shows you a representation of your value, as stored in one particular object. This does not have to be the representation, even on the one machine on which you try it. For instance, some machines have a scary number of representations of the double value 0.0 (e.g., 4,503,599,627,370,496 possible ways to represent 0.0). Worse, the representation on one machine may bear little or no resemblance to that on another.

The things to remember about representations, then, are:

they are not necessarily unique (i.e., one value may have many representations);
the value they represent may depend on the instruction(s) used to access them (i.e., one representation may have many values); and
they differ from one machine to another.

The most obvious cases -- because they occur the most often -- involve byte orders (aka endianness, named after the characters in Gulliver's Travels that had wars over whether to eat the egg from the little end or the big one). Floating-point representations used to cause trouble more often, but today many machines use IEEE 8xx[number?], so that it is rarer to encounter mismatches. But even pointer types can have different representations. The same Data General Eclipse has two machine-level pointer representations, called ‘byte pointers’ and ‘word pointers’. To convert a word pointer to a byte pointer, the machine must execute an instruction. This instruction shifts the word address left one bit, introducing a zero in the low-order bit, so that the resulting byte pointer points to the first (number 0) byte in the two-byte word. To convert back, the machine shifts the byte address right, discarding the byte-offset bit. (The uppermost bit in a word pointer is a special ‘indirection’ bit that is not used by the C compiler.) On the Eclipse, int * uses a word pointer, while char * uses a byte pointer. Putting the same machine-word address into these two different kinds of pointers results in two different representations.

back