(Note: this page uses UTF-8 and CSS to get subscripts and simple math characters.)

Note that in this mathematical notation, additional leading zero digits are allowed, with no effect: 0042 has zero thousands, zero hundreds, four tens, and two ones, and is the same as 42. Normally we leave off the leading zeros, of course, as they do not add anything useful.

Internally, modern computers use binary, or base 2. Here the digit-sequence 100110 represents 1•2

As a C programmer, you need to become familiar with two more bases as well: 8 (also called “octal”), and 16 (“hexadecimal”, although Knuth argues convincingly that it should really be called “senidenary”). Bases above 10 have a notational problem: individual digits need to exceed nine. The solution used in C is that the letter A (in either upper or lower case) represents 10, B represents 11, and so on through F.

To write the number forty-two in C, you can use either base 10 as usual, or base 8, or base 16. In base 8, 42 needs five eights and two more ones, so instead of 42, we write 52. In base 16, we need two sixteens and ten more ones, so we write 2A. Of course, if you just write “52”, people will think you mean fifty-two, not forty-two; clearly we need some way to denote the base of the number, as well as the digit-sequence. In mathematics, we use subscripts: 42

Note that the number itself does not change just because we represented it differently, using a different base. This fact is quite crucial: the number itself is an abstraction; the various representations are concrete manifestations of that abstract.

A useful property of binary (base 2) numbers is that each digit is always either zero or one. Zero times anything is zero, and one times any other number is the other number, so only the ones “count”, as it were, and only the position of those ones matters. Each power of two is either included, or excluded. Also, when counting up, we find that the least significant bit simply flips back and forth, and each more-significant bit flips whenever its next-less-significant bit goes from 1 to 0: 0000, 0001, 0010, 0011, 0100, 0101, 0110, 0111, 1000, 1001, 1010, .... Given some number of binary digits, the maximum possible value occurs when they are all ones (and the minimum is of course just zero). Moreover, this maximum value is simply one less than the next power of two: for instance, 11

C lacks the ability to write out binary numbers directly, but octal and hexadecimal tend to suffice.

This is all well and good for non-negative integers, but what about negative integers? Over time, a number of schemes have been used to represent negative integers in binary computers. The one that is perhaps simplest conceptually is the sign-and-magnitude method, also called signed magnitude. The other two are called ones’ complement and two’s complement. All three methods are allowed in C. In all three cases, we pick one bit to represent the sign, and use the rest of the bits for the value. It does not matter which bit we choose, but for notational convenience we tend to pick the “first” bit (this causes some interesting fights later on when we look at “endianness” of machines, since which bit comes “first” depends on where you stand—at the low-order end or the high-order end—when looking over at bits and bytes).

In signed magnitude, you simply take the value of all the “value bits” to get the value, and then if the sign bit is set, call the number negative. Hence the number 1 (for the sign), 011 (for value) has the value 3

The ones’ complement method avoids the second drawback. Here, to represent a negative number, we again set the sign bit, but this time we also invert all the value bits (change ones to zeros, and vice versa). Now to represent negative-three we set the sign bit, but also flip the value bits, giving 1 (for the sign), and 100 for the value. When we go to add two numbers, we need a device called an “end-around carry” to bring any carry bit out of the sign bit back over into the value bits. Watch how this works when we add –3 and +2, and then –3 and –1, and then –3 and +4 (the dots below represent carries that are added back in):

1100 (-3)

+0010 (+2)

----

1110 (-1)

. .

1100 (-3)

+1110 (-1)

----

1011 (-4)

. .

1100 (-3)

+0100 (+4)

----

0001 (+1)

Again, there are two representations for zero: the normal, regular zero; and the one with the sign bit (and in fact every bit) set, which is a “negative zero”. This value is slightly peculiar, and on a number of machines that used ones’ complement, an attempt to use “–0” was (sometimes optionally) caught at runtime and could thus catch attempts to use variables that had not been set to a “proper” value yet, by simply filling all memory with all-one-bits in advance. (A similar trick is found on today's computers, in another form entirely.)

Here are a few more example sums, using ones’ complement arithmetic. Note that if both of the low-order value bits being added are 1s, their sum is 0 (with a carry into the next value bit), but an end-around-carry will cause the low order bit to become one again:

. ...

11001 (-6)

+11011 (-4)

-----

10101 (-10)

.....

00010 (+2)

+11111 (-0)

-----

00010 (+2)

With five bits, the allowed value range is –15 to +15:

.....

10111 ( -8)

+10000 (-15)

-----

01000 ( +8)

The last method, and actually the most common by far today, for representing negative integers is two’s complement. (Incidentally, see Knuth for the reason for the difference in apostrophe placement in ones’ and two’s complement.) This is much like ones’ complement, except that if the sign bit is on, we flip all the other value bits and then add 1 to them, to find the value being represented. Hence, while –3 is 1100 in a four-bit ones’ complement number, it becomes 1101 in two’s complement. This extra step—“then add 1”—avoids the need for an end-around carry. Now we can add any two values, simply discarding the carry out of the sign bit, to get the sum:

.

11010 (-6)

+11100 (-4)

-----

10110 (-10)

These are all relatively minor points in most cases, because as programmers working in a sufficiently high level language, we do not have to concern ourselves with the gritty details of how the machine implements numbers internally. At some point, however, the seams start to show—and it is at this point that it becomes crucial that we understand the representations, and their failings. For instance, while sign-and-magnitude and ones’ complement both have a “negative zero” that we may have to deal with, two’s complement has a different problem: it can represent one more negative value than positive.

In sign-and-magnitude, all-ones is –1, but a sign bit and all-zeros is “–0”. In one’s complement, all-ones is “–0”. In two’s complement, all-ones is –1 again—so what is a sign bit and all-zeros? The answer is, it is a negative value that has no corresponding positive value. If we flip all the bits in 1000...00, then add 1, we add 1 to 0111...11 which gets 1000...00 again. In 16 bits, a 1 for the sign followed by all-zeros for the value represents –32768, while the largest positive value is +32767. Negating –32768 leaves the value at –32768 (unless, as hardly ever happens, the machine catches this error for you; again, it is usually more important to get the wrong answer as fast as possible).

Thus, all three representations have drawbacks: sign-and-magnitude is easy to understand, but requires a lot of extra hardware and has a “negative zero”; ones’ complement requires a little extra hardware (the end-around carry) and, as with signed magnitude, needs separate instructions for signed and unsigned and still has a “negative zero”; and two’s complement has a negative value that has no corresponding positive value. But two’s complement needs no extra hardware, can be made very fast, and can get the wrong answer as fast as possible using a simplified instruction set, so most modern computers do that.

Note, incidentally, that I have talked about computing the positive value of a negative number in terms of bit inversions (and possibly adding 1). There is a nice neat mathematical interpretation we can use instead, with the digit-sequence summation rules as above. Instead of the a

C's unsigned integer types work well for fixed-point numbers, though of course, they are unsigned. If you need negative fixed-point numbers, you can simply use signed arithmetic. This will work fine for most cases, but note what happens when you multiply or divide two fixed-point numbers: they already include that scale factor of 2

Consider a typical “scientific notation” number written in decimal, which looks like 7.2345•10

Scientific notation is, in effect, a decimal form of floating point. It is easy to fall into a trap here: computer arithmetic is usually printed out in decimal floating point form, making it easy to believe that the numbers were in that form all along. But in fact, most computers today use a binary floating point format.

Consider the number 4.5, but in binary floating-point. As with the fixed-point example above, 4.5 is 2

In decimal floating-point—scientific notation—it is trivial to scale by powers of ten: we just fiddle with the exponent. Similarly, in binary floating-point, it is trivial to scale by powers of two. Since 9 is twice 4.5, we just bump the exponent one. To get to 18, we just bump the exponent again. Each time, the significand stays 1.001; only the exponent changes.

Curiously enough, binary floating-point is absolutely terrible at representing numbers like one-tenth. In decimal, this is easy: 1.0e–1. In binary, however, one-tenth is one-sixteenth (0.0675) plus one-thirty-secondth (0.03125) plus no sixty-fourths (0.015625; this takes us over) plus no 128ths (0.0078125; this still takes us over) plus one 1/256th plus one 1/512th plus no 1/1024ths, and so on, giving 1.100110011001100...b–4 as its binary representation. The sequence "1100" repeats forever, in much the same way as one-third, written in decimal, comes to 3.33333...e–1, using an endless string of "3"s. (In fact, it is easy to see that any decimal fraction that does not end with the digit 5 has this problem.) Fortunately, integer values, at least for “sufficiently small” integers, are always exactly representable in binary floating-point.

Note that if we always write our binary floating-point number in normalized form, not only is the binary-point always the second character, but also the first character is always—always—a “1”. That means we can not only omit storage for the binary point, but omit storage for the first digit as well. So, inside the computer, the number 1.1011b+4 (which represents 16+8+2+1 or 27) can be represented as just the binary digit sequence “1011” and the exponent (+4). Hence, in IEEE binary floating point, the storage is broken into two pieces: one for the mantissa (this term is now being used correctly, since the leading digit is no longer stored) and one for the exponent. Of course, we also have pesky sign issues to deal with: both the mantissa and the exponent are signed, as numbers can take forms like –1.1011b–4 as well as +1.1011b+4. The technique used here, in IEEE format, is a bit peculiar: the mantissa is stored as signed-magnitude, as with older computers and integers; but the exponent is stored in an “excess” format. Here, negative numbers have a bias added to make them non-negative.

Without getting into too many additional details, what this amounts to is that floating-point numbers have three pieces (instead of four): one bit for the sign of the mantissa; some fixed number of bits for the mantissa itself; and the rest for the exponent, in excess form. The “excess” amount depends on the number of bits remaining. For instance, IEEE single-precision is a 32-bit format, with 1 bit for mantissa-sign, 23 bits for mantissa, and 8 bits for exponent. Since eight unsigned bits can represent numbers from 0 to 255 inclusive, the exponent should logically be stored in “excess-128” form. For reasons I am not really prepared to explain, however, the actual excess is 127, so an exponent value of 1 represents 2

Since we use the normalized form, the 23 mantissa bits actually store 24 bits of significand value, with the implied leading “1” digit and binary point. Our numbers are always plus-or-minus (based on mantissa sign) 1.something, b, plus-or-minus something (whatever is in the adjusted exponent). Or, perhaps more understandably, if the actual mantissa value is m, and the actual (unadjusted) exponent value is e, the value represented is 1.m•2

If you have been paying attention, or are familiar with the issue, you should have noticed a problem by now. If the mantissa always means one-point-something, how can we represent zero? The best we can do with a minimum normal exponent (unadjusted value 1, representing 2

The first two special cases are “zero” and denormalized (also known as subnormal) numbers: a value that has all-zero bits in the mantissa and a zero exponent represents 0.0 (or –0.0 if the sign bit is set), while one that has some nonzero mantissa bits is considered to be a “denorm”. Instead of 1.m•2

The last two special cases, with all-ones exponents, are used to represent “Infinity” (if all the mantissa bits are zero) and “Not a Number”, or “NaN” (if some mantissa bits are set). As a useful trick, if all bits are set, the exponent is by definition all ones, and the mantissa clearly has at least some nonzero bits (in fact all are nonzero), so this is a NaN. Setting memory to all-ones thus catches some uses of uninitialized variables.

Note that infinities are signed—positive infinity is greater than all other numbers, and negative infinity is less than all others—but NaNs, while they still have a sign bit, are not supposed to be considered signed. Not every implementation always gets this right. As a further complication, NaNs are split into “Quiet NaNs” and “Signaling NaNs”, with QNaNs in calculations just causing the result to be another QNaN, while SNaNs are supposed to be caught at runtime. Again, not every implementation gets this right.

Calculations whose result exceeds the maximum possible representable value are supposed to convert to +Infinity: for instance, 1e38 times 1e1 would overflow, so should yield +Infinity. Negative numbers that exceed the maximum negative number likewise overflow to –Infinity (so –1e38*1e1 = –Infinity).

With an exponent that goes only to b+127, single-precision numbers are limited to about 1e38. (The actual number, 340282346638528859811704183484516925440, or about 3.4e+38, is what you get with 1.m•2

double a = 1.0e200, r;should print "Infinity", but actually prints "1.0e100", on many Intel machines with many C compilers. (Again, we see it is important to get the wrong answer—at least, wrong as far as IEEE floating-point is concerned—as fast as possible. One can, of course, argue that this is obviously the right answer. Still, the rules for both C and IEEE arithmetic dictate that the result should be Infinity.)

r = a * a;

r = r / 1.0e300;

printf("%g\n", r);

back