Matters Computational – Ideas, Algorithms, Source Code

Recommended

Bit wizardry

We give low-level functions for binary words, such as isolation of the lowest set bit or counting all set bits. Sometimes the term ‘one’ is used for a set bit and ‘zero’ for an unset bit. Where it cannot cause confusion, the term ‘bit’ is used for a set bit (as in “counting the bits of a word”).

The C-type unsigned long is abbreviated as ulong as defined in [FXT: fxttypes.h]. It is assumed that BITS_PER_LONG reflects the size of an unsigned long. It is defined in [FXT: bits/bitsperlong.h] and usually equals the machine word size: 32 on 32-bit architectures, and 64 on 64-bit machines. Further, the quantity BYTES_PER_LONG reflects the number of bytes in a machine word: it equals BITS_PER_LONG divided by eight. For some functions it is assumed that long and ulong have the same number of bits.

Many functions will only work on machines that use two’s complement, which is used by all of the current general purpose computers (the only machines using one’s complement appear to be some successors of the UNIVAC system, see [358, entry “UNIVAC 1100/2200 series”]).

The examples of assembler code are for the x86 and the AMD64 architecture. They should be simple enough to be understood by readers who know assembler for any CPU.

Trivia

Little endian versus big endian

The order in which the bytes of an integer are stored in memory can start with the least significant byte (little endian machine) or with the most significant byte (big endian machine). The hexadecimal number 0x0D0C0B0A will be stored in the following manner if memory addresses grow from left to right:

adr: z z+1 z+2 z+3
mem: 0D 0C 0B 0A // big endian
mem: 0A 0B 0C 0D // little endian

The difference becomes visible when you cast pointers. Let V be the 32-bit integer with the value above. Then the result of char c = *(char *)(&V); will be 0x0A (value modulo 256) on a little endian machine but 0x0D (value divided by 224) on a big endian machine. Though friends of big endian sometimes refer to little endian as ‘wrong endian’, the desired result of the shown pointer cast is much more often the modulo operation.

Whenever words are serialized into bytes, as with transfer over a network or to a disk, one will need two code versions, one for big endian and one for little endian machines. The C-type union (with words and bytes) may also require separate treatment for big and little endian architectures.

Size of pointer is not size of int

If programming for a 32-bit architecture (where the size of int and long coincide), casting pointers to integers (and back) will usually work. The same code will fail on 64-bit machines. If you have to cast pointers to an integer type, cast them to a sufficiently big type. For portable code it is better to avoid casting pointers to integer types.

Attribution

Jorg Arndt, Matters Computational – Ideas, Algorithms, Source Code

This work is licensed under the Attribution-Noncommercial-No Derivative Works 3.0.

VP Flipbook Maker

Display your book with VP Online Flipbook Maker! With this tool, we can convert our work to digital flipbook, or create a customized flipbook as we like. Try it now!