CPUs and pipelines, how do they work?

After the tutorial on C that I re-published recently, it is now time for another old article that is still remarkably educational and useful today.

It describes the internals of the Pentium and Pentium Pro CPUs. Ironically enough, today’s Intel CPUs are still remarkably similar to these oldies. The Atom is a superscalar two-pipeline in-order CPU, closely related to the classic Pentium. The Core architectures are out-of-order execution CPUs, which still work very similar to the Pentium Pro I described. Likewise, AMD’s x86 CPUs are all out-of-order execution CPUs as well, and work according to the same principles.

An execution pipeline is basically a conveyor belt for instructions. Instructions pass various stages, at each stage, a part of the execution of the instruction is done. Normally each stage will take 1 clock cycle (clk), that is, an instruction will stay in a stage for 1 clk, and then proceed on to the next. Basically there are 3 main operations:

  • Fetch: Load instruction from memory.
  • Decode: Set up operands (load from memory if necessary), and find out what the instruction is supposed to do. Basically prepare to dispatch operands to the correct logic unit, for execution.
  • Execute: Take the operands, dispatch to the correct logic unit, perform the desired operation, and store the result.

A simplified pipeline could have these three stages: fetch – decode – execute. The instruction passes through them in that sequence:

        Fetch -> Decode -> Execute
       +--------+--------+--------+
clk 0: | INSTR  |        |        |
       +--------+--------+--------+

       +--------+--------+--------+
clk 1: |        | INSTR  |        |
       +--------+--------+--------+

       +--------+--------+--------+
clk 2: |        |        | INSTR  |
       +--------+--------+--------+

Now if the instruction is being fetched, it cannot be decoded or executed yet. Does this mean that only 1 stage of the pipeline is active, while the rest is idle? The answer is no. Why not? Because while you execute the first instruction, you can decode the second, and fetch the third at the same time. The different stages do not share any hardware resources, so each stage can work independently of the others, at the same time:

        Fetch -> Decode -> Execute
        +--------+--------+--------+
 clk 0: | INSTR0 |        |        |
        +--------+--------+--------+

        +--------+--------+--------+
 clk 1: | INSTR1 | INSTR0 |        |
        +--------+--------+--------+

        +--------+--------+--------+
 clk 2: | INSTR2 | INSTR1 | INSTR0 |
        +--------+--------+--------+

So the stages can work in parallel on sequential instructions. This in effect reduces the execution time of a sequence of instructions. If you have for example a piece of code of 10 instructions, it doesn’t take 10 x 3 = 30 clks, but instead, it takes 2 clks for the first instruction to reach the execute stage, and from there on, at every clk a new instruction arrives at the execute stage, so you execute the 10 instructions in 10 clks. That’s a total of only 2 + 10 = 12 clks. There is a sort of ‘telescoping’, or ‘waterfall’ effect.

The different instructions in the pipeline can be dependent however. It could be that the instruction that is being decoded, requires an operand that is the result of the instruction being executed. To solve this problem, operand forwarding was introduced. This means that the decode stage does not have to wait for the result to be available. The result is forwarded back into the pipeline as a new operand, so execution can continue without “stalling” (waiting):

Consider an add instruction of the form: add destination, source1, source2

INSTR0: add r0, r1, r2
INSTR1: add r4, r0, r3  Decode -> Execute
       +--------+--------+--------+
clk 0: | INSTR2 | INSTR1 | INSTR0 |->-+
       +--------+--------+--------+   |
                                      |
                         +------<-----+ r0
                         |
       +--------+--------+--------+
clk 1: | INSTR3 | INSTR2 | INSTR1 |
       +--------+--------+--------+

Another problem arises when a jump is made. When you jump to a different location, the instructions that were already in the pipeline, are now invalid. So the pipeline needs to be flushed, and start over at the right location. With unconditional jumps, modern CPUs will spot the jumps in time, and continue fetching instructions at the new location right away, without stalling. With conditional jumps however, this is not possible. It is not clear whether the jump should be taken or not, until the condition has been tested, which at worst happens by the last instruction before the conditional jump in the pipeline. So instead of waiting, a modern CPU will try to predict whether the jump will be taken or not, judging from a hint in the code (the usual rule is: if the jump is to a lower address, it is probably a loop, so take the jump. Else, do not take it. Some CPUs even have hint bits inside the opcode, to indicate the behaviour), and from previous iterations. They store whether the jump was taken or not, on the last few occasions. With this strategy, the code can usually continue without stalling. Only when the prediction turns out to be wrong, the pipeline will have to be flushed, and execution will stall for a few clks until the first instruction has reached the execution stage again.

A superscalar CPU is a CPU which has more than one execution pipeline, working in parallel. So you can execute more than one instruction per clk. Operand forwarding can not solve all dependency problems anymore however. It could be that there are 2 instructions being executed in parallel, but the second one requires the result of the first one. The CPU will then have to put the second pipeline on hold. It can only execute the first instruction, then forward the result into the second pipeline. Then the second pipeline can continue executing on the next clk. We refer to this as a stall in the second pipeline. The penalty was 1 clk. It is the responsibility of the programmer to avoid these stalls.

What happens when the clockspeed goes up? All stages in the pipeline consist of sequences of logic gates. A gate has a certain propagation time. The time it takes to settle to a consistent state, after a change of input signals. So the sequence of gates in a pipeline stage can have at most a propagation time of 1 clk. As clockspeed goes up, the time that 1 clk takes goes down, so the maximum length of the logic sequence in a pipeline stage decreases as well. There are 2 solutions to this problem:

  • Reduce the logic required to fetch, decode and execute instructions
  • Split the pipeline up in more stages

In order to reduce the logic required to fetch and decode instructions, the instructions should be encoded in a less complex fashion. Reduced Instruction Set Computing (RISC) is a CPU design philosophy which aims for small and simple instructionsets. Complex instructions which can be replaced by a sequence of 1 or more simple instructions without significantly increasing total execution time, will not be implemented in hardware. Instructions should also be encoded in an orthogonal (uniform) manner, to reduce decoding complexity. They should all be the same size, and layout. The old paradox of “less is more” is a nice description of RISC. To give a name to the previous, more complex designs, the name Complex Instruction Set Computing (CISC) was chosen. These more complex designs date from an earlier age, when memory was small and expensive, and clock speeds were low. There was no pipelining yet, let alone superscalar designs. Code was mostly written by hand, rather than by compilers. So the designs were made to minimize the code footprint, and maximize the operations per cycle. This meant a lot of instructions with implicit operands, and performing common sequences of operations.

A few results of this philosophy are that virtually all instructions can only operate on registers, and virtually all instructions have the same number of operands. Loading registers from memory, or storing to memory is done with a few special instructions. At the time of RISC implementation, manufacturers were able to put millions of gates on a chip, but because of RISC, way less were needed than ever before. The extra gates could now be used for more registers, extra cache or other features.

Splitting the pipeline up in order to increase clockspeed is a poor mans solution. Namely, if there are interlocks because of dependent instructions for example, the penalties can increase from 1 to several stages. But, what if you cannot redesign your instructionset, because you want to keep supporting legacy code?

There are again 2 solutions:

  • Design a new RISC CPU, and support legacy code with a software emulation layer.
  • Design a new RISC CPU, and support legacy code with a hardware emulation layer.

It is interesting to see that Motorola chose the first option, when going from 68k to PowerPC. The Apple computers were powered by 68k CPUs. The new PowerPC-powered CPUs had 68k emulation built into the OS. At that time, the choice was not such a bad one. The new RISC CPU could be manufactured at much higher clockspeeds, and with more cache and more registers. This gave a leap in performance, which could make emulated legacy code perform acceptably, while native code would run at the highest possible speed.

Motorola did however release a 68060, which was designed with the second option. Which was the same choice that Intel made for their x86 family. The instructionset was not redesigned totally. They did use some ideas taken from the RISC philosophy however. In short, their CPUs now translate the complex x86 instructions internally, and then issue them to an execution pipeline which could be seen as a sort of RISC CPU. It can execute what Intel calls “micro-operations” or µOps, since the Pentium PRO (Pentium PRO, Celeron, PII and PIII all have the same P6 core). And surely, some of the instructions which would have been removed in a RISC instructionset, are now constructed from a sequence of 2 or more µOps. The Pentium has no names for the ‘atomic operations’ that it builds the instructions from… But it clearly does just that. Their pipeline is still longer than that of a true RISC CPU, but it is shorter than it would be when all instructions were to be decoded and executed directly, rather than translated to µOps. Intel managed to keep the implementation of a complex instructionset within reasonable bounds, and managed to get the clockspeed up to impressive rates. At the time of writing they have a Pentium 4 CPU out, which runs at 1.5 GHz. This makes the Pentium 4 the highest clocked CPU currently available. The high clockspeed comes at a cost however. The instruction throughput is relatively low, because the pipeline has a lot of stages, because at 1.5 GHz, the stages have to be very short. On top of that, its decoding scheme is still more complex than that of CPUs with a RISC instructionset.

But let’s look at the x86 family more closely. Because at the backend there is the RISC-like µOp-execution unit, the CPU also gets RISC-like traits. What does this mean? The translation from x86 instructions to µOps is not always an easy one. Only a part of the x86 instructionset can be considered orthogonal. But the x86 instructions have complex addressing modes. Intel decided to map register-only operations 1:1 to µOps, and add additional µOps for the complex addressing modes. Pentium PRO can handle simple addressing modes ([reg]) in 1 µOp forms as well.

A few Pentium examples:

add eax, edx

is 1 clk.

add eax, [edx]

will translate to:

mov tmp, [edx]
add eax, tmp

where tmp is a temporary internal register. Takes 2 clks.

add [eax], edx

is three operations on Pentium:

mov tmp, [eax]
add tmp, edx
mov [eax], tmp

This is exactly the style in which RISC code is written, since add would only be able to operate on registers, not on memory directly. So in short, the CPU seems to translate ordinary x86 code to RISC code. The downside to having the CPU translate your code, is that the pipeline will have to wait for the sequence of instructions to finish, before issuing the next pair of instructions. When you write the RISC-like code yourself, you are working with 1 clk operations only, and you can get 2 independent instructions through per clk. A few instructions are even more interesting on Pentium, because translating them by hand will actually make them faster.

For example:

cdq

takes 3 clks.

mov edx, eax
sar edx, 31

takes 2 clks.

stosd

takes 2 clks.

mov [esi], eax
add esi, 4

takes 1 clk.

loop label

takes 5 clks.

dec ecx
jnz label

takes 1 clk.

Basically there are a few extra rules for instructions to execute in 1 clk on Pentium. The Pentium has 2 pipelines, but they are not identical, the primary pipeline is called U, the secondary pipeline is called V. Some instructions can only be executed in U (eg. shift/rotate instructions), others only in V (eg. call/jump instructions). There are also some instructions which will not pair at all. They will only execute in U, and V will stall. The decoders also have limits. Each pipeline can decode instructions up to 7 bytes per clk. If a longer instruction is encountered, there will be a stall. Finally, the Address Generation Unit (AGU) is a stage earlier in the pipeline. What this means is that if you have an instruction that has to address memory with the result of a previous instruction, then it will have to stall for 1 clk. Namely, the operand has to be forwarded not one, but 2 stages back. This is known as the Address Generation Interlock (AGI) stall:

    add eax, ebx
    mov edx, [eax]

        Decode ->  AGU -> Execute
       +--------+--------+--------+
clk 0: | INSTR2 | INSTR1 | INSTR0 |->-+
       +--------+--------+--------+   |
                                      |
                +----------<----------+ eax
                |
       +--------+--------+--------+
clk 1: | INSTR2 | INSTR1 |        |
       +--------+--------+--------+

We say the latency of the instruction is 2 clks. The result has to be available 2 clks ahead. Most instructions have a latency of 1 clk. Operand forwarding can avoid stalls on 1 clk latency instructions, as long as there are no 2 dependent instructions in the pipeline at the same time, as we’ve seen earlier.

The Pentium PRO (and other P6 core CPUs) is a completely different beast however. Instead of translating in-place, as the Pentium does, the Pentium PRO decodes the x86 instructions into µOps and stores them in a buffer. A scheduling unit (Reservation Station) will then dispatch µOps to the units as the operands and the unit become available. This means it can execute instructions out-of-order (ooo). The results are then stored in a reorder buffer (ROB), so the result is guaranteed to be equivalent to the original sequence of instructions. In other words, the out-of-order execution is transparent to the program. Why is this interesting? A stall occuring in the early part of the pipeline will not affect the backend. The decoder might not issue new µOps during the stall, but while there are still µOps in the ooo buffer, the backend can continue execution. The ooo execution can also minimize the cost of latencies/stalls. While 1 µOp has to wait on a previous one to complete, the CPU can still dispatch other µOps which are not dependent, even if they were originally sequenced after the dependent instruction.

The Pentium PRO has 3 decoders, d0, d1 and d2. Decoder d0 can decode an instruction with a maximum length of 7 bytes, and a maximum complexity of 4 µOps per clk. Decoders d1 and d2 can decode an additional instruction of 1 µOp per clk each. So the ooo buffer can be filled with up to 6 µOps per clk.

The ooo execution pipeline has 5 different ports, port 0 through 4. Each port handles a specific class of µOps. Port 0 handles integer and floating point arithmetic, and address generation operations. Port 1 handles simple integer arithmetic instructions (not shift, multiply or divide). Port 2 handles memory load operations. Port 3 handles the calculation of memory write addresses. Port 4 handles memory write operations.

Each port can take 1 µOp per clk. This means that the CPU can dispatch a maximum of 5 (independent) µOps per clk. Since the decoders decode into the ooo buffer, they are now no longer dependent on the execution backend. This means that the decoders no longer have to stall when the execution takes more than 1 clk. In fact, the ports on a PPRO themselves are pipelined as well. Which means that even multiple-clk instructions such as multiply can be issued every clk, and they pass through the stages subsequently. With PPRO, you don’t specify execution speed of instructions by simply counting the clks, instead, you specify latency and throughput. For example, mul is specified as: latency 4 clks, throughput 1 per clk. This means that when we issue a mul, it takes 4 clks to complete in total. But you can issue 1 mul per clk. If you issue 2 subsequent muls, then the second one will follow one clk after the first. So we see the same ‘telescoping’ or ‘waterfall’ effect…

1 mul takes 4 clks
2 muls take 5 clks
3 muls take 6 clks
etc.

After execution, the µOps wait in the ROB for “retirement”. The retire stage will update all operands (register or memory) in order, at a maximum of 3 µOps per clk. It will also take care of the operand forwarding.

This rather complex and very interesting execution scheme shifts the focus from instruction scheduling to issuing µOps to the ooo backend. Scheduling instructions for dependency is now less important, because the CPU can schedule the µOps itself. There should not be so many dependencies that the ooo-buffer fills up faster than the backend can handle, but not each and every dependency results directly in a stall anymore. The ooo buffer acts like a cushion for dependencies and latencies.

You could see the CPU as a funnel for instructions… You can add up to 6 µOps per clk, then it can dispatch up to 5 µOps, and finally retire up to 3 µOps per clk. When writing code for PPRO CPUs, the main focus should be on using as little µOps as possible, rather than getting as many µOps per clk into the ooo buffer. Decoding them efficiently is important as well. Instructions that consist of more than 1 µOp should be scheduled to go into d0. Try to get both d1 and d2 to decode an instruction as well.

One part we’ve not touched yet, is the PPRO’s register renaming scheme. While on the outside it appears that the PPRO has only 8 integer registers, it has in fact 40 temporary registers inside. It uses these temporary registers internally, for a special kind of dependencies such as this one:

mov eax, [a]
mul [b]
mov [c], eax
mov eax, [d]

We saw earlier that a mul takes 4 clks. If the CPU would use the physical registers directly, then the last mov would have to wait until the previous store operation was dispatched. But instead the first three instructions get one temporary register assigned for eax, and the last instruction gets a new temporary register assigned again. The eax register has been ‘renamed’ to a temporary register. Now the load operation can be dispatched at any time, it does not have to wait for the previous instructions to finish. The CPU can rename 3 registers per clk. Each temporary register has its own dependency chain. Try to keep these chains short, so that the CPU can allocate new temporary registers early, which will reduce dependencies, and allow the CPU to dispatch more indepentent µOps. (note: the old tricks of ‘xor reg, reg’ or ‘sub reg, reg’ to set a register  to 0 will not be recognized as independent. A new independent chain will not  be created.)

Posted in Software development | Tagged , , , , , , , , , , , , , , , | Leave a comment

The myth of CMT (Cluster-based Multithreading)

The first time I heard someone use the term ‘CMT’, I was somewhat surprised. Was there a different kind of CPU multithreading technology that I somehow missed? But when I looked it up, things became quite clear. If you google the term, you’ll mainly land on AMD marketing material, explaining ‘cluster-based multithreading’ (or sometimes also ‘clustered multithreading’):

This in itself is strange, as one page you will also find is this: http://dl.acm.org/citation.cfm?id=640477.640525

Triggered by the ever increasing advancements in processor and networking technology, a cluster of PCs connected by a high-speed network has become a viable and cost-effective platform for the execution of computation intensive parallel multithreaded applications.

So apparently the term ‘cluster-based multithreading’ has been used before AMD’s CMT, and is a lot less confusing: it just speaks of conventional clustering of PCs to build a virtual supercomputer.

So CMT is just an ‘invention’ by AMD’s marketing department. They invented a term that sounds close to SMT (Simultaneous Multithreading), in an attempt to compete with Intel’s HyperThreading. Now clearly,  HyperThreading is just a marketing-term as well, but it is Intel’s term for their implementation of SMT, which is a commonly accepted term for a multithreading approach in CPU design, and has been in use long before Intel implemented HyperThreading (IBM started researching it in 1968, to give you an idea of the historical perspective here).

Now the problem I have with CMT is that people are actually buying it. They seem to think that CMT is just as valid a technology as SMT. And worse, they think that the two are closely related, or even equivalent. As a result, they are comparing CMT with SMT in benchmarks, as I found in this Anandtech review a few days ago: http://www.anandtech.com/show/5279/the-opteron-6276-a-closer-look/6

AMD claimed more than once that Clustered Multi Threading (CMT) is a much more efficient way to crunch through server applications than Simultaneous Multi Threading (SMT), aka Hyper-Threading (HTT).

Now, I have a problem with comparisons like these… Let’s compare the benchmarked systems here: http://www.anandtech.com/show/5279/the-opteron-6276-a-closer-look/2

Okay, so all systems have two CPUs. So let’s look at the CPUs themselves:

  • Opteron 6276: 8-module/16-thread, which has two Bulldozer dies of 1.2B transistors each, total 2.4B transistors
  • Opteron 6220: 4-module/8-thread, one Bulldozer die of 1.2B transistors
  • Opteron 6174: 12-core/12-thread, which has two dies of 0.9B transistors each, total 1.8B transistors
  • Xeon X5650: 6-core/12-thread, 1.17B transistors

Now, it’s obvious where things go wrong here, by just looking at the transistorcount: The Opteron 6276 is more than twice as large as the Xeon. So how can you have a fair comparison of the merits of CMT vs SMT? If you throw twice as much hardware at the problem, it’s bound to be able to handle more threads better. The chip is already at an advantage anyway, since it can handle 16 simultaneous threads, where the Xeon can only handle 12.

But if we look at the actual benchmarks, we see that the reality is different: AMD actually NEEDS those two dies to keep up with Intel’s single die. And even then, Intel’s chip excels in keeping response times short. The new CMT-based Opterons are not all that convincing compared to the smaller, older Opteron 6174 either, which can handle only 12 threads instead of 16, and just uses vanilla SMP for multithreading.

Let’s inspect things even closer… What are we benchmarking here? A series of database scenarios, with MySQL and MSSQL. This is integer code. Well, that *is* interesting. Because, what exactly was it that CMT did? Oh yes, it didn’t do anything special for integers! Each module simply has two dedicated integer cores. It is the FPU that is shared between two threads inside a module. But we are not using it here. Well, lucky AMD, best case scenario for CMT.

But let’s put that in perspective… Let’s have a simplified look at the execution resources, looking at the integer ALUs in each CPU.

The Opteron 6276 with CMT disabled has:

  • 8 modules
  • 8 threads
  • 4 ALUs per module
  • 2 ALUs per thread (the ALUs can not be shared between threads, so disabling CMT disables half the threads, and as a result also half the ALUs)
  • 16 ALUs in total

With CMT enabled, this becomes:

  • 8 modules
  • 16 threads
  • 4 ALUs per module
  • 2 ALUs per thread
  • 32 ALUs in total

So nothing happens, really. Since CMT doesn’t share the ALUs, it works exactly the same as the usual SMP approach. So you would expect the same scaling, since the execution units are dedicated per thread anyway. Enabling CMT just gives you more threads.

The Xeon X5650 with SMT disabled has:

  • 6 cores
  • 6 threads
  • 3 ALUs per core
  • 3 ALUs per thread
  • 18 ALUs in total

With SMT enabled, this becomes:

  • 6 cores
  • 12 threads
  • 3 ALUs per core
  • 3 ALUs per 2 threads, effectively ~1.5 ALUs per thread
  • 18 ALUs in total

So here the difference between CMT and SMT becomes quite clear: With single-threading, each thread has more ALUs with SMT than with CMT. With multithreading, each thread has less ALUs (effectively) than CMT.

And that’s why SMT works, and CMT doesn’t: AMD’s previous CPUs also had 3 ALUs per thread. But in order to reduce the size of the modules, AMD chose to use only 2 ALUs per thread now. It is a case of cutting off one’s nose to spite their face: CMT is struggling in single-threaded scenario’s, compared to both the previous-generation Opterons and the Xeons.

At the same time, CMT is not actually saving a lot of die-space: There are 4 ALUs in a module in total. Yes, obviously, when you have more resources for two threads inside a module, and the single-threaded performance is poor anyway, one would expect it to scale better than SMT.

But what does CMT bring, effectively? Nothing. Their chips are much larger than the competition’s, or even their own previous generation. And since the Xeon is so much better with single-threaded performance, it can stay ahead in heavy multithreaded scenario’s, despite the fact that SMT does not scale as well as CMT or SMP. But the real advantage that SMT brings is that it is a very efficient solution: it takes up very little die-space. Intel could do the same as AMD does, and put two dies in a single package. But that would result in a chip with 12 cores, running 24 threads, and it would absolutely devour AMD’s CMT in terms of performance.

So I’m not sure where AMD thinks that CMT is ‘more efficient’, since they need a much larger chip, which also consumes more power, to get the same performance as a Xeon, which is not even a high-end model. The Opteron 6276 tested by Anandtech is the top of the line. The Xeon X5650 on the other hand is a midrange model clocked at 2.66 GHz. The top model of that series is the X5690, clocked at 3.46 GHz. Which shows another advantage of smaller chips: better clockspeed scaling.

So, let’s not pretend that CMT is a valid technology, comparable to SMT. Let’s just treat it as what it is: a hollow marketing term. I don’t take CMT seriously, or people who try to use the term in a serious context, for that matter.

Posted in Hardware news | Tagged , , , , , , | 2 Comments

A tutorial on programming C

Over the years I have written various things related to programming and hardware. They were published on websites that no longer exist and/or have disappeared into oblivion. I have found a few of them, which I think may still be relevant today, so I will re-publish them here.

I will start with a tutorial on programming in C. I think programming in bare C is still relevant today, even with fancy languages based on C, such as C++, C#, Objective C and Java. It is good to have a good grasp of the basics in C, and be fluent with pointer manipulation and such. It can help you to make your code simpler and more efficient in more modern languages as well.

C: The beginning
—————-

Well, C is a relatively simple language, in that it does not have too many
language constructs. That’s because the language is relatively low-level,
it’s quite close to the machine, so to say. And the machine is a thing of
anarchy and chaos. C was developed to get some structure in this, without
sacrificing too much performance, and size.

We have:

- Data declarations
- Functions
- Operators
- Flow control statements
- Type definitions

That is what you have to work with. In short, code consists of functions,
containing (flow control) statements, with operations on data of specific
types.
That may seem a bit too much to grasp at this point, but we’ll take it one
step at a time.

Data
—-

Ofcourse we need data… like text or numbers. On data, we can perform
operations, like adding, subtracting, multiplying and such.
So first, let’s see how we can give our programs some data.

We have 2 kinds of data: initialized data, and uninitialized data. They are
declared in much the same way, except that initialized data gets a value
assigned at declaration, and uninitialized data does not. So the initial
value of an uninitialized variable is undefined.

First you give the type of your data variable, then you give the name:

char myChar;

This is an unintialized data variable. Initialized data works by assigning an
initial value to the variable:

char myChar = 'a';

Well, char is just 1 primitive data type of C. I will give you the complete
list and their sizes in memory here:

- char          1 byte integer, also used for characters.
- int           2 or 4 bytes integer, depending on the system architecture.
- float         4 bytes floating point number.
- double        8 bytes floating point number.
- (pointers)    depending on the system architecture.

As you see, some data types are dependant on the system. So to make things
easier, I will choose the popular x86 system in 32 bit mode from now on.
Note that other systems may vary.

There are also 2 ‘size modifier’ directives:

- short         2 bytes
- long          4 bytes

These directives can be prefixed to int, and will determine the size. In most
compilers, you can omit the int, and just use long and short as if they were
primitive types themselves.

So for 2 byte integers, both these declarations are correct:

short int a;
short a;

Similarly for 4 byte integers:

long int a;
long a;

And if there’s no size modifying directive for an int, the compiler will use
the default size, which is platform-dependant.

On the x86 system, ints are long, 4 bytes, and so are pointers. Pointers are a
special group of data types, which I will cover later.
(in 16 bit realmode OSes, ints used to default to short, and pointers could
be either near or far. This had to do with the segmented memory model on old
x86 processors (8086, 8088, 80186 and 80286). This legacy system is beyond
the scope of this text, since modern x86 systems use 32 bit addressing. But
when using a realmode OS such as DOS, you have to pay attention to this.)

These data types are seen as signed numbers by default. Signed means that the
number can be both positive and negative. Unsigned variables can not have
negative values.
This is interesting, because a char is only 1 byte, or 8 bits big. It can take
on 2^8, or 256 values. With signed, this would be -128 to and with 127.
With unsigned, it would be 0 to and with 255.
You can control this behavior with the signed and unsigned directives:

signed int;
unsigned int;

It is also legal to define multiple variables of the same type on one line,
even intermixing initialized and uninitialized data.
It works by simply separating all variables by comma’s, like this:

unsigned int myVar1, myVar2, myVar3 = 50, myVar4;

Functions
———

Functions are the core of any C program. They contain the actual code, and
therefore provide the functionality of the program. A function can receive
parameters, the data it will process. And a function can return a primitive
data type variable. You declare a function in the sequence of return type,
function name, and parameter list (in parentheses):

int MyFunction(int param1, unsigned char param2, signed short param3)

Functions also have the possibility to not return anything. In that case we
have the special void data type. We will see this type again later with
pointers. This data type can also indicate that we want no parameters.
So if you don’t need any return value, and no parameters, then you can do:

void MyFunction(void)

(note: there’s old-style and new-style for functions with no parameters.
Official ANSI C wants MyFunction(void), but before the ANSI C standard was
introduced, MyFunction() was used. For most compilers, both styles should
work, but some (eg Borland) may enforce the ANSI C (void) style.)

Blocks of code are always between curly braces: {}. So the code that goes
into our function is no different. The code block immediately follows our
first line which declared the function prototype.

This might also be a good time to explain how to add comments to your program.
A C comment is prefixed by /* and postfixed by */. Anything between those
symbols is considered as comment, and will not be looked at by the compiler.

A small example:

int main(void)
{
    /* Print some text to the screen, using a library function */
    puts("Hello world!");

    /* Exit function with return value */
    return 0;
}

Here we have a function calling the puts() function with a text string as
a parameter (puts() will ‘put’ the ‘s’tring on screen. We will look at these
strings later, aswell as the puts function), and then returning a signed
integer value of 0 (this is an immediate operand).
Note also that each line of code in C is delimited by a semicolon (;).

Now, to look at the calling of functions more closely…

You can use functions from your own source, but you can also import functions
from earlier compiled modules of code, or libraries. Libraries are made up of
a number of modules of code. ANSI C comes with quite a few libraries of code,
which you can use in your programs. With these libraries, you also get header
files, which include these function prototypes, among other things. We will
look at these header files more closely lateron, when we are actually going
to write a program.

Before you can call a function, the compiler needs to know how many parameters
are to be passed to the function, and what types they are. This is done via a
prototype of the function.

If a function is defined in your own source code, above the line where you
want to call it, then the compiler already knows the prototype, since it has
seen the actual function before, and you won’t have to do anything.
If a function is below your call, or imported from a code library or module,
then the compiler won’t know the function, so we have to provide a prototype
before using the function.

A prototype looks much like the first line of a function, except that the
parameter names are optional, and are usually omitted.
An example of a prototype:

int MyFunction(int x, char y, short z);

Or, omitting the names:

int MyFunction(int, char, short);

Then you’re all set to use the function lateron in your source.

Calling a function is as simple as filling in the blanks, basically. All you
have to do is fill in the variables, in the prototype, and the function will
be called, and its return value will be yielded.
You can either import functions from another library of code, or
use functions from your own source.

For example, if we have a function like this:

float sqrt(float);

which will return the square root of a float we give it, we can make a small
piece of code like this:

float x = 25, sqrtOfX;

sqrtOfX = sqrt(x);

As you can see, you can treat a function call like a number. In this example,
the parameter x will be passed to the function, the function will do its work,
and return the square root of x. And this result will be assigned to the
sqrtOfX variable.

Operators
———

So now that we know how and where to put our code, the next question ofcourse
will be: “How do I write code?”. That is not a trivial question, so we will
break code down to some subsets. Our first subset is “operations on data”.

As with most language constructs, C does not have too much operations on data.
Here’s the list:

Mathematical operators:

- +  : addition.
- -  : subtraction.
- *  : multiplication.
- /  : division.
- %  : division remainder/modulus.

Bitwise operators:

- &  : AND
- |  : OR
- ^  : XOR
- ~  : NOT
- >> : shift right
- << : shift left

They all work the same, in that you specify a target variable, then the first
operand, the operator, and then the second operand.

I will give a small example, with some data:

int destination;
int operand1 = 10;
int operand2 = 20;

destination = operand1 * operand2;

This will assign the value of the expression 10 * 20 to the destination
variable.
Well, to be more precise, the right hand side is an expression, which yields
a result. You could just write this in C:

operand1 * operand2;

This would yield the result, but it never does anything with it. In these
first examples, we will assign the result to a variable, but we will see that
there are other things we can do with expressions, such as combining them to
larger expressions. You could say that the above expression is a ‘primitive
expression’

You can also use immediate operands instead of variables:

int destination;
int operand1 = 15;

destination = operand1 / 3;

This will assign the result of 15 divided by 3 to destination.

There is also shorthand notation for the case where one of the operands is
also the destination variable. The shorthand notation works with putting the
operator directly in front of the equals-sign, and specifying only the other
operand. So:

destination = destination ^ 10;

can be written as:

destination ^= 10;

There’s another shorthand case, namely when you want to increase or decrease
the value of a variable by 1 unit. Why do I call it a unit? We’ll see that
later, when discussing pointers. For numbers, the unit is simply the number
1. These are the operators for it:

- ++ : increase
- -- : decrease

They work slightly different from the normal operators. You just prefix or
postfix them to a variable, there is no equals-sign involved. When postfixing
the operator, the value is used in an expression, and afterwards its value is
increased. When prefixing it, the value is increased first, then used in the
expression.

Some examples:

int destination;
int operand1 = 30;
int operand2 = 19;

destination = operand1 - ++operand2;

This will result in the following values:

destination = 30 - 20 = 10
operand1    = 30
operand2    = 20

Postfixing the operator:

destination = operand1 - operand2++;

Gives the following results:

destination = 30 - 19 = 21
operand1    = 30
operand2    = 20

You can also make more complex expression. The compiler should follow standard
operator precedence, and resolve expressions in brackets first. You can
recursively compose expressions, yielding the results of each ‘primitive
equation’ in the sequence the brackets and operator precedence have defined.

For example, this is possible:

int destination;
int var1 = 10;
int var2 = 20;
int var3 = 30;

destination = 27 * ((var2 + --var3)/(var1 * 2))

This will give destination the value of:

27 * ((20 + 29)/(10 * 2)) =
27 * 49/20 =
27 * 2 = 54

And finally, you can also use the results of functions in expressions:

destination = 25 + sqrt(var0);

Now, onto another kind of expressions…

Flow-control statements
———————–

Here comes the interesting part of code, namely controlling the flow of our
program, on specific conditions.

We do this with boolean expressions, a condition is either TRUE or FALSE.
Since a computer can only represent numbers, we express these boolean values
in integers (either char, short, int or long). A 0 stands for false,
everything else stands for true. But as a rule, when assigning the true value
to a variable, 1 is used.

For boolean expressions we have some operators:

- == : equals
- != : not-equals
- && : logical and
- || : logical (inclusive) or
- !  : logical not
- >  : greater than
- <  : less than
- >= : greater than or equal
- <= : less than or equal

Any boolean expression should yield the value 0 for false and 1 for true. They
work much the same as the previous expressions, namely with first operand,
operator, second operand. We will usually not store them in new variables
however, but use the result directly in a statement.

A simple example of a boolean expression would be:

short var;

var == 15;

The result of this expression is true, or 1, if var equals 15, and false, or 0
if var is any other value than 15.

Another example:

var1 && var2;

This expression results in true if, and only if both var1 and var2 are true
(non-zero).

var1 || !var2;

This expression is true when var1 is true and/or var2 is false.

Boolean expressions can also be combined:

var1 && !(var2 >= 10 || var3);

We have only a handful of flow-control statements:

- if-else
- do-while
- for
- break
- continue
- switch-case
- goto

The if-statement is simple, but effective…
‘If (this expression is true) then run this block of code’. And you can
optionally run an alternative block of code if the expression was not true,
using the else-statement after the first block of code.

Going something like this:

int var1, var2;

if (var1 == 3)
{
    var2 += 10;
}
else
{
    var1 /= 5;
}

do and while can be used to run a block of code in a loop, while (‘as long
as’) the boolean expression is true. do-while() is a special form of while(),
where the expression is checked after the code block is executed, instead of
before, with the normal while(). As a result, the code block is always run
at least once.

As an example, I shall show the code for a power function:

int mantissa = 5, exponent = 3, result = 1;

while (exponent--)
    result *= mantissa;

At the end of the loop, result will contain 5^3.
(And as you see, when there’s only 1 line of code in the loop, there’s no need
to put it in between the {} brackets).

Now for an example with do-while:

int result, var1 = 3, var2 = 7;

do
{
    result = var1 * var2;
    var1 += var2;
} while (result < 5000);

Here you can’t result before entering the loop, because result is not
initialized yet. So basically it has just the value that the last program left
there when using that piece of memory. Checking its value would be irrelevant.
We could make it an initialized value, but we know that the result is less
than 5000 the first time anyway, so this way, we save 1 check, and we save
the trouble of initializing the result value.

The for-loop is similar to a while-loop, but it has a special construct, where
you can not only specify a boolean expression which must be true to loop, but
you can also initialize some variables before entering the loop, and you
can specify some expressions which will be carried out after each loop. These
expressions are usually used to update the variables that are used for the
loop. It goes like this:

for (<initialize variables>; <boolean expression>; <update expressions>)

Or less abstract (an example which calculates faculty of x):

int i, x, y;

for (i = 1, x = 5, y = 1; i <= x; i++)
{
    y *= i;
}

In fact, it can be shortened to this:

int i, x, y;

for (i = 2, x = 5, y = 1; i <= x; y *= i++);

(Here we see that when there’s no code block following a statement, we can
just delimit the line of code, and with that the loop, with a ;)

The expressions in a for-loop are optional. For an endless loop, you can
simply omit all expressions:

for (;;)
{
    /* Put code in endless loop here */
}

Which brings us to the next question… What if we want to exit a loop such as
this endless loop, when a certain condition occurs?
That’s where the break statement comes in.

The break statement will exit the current loop, and continue the rest of the
program.
You will usually use this with an if (conditon) break; construction.

A small example:

Let’s say you’ve got a program that reacts to the user’s input (a menu or
something).
It should stay in the loop until the user chooses ‘quit’:

for (;;)
{
    /* Put code to display the options here */

    if (getUserInput() == QUIT)
        break;
    else
        doSomething();
}

Question: Do we really need to put doSomething() in an else-statement?

<Author gets some coffee, giving you time to think over the question>

Answer: In this case we don’t. Notice that when the if is true, it will only
do the break (and therefore breaks out of the loop). When the if is false, it
will do doSomething().

The continue statement is similar to the break statement, but it only exits
the current cycle of the loop and enters the next.

Going something like this:

int i, n, m;

/* user inputs n */

for (i = 0; i < 100; i++)
{
    m = i % n;

    if (m == 0)
        continue;

    n /= m;
}

We see here how we can exit the current loop on a certain condition. In this
case we exit if m is 0, to avoid a division-by-zero exception.

But, now back to our menu-example of earlier…
OK… so now you’re probably wondering why you’d want a menu with only a
quit-option.
Well… You don’t.

Now, there are two ways to add new options to the menu.
You could of course do it by adding if’s:

int userinput;

for (;;)
{
    /* Put code to display the options here */

    userinput = getUserInput();

    if (userinput == FILE)
        doFile();
    else if (userinput == EDIT)
        doEdit();
    else if (userinput == VIEW)
        doView();
    else if (userinput == QUIT)
        break;
    else doBadInput();
}

OK, this’ll work, but it’s not the ideal solution.
The compiler only sees a couple of if’s, but it isn’t able to tell whether
they have anything in common or not (compilers aren’t that smart yet.),
therefore the code it produces won’t be as efficient as it could be.

The best way to do it is to use the switch-case statement for this. The switch works
by testing one int variable against several (const) solutions.

The above example menu-example using switch ():

int userquit;

for (userquit = 0; userquit == 0;)
{
    /* Put code to display the options here */

    userinput = getUserInput();

    switch (userinput)
    {
        case EDIT:
            doEdit();
            break;
        case VIEW:
            doView();
            break;
        case QUIT:
            userquit = 1;
            break;
        default:
            doBadInput();
    }
}

Some comments on this structure: Firstly, the code looks much better :)

Secondly, for the compiler, it’s very easy to see that the cases are related
to each other, because they all depend on the same variable, by definition.
Hence the compiler can make optimizations that a normal block of if’s won’t
allow.

And thirdly: see the break at the end of each case? That is there to stop the
program entering the other cases. You can use this mechanism if you want to
make the program do the same thing for some of the cases. For example suppose
you’re on a diet that prescribes that you can only eat meat on tuesday,
thursday and sunday, and you want a program to print (when given a day-number
between 1 and 7) if that day is a meat-day?

/* day 1 is monday */
int day = getUserInt();

switch (day)
{
    case 2:
    case 4:
    case 5:
        puts("Meatday!!");
        break;
    case 1:
    case 3:
    case 6:
    case 7:
        puts("No meat");
        break;
    default:
        puts("This day doesn't exist, no meat for you, pal!");
}

Let’s say we enter day = 5. It’ll start executing from case 5: until it hits
a break.
See how good this looks? You can instantly see which cases do what…
Much better than:

if (day == 1 || day = 3 || day == 6 || day == 7)
    puts("No meat");

else if (day >1 && day <8)
    puts("Meatday!!");

else puts("The day doesn't exist");

The compiler can’t really optimize this in terms of speed, but here it just
looks better.
So there!

And now onto our last control flow statement: goto.

This statement can be used to make jumps to and fro in the code. It can be
quite useful in some cases.

The destination will be marked by a label. Labels are defined by a name,
followed by a semicolon:

myLabel:

They are useful for exiting multiple nested loops at once, where break cannot,
among other things.

That would look something like this:

int i, j;

for (i = 25; i >= 0; i--)
    for (j = 0; j <= 25; j++)
            if (i == j * j)
                    goto endOfLoop;

endOfLoop:
/* program continues here */

And that about wraps up our control-flow. Now on to the next challenge…

Arrays and pointers
——————-

Okay, here comes an interesting part of C. We will get into direct contact
with a part of the machine here, namely the memory.
The variables we used earlier were stored in memory aswell, but we didn’t get
to see where and how exactly. The compiler took care of that for us, we could
just use the variables by the name and type we had given them.
Now we are going to use linear sets of data, called arrays. And to understand
how they work exactly, we have to look at how the machine uses its memory.

So how does the machine use its memory?
You could picture it as a giant cupboard of drawers, and each drawer has a its
own unique number. In every drawer we can store one byte.
Picture it like this:

0: [  ]
1: [  ]
2: [  ]
3: [  ]
4: [  ]
5: [  ]
...

Or perhaps you prefer a horizontal look at it:

0:  1:  2:  3:  4:  5:  ...
[  ][  ][  ][  ][  ][  ]...

So to get a byte in memory, all we have to know is the number of its
container. We call this number the address.

It’s also interesting to look at how we store larger variables in memory.
This is another platform-dependant matter. An int for example is 4 bytes.
Now, there is 2 ways to store those 4 bytes in memory. The way we store multi-
byte numbers into memory, is called Endianness. There is Big Endian, and
Little Endian.

Let’s say that the 4 bytes of our int are AA, BB, CC and DD, from most
significant part to least significant part.
So our int looks like AABBCCDD.
In Big Endian, we store it like humans write numbers: from left to right, most
significant to least significant, so it will look like this:

[AA][BB][CC][DD]

Little Endian stores it from least significant to most significant instead.
So we get the reverse:

[DD][CC][BB][AA]

Nearly all systems use the Big Endian byte order these days, it became the
preferred method for new systems in the 80s. It is also the ‘network byte
order’, used on all standard networks, including ofcourse the internet.
But, bear in mind that the original 8086 dates from 1978, when Little Endian
was still used predominantly, so the x86 is one of very few systems around
today, that still uses the now largely obsolete Little Endian method.

So, in short, the address of a multibyte variable will always point to the
most significant byte in Big Endian, and it will always point to the least
significant byte in Little Endian.

Well, what is a pointer? That is simply the address or ‘reference’ of a
variable in memory. A pointer has a type, and contains an address, which is
a 32 bit number. So like a normal variable, it actually contains a number, and
it is stored in memory (storage is also affected by the Endianness of the
architecture, as with the normal multibyte variables).

Declaring a pointer of some type is as simple as giving the type, and then the
name, with an asterisk (*) prefixed to it:

int *myPointer;

You can use pointers to any of the primitive types, and also to user-defined
types, which we will see later.

But now, how do we assign an address value to it?
One option is to use the ‘address-of’ operator, which gets the address of a
variable. It is as simple as prepending ‘&’ to the variable name:

char myChar = 25;

‘&myChar’ will then resolve to the address of the char in memory.

Assigning a value to a pointer works the same as with the other variables:

char *myPointer, myChar = 25;

myPointer = &myChar;

Now myPointer contains a reference to myChar. There is also a dereferencing
operator in C, this will get the value of the variable at the address that is
being pointed to. It works by prepending a ‘*’ to the variable name. You could
say that it is the opposite of the &-operator. The &-operator turns a variable
into a pointer, and the *-operator turns a pointer into a variable.

So we could store the char that myPointer is pointing to, back into a char
like this:

char *myPointer, myChar = 25, newChar;

myPointer = &myChar;
newChar = *myPointer;

Now newChar has a value of 25;

We could also assign a value to an address by dereferencing a pointer. Take a
careful look at this:

char *myPointer, myChar;

myPointer = &myChar;
*myPointer = 25;

We store 25 at the address that myPointer is pointing to. But, pay attention
here! Look what happened… myPointer contained the address of myChar. So by
storing 25 to myPointer, we stored it into the address of myChar, and
therefore myChar now has the value of 25!

Now from pointers on to arrays…

An array in the field of programming is basically what the word means: an
array, a row.
A row of variables of the same type, more specifically, stored in one linear
piece of memory. You can picture it like this:

An array of 6 chars (1 byte), starting at address 521:

Address: 521: 522: 523: 524: 525: 526:
Index:   [ 0] [ 1] [ 2] [ 3] [ 4] [ 5]

Note that we start indexing by 0, not 1. So, the array of 6 chars has indices
0 to 5 for all the elements there. More generally speaking:

N elements of an array are indexed from 0 to (N-1)

Also note that we can get the address of each individual element by:

starting address + index

Let’s look at arrays of bigger types. For example, an array of long ints.

An array of 4 long ints (4 bytes), starting at address 64:

Address: 64:   68:   72:   76:
Index:   [   0][   1][   2][   3]

We see here that the correct addressing formula for all types is:

starting address + (index*sizeof(type))

Incidently, sizeof(type) is recognized by C. If we put:

sizeof(int)

it will evaluate to a value of 4, which is the number of bytes in an int.
More generally speaking, sizeof(x) will evaluate to the total number of bytes
of memory used by x.
This is applicable to all primitive and user-defined types, and as we will
see later, also to arrays and structures.

How do we define arrays and use arrays in C?

There’s basically 2 variations… We have statically allocated arrays and
dynamically allocated arrays.

First we will look at the statically allocated ones. They are defined much
like a single variable: type, name. But after the variable name we put the
number of elements we want, in square brackets []. Looking like this:

char myArray[256];

Now, we have set up an array of 256 (uninitialized) elements. myArray is the
pointer to the first element in the array.

To access an element in the array, you can simply use the subscript operator
([]), as it is called: name[index].
For example, we want to test whether element 5 equals 25.
You can manipulate myArray[5] like a normal char, so we can just create a
normal boolean expression with it:

if (myArray[5] == 25)
{
    /* Do something */
}

Assigning values and doing operations on array elements are also done like
we’ve seen before.

Here’s a small example that multiplies every element of an array by 23:

short numbers[36], i;

for (i = 0; i < (sizeof(numbers)/sizeof(numbers[0])); i++)
{
    numbers[i] *= 23;
}

Lets also take a closer look at this expression:

(sizeof(numbers)/sizeof(numbers[0]));

This yields the size of the array. So i will go from 0 to 36. Namely, what
happens here, is this:

sizeof(numbers) will give use the total amount of memory used for the array
numbers. This will be in bytes. But a short int is 2 bytes, so we need to
correct for that.

sizeof(numbers[0]) will give us the size of element 0 of the array. Element 0
is 1 short int, so this will give us 2 bytes (ofcourse all elements in the
array are equal size).

So if we divide the two, we get this:

(sizeof(numbers)/sizeof(numbers[0])) = 72/2 = 36

Which is our arraysize. This little trick is very useful. If for example we
would change the type of the array to long int later, this line can remain
untouched, as it would still yield the correct arraysize. And if we decide to
change the size of the array to say 50, then all we have to do is to change
the definition of the array:

short numbers[50];

and the for-loop will still loop through all elements.

It’s also possible to put initialized data into an array. In that case, the
size does not need to be specified, because the compiler can derive that from
the number of elements it needs to add.

We give a list of elements, in {} brackets, and separate each element with a
comma:

int myArray[] = { 10, -642342, 12321, 213122, 1231 };

For text, there is a special array definition. Text is stored as an array of
chars, terminated by a 0. For example:

char myText[] = {'H','e','l','l','o',0};

But, C provides a special construct for such text strings. The following is
equivalent:

char myText[] = "Hello";

The 0-terminator will be appended automatically.
This construct will also allow you to append strings together. If you put 2
or more strings after one another, they will be appended to form 1 string.
This makes it possible to split long strings up in multiple lines, or even
add comments inbetween. A few examples:

char myText[] = "Hi" " there!" " How are you doing?";

char myText[] = "Hello, "
                "how are you?";

char myText[] = "This is version "
/* edit this */ "0.1 beta"
                " of the software";

It’s also possible to initialize some elements, and specify a size. This can
be useful when for example only the first element needs to be 0 (for a zero-
terminated list, for example):

short myArray[32] = { 0 };

myArray is a pointer, in that it evaluates to a memory address, but it’s a bit
different from the ones we’ve seen earlier. Namely, this pointer is not stored
in memory, but the address is a constant rather, which you can use, but not
modify. The pointers we’ve seen earlier, are stored in memory, and act like
variables. You can also use operations on them.

For example, we want to fill an array with all powers of 7:

unsigned int myArray[256], *myPointer, i, power = 1;

myPointer = myArray;

for (i = 0; i < (sizeof(myArray)/sizeof(myArray[0])); i++)
{
    *myPointer++ = (power *= 7);
}

Now, a few notes here… First, this part:

*myPointer++

The ++ postfixed to myPointer basically does what you would expect it to do.
The operation is performed, and after that, the variable is increased.
In this case, the operation is a dereference. So we assign a value to the
address that myPointer points to, and then we increase its value by 1.
“1 what?” you may ask. The answer to that, is “1 element”.
If you want to increase the pointer by say 6 elements, then you can just add
6 to it, like this:

myPointer += 6;

Basically, you can do any operation on it, even stuff like adding 2 pointers.
Note however, that it adds 6 elements. The actual address that myPointer
contains, will be increased by 6 * sizeof(type), so in this case, that is
6 * 4 = 24. We will look into that some more, when we get to typecasts.

Now to the second part:

(power *= 7);

This is a nice shorthand notation. The statement in brackets is executed, and
its value is evaluated, and can be assigned to our array-element.
So, power gets multiplied by 7, and the new value of power is then assigned
to *myPointer.

On to dynamically allocated arrays.

Dynamically allocated arrays can be useful when you don’t know the size of
the array beforehand, because it is dependant on user input, for example.
Or, when you don’t need the array for the duration of the entire program, and
you would like your memory deallocated after you’re done with it.

For the allocation of memory, we have the following function:

void *malloc( size_t size );

Well, we see here that it returns a void *. A type we haven’t seen before.
Basically, it is a typeless pointer, and it cannot be dereferenced, since we
don’t know the type, and therefore we don’t know how large an element would
be. But, we assign its value to a pointer of the type of the array that we
need, so there is no trouble. Assigning the value of a variable of one type
to the value of another type, is called typecasting. Let’s look deeper into
that before we continue.

Typecasting
———–

This is a very simple and short subject… Typecasting a variable is
basically forcing the compiler to interpret the data of the variable as if
it were of another type. This is done by putting the desired type before the
variable (or expression), in brackets, looking like this:

(int)myVariable;

Here’s an example on how a cast can affect variables:

short i;
char j = -1;

i = (unsigned char)j;

Now, instead of what you would expect to happen, i is not -1 now, but it is
255. What happened is this: j is cast to an unsigned char. The bitpattern for
-1 is 11111111 in 2s complement notation. Now, we interpret that same bit
pattern as if it were not signed. In that case, 11111111 is equal to 255.
And that is the value assigned to i.

You could also use a cast to add a certain amount of bytes to a pointer,
instead of a certain amount of elements. A char is 1 byte, so we could cast
our pointer to a char pointer temporarily, and add the number of bytes we
want.

For example:

int *myPointer, myBytes;

(char *)myPointer += myBytes;

We temporarily change the type of myPointer to char *, then we add the number
of bytes we want, which is stored in myBytes in this case. Afterwards,
myPointer will be an int * again, since a cast is only temporary.

In most cases, a cast is implicit. For example, void * can always be cast to
other pointer types. In some cases tho, the compiler might give a warning,
or you want to change the behavior to that of another type temporarily.
That’s when you use a cast.

Now, back to the malloc() function…

There was this other new thing… the size parameter has type size_t… That’s
odd… we haven’t seen size_t in the variable types.
Well, this is because size_t is not a primitive type, but a user-defined one.
Basically, it’s just a primitive type, but given another name, so that it
makes more sense when reading the source.

When we search through the header files (we will look at header files more
closely lateron), we find that size_t is defined in a file called Stddef.h,
by the following line:

typedef unsigned int size_t;

So, size_t is basically nothing but an unsigned int.
The typedef directive is followed by the primitive type, and then a list of
new names to be used as variables of that type, separated by comma’s.
For example:

typedef unsigned int colour, serial, uint;

Now you can declare variables like this:

colour red, green, blue;
serial nr1, nr2, nr3;
uint a, b, c;

An interesting application for these typedefs is portability. You could write
a program that uses only user-defined types, and to port it from one system
to another, all you would have to do is adapting the type-definitions to the
new architecture.

For example, you need a 32 bit signed integer type, and on architecture X, you
would need a long int, and on architecture Y, you would need an __int32.
You would choose a usertype to represent the 32 bit signed integer, let’s take
sint32 in this example.

Then all you have to do to make it work correctly on architecture X, is this:

typedef long int sint32;

And to make it work on architecture Y, you would use:

typedef __int32 sint32;

Then there’s also the compound datatype in C, the data structure. Structures
contain a set of variables, and can be very useful for adding logic and
structure to your program. Namely, when you deal with real-world entities for
example, and they have certain attributes, you can group them together into
1 compound datatype.

Let’s say you want to store data about cars. For example, you want to store
model, year and colour of a car.
First you use the ‘struct’ keyword. Then you give your type a name. And then
you group some types together, and give them names, just like you would do
with separate variables.
Put them in between {} brackets.

Looking like this:

struct Car { unsigned int model; unsigned short year; unsigned int colour; };

Then when you want to define a variable of this type, you have to mention that
it is a structure, by using the struct keyword.
Looking like this:

struct Car myCar;

You can also initialize the struct, which would look like this:

struct Car myCar = { 911, 1989, 500 };

You could make a type of ‘struct car’, by using a typedef. This would save you
from typing ‘struct’ before each new variable definition.
So you would define a type like this:

typedef struct Car sCar;

And then define your variable like:

sCar myCar = { 911, 1989, 500 };

The most convenient way is to combine the struct definition with the typedef.
You don’t have to give the struct a name then, since you won’t be needing the
name of the actual struct, but only the name of the newly defined type.
Only if you would define a variable of the same type inside the struct, since
then the new name is not known until the typedef, which takes place after the
struct definition.

The entire line would look like:

typedef struct { unsigned int model; unsigned short year; unsigned int colour; } Car;

If you would reference a struct inside a struct, you would do this:

typedef struct tagSelf ( struct tagSelf next; } Self;

You use a temporary name, or ‘tag’, to be able to have the struct reference to
itself.

It is also common to write each member of the struct on a new line, to
increase readability:

typedef stuct {
    char *processor;
    unsigned int memory;
    unsigned int diskspace;
} Computer, *pComputer;

Note also that we defined 2 names here, Computer and *pComputer.
*pComputer is a dereferenced pointer, hence the *.
This automatically leads pComputer to be a pointer to a Computer struct
(the ‘p’ stands for pointer. For improved readability, sometimes variable
names are prefixed with abbreviations of their type, like this ‘p’ for
pointer. This is called Hungarian notation. The inventor was a Microsoft
programmer from Hungary, by the name of Charles Simonyi. The code looked like
some weird foreign language at first sight, and since Simonyi was Hungarian
by birth, they decided to call this convention Hungarian. To this day, all
Microsoft code uses this notation. It can be very convenient.)

So basically we combined this line:

typedef (Computer *) pComputer;

together with the definition of the Computer structure itself.

To access members in a struct, we use the dot (.) operator. The syntax is:

struct.member

This will resolve to a ‘normal’ variable, which can be used in expressions
just as usual.

An example:

Computer myComputer;

myComputer.processor = "MC68000";
myComputer.memory = 1048576;
myComputer.diskspace = 30234234;

When you would have a pointer to a struct, you could do this:

pComputer myComputer;

(*myComputer).processor = "MC7400";

But, there is a special arrow (->) operator for pointers to structs, which is
the preferred method:

pComputer myComputer;

myComputer->memory = 655360;

It is also possible to create arrays of structs, and even initialize them.
For example:

Computer myComputers[2] = { { "MC68000", 1048576, 30234234 },
                            { "Pentium", (48*1048576), (3072*1048576) } };

The sizeof operator also works on user-defined types and structures, as I said
earlier. So sizeof(Computer) will return the combined size of all the members
of the Computer struct:

sizeof(char *)
sizeof(unsigned int)
sizeof(unsigned int) +
----------------------
sizeof(Computer)

In other words:

 4
 4
 4 +
----
12

sizeof(myComputers) would return the total size of the array, which will be
2 * 12 = 24 bytes, in our example.

There is another type, very similar to the struct. That’s the union. A union
is used exactly like a struct, but with one difference: instead of having
all members, you can pick one member, and use that.
An example should clarify it:

union Number { int i; double d; };

union Number myNumber;

Now if we want to use this variable, we can use either the int:

myNumber.i = 245;

or the double:

myNumber.d = 26.32345;

You can also use the typedef-combinations like we saw earlier with the struct:

typedef union { int i; double f; } Number;

Number myNumber;

The union will always take up as much memory as necessary for the largest
member. sizeof(<union>) will also return that size.
sizeof(<union.member>) will return the size of the type of that member.

That about covers the user-defined types, now back to our malloc() call.
We now know that we can simply specify the number of bytes as an argument of
the malloc() function, and we get a void * back.

So, we can have it cast implicitly to a pointer, and we have a pointer of the
type we want, to a piece of memory of the size we want.

This can be useful to dynamically allocate variables, and keep the memory
usage under control. We allocate an array when we need it, and deallocate it
again, when we’re done with it, and save the memory for other uses and
applications.

For example, to allocate 1 car structure dynamically, we can do this:

Car *myCar;

myCar = malloc(sizeof(Car));

We can also allocate an array dynamically… simply by multiplying the size of
1 element by the number of elements we want.
For example, an array of 25 ints:

int *myArray;

myArray = malloc(25*sizeof(int));

Now, to use this array, we can simply apply the subscript operator to the
pointer, just like with the statically allocated arrays:

myArray[20] = -15;

To deallocate the memory again, we can use the free(void *) function. As you
can see, it takes a pointer as its argument. Since it’s a void *, any pointer
will be implicitly cast, so we can just feed it any type of pointer directly.

To deallocate our int-array, we simply type:

free(myArray);

and our memory is regained.

Our first program
—————–

Well, we now covered just about everything that C can do. Now, let’s look at
how we create programs with the language constructs we’ve seen. How to group
it together to a sourcefile, and how to create a binary executable from it.

A sourcefile is a simple ASCII file, and can be written with any texteditor
you like.
There are 2 types of sourcefiles in C:

- normal sourcecode, text files usually with the .c extension.
 - headers, text files usually with the .h extension.

There is no physical difference between .c and .h files. Both can contain any
form of C statements, directives and code. It’s more of a habit of the C
programmers to make the distinction, etiquette rather than syntax.

Headers are a special kind of sourcefiles, which usually come with code
libraries. They contain the necessary function prototypes, type definitions,
constants and macros for use with the library. They don’t normally contain
code.

When you use code from a library, you include its header file in the source:

#include "header.h"

The compiler will search the current directory for header.h first, and if it’s
not found there, it will continue to search the path specified by the INCLUDE
environment variable.

There is also this form:

#include <header.h>

This will not search the current directory, but will start with the INCLUDE
path immediately.
To speed up compilation, use the <> form whenever possible, to prevent the
compiler from searching too much.

Right, now for our first real program, the infamous “Hello world” example…

We use the puts() function to print a null-terminated string to the console
output stream.
Your API reference will tell you which header and which library to use.
On *nix systems, you can type “man puts”, and on most other systems, there
will be some help function implemented into the IDE, where you can search for
“puts”.
You will find that we need the stdio.h (‘standard I/O’)header file, so we write:

#include <stdio.h>

This will basically paste the code from stdio.h into your current source file
at the place of the #include statement.

A C program starts execution from the main() function, which is defined to
return an int, which will be the exit-code to the OS.
There are 2 versions:

- no arguments: int main(void)
- Commandline passed from the OS: int main(int argc, char *argv[])

argc is the argument-count, the number of commandline parameters passed to the
program.
argv is an array of null-terminated strings, of size argc.
Note that argv[0] is the program name itself, so if for example you had a
commandline like this:

myprogram one two three

Then you would get argc = 4, and:

argv[0] = "myprogram"
argv[1] = "one"
argv[2] = "two"
argv[3] = "three"

For our “Hello world” example, we could use the main(void) version, since we
do not require any commandline parameters to be passed to this program.

(Note: The main() function is called by a piece of code known as the ‘stub’.
The stub contains the raw entrypoint of the executable, and sets up the
environment for running a C program (such as parsing the commandline,
setting up standard input and output streams, and the OS environment
variables). It then calls your main() function. When main() returns, the
stub will clean up the environment again, and pass the return value of main()
back to the OS and exit.)

Actually, I already gave the code for the program as an example for the
functions.

The entire program would look like this:

#include <stdio.h>

int main(void)
{
    /* Print some text to the screen, using a library function */
    puts("Hello world!");

    /* Exit function with return value */
    return 0;
}

Save it to a text file called “hello.c”, and after discussing the compiling
process, we will make an executable out of it.

Now that we have our first source, it is time to compile it to a running
program. This is a bit problematic to explain, since each compiler has its own
commandline options and special behavior. I will just tell a bit about the
process of compiling and linking, so you will know what to look for in the
documentation of your C compiler.

There are several levels in the transition from C source code to a binary
executable.
Technically, we have these levels:

4. C source code
3. Instruction listing (assembly source)
2. Linkable code object
1. Executable image

A compiler will take us from level 4 to 3. Then an assembler will take us from
3 to 2, and finally, a linker will take us from 2 to the final level of 1.

But in practice, we find that compilers will go from 4 to 2 immediately these
days, and use a bytecode format internally, rather than a true assembly
instruction listing. They often still have the option to output an assembly
listing tho, should the programmer so require.
And compilers automatically invoke the linker these days, so that we can go
from C source code to an executable in one go.
Compiling and linking will always be separate steps tho, since you may have
some precompiled code which you want to link to the newly compiled code, such
as imported library functions, or you might want to create some precompiled
libraries, so you will not link them to a binary executable at all.

So, I can’t really help you with the commandline options for your compiler.
But now that you know what the compiling process is about, you should be able
to figure it out yourself. I can give you 2 examples tho, for both Microsoft’s
compiler and the GNU C Compiler (gcc).

Microsoft Compiler:

CL hello.c

This will compile hello.c into hello.obj, and then link in the necessary
standard C libraries. CL.EXE invokes LINK.EXE itself. It has libc.lib as a
standard library, so you do not need to specify any libraries for standard C
functions. You only need to specify libraries for non-C functions, like
Windows API or third party libraries.
CL will give the executable the same name as the source file which contains
the main() function, so in our case, we will get hello.exe as a filename.
In this case, this line is all you need to have hello.exe generated.
if you need other libraries or objects, you can simply add them to the
commandline: CL hello.c blah.obj foo.lib
So, try to execute your first program now!

GNU C compiler:

gcc -o hello hello.c

This compiler also has a C library as standard, so no need to specify it. The
-o switch specifies the name of the executable. If no name is given, then it
defaults to a.out instead.

Some example programs
———————

Here is a small program that I think will be useful. It will show you how to
do some simple interaction with the user. It will also allow you to play with
some bitwise operations, and make you get used to playing with hexadecimal
values. This might come in handy lateron, as hexadecimal numbers and bitwise
operations can be quite useful in speeding your code up.

We will see puts() here again. The puts() function puts a string onto the
standard output stream (stdout), which is normally the screen. And I say
‘normally’ here, because streams may be redirected to other devices, such as
printers for example. We will also see gets(), which can read a line from the
standard input stream (stdin), which is normally the keyboard input.
It will put this line into a char array. The user must pass a pointer to this
array to the gets() function.
The getchar() function is similar, but it returns the first character from the
stdin stream immediately, and does not require a buffer to be passed.

We also get to meet fflush(), which ‘flushes’ a file stream. The stdin and
stdout streams are considered as files aswell in C, so we can use it on these
streams aswell. In our case we flush the stdin stream, to remove any
superfluous input that we would rather ignore.

The last function we get to meet, is printf(). This is a function to print
formatted text (hence the ‘f’) onto stdout. It is quite an interesting
function, because you can pass as many arguments as you wish. It has ellipsis
(…) in its prototype, to make this possible. I will give an example of how
to use them later.
The printf takes a format string as its first argument, which will basically
define what to do with the next arguments. How to format these, to be more
specific.
It will use a %, followed by some formatting rules, at the place where you
want the respective argument to be printed.
In our case we will see %s, which will print the argument as a string.
So for example, you could do this:

char name[] = "Jopie";

printf("Hello %s!", name);

This will result in:

Hello Jopie!

We will also see %d, which will print a signed decimal number. For unsigned
numbers, there is also %u.

And lastly we will see %08X… This formatting is a bit more advanced.
The 0 indicates that we want 0s prefixed to our number. The 8 indicates that
we want to have 8 digits in total. And the X tells printf() to print the
argument as a hexadecimal number, with uppercase alphas. There is also
%x for lowercase alphas, but I prefer uppercase.

There are a lot more options with printf(), but it would be a bit much to
cover them all here. I suggest you use your reference for that instead.
I will leave you with one last example, then we go to the actual program:

unsigned int age = 16;
char name[] = "Jopie";

printf("Congratulations %s, today is your %uth birthday!");

Result of this would be:

Congratulations Jopie, today is your 16th birthday!

Note also that puts() always does a newline after printing the string, while
printf() does not. So in some cases, you might want to use printf() instead of
puts() to avoid that newline. Be careful though, if your string contains %,
printf() will see this as formatting. A workaround for this is using %c, which
prints 1 character:

printf("I can use %c in my strings with printf()", '%');

Anyway, here is the program:

#include <stdio.h>

char name[32];
int x = 1, y = 1;

/* Print menu, allow the user to choose an option, and process that option.
   Returns 0 if user chose to quit, else 1 */
int doMenu()
{
    char buffer[32];

    printf( "\nHello %s, here is the menu for today:\n\n"
            "1) Enter X\n"
            "2) Enter Y\n"
            "3) X + Y\n"
            "4) X - Y\n"
            "5) X * Y\n"
            "6) X / Y\n"
            "7) X & Y\n"
            "8) X | Y\n"
            "9) X ^ Y\n"
            "A) ~X\n"
            "B) ~Y\n"
            "C) Quit\n\n"
            "Current X = %d\n"
            "Current Y = %d\n\n"
            "What is your choice? ",
            name, x, y);

    /* remove additional input from stdin stream */
    fflush(stdin);

    /* Get a character from stdin stream and process the command */
    switch(getchar())
    {
        case '1':
            /* remove additional input from stdin stream */
            fflush(stdin);      

            puts("Please enter new value for X.");
            x = atoi(gets(buffer));
            break;
        case '2':
            /* remove additional input from stdin stream */
            fflush(stdin);      

            puts("Please enter new value for Y.");
            y = atoi(gets(buffer));
            break;
        case '3':
            printf("%d + %d = %d\n", x, y, x + y);
            break;
        case '4':
            printf("%d - %d = %d\n", x, y, x - y);
            break;
        case '5':
            printf("%d * %d = %d\n", x, y, x * y);
            break;
        case '6':
            if(y == 0)
                puts("Y = 0. Division is not defined.");
            else
                printf("%d / %d = %d\n", x, y, x / y);
            break;

        /* The following are bitwise operations, let's also print out
           the hexadecimal representations for clarity */
        case '7':
            printf("%d & %d = %d || "
                   "%08X & %08X = %08X\n",
                   x, y, x & y,
                   x, y, x & y);
            break;
        case '8':
            printf("%d | %d = %d || "
                   "%08X | %08X = %08X\n",
                   x, y, x | y,
                   x, y, x | y);
            break;
        case '9':
            printf("%d ^ %d = %d || "
                   "%08X ^ %08X = %08X\n",
                   x, y, x ^ y,
                   x, y, x ^ y);
            break;
        case 'a':
        case 'A':
            printf("~%d = %d || "
                   "~%08X = %08X\n",
                   x, ~x,
                   x, ~x);
            break;
        case 'b':
        case 'B':
            printf("~%d = %d || "
                   "~%08X = %08X\n",
                   y, ~y,
                   y, ~y);
            break;
        case 'c':
        case 'C':
            /* The user wants to leave, we return 0, to break out of the
               while() loop in main() */
            puts("Thanks, and have a nice day.");
            return 0;
        default:
            /* Since we have arrived here, apparently none of the valid
               choices were reached */
            puts("Invalid choice.");
            break;
    }

    printf("Press enter to continue.");
    getchar();

    return 1;
}

int main(void)
{
    /* Input the username */
    puts("What is your name?");
    gets(name);

    /* Keep returning to menu until user chooses to quit */
    while(doMenu());

    return 0;
}

This might be a good time to greet some people.
First of all, ofcourse my Diamond Crew people, ewald and Maybird.
And ofcourse also all my Phrozen Crew mates, you know who you are :)
And the #Win32ASM guys, hutch, llama, Iczelion, nuu, dowap (or whatever :) ,
and the rest.
Ofcourse Kalms, and last but not least, the ladies (in no particular order,
you know how women are :) .
Sara`, Tracy, blorght, MoonDawn, Nitallica, CandyII, jessca, embla, Baudie,
flipgrrrl, and yes, even taylor^ :)

X-Calibre

Posted in Software development | Tagged , , , , , , , | 1 Comment

Porting BHM3DSample to Android: Some… well… a lot of… stressful development

As you may know, I have ported my OpenGL rendering framework to iPhone a few months ago. Originally I was not all that interested in Android, since it apparently uses Java for its apps. While it has OpenGL ES support in Java, it would require me to rewrite the entire framework from C++ to Java. With the iPhone I had a similar situation where Apple wants you to write things in Objective-C. However, as it turned out, there was a way to use regular C/C++ on the iPhone, so in the end I could just use most of my OpenGL framework as-is.

Looking a bit further into Android, I found that Android not only has the SDK for Java-based applications, but also an NDK, for more low-level things. The NDK allows you to use C/C++ and even assembly. Now that’s more like it! So I checked out the NDK, and spotted an example using OpenGL ES from C++. Exactly what the doctor ordered!

Things quickly went downhill from there though… Where I was used to having some a proper IDE for the iPhone and for J2ME, the Android SDK and NDK had a distinct 1970s feel to them: just a set of headers, libraries and some commandline tools slapped together. There was not too much in the way of documentation. It’s mostly learn-by-example, apparently. It didn’t look good. What’s worse, there are a few snafu’s in the SDK and NDK. For example, the Windows version of the SDK installs itself in your Program Files by default. However, writing to Program Files requires administrator rights. But the SDK manager is not installed to have these rights by default. So if you start it, and try to download and install any packages, it will fail to write them. And the NDK has some build/make scripts, which can’t handle spaces in pathnames. So if you put the NDK in Program Files together with the SDK, it won’t work.

Google does offer an Android plugin for Eclipse though, so I figured I’d give that a try. I never liked Eclipse, as I’ve always found it to be slow, cumbersome and generally unstable. For my Java development (including J2ME) I generally use Netbeans. But well, this time I had no choice, so Eclipse it is…

Eclipse did not disappoint… It was still as slow and cumbersome as I remembered it… and from time to time it would also crash… One time it even corrupted the workspace, so it would hang the next time it tried to open it. All I could do was delete the metadata in the workspace and start over.

However, at least the Android plugin was reasonably nice. At least I had a simple tool to build and deploy Android apps in the emulator. Oh yes… the emulator. Another part of the success-story that is the Android SDK. The emulator is extremely slow, and takes up lots of memory. Even on a modern high-end machine like an Intel Core i7, it still takes ages to boot the emulator up. And it actually runs considerably slower than a real phone.

Right, well… now that I had familiarized myself with the SDK somewhat, it was time to actually develop something. It quickly became apparent that you could not get rid of Java completely: the NDK can produce libraries, but they will have to be called from a Java app through JNI. I’m no stranger to JNI myself, I have used it before to add a fullscreen option to the Java demo system used for Croissant 9 on Windows. It involved a simple DLL that set up DirectDraw, and the Java application could pass its pixelbuffer to it through a JNI call.

But JNI has always been rather obscure, and quite archaic in nature. It’s difficult to find good documentation on it, especially now that Oracle has taken over Java, and the old Java.sun.com webpages are gone. A lot of dead links. After some digging, I managed to find a reasonable overview of the JNI calls though. Looks like nothing changed in the roughly 10 years since I wrote my JNI DLL. The Java side is quite simple: just declare a function with ‘native’, and call System.loadLibrary() in the static constructor to load your native library. The system works entirely via reflection, so all linking is done dynamically at runtime.

The C++ side is not so pretty. Firstly you have to give your functions very specific long names in order to make the reflection work in the JVM. Secondly, you also need to use reflection yourself, when you want to call methods on Java objects. What makes it even more cumbersome to use is that you cannot use any of the data directly. You need to perform marshaling via the JNIEnv object in order to access strings, arrays and such in a form that C++ can use. And since the NDK has only limited functionality, you will need use Java objects and functions in C++ from time to time.

To make matters worse, it does not seem to be possible to debug the native code in the emulator. The only thing I managed to do was to output log messages in Android’s logCat, so I at least had some idea of what was going on.

And no, the horror does not stop there… No… Once I got some basic C++ code running inside a Java app, I tried to compile the actual C++ framework for OpenGL ES, which I had used on the iPhone earlier. Problem #1: My code was for OpenGL ES 2.0, but the emulator only supports 1.0. So I could not use shaders and things, and had to go back to fixed function, and using CPU-emulation for the skinning. And for some reason, the GPU emulation is disabled by default… So I first had to reconfigure my emulator to get it to work… Problem #2: There was no STL support in the NDK.

Or at least, that’s what it looked like originally. Googling for some information turned out to be a wild goose chase. People trying to port uSTL or STLPort to Android. That’s what I tried originally, but that didn’t quite work. Luckily I later found out that Google has since added its own limited STLPort to the SDK. It is just disabled by default. But apparently if you add an Application.mk to your project, you can put the following statement in there to enable STL:

APP_STL := stlport_static

Now I could finally get my code compiled. Next step was to actually make it work. The biggest problem here is that you can’t just access files. I wanted to bundle the data files in the Application Package (.apk) itself, much like I did with the iPhone version. But this was not easy… like pretty much everything else in Android development. Eventually I figured out that I had to put it in the assets-folder. And then you can open it from Java, and get an inputstream. But that is still quite useless, since I needed to access the file in C++, not in Java. So I again had to pull some nasty JNI trickery to finally get things to work.

There were a lot of other minor things that delayed the port further… but in the end I did more or less get it to work:

Okay, so the textures still don’t work… But that should just be a formality at this point. I already managed to get the BHM file loaded from the assets, so getting a JPG to load shouldn’t be much harder. There are some Java helper functions for that, which I will be using.

And here it is, running on an actual phone:

All in all, it wasn’t a very pleasant experience. Google may have this geeky/nerdy/linux/developer-like image, but the Android development tools are horribly archaic, immature and just generally unprofessional. Apple has done a much better job on the development tools for iPhone and iPad. And this coming from a guy who likes to code classic Amiga and 16-bit DOS with Hercules/CGA/EGA for fun! This was an absolute nightmare compared to that. I’d almost be inclined to say that the extra money you spend on the iPhone and getting an iPhone developer account is worth the extra money. Decent tools make your life so much easier!

Posted in OpenGL, Software development | Tagged , , , , , , , , , , , | 1 Comment

Another root exploit for linux

A few days ago, the following exploit was published: http://blog.zx2c4.com/749

Another small step in debunking the myth of linux security. What is also interesting is that this bug was introduced only recently:

In 2.6.39, the protections against unauthorized access to /proc/pid/mem were deemed sufficient, and so the prior #ifdef that prevented write support for writing to arbitrary process memory was removed.

Well, those linux kernel developers sure are geniuses when it comes to writing secure code, aren’t they? And all those eyes that are allegedly inspecting the source code all the time… well, this code was submitted in March 2011. So it took many months to find the bug, and it is now widespread. A fix is available now, but there will obviously be tons of unpatched systems out there (people with a false sense of security… after all, they’re running linux, right?)

Another interesting tidbit is this:

It turns out that su on the vast majority of distros is not compiled with PIE, disabling ASLR for the .text section of the binary!

Yes, really! Which is interesting actually. I recall when I ported some of my CPUInfo code to OS X, that I ran into a problem. It did not allow me to use the EBX register freely. This was because the default build options in OS X are to compile everything position-independent. That is probably related to this: position-independent code enables ASLR. I didn’t have the problem on linux, because I used Ubuntu, which is one of the many distros that does not force PIC. As an aside, Windows relocates code in a slightly different way. Windows calculates the addresses with the PE loader, and patches them into memory. This takes slightly more time during loading of an executable, but it saves time during execution. An interesting difference in tradeoffs between Windows and linux.

So, a few minus points for linux security again: both in the quality of the kernel code, and in the quality of the default configuration of most linux distros.

Posted in Software development, Software news | Tagged , , , , | Leave a comment

Intel Medfield vs ARM

I did an article on ARM and x86 about a year ago, since multicore ARMs were up and coming, and Windows 8 would allow them to be used in netbooks, notebooks or even desktops, as competing solutions to x86. That was how I saw things at the time: ARM moving towards x86.

But Intel has recently introduced a new generation of Atom processors, codenamed Medfield. This is showing a move in the opposite direction: x86 moving towards tablets and smartphones. The move in itself is not new, as Intel has been trying to get Atom processors in embedded and mobile devices for a while now. But the difference is that Intel is actually succeeding this time.

As I mentioned last time, x86 bears some legacy which makes it inherently larger and more complex than a more modern architecture, such as ARM. However, this extra overhead is more or less a constant factor, where the rest of the CPU design will grow larger over time as more transistors can be fitted on a single die, because of manufacturing progress. For regular desktop, workstation and server systems, the x86 overhead has become a non-issue years ago. With the amount of execution units, caches and everything you find in a modern CPU, the legacy overhead of the x86 becomes insignificant. An added factor is that Intel has always had the most advanced manufacturing. This allowed them to fit more transistors in a smaller space, using less power, which could somewhat compensate for the extra cost of x86 legacy.

For Atom however, the small scale of things meant that Intel could not quite get the upper hand over the competition yet. An important issue with early Atoms was that it consisted of the CPU and a separate chipset, where the competing ARM solutions were a System-on-a-Chip (SoC). As a result, Atoms were not small and energy-efficient enough for mobile devices such as tablets, let alone smartphones.

Then Intel introduced Oak Trail, the first SoC version of Atom. This makes it more compact and more power-efficient. It was still too powerhungry for phones, but tablets were more or less within Intel’s reach now. But since ARM solutions still had better performance and battery life, Atom-powered tablets never took off.

The new Intel Medfield is an SoC as well, but it looks a lot better than Oak Trail. Oak Trail was built on a 45nm process, but Medfield uses a 32nm process. This puts Intel in its favourite position: a step ahead of the competition, who are still on 40nm process. The result is that x86 can now hide its legacy very well. And this time we will actually be seeing this Atom SoC in a number of tablets and smartphones.

So where I was expecting ARM to invade Intel’s turf, Intel is now doing the opposite. I was not expecting Intel to close the gap just yet. But it seems that they are well on their way. And 22nm is just around the corner for Intel, which may allow Atom to take yet another step forward compared to ARM.

This newfound competition is also interesting for ARM itself. Because Intel throws HyperThreading into the mix, which ARM does not have. And Intel’s SIMD extensions may also be slightly more mature than ARM’s. So we will have to see how ARM is going to respond to this.

Posted in Hardware news, Software news, Uncategorized | Tagged , , , , , | 3 Comments

Why I don’t use linux (and why you shouldn’t either)

As you may know, I have nothing against open source software. In fact, I am both a user of FreeBSD, and a developer of open source projects. But linux never sat well with me. It’s not so much the software itself, as it is the culture. The GPL is not my idea of free software. I think the BSD license offers considerably more freedom. GPL is more of a political manifest if anything. And I am interested in software development, not politics. Aside from that, the attitude of the linux community does not appeal to me.

Now, back in late September/early October, there was some buzz when word got out that Microsoft wanted to have UEFI secure boot enabled by default for Windows 8 systems. I did not bother to blog about it at that time… But as new ARM-based devices for Windows 8 are being introduced, the issue is being recycled by the linux community. You get over-the-top articles like this one. Where, as usual, linux is trying to play the victim, and blame everything on evil Microsoft.

Excuse me? But linux is doing it to themselves. A far more balanced article that was released earlier, can be found here. The short version is that the GPL (specifically version 3) doesn’t allow any kind of binary code to be distributed without source code. This restriction means that the secure key for booting cannot be kept a secret. So the GPL is locking linux out from participating in UEFI’s trusted boot sequence, which is meant to prevent rootkits from installing on your system unnoticed (is that such an evil thing?)

Now, the simple solution would be to create a license that is compatible with UEFI, so linux too can support secure booting. But no. Pragmatic as always, the linux community feels that their license is holy, and that the rest of the world is wrong, and has to adapt to their ways (which has worked just great so far, hasn’t it?).

And this is the sort of thing that drives me away from linux. I simply don’t want to be associated with these people and their crazy ideas and conduct in any way. I don’t need their software. Choice is important, right? Well good, because I can choose alternatives, such as FreeBSD. It’s free, it’s open source, and it does everything I need it to. But the community seems nicer. It’s just one coherent project, instead of tons of distributions-on-distributions, and the focus is on developing quality software. There are clear goals, a clear vision. If you are a linux user, I suggest you check out FreeBSD. For most types of linux installations, FreeBSD will make a fine alternative, as it is a true UNIX derivative, and most software that is available for linux, is also available for FreeBSD (Apache, mysql, postgresql, KDE, Gnome, VLC, Firefox, Chromium, Thunderbird etc). Under the right circumstances, FreeBSD will even perform better. And you will no longer be involved in all the political nonsense, FUD and distro-wars of the linux community (try asking a question… no matter what the topic, one of the first answers is always going to be: “But distro X sucks. You should try distro Y!” As if that matters…).

Posted in Software news | Tagged , , , , , , | 48 Comments