September 17, 2009 (Lecture 8)

September 17, 2009 (Lecture 8)

Enter: Compiled, General-Purpose, High-Level Languages

Although we'll continue playing with various new tools throughout the rest of the semester, most of them will be directly applicable to the next section of the course, which begins today: The C Programming Language.
Today, we're going to talk about about what it means to "compile" a program. Then, we'll write, compile, and run a simple C program. And, we'll talk about some of the cosmetic simiarities and differences as compared to Java. Next class, we'll dig in deeper, and so on.

High-Level code, Assembly, Compilers and Interpreters

So, let's do a quick review of the compilation process as you probably knew it in 15-100. When people write programs, they are doing it with a mind toward solving some real problem. They want to be concerned with the problem and its solution, not the details of the machine that is a tool for solving it.
For this reason, programmers usually write programs in English-like programming languages: Perl, C, C++, and Java, for example. Although these languages are structured in a way that is useful to a computer, they are designed to be understandable and convenient for people.
Ultimately, a program written by a computer in one of these languages, a so-called High-Level Language is translated into another form that is better tailored for the machine. This new form is often known as an assembly language program. This is further represented in a computer-friendly form known as object code, machine code, or, in Java, as byte code.
To help understand the difference between a high-level language and assembly, I like to think about a car with a driver and passenger. The passenger might give the driver directions:

Back out of the driveway and go right. Continue for 4 blocks. At the stop sign, make a right. Travel about 1/2 mile. You'll see a grocery store. The parking lot is on the right-hand side of the road, just past that grocery store. Park there.

The passenger provided a set of high-level instruction to the driver. These instructions were provided in a way that the driver understood and used commands that were descriptive in the context of the problem: Navigating the city en route to the grocery store.
But the car, the machine, can't understand these instructions. It requires instructions in a different language. The instructions might begin like this:

Press break pedal: Not less than 25 lbs of pressure. Push key into keyswitch, twist forward to on position; listen for click. Twist forward again with not less than 10 lbs of pressure. Listen for motor. Release keyswitch pressure. Slide gear selector down two notches into reverse. Reduce brake pressure to 5 lbs. Roll to bottom of driveway. Rapdily increase break pressure to not less than 25 lbs.
Etc. Etc. Etc.

The driver has to translate the high-level instructions provided by the passenger into a low-level (physical) language understood by the car. This language is coposed of much smaller steps. And, it might vary slightly from car to car. For example, the appropriate actions taken by the driver will be different for a car with a "standard" transmission than one with an "automatic" transmission. So, although the passenger's language is problem-oriented and car-independent, the driver's translation is car-oriented and car-specific.
The programming process happens in much the same way. Human programmers produce high-level directions in languages like C or Perl. If we imagine that the directions are given "as we go", this is analagous to an interpreted language like Perl, which does the translation as it goes. If, instead, we imagine that the directions are writted down and translated before we get into the car, this is analagous to a language like C or Java, where programs are compiled, and completely translated into assembly, before executing. A better example here might be to consider some carefully choreographed event, such as a roller coaster or other park ride.
The languages that do the translation as they go are known as interpreted languages and, the program that does the interpretation is known as the interpreter. For example, in the case of Perl, the perl program is the interpreter that does this translation. The languages that do the translation up front are known as compiled languages and the program that does the translation is known as the compiler. For example, in Java, the compiler is the javac program.
In the case of a compiled program, the resulting executable, which is machine specific, can run on only one type of computer, such as iMacs running OS X. This is why, when you go to the store, you have to buy different versions of the same program for different types of computers. The same high-level programs were compiled for different types of computer. In the case of interpreted languages, the programs can run in any environment that has the interpreter -- it is, afterall, what is doing the translation.

The Compilation process: More Details

So, as we understand it so far, the compilation process looks roughly like this:
  "source code" --> |compiler| --> "assembly"/"object code"/"executable"
  
Let's take this to a few more levels of detail. First, let's examine the difference between "assembly" and the computer-friendly forms, "object code" and the final "executable".
Compilation proper is the translation of the high-level language source code to the lower-level language of assembly. Assembly code contains instructions that are understandable by the computer, but they are still written in a human-readable form. Strictly speaking it is this translation which is known as compilation. The translation from assembly into the machine readable form, sometimes known as "machine code" or "the object file" is technically the domain of another tool, the assembler.
Just to give you a flavor of assembly, here are two examples of assembly code. One generated from C code on an x86 Linux box and the other from the Java compiler for the Java Virtual Machine (JVM). The instructions are shown as mnemonics, short three or four letter abbreviations for the operations. They might not be meaningful to you, but they are to those who have studied assembly.
From C to x86 assembly:
        pushl   %ebp
        movl    %esp, %ebp
        subl    $152, %esp
        andl    $-16, %esp
        movl    $0, %eax
        addl    $15, %eax
        addl    $15, %eax
        shrl    $4, %eax
        sall    $4, %eax
        subl    %eax, %esp
        movl    $0, 8(%esp)
        movl    $3, 4(%esp)
        movl    $0, (%esp)
        call    fcntl
  
From Java to "javap-style" JVM assembly:
   0:   ldc     #2; //String hello
   2:   astore_1
   3:   aconst_null
   4:   astore_2
   5:   new     #3; //class java/lang/StringBuffer
   8:   dup
   9:   invokespecial   #4; //Method java/lang/StringBuffer."":()V
   12:  astore_3
   13:  aload_3
   14:  aload_1
   15:  invokevirtual   #5; //Method java/lang/StringBuffer.append:(Ljava/lang/S
  
So, now our process looks like this:
  "source code" --> |compiler| --> "assembly code" --> |assembler| --> "object code"/"executable"
  
Okay. So, what's the difference between the assembly code and the object file? The object file contains "machine code". It is a different representation of the assembly code. It contains exactly the same stuff, but the encoding is different. All of the mnemonics are translated into numbers and squished together. The reason for this is that people recognize words better than numbers and separate items with spaces. Computers process numbers better than words and spaces.
I'd show you an example of the machine code, but it really is unreadable. It is just a mess of unparsable data. The important thing to rember though is that the assembly and machine code are the same thing, just different representations.
Okay. So, let's talk a little more about going from the "object code" to the executable. Often times, programs are written in pieces. These pieces need to be assembled to form the executable. Each piece is an "object". And, the process of bringing the objects together is called linking. Linking is done, no great surprise, by a tool called the linker. Sometimes, linking is done as part of the compilation process. This is known as static linking. Sometimes it is done after the program is actually running. This is known as dynamic linking.
In Java, execution begin within one .class file and the others were located and loaded as needed. This is an example of one type of dynamic linking. In C, some parts of the program will be dynamically linked, be it by a different mechanism, but, others will be statically linked.
This leaves us with our next revision of the model:
  "source code" --> |compiler| --> "assembly code" --> |assembler| --> "object code" --> |static linker| --> "executable" --> |OS load and dispatch| --> "running process"
                                                                                                                                                    /\                  |
                                                                                                                                                     |_|dynamic linker|-|
  
The only other detail in this process is somewhat unique to the C Language. It is called preprocessing. The preprocessor manipulates source code. It takes as input source code, munges it a bit, and produces source code. It does things like expands some short-hand notation into first-class code and follows some programmer provided preprocessor directives to do things such as ignores certain sections of code, perform substitutions, or include code read in from other files.

So, at this point, her'es our final version of the process:

  "source code" --> |preprocessor| --> "source code" --> |compiler| --> "assembly code" --> |assembler| --> "object code" --> |static linker| --> "executable" --> |OS load and dispatch| --> "running process"
                                                                                                                                                                                          /\                 |
                                                                                                                                                                                          |_|dynamic linker|-|

A Quick Tour: HelloWorld in C

Okay. Let's take a look at the world's best-known C program, HelloWorld. For those with some experience, this version isn't perfect. We'll examine that, too.
  #include <stdio.h>

  int main (int argc, char *argv[]) {
    printf ("Hello world!\n");
  }
  
What does this program do? Well, no great surprise: It prints out "Hello world!". Just as it was in Java, main() is the entry point for a C program. And, printf() works just as it did in AWK. The format of C functions isn't unlike Java methods. main() is a function. The stuff within the ()-parenthesis are its arguments, and in particular, the command-line arguments. The main() method, as declared here, returns an int. And, you probably suspect from shell scripting that this "int" probably should be a number from 0-255 indicating the program's exit status, where 0 is success.
So, what is the weirdness that appears to be the argument list for main()? Well, as in Java, the argument list for main() is going to be an array of strings. But, as we'll learn, in C, strings are actually an array of chars. So, what we actually have is an array of arrays. In effect, the []-brackets mean an array and the *-asterisk is used to make that an array of arrays. But, I can't really explain the details quite yet -- it is a bit of a complicated C-ism. But, we'll get there soon.
But, one thing I can mention is that, unlike Java, C arrays don't carry their size with them. So, the first argument to amin, "argc" is short for "argument count". it contains the number of command line arguments. The second argument, the array of argument strings, is known as "argv", short for "arguments verbose".
What about the "#include <stdio.h>? Anything that begins with a #-pound is a preprocessor directive. In the case of "#include" it instructs the preprocessor to slurp in the file "stdio.h" and place it at the top of this file. Notice the <>-brackets. These tell the preprocessor to look in the "standard" places for this file. By convention this include "/usr/include". Try doing an "ls" there -- you'll see "stdio.h".
If we would have used "-quotes, it would have looked first in the local directory and then in the standard place -- we'll do this for header files that we, ourselves, write. We use <>-brackets otherwise to speed up the process by eliminating an unnecessary look in the current directory -- and the possibility of hitting an extraneous file wiht the same name.

Compiling and Running Our Program

We generally use the .c extension for C programs. So, we'll name our program "helloworld.c". We can then use the "c compiler", cc to compile it. In our case, the compiler is from GNU, The Free Software Foundation, so it is called gcc, but still aliased so that cc also works:
  gcc helloworld.c
  
This produces an executable called "a.out". We can run this by typing "./a.out". But, this default name isn't very descriptive. So, we can use the "-o" option to set the "output file name". Now our program will be called "helloworld" and run as "./helloworld".
  gcc helloworld.c -o helloworld
  
Notice that when we do this, gcc goes through the entire "compilation process". We can make it stop after each stage by using flags. If we use the "-E" flag, the output file will contain the results of the preprocessor. By convention these files should have a ".i" extension. the "-S" flag will cause the compiler to stop after compilation-proper, without assembling, leaving assembly code. By convention these files should end in a ".s" extension. The "-c" option will cause the compiler to stop after assembling, leaving object code. By convention, these files should end with a ".o" extension.
In class, we looked at "helloworld" after the preprocessor (-E) and after the compiler-proper (-S), yu might want to do this again on your own. Each time, we use the "-o" flag to generate a file with our choice of names. And, each time, we used the correct extension: "helloworld.i" and "helloworld.s".

Errors vs. Warning

In Java, for the most part, the compiler produced one thing: Errors. If it found something it didn't like, it would not compile. In C, there are two differnt types of problems: "warnings" and "errors". When the compiler runs into an error, it can't figure out your code and doesn't know what you want. It, therefore, cannot produce an executible. When the compiler can do what you ask -- but suspects that it might be problematic -- it produces a warning. For the most
For the most part, "errors" are syntax issues and "warnings" are semantic issues. For example, if in java you tried to assign a float to an int, the compiler would generate an error. In C, this is "just" a warning. It'll tell you that it looks wrong -- but it'll generate code that does it anyway.
gcc, by default, is quiet about most "warnings". It only mentions the ones it views as most unusual and most likely to be problems. But, for the purpose of this class, we will deduct 10 points for each warning. Given the types of programs that we'll write, warnings are unnecessary -- and often forcast observable bugs during execution.
In order to get the most help from the compiler, we'll compile with four flags: "-ansi -Wall -W -pedantic". The "-ansi" flag asks for strict, standard C -- it disallows more relaxed syntac otherwise allowed by gcc. We do this to ensure that our code is fully portable. "-Wall and -W" ask for all warnings. "-Wall" is short for "Warnings: all". For whatever reason, the otherwise less-thorough "-W" flag picks up a small few that "-Wall" misses. "-pedantic" sounds exactly like what it is -- it asks for even the small stuff.
Again, this semester you should compile with all four of these flags, because we will -- and each warning costs 10 points!
So, let's ocmpile again with these flags:
  gcc -Wall -W -ansi -pedantic helloworld.c -o helloworld
  
Notice it gives us four warnings:
  helloworld.c:3: warning: unused parameter 'argc'
  helloworld.c:3: warning: unused parameter 'argv'
  helloworld.c: In function `main':
  helloworld.c:5: warning: control reaches end of non-void function
  
Each of which is easily resolved:

We remove the unused arguments from the list
We return 0 to indicate success

So, here's the corrected version:
  #include <stdio.h>

  int main () {
    printf ("Hello world!\n");

    return 0;
  }
  

Prototypes, Header Files and The #include

So, what does the "#include" line do? Let's remove it and see what happens:
  helloworld.c: In function `main':
  helloworld.c:3: warning: implicit declaration of function `printf'
  
The compiler complains of an "implicit declaration of function 'printf'". What does this mean? In C, the compiler needs to know the definition of a function before it is called. if it doesn't, this is called an "implicit declaration". Sometimes this generates this one warning -- and sometimes it generates multiple additional warnings. Depending on the compiler, it can generate multiple warnings for a few reasons. Some compilers will assume that all of the types associated with the function are "int", which is an old C default, and then complain that they don't match. Others know about certain functions a prior and will complain that this "default" doesn't match what they know.
Regardless, we can quiet these warnings by adding a definition, known as a function prototype at the top of the file, before using the function. This prototype is a type of forward reference, which tells the compiler about something before it is used. The prototypes contains the "signature" of the function and its return value -- basically the name and the types, but not necessarily the argument names.
Check out the new version of our program below. Remember, the "char *" is just "C for string" and the "..." just indicates, as you know, that this is a "variable argument function". Remember from AWK, that the number of arguments to printf() is the same as the number of placeholders in the format string.
  int printf (char *, ...);

  int main () {
  printf ("Hello world!\n");

  return 0;
}

  
So, what do you think is in the "stdio.h" file that we included before? Well, "stdio" is short for "standard input and output" and the ".h" extension is used for "header files". Header files files are files that are usually icnluded at the top of files containing real code, hence they are called "headers". They contian function prototypes and other definitions. You'll include "stdio.h" in most of your programs -- it contains the prototypes and definitions for most of the usual I/O functions, and then some.
In class we looked at it using "less /usr/include/stdio.h" and then looked at the version of our program with the #include, after compiling with the "-E" option. We noted that the stuff from the header file was not at the top of the code -- with our code as only the last few of very many lines.

Minor Homework

Everyone was asked to type in, by hand, the "Hello World" program and get it to compiler on hte "andrew unix systems" -- without warnings.