What is a compiler
Posted by shabeer Ahamed F
Generally
compiling is a term which is often heard by everyone who is associated with
programming, even if remotely. A compiler is a program which converts a high
level language program/code into binary instructions (machine language) that
our computer can interpret, understand and take the appropriate steps to
execute the same.
Let
us take an example to understand what does the above definition means.
If
you ask any person who is associated with programming, that what the first
program he/she wrote was, then the obvious answer would be “Hello World”. So
let us also start with the same.
#include<stdio.h>
Void main()
{
printf("Hello, world!");
}
This
most basic program prints or displays the words "Hello,World!"
on the computer screen. But there is a problem. It is not that simple. Behind
the curtains there is lot of complex things going. Let us peep inside these
things. The hard truth is that our computer cannot understand the
commands/instructions contained in a source file (helloworld.c), because C is a
high-level language which means, it contains various characters, symbols, and
words that represent complex, numbers-based instructions for eg. printf, main,
header files etc. The only instructions a computer can execute are those
written in machine language, consisting entirely of numbers that is the
binary language in terms of 0 and 1. Before our computer can run our C program,
our compiler should convert our helloworld.c into an object file; then a
program called a linker should convert the object file into an executable file.
The
drawing below illustrates the process.
Definition
of compiler?
What
we have defined earlier is only half true. The notion of compiler as a program
which converts input (high level language) into output (assembly language or
machine code) for some processor is a very limited definition of the same. A
compiler in real context takes a string and outputs another string. This
definition covers all manner of software which converts one string to anther
such as text formatters which convert an input language into a printable
output, programs which tend to convert among various file formats or different
programming languages and also web browsers.
History
In
earlier time only machine dependent programming languages were used and hence
any program which could be run on one machine could not run on any other as it
was specific to that machine. When high level languages that is machine
independent language were first invented in the 40s and 50s no compilers had
been written. In fifties the first compiler was written by Grace Hopper. The
FORTRAN team lead by John Backus at IBM introduced the 1st complete compiler in
1957.
Development
Early
compilers were complex in their code and also had large compiling time. But
with time, several developments took place which led to advanced compiler. The
main reason behind this was the splitting of a compiler into parts. In broad
terms compiler can be distributed in 3 parts:
-> The
front end - It understands the syntax of the source language.
-> The
mid-end - Its role is to perform the high level optimizations.
-> The
back end - It produces the assembly language.
Front
end
The
front end job is to analyze the source code file and then to make an internal
picture of the program (code), called the intermediate representation or IR. It
also manages the symbol table which is a data structure which maps or links
each symbol in the program or the code with the corresponding information
(location and type are few of the information). This is done over several
phases:
-> Line
reconstruction -
This phase is required so as to convert the input character sequence into a
form which is ready for the parsing phase.
-> Lexical
analysis - This
phase is required to divide the code into small pieces known as tokens (for
example a keyword, identifier or symbol name). This phase is also called lexing
or scanning, and the corresponding software for the same is called a lexical
analyzer or scanner.
-> Preprocessing - Some languages for example C,
requires a preprocessing phase which allows the macro substitution and
sometimes conditional compilation.
-> Syntax
analysis - This
phase allows parsing the tokens so as to identify the syntactic structure of
the code. This phase is used to build a parse tree according to the rules of
the grammar of the language.
-> Grammar: The grammar of a language is needed
so as to get a meaningful outcome of the language. It doesn’t defines it
meaning, rather defines rules to get a meaning. Grammars are specified using
"productions." Few of the examples of production are as follows:
Statement
-> if (expression) statement else statement
Statement
-> while (expression) statement
-> Tree:
Basically tree is the data structure
used by the compiler internally to represent the meaning of the code. This
phase is after the parsing phase in which the grammar of the language is set
with the code.
-> Semantic
analysis - This phase
is used to add the semantic information to the parse tree and hence to build
the symbol table for the same. This phase performs semantic checks and rejects
the incorrect programs or issue warning.
-> Symbol
Table: Every
subroutine, variable has some information associated with them.
--> Variable names – information regarding type,
storage location and scope.
--> Subroutine names – information regarding locations,
argument and types. The information is associated with the help of a "hash
map", which is a table that associates a string with the corresponding
information.
Back end
In
layman language one can say that back end is associated with the generation of
the code that is machine independent.
The
main phases of the back end include the following:
-> Analysis: This phase is the base for the
optimization of the code. The typical analyses methods are data flow analysis,
dependence analysis, pointer analysis etc... The call and control flow graph
are also built during this phase.
-> Optimization: This is the intermediate phase which
converts the language representation into equivalent faster forms. The various
optimization methods are inline expansion, dead code elimination, loop
transformation, register allocation etc.
-> Code generation: This is generally the last phase in
the process as it is associated with output language that is the machine
language. This involves resource and storage decisions. Debug data if generated
is also generated during this phase.
If
we break these three levels then there are total seven levels as follows.
Out
of these few major fields are explained below:
The
Lexer (or lexical analyzer)
The
lexer is the first process in the phase of compiling. Its purpose is to
decompose the stream of input characters into discrete sets known as
"tokens".
Let
us take an example to understand it:
char
str[] = "Compiler.";
Decomposes
into:
token
1: Keyword, "char"
token
2: Identifier, "str"
token
3: Left square bracket
token
4: Right square bracket
token
5: Equals sign
token
6: String, "Compiler." (Whole strings are token)
token
7: Semicolon
From
the above example we can easily say that token is a string of characters,
categorized according to the rules that are it may be a Identifier, Number,
Comma. The role of lexer is to categorize the token according to a symbol type.
The tokens are made so as to make the processing of strings easy.
The
Parser
In
today’s world when dependency on machines is constantly increasing and lots of
complicated tasks are performed by the machines, the phase of parsing is very important.
This phase increases the capability of the computer to understand the code and
take the corresponding action according to the code.
Parsing
is the process of understanding the syntax of a language by representing the
code by data structures understood by the compiler.
Generally
there are two main methods of parsing that is Top-down parsing and Bottom-up
parsing.
-> Top-down
parsing partitions a
program top-down, programs into modules, modules into subroutines, subroutines
into blocks.
-> Bottom-up
parsing group tokens
together into terms, then expressions, statements, then blocks and subroutines.
Error
detection
The
error detection is a basically not a phase but a process which keeps on going
at the background during the various phases. It is an ongoing process occurring
throughout.
-> Lexer - detects malformed tokens.
-> Parser
- detects syntax errors.
-> Tree - detects annotation type
mismatches.
Types
of Compilers
The
compilers are classified according to the machine, input and output. Some of
the types of compilers are:
-> One-pass
compiler
-> Threaded
code compiler
->
Incremental
compiler
-> Stage
compiler
->
Just-in-time
compiler
-> A
retargetable compiler
-> A
parallelizing compiler
Compiler
Benefits
-> The main benefit of compiler is that it allows you to write the code that is not machine dependent.
-> The main benefit of compiler is that it allows you to write the code that is not machine dependent.
->
The
compiler converts a high-level language into machine code, and it also looks at
the source code to make it efficient (by collecting, reorganizing and
generating a new set of instructions to make the program run faster on the
computer.)
0 comments: