For making programming easier I will write a simplified version of C Language.There are a lot of things which are similar like C.
Changes in the simplified C :
- There are Less Keywords or reserved words as compared to C.
- We use only Basic data type (char,int,float,float,double).
- There is no include statement. So we can not include libraries.
- Simple regular expressions for the lexical units or tokens of lexical analyser
- We can define variables at the beginning of program only
- There is no need for the main function. We can directly start our program.
- There are not any dynamic semantic rules.
- We are not getting input as we did in C.
- Only static memory allocation.
- Only While loop available for iteration
- No function can be defined in the program
- CHAR (char data type)
- INT (int data type)
- FLOAT (float data type)
- CONTINUE (to get next iteration of loop)
- BREAK(to break out loop)
- WHILE (for loop declaration)
- VOID (void data type)
- Operators:
- ADD_OP (+ or -)
- MUL_OP (*)
- DIV_OP (/)
- INCR (++ or --)
- OR_OP (||)
- AND_OP (&&)
- EQU_OP (== or !=)
- REL_OP (> or < or >= or <=)
- ID are used for the name of identifiers which start with a letter and contain character or number after that.
- INT_C is an integer constant that can start with 1-9 and follow by numbers 0-9.
- FLOAT_C is a floating point constant.
- CHAR_C is any printable ASCII character.
- STRING’s are printable ASCII characters in between of two “.
- ‘(‘ ‘)’ parentheses
- ‘[‘ ‘]’ brackets
- ‘{‘ ‘}’ bracelets
- ‘;’ ‘,’ ‘.’ semi,comma,dot
- ‘=’ ‘&’ assign and reference
Main function have two parts
- Variable declaration (data type, initialization of variables)
- Body/Statements (body statements contain arithmetic expressions,for,print() etc.) Example:
int j; //defining variable at initial
float value=3.0; //initializing variable
for(j=0;j<10;j++){
j=j+10;
}
j++;
A lexical analyzer creates “tokens” by analysing character stream.A token can be anything from keyword,operators,identifiers or even constants/literals.
Symbol table is a way to store the information of identifiers so that we can access the information for an identifier anytime when it needed.
An identifier can be :
- Program
- Variables
- Subroutine parameters
- Instruction labels
- Constants
- Data type
We have to store all the information for semantic analysis.
As example:
int m;
- Identifye the keyword int and return token to parser
- Identify the identifier m
- Add it to the symbol table and return the pointer of the identifier to the parser.
- Recognize the semicolon ‘;’ and return the token the parser
- We first add the identifiers in lexical analysis with some initial information.
- Most information added in the syntax analysis,cause there is more information available.
- We process some information and sometime modify it in the semantic analysis
We can implement symbol table as a hashtable using chain implementation.
Basic Operations on a symbol table.
- Adding identifiers (insert).
- Searching identifiers(lookup).
Using the Hash function and ASCII characters of name of ID’s we generate a integer/position and return for hashtable.
hash(name) returns a position for hash table.
Insert() :-We create a new entry for the symbol table when we recognize a new identifier if it does not already exist in symbol table. Lookup() :- During syntax or semantic analysis search for an identifier and return it’s symbol table entry.
This function use for inserting a new identifier to the Symbol Table.The parameter are the name of identifier that will first undefined(UNDEF) and change afterwards and lastly the line number where it was found.For inserting a identifier we use hash() function for generating hash index. After that we search at the corresponding list.If the identifier is not found then we can create a new entry that will contain the current known information and we initialize the reference list.If the identifier is found we will add the line number to the reference list.
This function is used for searching.First we find the hash index of the input by using the hash() function, and then start searching for the corresponding linked list.If identifier was found then we will return the pointer of that identifier otherwise we will return NULL,which mean identifier was not found in list.
Syntax Analysis
Syntax analysis is a step in which we check the correctness of the syntax of our program. It combines the tokens from lexical analysis to form a syntax tree based on grammar/production rules.
Some other things that we do in this step:
- Syntax error message
- Insert information to our symbol table using the information that we get through the parsing of our program.
- Syntax analysis uses a context-free grammar, which can be expressed using the production rules,which become push-down or shift reduce automata.
- Bison uses LALR parser.
program: declarations statements;
declarations: declarations declaration | declaration ;
declaration: type names SEMI
;
type: INT
| CHAR
| FLOAT
| DOUBLE
| VOID
;
names: names COMMA variable
| names COMMA init
| variable
| init
;
variable: ID ;
init:
var_init
;
var_init : ID ASSIGN constant
;
statements:
statements statement
| statement
;
statement:
while_statement
| assigment SEMI
| CONTINUE SEMI
| BREAK SEMI
| ID INCR SEMI
| INCR ID SEMI
;
while_statement: WHILE LPAREN expression RPAREN tail
;
tail: LBRACE statements RBRACE
;
expression:
expression ADDOP expression
| expression MULOP expression
| expression DIVOP expression
| ID INCR
| INCR ID
| expression OROP expression
| expression ANDOP expression
| NOTOP expression
| expression EQUOP expression
| expression RELOP expression
| LPAREN expression RPAREN
| var_ref
| constant
| ADDOP constant %prec MINUS
;
constant:
ICONST
| FCONST
| CCONST
;
assigment: var_ref ASSIGN expression
;
var_ref: variable
| REFER variable
;
Until now the lexical analyzer (lexer) is responsible for validity of the tokens,token are supplied to the syntax analyzer.Then syntax analyzer combines these “valid” tokens to check if the syntax of the program is correct.
But parser can’t determine if:
- A token is valid
- A token was declared before being used
- A token was initialized before being used
- A operation performed on the specific token is allowed or not
These operations are accomplished by semantic analyzer. The program is correct.Semantic analysis also performs Type checking which recognizes type mismatch,identifier misuse.
Abstract syntax tree is used for representing the syntax of a programming language as a tree-like structure,which is similar to a parse tree.
Program will generate syntax tree from the tokens. Nodes are created for every production.
ast.h contain difinition of nodes and ast.c contain declation of nodes .
// declaration
int a;
int b=10;
char c;
float f=2.5
At initial stage variable are declared in program when new variable is found we add name of variable in list after that a declaration node is created in which identifier type and name of identifier is defined.
For production
- declaration : type names SEMI
- $$=new_ast_decl_node($1,names,nc)
- new_ast_decl_node () create a declaration node
- nc is stand for name count
declarations: declarations declaration | declaration ;
declaration: type { declare = 1; } names { declare = 0; } SEMI
{
int i;
$$ = new_ast_decl_node($1, names, nc);
nc = 0;
AST_Node_Decl *temp = (AST_Node_Decl*) $$;
for(i=0; i < temp->names_count; i++){
if(temp->names[i]->st_type == UNDEF){
set_type(temp->names[i]->st_name, temp->data_type, UNDEF);
}
else if(temp->names[i]->st_type == POINTER_TYPE){
set_type(temp->names[i]->st_name, POINTER_TYPE, temp->data_type);
}
else if(temp->names[i]->st_type == ARRAY_TYPE){
set_type(temp->names[i]->st_name, ARRAY_TYPE, temp->data_type);
}
}
ast_traversal($$); /* just for testing */
}
;
type: INT { $$ = INT_TYPE; }
| CHAR { $$ = CHAR_TYPE; }
| FLOAT { $$ = REAL_TYPE; }
| DOUBLE { $$ = REAL_TYPE; }
| VOID { $$ = VOID_TYPE; }
;
names: names COMMA variable
{
add_to_names($3);
}
| names COMMA init
{
add_to_names($3);
}
| variable
{
add_to_names($1);
}
| init
{
add_to_names($1);
}
;
add_to name function is used for making entry of Identifier whenever a new entry is found it is inserted in the list.
add_to_name(list_t *entry)
{
//first entry
If nc is 0{
nc=1;
names=(list_t**)malloc(1*size_of (list_t*));
names[0]=entry;
}
//general entry
else{
nc++;
names=(list_t **)realloc(names,nc*sizeof(list_t *));
names[nc-1]=entry;
}
}
Only declaration of a variable
- int a;
Semantic action for variable production
variable: ID { $$ = $1; };
Initialization of a variable during declaration
int a=10;
char c=’f’;
tem_no is associated with the identifier which is used for the generating 3 address code. When a constant value is found we create a constant node and add the type of constant in the constant node we create a constant node using AST_Node_Const .
init:
var_init { $$ = $1; }
;
var_init : ID ASSIGN constant
{
AST_Node_Const *temp = (AST_Node_Const*) $$;
$1->val = temp->val;
$1->st_type = temp->const_type;
$$ = $1;
AST_Node_Const *t=(AST_Node_Const *)$3;
AST_Node *t2=(AST_Node *)$3;
$1->tem_no=t2->tem_no;
}
;
An expression can be
-
Increment expression
-
Arithmetic expression node
-
Boolean expression node
-
Equality expression node
-
Relational expression node
-
var_ref , which is a variable with or without reference
-
An expression with parentheses ”()”.
All the arithmetic expressions are applied on two variables; they have two child (left child and right child) and an operator depending on their type which can be arithmetic,boolean,equality and relational. We create an arithmetic node which contain information of left child right child and operator type -
$$=new_ast_arithm_node(OP,$1,$3)
-
Op stands for operator $1 is left child and $3 is right child.
-
tem_no is used for creating three address code generation.
expression:
expression ADDOP expression
{
$$ = new_ast_arithm_node($2.ival, $1, $3);
AST_Node *temp=(AST_Node *)$$;
AST_Node *temp1=(AST_Node *)$1;
AST_Node *temp2=(AST_Node *)$3;
temp->tem_no=tem_no;
tem_no++;
}
| expression MULOP expression
{
$$ = new_ast_arithm_node(MUL, $1, $3);
AST_Node *temp=(AST_Node *)$$;
AST_Node *temp1=(AST_Node *)$1;
AST_Node *temp2=(AST_Node *)$3;
temp->tem_no=tem_no;
tem_no++;
}
| expression DIVOP expression
{
$$ = new_ast_arithm_node(DIV, $1, $3);
AST_Node *temp=(AST_Node *)$$;
AST_Node *temp1=(AST_Node *)$1;
AST_Node *temp2=(AST_Node *)$3;
temp->tem_no=tem_no;
tem_no++;
}
| ID INCR
{
/* increment */
if($2.ival == INC){
$$ = new_ast_incr_node($1, 0, 0);
}
else{
$$ = new_ast_incr_node($1, 1, 0);
}
}
| INCR ID
{
/* increment */
if($1.ival == INC){
$$ = new_ast_incr_node($2, 0, 1);
}
else{
$$ = new_ast_incr_node($2, 1, 1);
}
}
| expression OROP expression
{
$$ = new_ast_bool_node(OR, $1, $3);
}
| expression ANDOP expression
{
$$ = new_ast_bool_node(AND, $1, $3);
}
| NOTOP expression
{
$$ = new_ast_bool_node(NOT, $2, NULL);
}
| expression EQUOP expression
{
$$ = new_ast_equ_node($2.ival, $1, $3);
}
| expression RELOP expression
{
$$ = new_ast_rel_node($2.ival, $1, $3);
}
| LPAREN expression RPAREN
{
$$ = $2; /* just pass information */
}
| var_ref
{
$$ = $1; /* just pass information */
AST_Node_Ref *t=(AST_Node_Ref *)$1;
AST_Node *t2=(AST_Node *)$$;
t2->tem_no=(t->entry->tem_no);
}
| constant
{
$$ = $1; /* no sign */
AST_Node *t=(AST_Node *)$1;
t->tem_no=tem_no;
AST_Node *t2=(AST_Node *)$$;
t2->tem_no=tem_no;
AST_Node_Const *t3=(AST_Node_Const *)$1;
tem_no++;
}
| ADDOP constant %prec MINUS
{
/* plus sign error */
if($1.ival == ADD){
fprintf(stderr, "Error having plus as a sign!\n");
exit(1);
}
else{
AST_Node_Const *temp = (AST_Node_Const*) $2;
/* inverse value depending on the constant type */
switch(temp->const_type){
case INT_TYPE:
temp->val.ival *= -1;
break;
case REAL_TYPE:
temp->val.fval *= -1;
break;
case CHAR_TYPE:
/* sign before char error */
fprintf(stderr, "Error having sign before character constant!\n");
exit(1);
break;
}
$$ = (AST_Node*) temp;
}
}
var_ref: variable
{
$$ = new_ast_ref_node($1, 0); /* no reference */
}
| REFER variable
{
$$ = new_ast_ref_node($2, 1); /* reference */
}
;
assigment: var_ref ASSIGN expression
{
AST_Node_Ref *temp = (AST_Node_Ref*) $1;
$$ = new_ast_assign_node(temp->entry, temp->ref, $3);
AST_Node *t=(AST_Node *)$3;
}
;
int i=10;
I++;
++i;
For increment or decrement we create a increment node which contain the entry of variable and increment type 0 stand for increment and 1 stand for decrement.
statement:
while_statement
| assigment SEMI
| CONTINUE SEMI
| BREAK SEMI
| ID INCR SEMI
{
/* increment */
if($2.ival == INC){
$$ = new_ast_incr_node($1, 0, 0);
}
else{
$$ = new_ast_incr_node($1, 1, 0);
}
}
| INCR ID SEMI
{
/* increment */
if($1.ival == INC){
$$ = new_ast_incr_node($2, 0, 1);
}
else{
$$ = new_ast_incr_node($2, 1, 1);
}
}
;
There are two simple statements : continue and break and both represented by AST simple node,by only having difference in the statement type 0 stands for continue and 1 stands for break.
statement:while_statement
| assignment SEMI
| CONTINUE SEMI
{
$$ = new_ast_simple_node(0);
}
| BREAK SEMI
{
$$ = new_ast_simple_node(1);
}
| ID INCR SEMI
| INCR ID SEMI
;
In the assignment right part is an expression that will be simply passed as the assign_val node entry of the AST assign node.AST assign node store in ST entry ,assign value and ref.
Example:-
int i=5;
a=10;
a=i;
statement:
while_statement
{
$$ = $1; /* just pass information */
}
| assigment SEMI
{
$$ = $1; /* just pass information */
}
| CONTINUE SEMI
| BREAK SEMI
| ID INCR SEMI
| INCR ID SEMI
;
assignment: var_ref ASSIGN expression
{
AST_Node_Ref *temp = (AST_Node_Ref*) $1;
$$ = new_ast_assign_node(temp->entry, temp->ref, $3);
AST_Node *t=(AST_Node *)$3;
}
;
var_ref: variable
{
$$ = new_ast_ref_node($1, 0); /* no reference */
}
| REFER variable
{
$$ = new_ast_ref_node($2, 1); /* reference */
}
;
We only use While statement for loop while statement contain condition for loop execution and body during. while node contain the condition and while branch (tail)
- tail is used for production of while branch.
statement:
While_statement
{
$$ = $1; /* just pass information */
}
………………………………….
;
while_statement: WHILE LPAREN expression RPAREN tail
{
$$ = new_ast_while_node($3, $5, &loop_no);
};
tail: LBRACE statements RBRACE
{
$$ = $2; /* just pass information */
}
;