Building a Linux Shell [Part IV]

Written by MIMA | Published 2020/07/04
Tech Story Tags: programming | linux | shell | terminal | tutorial | c-tutorials | cli | hackernoon-top-story

TLDR The symbol table is a data structure used by compilers and interpreters to store variables as entries in the table. Each entry consists of a key (the variable's name) and an associated value (the value) The shell adds each variable (and its value) to the symbol table on startup. We can edit, remove, or export shell variables at will, using the proper builtin utilities (which we'll talk about later on in this series) All shells have some sort of symbol table implementation, although some shells might have different names for it.via the TL;DR App

This is part IV of a tutorial on how to build a Linux shell. You can read the previous parts of this tutorial from the following links: part I, part II, part III.
NOTE: You can download the complete source code for Part IV from this GitHub repository.

Introduction to Part IV

In this part, we're going to add symbol tables to our shell. The symbol table is a data structure that is used by compilers and interpreters to store variables as entries in the table. Each entry consists of a key (the variable's name) and an associated value (the variable's value). Keys are usually unique, that is, we can't have two entries that share the same key (i.e., can't have two variables that share the same variable name).
Typically, a Linux shell populates its symbol table on startup. After populating the symbol table, the compiler or interpreter can easily search for a variable in the table to retrieve that variable's value. We can also perform type checking, enforce scoping rules (e.g., to make a variable visible only to the function in which it was declared), and export shell variables to external commands.
To populate the symbol table, the shell reads the environment variables list, which is passed to the shell from its parent process (usually the process that logged the user in, or a child process of the login process). The shell adds each variable (and its value) to the symbol table. Afterwards, we can edit, remove, or export shell variables at will, using the proper builtin utilities (which we'll talk about later on in this series).

Why Do We Need a Symbol Table?

In short, the symbol table enables us to define shell variables, modify their values, use the values of different shell variables when we perform variable expansion, and export variables to external commands. The symbol table will also become handy when we talk about positional and special shell parameters later on in this series.
Whenever you ask the shell to echo, export, or unset the value of a shell variable, you are effectively asking the shell to access and/or modify its symbol table. All shells have some sort of symbol table implementation, although some shells might have different names for it.
For example, say you invoked the following command:
echo $PATH
Which should give you an output similar to this:
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
You probably know that the
echo
command had nothing to do with the output you see on the screen, except for the fact that
echo
printed the path out. It is the shell who actually understood that
$PATH
stands for a shell variable name.
It is also the shell who replaced the word
$PATH
with the actual path value, which was then passed to
echo
. The
echo
command simply echoed back the argument it was passed by the shell, which is the executable path you see printed on the screen.
So in order to be able to define, modify, unset, and export shell variables, we first need to implement our symbol table. Let's see how we can do that next.

Implementing the Symbol Table

There are different ways to implement a symbol table, the common ones being linked lists, hash tables, and binary search trees. There are pros and cons to each method, and we don't have the time, nor the space, to discuss each one in detail. For our purposes, we'll use linked lists, which are the easiest to implement, and which fair well in terms of speed of access and memory usage.
(NOTE: If you want to use the shell for anything other than learning, you should consider changing the symbol table implementation to one that uses hash tables or binary trees. You can find an example of the hash table implementation here).
Now let's get cracking on that code. In your source directory, create a subdirectory named
symtab
(invoke
mkdir symtab
from your terminal emulator). Navigate to that directory (
cd symtab
) and create a file named
symtab.h
. Add the following code to the header file you've just created:
#ifndef SYMTAB_H
#define SYMTAB_H

#include "../node.h"

#define MAX_SYMTAB	256

/* the type of a symbol table entry's value */
enum symbol_type_e
{
    SYM_STR ,
    SYM_FUNC,
};

/* the symbol table entry structure */
struct symtab_entry_s
{
    char     *name;
    enum      symbol_type_e val_type;
    char     *val;
    unsigned  int flags;
    struct    symtab_entry_s *next;
    struct    node_s *func_body;
};

/* the symbol table structure */
struct symtab_s
{
    int    level;
    struct symtab_entry_s *first, *last;
};

/* values for the flags field of struct symtab_entry_s */                       
#define FLAG_EXPORT (1 << 0) /* export entry to forked commands */

/* the symbol table stack structure */
struct symtab_stack_s
{
    int    symtab_count;
    struct symtab_s *symtab_list[MAX_SYMTAB];
    struct symtab_s *global_symtab, *local_symtab;
};


struct symtab_s       *new_symtab(int level);
struct symtab_s       *symtab_stack_push(void);
struct symtab_s       *symtab_stack_pop(void);
int rem_from_symtab(struct symtab_entry_s *entry, struct symtab_s *symtab);
struct symtab_entry_s *add_to_symtab(char *symbol);
struct symtab_entry_s *do_lookup(char *str, struct symtab_s *symtable);
struct symtab_entry_s *get_symtab_entry(char *str);
struct symtab_s       *get_local_symtab(void);
struct symtab_s       *get_global_symtab(void);
struct symtab_stack_s *get_symtab_stack(void);
void init_symtab(void);
void dump_local_symtab(void);
void free_symtab(struct symtab_s *symtab);
void symtab_entry_setval(struct symtab_entry_s *entry, char *val); 

#endif
The
symbol_type_e
enumeration defines the types of our symbol table entries. We'll use the type
SYM_STR
to represent shell variables, and
SYM_FUNC
to represent functions (we'll deal with shell functions later on in this series).
The
struct symtab_entry_s
structure represents our symbol table entries. The structure contains the following fields:
  • name
    => the name of the shell variable (or function) represented by this entry.
  • val_type
    =>
    SYM_STR
    for shell variables,
    SYM_FUNC
    for shell functions.
  • val
    => string value (for shell variables only).
  • flags
    => indicates the different properties we'll assign to variables and functions, such as the export and readonly flags (we'll deal with these later in this series).
  • next
    => pointer to the next symbol table entry (because we're implementing the table as a singly linked list).
  • func_body
    => for shell functions, the Abstract Syntax Tree, or AST, of the function body (we talked about AST's in part I of this tutorial).
The
struct symtab_s
structure represents a single symbol table. For starters, we'll use one symbol table, in which we'll define all our shell variables. Later on, when we discuss shell functions and begin working with script files, we'll need to define more symbol tables.
The zeroth symbol table will be the global table, in which we'll define our global variables (the ones that are accessible to the shell, and all functions and scripts executed by it).
Symbol tables number one and above are the local tables, in which we'll define our local variables (the ones that are only accessible to the shell function or script in which they were declared). By cascading symbol tables in this way, we are effectively implementing variable scoping.
Our
struct symtab_s
structure contains the following fields:
  • level
    => 0 for the global symbol table, 1 and above for local symbol tables.
  • first
    ,
    last
    => pointers to the first and last entries in the table's linked list, respectively.
Now to be able to cascade symbol tables as we discussed above, we need to define and implement a symbol table stack. A stack is a Last-In-First-Out, or LIFO, data structure, in which the last item added (or pushed) is the first item removed (or popped). The
struct symtab_stack_s
structure represents our symbol table stack. The structure contains the following fields:
  • symtab_count
    => the number of symbol tables currently on the stack.
  • symtab_list
    => an array of pointers to the stack's symbol tables. The zeroth item points to the global symbol table, and the
    symtab_count-1
    item points to the last (or local) symbol table. The stack can hold up to
    MAX_SYMTAB
    entries, which we defined at the beginning of the header file to be 256.
  • global_symtab
    ,
    local_symtab
    => pointers to the global and local symbol tables, respectively (for ease of access).
We'll implement the stack later in this lesson. For now, we'll start by writing the functions we'll need to work with symbol tables.

Symbol Table Functions

Create the
symtab.c
file (in the
symtab
subdirectory) and start by adding the following code:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include "../shell.h"
#include "../node.h"
#include "../parser.h"
#include "symtab.h"


struct symtab_stack_s symtab_stack;
int    symtab_level;


void init_symtab(void)
{
    symtab_stack.symtab_count = 1;
    symtab_level = 0;

    struct symtab_s *global_symtab = malloc(sizeof(struct symtab_s));

    if(!global_symtab)
    {
        fprintf(stderr, "fatal error: no memory for global symbol table\n");
        exit(EXIT_FAILURE);
    }

    memset(global_symtab, 0, sizeof(struct symtab_s));
    symtab_stack.global_symtab  = global_symtab;
    symtab_stack.local_symtab   = global_symtab;
    symtab_stack.symtab_list[0] = global_symtab;
    global_symtab->level        = 0;
}
First, we have two global variables:
  • symtab_stack
    => pointer to our symbol table stack (we only need one stack per shell).
  • symtab_level
    => our current level in the stack (0 if we're working with the global symbol table, non-zero otherwise).
The
init_symtab()
function initializes our symbol table stack, then allocates memory for, and initializes, our global symbol table.
Next, add the following function:
struct symtab_s *new_symtab(int level)
{
    struct symtab_s *symtab = malloc(sizeof(struct symtab_s));

    if(!symtab)
    {
        fprintf(stderr, "fatal error: no memory for new symbol table\n");
        exit(EXIT_FAILURE);
    }

    memset(symtab, 0, sizeof(struct symtab_s));
    symtab->level = level;
    return symtab;
}
We'll call the
new_symtab()
function whenever we want to create a new symbol table (for example, when we're about to execute a shell function).
Next, add the following function:
void free_symtab(struct symtab_s *symtab)
{
    if(symtab == NULL)
    {
        return;
    }

    struct symtab_entry_s *entry = symtab->first;

    while(entry)
    {
        if(entry->name)
        {
            free(entry->name);
        }

        if(entry->val)
        {
            free(entry->val);
        }

        if(entry->func_body)
        {
            free_node_tree(entry->func_body);
        }

        struct symtab_entry_s *next = entry->next;
        free(entry);
        entry = next;
    }

    free(symtab);
}
We'll call the
free_symtab()
function when we're done working with a symbol table, and we want to free the memory used by the symbol table and its entries.
Next, we'll define a debugging function:
void dump_local_symtab(void)
{
    struct symtab_s *symtab = symtab_stack.local_symtab;
    int i = 0;
    int indent = symtab->level * 4;

    fprintf(stderr, "%*sSymbol table [Level %d]:\r\n", indent, " ", symtab->level);
    fprintf(stderr, "%*s===========================\r\n", indent, " ");
    fprintf(stderr, "%*s  No               Symbol                    Val\r\n", indent, " ");
    fprintf(stderr, "%*s------ -------------------------------- ------------\r\n", indent, " ");

    struct symtab_entry_s *entry = symtab->first;

    while(entry)
    {
        fprintf(stderr, "%*s[%04d] %-32s '%s'\r\n", indent, " ",
                i++, entry->name, entry->val);
        entry = entry->next;
    }

    fprintf(stderr, "%*s------ -------------------------------- ------------\r\n", indent, " ");
}
This function prints the contents of the local symbol table. When our shell starts, the local and global symbol tables will refer to the same table. It is only when the shell is about to run a shell function or script file that we have a local table that is different from the global table. (later on in this lesson, we'll write a builtin utility that will call
dump_local_symtab()
to help us visualize the contents of our shell's global symbol table).
Now let's define some functions to help us work with symbol table entries. In the same file (
symtab.c
), add the following function:
struct symtab_entry_s *add_to_symtab(char *symbol)
{
    if(!symbol || symbol[0] == '\0')
    {
        return NULL;
    }

    struct symtab_s *st = symtab_stack.local_symtab;
    struct symtab_entry_s *entry = NULL;

    if((entry = do_lookup(symbol, st)))
    {
        return entry;
    }

    entry = malloc(sizeof(struct symtab_entry_s));

    if(!entry)
    {
        fprintf(stderr, "fatal error: no memory for new symbol table entry\n");
        exit(EXIT_FAILURE);
    }

    memset(entry, 0, sizeof(struct symtab_entry_s));
    entry->name = malloc(strlen(symbol)+1);

    if(!entry->name)
    {
        fprintf(stderr, "fatal error: no memory for new symbol table entry\n");
        exit(EXIT_FAILURE);
    }

    strcpy(entry->name, symbol);

    if(!st->first)
    {
        st->first      = entry;
        st->last       = entry;
    }
    else
    {
        st->last->next = entry;
        st->last       = entry;
    }

    return entry;
}
This function adds a new entry to the local symbol table. Remember, at the beginning of this lesson, I've said that each entry must have a unique key, which is the name we give to the shell variable or function. To ensure this uniqueness, we first check to see if an entry exists with the given name, by calling
do_lookup()
(which we'll define in a minute).
If an entry with the given name exists, we simply return the existing entry, without adding a new one. Otherwise, we add the entry, set its name, and adjust the symbol table's pointers. Lastly, we return the newly added entry.
The next function does the opposite work, i.e. it removes the symbol table entry whose key matches the given name:
int rem_from_symtab(struct symtab_entry_s *entry, struct symtab_s *symtab)
{
    int res = 0;
    if(entry->val)
    {
        free(entry->val);
    }

    if(entry->func_body)
    {
        free_node_tree(entry->func_body);
    }

    free(entry->name);

    if(symtab->first == entry)
    {
        symtab->first = symtab->first->next;

        if(symtab->last == entry)
        {
            symtab->last = NULL;
        }
        res = 1;
    }
    else
    {
        struct symtab_entry_s *e = symtab->first;
        struct symtab_entry_s *p = NULL;

        while(e && e != entry)
        {
            p = e;
            e = e->next;
        }

        if(e == entry)
        {
            p->next = entry->next;
            res = 1;
        }
    }

    free(entry);
    return res;
}
This function frees the memory used by the entry, and adjusts the linked list pointers to remove the entry from the symbol table.
To perform lookup (i.e., search for a variable with the given name), we need to define the following function in the same file:
struct symtab_entry_s *do_lookup(char *str, struct symtab_s *symtable)
{
    if(!str || !symtable)
    {
        return NULL;
    }

    struct symtab_entry_s *entry = symtable->first;

    while(entry)
    {
        if(strcmp(entry->name, str) == 0)
        {
            return entry;
        }
        entry = entry->next;
    }

    return NULL;
}
This function searches the given symbol table, starting with the first entry. If the entry's key matches the variable name we're looking for, the function returns the entry.
Otherwise, the function follows the linked list pointers to look at each entry, in turn, until we find an entry whose key matches our desired name. If no match is found, we return
NULL
.
Next, add the following function:
struct symtab_entry_s *get_symtab_entry(char *str)
{
    int i = symtab_stack.symtab_count-1;

    do
    {
        struct symtab_s *symtab = symtab_stack.symtab_list[i];
        struct symtab_entry_s *entry = do_lookup(str, symtab);

        if(entry)
        {
            return entry;
        }

    } while(--i >= 0);

    return NULL;
}
This function searches for a symbol table entry whose key matches the given name. This might seem redundant at first, as we've already defined the
do_lookup()
function to search the local symbol table. The difference here is that
get_symtab_entry()
searches the whole stack, starting with the local symbol table. At the moment, this distinction is not important, as our local and global symbol tables refer to the one and same table.
It is only when we talk about shell functions and script files that you'll appreciate what this function does (so hang tight in there!).
Lastly, add the following function:
void symtab_entry_setval(struct symtab_entry_s *entry, char *val)
{
    if(entry->val)
    {
        free(entry->val);
    }

    if(!val)
    {
        entry->val = NULL;
    }
    else
    {
        char *val2 = malloc(strlen(val)+1);

        if(val2)
        {
            strcpy(val2, val);
        }
        else
        {
            fprintf(stderr, "error: no memory for symbol table entry's value\n");
        }

        entry->val = val2;
    }
}
This function frees the memory used to store the old entry's value (if one exists). It then creates a copy of the new value and stores it in the symbol table entry.
That's it for the symbol table functions. Now let's write some functions to help us work with the symbol table stack.

Symbol Table Stack Functions

Add the following code to the same source file,
symtab.c
:
void symtab_stack_add(struct symtab_s *symtab)
{
    symtab_stack.symtab_list[symtab_stack.symtab_count++] = symtab;
    symtab_stack.local_symtab = symtab;
}


struct symtab_s *symtab_stack_push(void)
{
    struct symtab_s *st = new_symtab(++symtab_level);
    symtab_stack_add(st);
    return st;
}


struct symtab_s *symtab_stack_pop(void)
{
    if(symtab_stack.symtab_count == 0)
    {
        return NULL;
    }

    struct symtab_s *st = symtab_stack.symtab_list[symtab_stack.symtab_count-1];

    symtab_stack.symtab_list[--symtab_stack.symtab_count] = NULL;
    symtab_level--;

    if(symtab_stack.symtab_count == 0)
    {
        symtab_stack.local_symtab  = NULL;
        symtab_stack.global_symtab = NULL;
    }
    else
    {
        symtab_stack.local_symtab = symtab_stack.symtab_list[symtab_stack.symtab_count-1];
    }

    return st;
}


struct symtab_s *get_local_symtab(void)
{
    return symtab_stack.local_symtab;
}


struct symtab_s *get_global_symtab(void)
{
    return symtab_stack.global_symtab;
}


struct symtab_stack_s *get_symtab_stack(void)
{
    return &symtab_stack;
}
Here's a quick breakdown of the above functions:
  • symtab_stack_add()
    adds the given symbol table to the stack, and assigns the newly added table as the local symbol table.
  • symtab_stack_push()
    creates a new, empty symbol table and pushes it on top of the stack.
  • symtab_stack_pop()
    removes (or pops) the symbol table on top of the stack, adjusting the stack pointers as needed.
  • get_local_symtab()
    and
    get_global_symtab()
    return pointers to the local and global symbol tables, respectively.
  • get_symtab_stack()
    returns a pointer to the symbol table stack.

Initializing Our Symbol Table Stack

Remember at the beginning of this tutorial I told you that the shell needs to initialize its global symbol table and add variables from the environment list to the table? Well, in this part, we'll initialize the symbol table stack and the global symbol table. In the next part of this series, we'll read the environment variables list and add them to our symbol table.
Go ahead and create a source file named
initsh.c
in your source directory, and add the following code:
#include <string.h>
#include "shell.h"
#include "symtab/symtab.h"

extern char **environ;

void initsh(void)
{
    init_symtab();

    struct symtab_entry_s *entry;
    char **p2 = environ;

    while(*p2)
    {
        char *eq = strchr(*p2, '=');
        if(eq)
        {
            int len = eq-(*p2);
            char name[len+1];

            strncpy(name, *p2, len);
            name[len] = '\0';
            entry = add_to_symtab(name);

            if(entry)
            {
                symtab_entry_setval(entry, eq+1);
                entry->flags |= FLAG_EXPORT;
            }
        }
        else
        {
            entry = add_to_symtab(*p2);
        }
        p2++;
    }

    entry = add_to_symtab("PS1");
    symtab_entry_setval(entry, "$ ");

    entry = add_to_symtab("PS2");
    symtab_entry_setval(entry, "> ");
}
This function initializes the symbol table stack (including the global symbol table) and scans the environment list, adding each environment variable (and its value) to the table. Lastly, the function adds two variables that we'll use to store our prompt strings, PS1 and PS2 (we talked about prompt strings in part I). Don't forget to add the function prototype to your
shell.h
header file:
void initsh(void);
Next, we need to call this function from our
main()
function, before we enter the REPL loop. To do this, add the following line to
main()
, right before the loop body:
initsh();
The last thing we need to do is to update our prompt string printing functions, so that they'll use the PS1 and PS2 variables we've just added to the global symbol table in
initsh()
. We'll also write our first builtin utility,
dump
.
Let's apply these changes to our code next.

Updating the Prompt Printing Functions

Open your
prompt.c
file. Remove the line in the body of the
print_prompt1()
and replace it by the following code:
void print_prompt1(void)
{   
    struct symtab_entry_s *entry = get_symtab_entry("PS1");

    if(entry && entry->val)
    {
        fprintf(stderr, "%s", entry->val);
    }
    else
    {
        fprintf(stderr, "$ ");
    }
}
The new code checks if there is a symbol table entry with the name PS1. If there is, we use that entry's value to print the first prompt string. Otherwise, we use our default builtin value, which is
$ 
.
The code changes to the
print_prompt2()
function are similar, so I won't show it here, but you can check it at the GitHub repo link I provided at the top of this page.
Don't forget to add the following include directive at the top of the file, right after the
#include "shell.h"
line:
#include "symtab/symtab.h"

Defining our First Builtin Utility

A builtin utility (a.k.a. builtin or internal command) is a command whose code is compiled as part of the shell executable itself, i.e. the shell doesn't need to execute an external program, nor does it need to fork a new process in order to execute the command.
Many of the commands we use on daily basis, such as
cd
,
echo
,
export
, and
readonly
are in fact builtin utilities. You can read more about shell builtin utilities in this POSIX standard link.
In the course of this tutorial, we'll be adding different builtin utilities to expand our shell. We'll start by defining a structure that will help us store information about our different builtin utilities.
Open your
shell.h
header file and add the following code at the end, right before the
#endif
directive:
/* shell builtin utilities */
int dump(int argc, char **argv);

/* struct for builtin utilities */
struct builtin_s
{
    char *name;    /* utility name */
    int (*func)(int argc, char **argv); /* function to call to execute the utility */
};

/* the list of builtin utilities */
extern struct builtin_s builtins[];

/* and their count */
extern int builtins_count;
The function prototype declares our first builtin utility,
dump
. The
struct builtin_s
structure defines our builtin utilities, and has the following fields:
  • name
    => the builtin utility name, which we'll use to invoke the utility.
  • func
    => function pointer to the function that implements the builtin utility in our shell.
We'll use the
builtins[]
array to store information about our builtin utilities. The array contains
builtins_count
number of elements.
Now create a new subdirectory and name it
builtins
. This is where we'll define all of our builtin utilities. Next, create the file
builtins.c
and add the following code to it:
#include "../shell.h"

struct builtin_s builtins[] =
{   
    { "dump"    , dump       },
};

int builtins_count = sizeof(builtins)/sizeof(struct builtin_s);
Next, we'll add a builtin utility, named
dump
, which we'll use to dump or print the contents of the symbol table, so we know what's going on behind the scenes (I mainly wrote this utility so that our discussion doesn't sound too abstract and theoretical).
In the
builtins
subdirectory, create the file
dump.c
and add the following code to it:
#include "../shell.h"
#include "../symtab/symtab.h"

int dump(int argc, char **argv)
{
    dump_local_symtab();
    return 0;
}
Simple, right? This function implements our
dump
builtin utility, which prints the contents of the local symbol table.

Updating the Executor

Next, we need to tell our executor about our new builtin utility, so that when we enter the command
dump
, the shell executes our
dump()
function, instead of searching for an external command with the name dump.
To do this, we'll need to fix our
do_simple_command()
function.
In the source directory, open the
executor.c
source file and navigate to the
do_simple_command()
function's definition. Now find the two lines:
argv[argc] = NULL;
pid_t child_pid = 0;
Insert a new line between those two lines and enter the following code:
    int i = 0;
    for( ; i < builtins_count; i++)
    {
        if(strcmp(argv[0], builtins[i].name) == 0)
        {
            builtins[i].func(argc, argv);
            free_argv(argc, argv);
            return 1;
        }
    }
Now our shell will check to see if the name of the command it’s about to execute is the name of a builtin utility and, if so, it calls the function that implements the builtin utility and returns.
And that's it! Now let's compile and test our shell.

Compiling the Shell

Let’s compile our shell. Open your favorite terminal emulator, navigate to your source directory and make sure you have 14 files and 2 subdirectories in there:
Now compile the shell using the following command:
gcc -o shell executor.c initsh.c main.c node.c parser.c prompt.c scanner.c source.c builtins/builtins.c builtins/dump.c symtab/symtab.c
If everything goes well,
gcc
should not output anything, and there should be an executable file named
shell
in the current directory:
(If you downloaded the source files from the GitHub repo, you can simply run
make
from the source directory, and it will take care of compiling the shell).
Now invoke the shell by running
./shell
, and try our new builtin utility,
dump
:
As you can see from the output, our global symbol table (level 0) contains many variables (67 in the above example), which correspond to the shell’s environment variables that were passed to us from our parent process. The last two entries represent our PS1 and PS2 variables. Try entering some other commands and see how our shell fairs in parsing and executing simple commands.

What's Next

It might not be obvious right now, but what we’ve done in this part is a huge accomplishment. Our symbol table will enable us to do important things in our shell, such as performing word expansions, exporting variables to external commands, defining shell functions, and so on.
In the next part, we’ll talk about the word expansion process, and how we can add word expansion capabilities to our shell.
Stay tuned!

Written by MIMA | GNU/Linux system administrator and programmer
Published by HackerNoon on 2020/07/04