Building a Linux Shell [Part V]

This is part V of a tutorial on how to build a Linux shell. You can read the previous parts of this tutorial from the following links: , , , . part I part II part III part IV : You can download the complete source code for Part V from . NOTE this GitHub repository Introduction to Part V As we've seen in the previous parts, a simple command consists of one or more , also known as . The first word contains the name of the command we want to execute, while the rest of the words contain the command's arguments. arguments words Before the shell executes a command, it performs on the command words. is the process by which the shell takes a command word, checks it to see if it contains variable names, pathnames, commands, and arithmetic expressions, and substitutes each name/command/expression with its value. The resultant word, which is usually (but not always) longer than the original word, might be broken down into one or more subwords (or ), in a process known as . word expansion Word expansion fields field splitting In this part, we'll implement the 7 word expansions defined by POSIX, which are: , , , , , , and . There are other word expansions, such as and , which are not defined by POSIX, and which we won't be discussing here. After finishing this lesson, it would be a good exercise if you extended the shell by implementing the non-POSIX word expansions. tilde expansion parameter expansion arithmetic expansion command substitution field splitting pathname expansion quote removal brace expansion process substitution Note About this Lesson's Code The word expansion process is very complex, and we'll have to implement a host of functions in order to perform word expansion in our shell. As such, we won't be able to discuss each and every function in detail. Instead, we'll have a general overview of each word expansion and the functions we need to process that expansion. At the end of each section, you'll find a link to our , where you'll be able to read the whole function's code at your own pace. GitHub repository After we finish discussing the individual word expansion functions, we'll write the main function that will tie everything together, then we'll update our existing shell's code to make our shell use the word expansion functions we've just implemented. To make the code easier to read, I've defined all our word expansion functions in the source file , which you'll find in (the pattern matching functions are defined in , which you can read in ). wordexp.c this link pattern.c this link Now let's get started. The Word Expansion Process When the shell performs word expansion, it checks every word in the command line to see if it contains possible word expansions. Expansions can appear anywhere in the word: at the beginning, in the middle, or at the end of the word. An expansion might also include the whole word. Word expansions are preceded by a sign. The characters following the sign indicate the type of expansion to be performed by the shell. These characters are interpreted by the shell as follows: $ $ One or more digits, which indicate variable expansion of a (we'll discuss these in a later lesson in this tutorial). positional parameter One of , , , , , , , or , which indicate variable expansion of a (we'll discuss these in a later lesson in this tutorial). @ * # ? - $ ! 0 special parameter One or more alphanumeric characters and/or underscores, starting with a letter or underscore, which indicates a shell variable name. A variable name surrounded by curly brackets and . { } An arithmetic expansion, surrounded by and . ( ) A command substitution, surrounded by and . (( )) The shell performs tilde expansion, parameter expansion, command substitution, and arithmetic expansion first, which is then followed by field splitting and pathname expansion. Lastly, the shell removes any quote characters from the expanded word(s) that have been part of the original word. Working with Words When the shell performs word expansion, the process can result in zero, one, or more words. We'll use a special structure to represent those words, which we'll define in our header file : shell.h *data; len; }; { struct word_s char int * ; struct word_s next The structure contains the following fields: => the string representing this word. data => the length of the field. len data => pointer to the next word, or if this is the last word (we'll use a linked list to represent our expanded words). next NULL Of course, we'll need some functions to allocate and free our structures. For this, we'll use the following functions: struct word_s ; ; struct word_s * make_word ( *str) char void free_all_words (struct word_s *first) The first function allocates memory for the structure, creates a copy of the word string, and returns the newly allocated word. The second function frees the memory used by a list of word structures. You can read the functions' code in our . wordexp.c source file Defining Some Helper Functions As we've said earlier, word expansion is a complex process, for which we need to define a host of different functions. Before we dive into the details of word expansion, let's first define some helper functions. The following list shows the function prototypes for our helper functions, all of which we'll define in the : wordexp.c source file ; ; ; find_closing_quote( *data); find_closing_brace( *data); ; ; * char wordlist_to_str (struct word_s *word) void delete_char_at ( *str, index) char size_t int is_name ( *str) char size_t char size_t char * char substitute_str ( *s1, *s2, start, end) char char size_t size_t int substitute_word ( **pstart, **p, len, *(func)( *), add_quotes) char char size_t char char int Here's a breakdown of what these functions do: => converts a linked list of expanded words to a single string. wordlist_to_str() => removes the character at the given index from a string. delete_char_at() => checks if a string represents a valid variable name (refer back to section above). is_name() The Word Expansion Process => when a word expansion contains an opening quote character ( , , or ), we need to find the matching closing quote character that encloses the quoted string (more on quoting below). This function returns the zero-based index of the closing quote character in the word. find_closing_quote() " ' ` => similar to the above, except that it finds the matching closing brace. That is, if the opening brace is , , or , this function returns the zero-based index of the matching , , or character, respectively. Finding pairs of quotes is important for processing parameter expansion, arithmetic expansion, and command substitution. find_closing_brace() { ( [ } ) ] => substitutes the substring of , from the character at position to that at position (both positions are zero-based), with the string. This is useful when the word expansion is part of a longer word, such as , in which case we need only expand , then attach to the end of the expanded string. substitute_str() s1 start end s2 ${PATH}/ls ${PATH} /ls => a helper function that calls the other word expansion functions, which we'll define in the following sections. substitute_word() Additionally, we'll define some functions to help us work with strings. We'll define all of these functions in the : strings.c source file ; ; ; ; * char strchr_any ( * , *chars) char string char * char quote_val ( *val, add_quotes) char int int check_buffer_bounds ( *count, *len, ***buf) int int char void free_buffer ( len, **buf) int char Here's what these functions do: => similar to , except that it searches the given string for any of the given characters. strchr_any() strchr() => does the reverse of quote removal, that is, converts a string to a quoted string (this might seem trivial at first, but quoting might become complicated when we have to escape quotes and backslash characters, for example). quote_val() The and functions will allow our backend executor to support variable number of command arguments, instead of the hard limit we've set in , which was 255. check_buffer_bounds() free_buffer() Part II Now let's write the functions to handle each type of word expansion. Tilde Expansion During , the shell substitutes a tilde character (followed by an optional username) with the pathname of the user's home directory. For example, and are tilde-expanded to the current user's home directory, while is tilde-expanded to user John's home directory, and so on. The tilde character, in addition to all the following characters up to the first unquoted forward slash, are known as the (we'll deal with quoting later on in this lession). tilde expansion ~ ~/ ~john tilde prefix To perform tilde expansion, we'll define the function, which has the following prototype: tilde_expand() ; * char tilde_expand ( *s) char The function accepts a single argument: the tilde prefix we want to expand. If the expansion succeeds, the function returns an 'd string representing the tilde-expanded prefix. Otherwise, it returns . Here's a quick breakdown of what the function does in order to expand a tilde prefix: malloc NULL If the prefix is , get the value of the shell variable. If is defined and is not , return its value. Otherwise, get the current ( ) by calling , and pass the UID to to get the corresponding to the current user. The field of the password database entry contains the pathname of the home directory, which the function returns. ~ $HOME $HOME NULL User ID UID getuid() getpwuid() password database entry pw_dir If the prefix contains other characters (in addition to the leading ), we take these letters as the name of the user whose home directory we want to get. We call , passing it the username, and return the value of the field. ~ getpwnam() pw_dir If we can't retrieve the home directory (for example, if the username is invalid), we return . Otherwise, we return an 'd copy of the home directory path. NULL malloc Take a minute to read the function's code in our . tilde_expand() GitHub repository Parameter Expansion In , the shell replaces the name of a shell variable with the variable's value (hence the other name, ). Parameter Expansion is what allows the shell to execute a command such as . In this example, the shell performs parameter expansion on the variable, replacing it with the actual executable path (something like ). What the command sees is not the word, but its expanded value (which can be , of course). parameter expansion variable expansion echo $PATH $PATH /bin:/sbin:/usr/bin:/usr/sbin echo $PATH NULL To signal to the shell that we want to expand a shell variable, we precede the variable name with a sign. That is, to expand the , , and variables, we need to pass the , , and words to the shell, respectively (alternatively, we can pass these variable expansions to shell shell as , , and ). Shell variable names can contain any combination of letters, digits, and underscores. Names can contain capital or small letters although, by convention, capitalized names are reserved for standard shell variables, such as those . $ PATH USER SHELL $PATH $USER $SHELL ${PATH} ${USER} ${SHELL} defined by POSIX We can control how the shell performs parameter expansion by using a parameter expansion modifier, which tells the shell which part of the value we want expanded, as well as what to do if there is no shell variable with the given name. The table below summarizes the parameter expansion modifiers (the ones are marked by the POSIX word in the column). Most shells support other modifiers (found in the bottom half of the table), which we won't discuss here. Refer to your shell's manual page for more information on the non-POSIX modifiers. defined by POSIX Description To perform parameter (or variable) expansion, we'll define the function, which has the following prototype: var_expand() ; * char var_expand ( *orig_var_name) char The function accepts a single argument: the parameter (i.e. the variable name) we want to expand. If the expansion succeeds, the function returns an 'd string containing the expanded value. Otherwise, it returns . Here's a quick breakdown of what the function does in order to expand a variable name to get its value: malloc NULL If the variable name is surrounded by curly braces (for example, ), remove the braces, as they are not part of the variable name itself. ${PATH} If the name begins with , we need to get the variable name's length (the expansion, which is seen in the 5th row in the table above). # ${#parameter} If the variable name contains a colon, we'll use it to separate the name from the (optional) word or pattern. The word or pattern is used as indicated in the table above. Get the symbol table entry with the given variable name (we implemented the symbol table stack in ). Get the symbol table entry's value. Part IV If the value is empty or null, use the alternative word supplied in the expansion, if any. If the value isn't empty, use the value as the expanded result. To enable the shell to perform pattern matching (the and expansions), we'll need two helper functions: and . We won't discuss these functions here, but you can read their code from . ${parameter#word} ${parameter%word} match_suffix() match_prefix() this link If the expansion modifier is , we need to set the value of the symbol table entry to the value we've just expanded. ${parameter:=word} If the expansion starts with , get the length of the expanded value and use it as the final result. # Return an 'd copy of the expanded value, or its length (in string format), as appropriate. malloc Take a minute to read the function's code in our . var_expand() GitHub repository Command Substitution In , the shell forks a process to run a command, then replaces the command substitution expansion with the command's output. For example, in the following loop: command substitution for i in $(ls); do echo $i; done the shell forks a process, in which the command is run. The output of this command is the list of files in the current directory. The shell takes that output, splits it into a list of words, then feeds those words, one at a time, to the loop. In each iteration of the loop, variable is assigned the name of the next file from the list. ls $i This name is passed to the command, which outputs the name on a separate line (Practically, it would be better to execute directly, but this example is just to show how command substitution can be used in the shell). echo ls Command substitution can be written as , or . To perform command substitution, we'll define the function, which has the following prototype: $(command) `command` command_substitute() ; * char command_substitute ( *orig_cmd) char The function accepts a single argument: the command we want to execute. If the expansion succeeds, the function returns an 'd string representing the command's output. If the expansion fails, or if the command outputs nothing, the function returns . malloc NULL Here's a quick breakdown of what the function does in order to expand a command substitution: Depending on the format used, we start by removing the or the backquotes . This leaves us with the command we need to execute. $() `` Call to create a pipe. We pass the command to be executed to , and we get a pointer to a stream from which we'll read the command's output. popen() popen() FILE Call to read the command's output from the pipe. Store the string read in a buffer. fread() Remove any trailing newline characters. Close the pipe and return the buffer with the command output. Take a minute to read the function's code in our . command_substitute() GitHub repository Arithmetic Expansion Using , we can let the shell perform different arithmetic operations and use the result in performing other commands. While POSIX requires the shell to only support signed long integer arithmetic, many shells (e.g. and ) support floating-point arithmetic. arithmetic expansion ksh zsh Also, the shell is not required to support any mathematical functions, although most shells do. For our simple shell, we'll only support signed long integer arithmetic with no math function support. Arithmetic expansion is written as . $((expression)) To perform the expansion, we'll define the function, which has the following prototype: arithm_expand() ; * char arithm_expand ( *expr) char The function receives a string containing the arithmetic expression, performs the necessary calculations, and returns the result in the form of an 'd string (or in case of error, e.g. an invalid arithmetic operation). The function and its associated helper functions are complex and lengthy, but here's the main highlights: arithm_expand() malloc NULL The arithmetic expression is converted to the (RPN), which is easier to parse and calculate. An RPN consists of a series of arithmetic operations, where the operator follows (i.e. comes after) its operands. For example, the RPN of is , and that of is . Reverse Polish Notation x - y x y - 3 + 4 × (2 − 1) 3 4 2 1 − × + During the conversion, arithmetic operators are pushed on an operator stack, from which we'll pop each operator and perform its operation later on. Similarly, operands are added to their own stack. Operators are popped off the stack, one at a time, and the operator is checked. One or two operands are popped off the stack, depending on the type of operator. The rules that govern this process are those of the shunting yard algorithm, which you can read about . here The result is converted to a string, which is returned to the caller. Take a minute to read the function's code in our . arithm_expand() GitHub repository Field Splitting During , the shell takes the result(s) of parameter expansion, command substitution, and arithmetic expansion, and splits them into one or more , which we call (hence the name, ). The process depends on the value of the shell variable. is a historical term that stands for (or ) , and it originates from the time when Unix shells didn't have a builtin array type. field splitting parts fields field splitting $IFS IFS Internal Input Field Separators As a workaround, early Unix shells had to find another way of representing multi-member arrays. The shell would join the array members in a single string, separated by spaces. When the shell needs to retrieve the array members, it breaks down the string into one or more fields. The variable tells the shell where exactly to break that string. $IFS The shell interprets characters as follows (this description is ): $IFS taken from POSIX If the value of is a , , and , or if the variable is not set, any sequence of , , or characters at the beginning or end of the input shall be ignored and any sequence of those characters within the input shall delimit a field. $IFS space tab newline space tab newline If the value of is null, no field splitting shall be performed. $IFS Otherwise, the following rules shall be applied in sequence: (a) white space shall be ignored at the beginning and end of the input. (b) Each occurrence in the input of an character that is not white space, along with any adjacent white space, shall delimit a field, as described previously. (c) Non-zero-length white space shall delimit a field. $IFS $IFS $IFS $IFS $IFS To perform the expansion, we'll define the function, which has the following prototype: field_split() ; struct word_s * field_split ( *str) char Take a minute to read the function's code in our . field_split() GitHub repository Pathname Expansion During (also known as ), the shell matches one or more file names with a given pattern. The pattern can contain normal characters (which are matched to themselves), in addition to the special characters , , and , which are also known as the . pathname expansion filename globbing * ? [] globbing characters The asterisk matches any length of characters (including zero characters), the matches one character, and the brackets introduce a . The result of the expansion is the list of files whose names match the pattern. * ? regular expression (RE) bracket expression To perform the expansion, we'll define the function, which has the following prototype: pathnames_expand() ; struct word_s * pathnames_expand (struct word_s *words) This function accepts a single argument: pointer to the first word in the linked list of words we want to pathname-expand. For each word, the function does the following: Check if the word contains any globbing characters ( , , and ), by calling the helper function , which we'll define in the source file . If the word contains globbing characters, we treat it as the pattern we need to match; otherwise, we move to the next word (if any). * ? [] has_glob_chars() pattern.c Get the list of files whose names match the pattern, excluding the special names and (which point to the current directory and the parent directory, respectively). We delegate pattern matching to another helper function, , which we'll define in the same source file . . .. get_filename_matches() pattern.c Add the matched file names to the final list. Move on to the next word and loop. Return the list of the filenames that matched all the given words. Take a minute to read the function's code in our . pathnames_expand() GitHub repository Quote Removal The final step in the word expansion process is . Quoting is used to remove the special meaning of certain characters to the shell. The shell treats some characters, such as the backslash and the quotes in a special way. To inhibit this behavior, we need to quote those characters to force the shell to treat them as normal characters (you can read more about quoting in ). quote removal this link We can quote characters in one of three way: using the backslash, single quotes, or double quotes. The backslash character is used to preserve the literal meaning (that is, to escape) the character following the backslash. This is similar to the way we escape characters in the C language. Single quotes preserve the literal meaning of all characters inside the quotes, that is, the shell doesn't attempt word expansion on single-quoted strings. Double quotes as similar to single quotes, except that the shell recognizes the special meaning of the backquote, backslash, and sign. That is, the shell can perform word expansion inside double-quoted strings. $ To perform quote removal, we'll define the function, which has the following prototype: remove_quotes() ; void remove_quotes (struct word_s *wordlist) Take a minute to read the function's code in our . remove_quotes() GitHub repository Putting It All Together Now that we have our word expansion functions, its time to tie it all together. In this section, we'll write our main word expansion function, the one we'll call to perform word expansion. This function, in turn, will call the other functions to perform the various steps of word expansion. Our main function is , which we'll define in the source file : word_expand() wordexp.c ; struct word_s * word_expand ( *orig_word) char Here's what the function does in order to perform word expansion on the word we pass as its sole argument: Create a copy of the original word. We'll perform our word expansions on this copy, so that if anything goes wrong, we'll have our original word intact. Scan the word, character by character, looking for the special characters , , , , , , and . ~ " ' ` = \ $ If one of the above characters is found, call , which in turn will call the appropriate word expansion function. substitute_word() Skip any characters that has no special meaning. After finishing with the word expansion, perform field splitting by calling . field_split() Perform pathname expansion by calling . pathnames_expand() Perform quote removal by calling . remove_quotes() Returns the list of expanded words. Take a minute to read the function's code in our . word_expand() GitHub repository Updating the Scanner In of this tutorial, we wrote our function, which we used to get input tokens. So far, our function doesn't know how to handle quoted strings and escaped characters. To add this functionality, we need to update our code. Open the file and add the following code to the function, right after the statement's opening brace: Part II tokenize() tokenize() scanner.c tokenize() switch : : : add_to_buf(nc); i = find_closing_quote(src->buffer+src->curpos); (!i) { src->curpos = src->bufsize; ( , , nc); &eof_token; } (i--) { add_to_buf(next_char(src)); } ; : nc2 = next_char(src); (nc2 == ) { ; } add_to_buf(nc); (nc2 > ) { add_to_buf(nc2); } ; : add_to_buf(nc); nc = peek_char(src); (nc == || nc == ) { i = find_closing_brace(src->buffer+src->curpos+ ); (!i) { src->curpos = src->bufsize; ( , , nc); &eof_token; } (i--) { add_to_buf(next_char(src)); } } ( (nc) || nc == || nc == || nc == || nc == || nc == || nc == ) { add_to_buf(next_char(src)); } ; case '"' case '\'' case '`' if fprintf stderr "error: missing closing quote '%c'\n" return while break case '\\' if '\n' break if 0 break case '$' if '{' '(' 1 if fprintf stderr "error: missing closing brace '%c'\n" return while else if isalnum '*' '@' '#' '!' '?' '$' break Now our lexical scanner knows how to recognize and skip over quoted strings, escaped characters, and other word expansion constructs. Check out the updated lexical scanner code in . this link Updating the Executor Lastly, we need to update our backend executor so that it can: perform word expansion on the arguments of a command before executing that command. support more than 255 arguments per command (we had this limit set in ). Part II Open the file, navigate to the beginning of the function and find the following lines (I reduced the loop to to preserve space): executor.c do_simple_command() ... argc = ; max_args = ; *argv[max_args+ ]; *str; (child) { ... } argv[argc] = ; int 0 long 255 char 1 char while NULL and replace them with the following code: argc = ; targc = ; **argv = ; *str; (child) { str = child->val.str; (!w) { child = child->next_sibling; ; } (w2) { (check_buffer_bounds(&argc, &targc, &argv)) { str = ( (w2->data)+ ); (str) { (str, w2->data); argv[argc++] = str; } } w2 = w2->next; } free_all_words(w); child = child->next_sibling; } (check_buffer_bounds(&argc, &targc, &argv)) { argv[argc] = ; } int 0 int 0 char NULL char while * = ( ); struct word_s w word_expand str if continue * = ; struct word_s w2 w while if malloc strlen 1 if strcpy if NULL With this code, the executor calls on each command argument and adds the expanded word(s) to the arguments list, which we'll eventually pass on to the command. The list can grow as much as needed, thanks to our function, which allocates memory to the buffer as needed. word_expand() check_buffer_bounds() All that remains now is to free our arguments list once we've finished executing the command, before we return to the caller. We do this by calling: free_buffer(argc, argv); in three different places: after executing a builtin utility, if returns error status (which means we can't execute an external command), and after has returned. Check out the updated executor code in . fork() waitpid() this link Compiling the Shell Let’s compile our shell. Open your favorite terminal emulator, navigate to your source directory and make sure you have 19 files and 2 subdirectories in there: Now compile the shell using the following command (you need to download the source project from our ): GitHub repository make If everything goes well, should not output anything (apart from a few harmless warnings), and there should be an executable file named in the current directory: gcc shell Now invoke the shell by running , and try using our word expansion functions and checking the results: ./shell Makefile build builtins executor.c executor.h initsh.c main.c node.c node.h parser.c parser.h pattern.c prompt.c scanner.c scanner.h shell shell.h shunt.c source.c source.h strings.c symtab wordexp.c * /home/user /home/user/Downloads value value 9 $ * echo $ echo '*' $ ~ echo $ ~/Downloads echo $ echo ${A=value} $ echo $A $ $((2+7)) echo And that's it for today. Our shell can now handle all kinds of word expansions (the ones defined by POSIX, to be specific). Toy around with the shell and see what results you can get from the different types of word expansion. Compare the results with the ones you get from your default shell. What's Next We've taken a huge leap in this lesson, with lots of code, most of which we didn't have the time or space to examine in detail. You might want to spend some time reading through the code in our GitHub repository to get yourself familiar with the word expansion process. In the next part, we’ll talk about and , what they are, why we need them, and how to implement them in our shell. positional parameters special parameters Stay tuned! Previously published at https://medium.com/@mohammedisam2000/building-a-linux-shell-part-v-9cf3c0e31269