How the Lexer Works

What is Lexical Analysis?

Lexical analysis, or tokenization, is the first phase of shell command processing. Think of it as reading a sentence and identifying individual words, punctuation, and meaning - but for shell commands instead of natural language.

When you type echo "Hello World" | grep Hello > output.txt, the lexer doesn't see this as one long string. Instead, it breaks it down into meaningful pieces:

Input Character Sequence	Token Type	Token Value	Meaning
`echo`	TOKEN_WORD	"echo"	Command to execute
`"Hello World"`	TOKEN_WORD	"Hello World"	Quoted argument
`\|`	TOKEN_PIPE	"\|"	Pipe operator
`grep`	TOKEN_WORD	"grep"	Next command
`Hello`	TOKEN_WORD	"Hello"	Argument to grep
`>`	TOKEN_REDIRECT_OUT	">"	Output redirection
`output.txt`	TOKEN_WORD	"output.txt"	Target file

Token Types and Recognition

Our Minishell lexer recognizes several distinct types of tokens, each serving a specific purpose in shell command processing:

🔤 TOKEN_WORD

Purpose: Represents commands, arguments, filenames, and any text that isn't a special operator.

Examples:

echo → Command name
file.txt → Filename
"quoted string" → Quoted argument (quotes removed)
'literal text' → Single-quoted text (quotes removed)
$USER → Variable (will be expanded later)

🔗 TOKEN_PIPE

Symbol: |
Purpose: Connects commands in a pipeline, passing the output of one command as input to the next.

                    # Example: count lines in all .c files
                    find . -name "*.c" | wc -l
                    # ^
                    # TOKEN_PIPE
                

📥 TOKEN_REDIRECT_IN

Symbol: <
Purpose: Redirects input from a file to a command.

                    # Read input from file.txt
                    sort < file.txt
                    # ^
                    # TOKEN_REDIRECT_IN
                

📤 TOKEN_REDIRECT_OUT

Symbol: >
Purpose: Redirects command output to a file (overwrites existing content).

                    # Write output to file.txt
                    echo "Hello" > file.txt
                    # ^
                    # TOKEN_REDIRECT_OUT
                

📝 TOKEN_REDIRECT_APPEND

Symbol: >>
Purpose: Redirects command output to a file (appends to existing content).

                    # Append output to log.txt
                    echo "New entry" >> log.txt
                    # ^
                    # TOKEN_REDIRECT_APPEND
                

📋 TOKEN_REDIRECT_HEREDOC

Symbol: <<
Purpose: Starts a here-document, allowing multi-line input until a delimiter is reached.

                    # Here-document example
                    cat << EOF
                    # ^
                    # TOKEN_REDIRECT_HEREDOC
                    Line 1
                    Line 2
                    EOF
                

The Finite State Machine

The heart of our lexer is a finite state machine (FSM) that processes input character by character. Each state represents a different context for interpreting characters, and transitions between states occur based on the current character and lexer rules.

Lexer States

LEXER_NONE - Initial state, no token being built
In this state, the lexer scans for the start of new tokens. It skips whitespace and determines what type of token to begin based on the current character.
LEXER_WORD - Building a regular word token
Accumulates characters that aren't whitespace, quotes, or special operators into a word token. Transitions out when encountering delimiters.
LEXER_UNI - Inside single quotes
Treats all characters literally until the closing single quote. No variable expansion or special character interpretation occurs.
LEXER_DUO - Inside double quotes
Similar to single quotes but allows variable expansion (though this happens in later processing phases, not during lexical analysis).

Character-by-Character Processing

Let's trace through the lexer's operation on a concrete example to understand how state transitions and token creation work in practice.

Example: Processing echo "Hello World" | grep Hello

Initialize: State = LEXER_NONE, Position = 0
Input: echo "Hello World" | grep Hello ^ Position 0, State: LEXER_NONE
Characters 0-3 ('echo'):
- Char 'e': Not whitespace/quote/special → State = LEXER_WORD, Start = 0
- Chars 'c', 'h', 'o': Continue in LEXER_WORD state
Character 4 (space):
- Whitespace encountered → End current token
- Create TOKEN_WORD with value "echo"
- State = LEXER_NONE
Character 5 ('\"'):
- Double quote → State = LEXER_DUO
- Start = 6 (position after opening quote)
Characters 6-16 ('Hello World'):
- All characters including space treated literally
- Continue in LEXER_DUO state
Character 17 ('\"'):
- Closing quote → End token with value "Hello World"
- Create TOKEN_WORD, State = LEXER_NONE
Character 19 ('|'):
- Special character → Create TOKEN_PIPE immediately
- No state change needed for single-char operators

                    Position: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
                    Input:    e c h o " H e l l o W  o  r  l  d  "  |
                    State:    W W W W N D D D D D D  D  D  D  D  D  D  N  N  N  N

                    Legend: N=LEXER_NONE, W=LEXER_WORD, D=LEXER_DUO
                

                    Position 4:  TOKEN_WORD ("echo")
                    Position 17: TOKEN_WORD ("Hello World")
                    Position 19: TOKEN_PIPE ("|")
                    Position 24: TOKEN_WORD ("grep")
                    Position 30: TOKEN_WORD ("Hello")
                

Special Character Handling

Special characters require immediate attention and create tokens as soon as they're encountered. The lexer recognizes both single-character and two-character operators.

Single-Character Operators

These create tokens immediately upon detection:

Character	Token Type	Action
`\|`	TOKEN_PIPE	Create token, advance position by 1
`>`	TOKEN_REDIRECT_OUT	Check next char for '>>', else single redirect
`<`	TOKEN_REDIRECT_IN	Check next char for '<<', else single redirect

Two-Character Operators

The lexer uses look ahead to detect two-character operators. When it encounters '>' or '<', it checks the next character:

                    if (lexer->text[lexer->pos] == '>')
                    {
                    if (lexer->text[lexer->pos + 1] == '>')
                    {
                    type = TOKEN_REDIRECT_APPEND; // ">>"
                    special = true; // Advance by 2 positions
                    }
                    else
                    type = TOKEN_REDIRECT_OUT; // ">"
                    }
                

The special flag indicates whether to advance by 1 or 2 positions, ensuring two-character operators are consumed completely.

Quote Processing Deep Dive

Quote handling is one of the most complex aspects of shell lexical analysis. Our lexer supports both single quotes (literal) and double quotes (with potential variable expansion in later phases).

Single Quote Behaviour

Single quotes preserve everything literally. No special characters, variables, or escape sequences are interpreted within single quotes.

                # Examples of single quote behaviour
                echo 'Hello $USER' # Outputs: Hello $USER
                echo 'pipe | redirect >' # Outputs: pipe | redirect >
                echo 'multi
                line
                text' # Preserves newlines literally
            

Single Quote Processing

Encounter opening single quote → State = LEXER_UNI
Set start position to character after opening quote
Consume all characters literally until closing quote
Create TOKEN_WORD with content between quotes
Return to LEXER_NONE state

Double Quote Behaviour

Double quotes allow variable expansion and some special character interpretation, though the actual expansion happens in later processing phases. During lexical analysis, the content is captured as-is.

                # Examples of double quote behaviour
                echo "Hello $USER" # $USER will be expanded later
                echo "Current dir: $(pwd)" # Command substitution preserved
                echo "Tab:\tSpace: " # Whitespace preserved literally
            

Quote Edge Cases

Several edge cases require careful handling in quote processing:

Empty Quotes: echo "" # Creates TOKEN_WORD with empty string value echo '' # Creates TOKEN_WORD with empty string value
Empty quotes are valid and create word tokens with empty string values.
Adjacent Quotes: echo "Hello"'World' # Two separate tokens: "Hello" and "World" echo 'Part1'\"Part2\" # Two separate tokens that get joined later
Adjacent quoted strings create separate tokens that are joined in post-processing.
Unclosed Quotes: echo "unclosed quote # ERROR: EINVAL returned echo 'missing end # ERROR: EINVAL returned
Unclosed quotes result in a lexer error and command rejection.

Word Token Joining

After initial tokenization, the lexer performs a post-processing step to join adjacent word tokens. This handles cases where quoted strings appear next to regular words or other quoted strings.

Why join tokens? Consider the input hello"world"test. During lexical analysis, this becomes three separate tokens:

TOKEN_WORD ("hello")
TOKEN_WORD ("world")
TOKEN_WORD ("test")

But semantically, this should be treated as a single argument: helloworldtest. The joining process combines these into one TOKEN_WORD.

            // Simplified joining logic
            static void try_join(t_list *token)
            {
            t_token *current = token->content;
            t_token *next = token->next->content;

            // Only join adjacent word tokens
            if (current->type != TOKEN_WORD || next->type != TOKEN_WORD)
            return;

            // Create new combined token
            t_token *new_token = create_token(
            ft_strjoin(current->value, next->value),
            TOKEN_WORD
            );

            // Replace current token and remove next
            token->content = new_token;
            // ... clean-up and continue joining
            }
        

Joining Examples

Input: file"name".txt
Initial tokens: ["file", "name", ".txt"]
After joining: ["filename.txt"]
Input: echo "Hello "'World'
Initial tokens: ["echo", "Hello ", "World"]
After joining: ["echo", "Hello World"]
Input: command arg1 | next
Tokens: ["command", "arg1", "|", "next"]
No joining: Pipe breaks word adjacency

Error Handling and Edge Cases

Robust lexical analysis requires careful handling of error conditions and edge cases. Our lexer implements several safety mechanisms to ensure reliable operation.

Memory Management

The lexer carefully manages memory allocation and clean-up:

Token Creation: Each token allocates memory for its value string
List Management: Tokens are stored in a linked list with proper clean-up functions
Error Recovery: On allocation failure, partial tokens are cleaned up
ENOMEM Handling: Memory exhaustion sets errno and returns gracefully

                    // Error handling in token creation
                    t_token *token = create_token(value, type);
                    if (token == NULL)
                    {
                    free(value); // Clean up partial allocation
                    errno = ENOMEM; // Set error code
                    return (false); // Indicate failure
                    }
                

Input Validation

The lexer validates input before processing:

Null Checks: Handles NULL lexer or text pointers
Empty Input: Returns NULL for empty strings gracefully
Quote Matching: Detects and reports unclosed quotes

                    t_list *run_lexer(t_lexer *lexer)
                    {
                    // Input validation
                    if (lexer == NULL || lexer->text == NULL || lexer->text[0] == '\\0')
                    return (NULL);

                    // ... processing ...

                    // Quote validation at end
                    if (lexer->state == LEXER_UNI || lexer->state == LEXER_DUO)
                    {
                    errno = EINVAL; // Invalid input
                    return (NULL);
                    }
                    }
                

State Consistency

The lexer maintains consistent internal state throughout processing:

Position Tracking: Current position never exceeds string length
Start Position: Token start position properly updated on state changes
State Transitions: Only valid state transitions are allowed

Performance Considerations

While correctness is paramount, the lexer is also designed with performance in mind. Several optimizations ensure efficient processing of shell input.

Single-Pass Processing

The lexer processes input in a single pass, character by character, without backtracking. This provides O(n) time complexity where n is the input length.

Minimal Memory Allocation

Memory allocation is minimized through:

Using ft_substr() to extract token values efficiently
Reusing lexer structure across multiple tokenization calls
Avoiding unnecessary string copying during state transitions

Efficient Special Character Detection

Special characters are detected using ft_strchr(SPECIAL_CHARS, char), which provides fast O(1) lookup for the small set of special characters.

Integration with the Shell Pipeline

The lexer is the first component in the shell's processing pipeline, but it doesn't work in isolation. Understanding how it integrates with other components is crucial for the overall shell architecture.

Shell Processing Pipeline

Input Collection: Shell reads command line from user
Lexical Analysis: Lexer converts input to token stream
Parsing: Parser builds command structure from tokens
Expansion: Variables and wildcards are expanded
Execution: Commands are executed with proper I/O setup

Token Stream Format

The lexer outputs a linked list of tokens that the parser can process sequentially. Each token contains:

                typedef struct s_token
                {
                char *value; // Token content (malloc'd string)
                t_token_type type; // Token classification
                } t_token;
            

This simple but powerful structure allows the parser to make decisions based on token type while having access to the original text for commands, arguments, and filenames.

Testing and Validation

Comprehensive testing is essential for a reliable lexer. Our testing strategy covers both normal operation and edge cases.

                    # Simple commands
                    echo hello
                    ls -la
                    cat file.txt

                    # Expected tokens: [echo][hello], [ls][-la], [cat][file.txt]
                

                    # Various quote scenarios
                    echo "hello world"
                    echo 'literal $USER'
                    echo "mixed 'quotes'"
                    echo ""
                    echo ''

                    # Test both quote types and nesting
                

                    # All operator types
                    echo hello | grep h
                    cat < input.txt
                    echo output > file.txt
                    echo append >> log.txt
                    cat << EOF

                    # Verify correct token types generated
                

                    # Complex scenarios
                    echo"no spaces"between'quotes'
                    command|pipe>redirect
                    echo "unclosed quote
                    multiple spaces between words

                    # Test error handling and boundary conditions
                

Common Pitfalls and Solutions

Learning from common mistakes can save significant debugging time. Here are frequent issues encountered when implementing shell lexers:

Incorrect Quote Handling

Problem: Treating quotes as part of the token value.

Solution: Set token start position after opening quote, end before closing quote. Quotes are delimiters, not content.

                    ❌ Wrong: "hello" → TOKEN_WORD("\"hello\"")
                    ✅ Right: "hello" → TOKEN_WORD("hello")
                

Missing Word Joining

Problem: Adjacent quoted strings treated as separate arguments.

Solution: Implement post-processing to join adjacent word tokens.

                    ❌ Wrong: file"name" → ["file", "name"] (2 arguments)
                    ✅ Right: file"name" → ["filename"] (1 argument)
                

Operator Precedence Issues

Problem: Not recognizing two-character operators like '>>'.

Solution: Use look ahead to match the longest possible operator first.

                    ❌ Wrong: >> → [">"][">"] (two redirections)
                    ✅ Right: >> → [">>"] (one append redirection)
                

Memory Leaks

Problem: Forgetting to free token values and list nodes.

Solution: Implement proper clean-up functions and use them consistently.

                    // Proper clean-up
                    void free_token(t_token *token)
                    {
                    if (token)
                    {
                    free(token->value); // Free the string
                    free(token); // Free the structure
                    }
                    }
                

Conclusion: The Foundation of Shell Processing

The lexer might seem like a simple component - after all, it just breaks text into pieces. But as we've seen, it's the critical foundation that enables everything else in the shell to work correctly.

Key Takeaways:

State machines provide clarity: The FSM approach makes complex parsing logic manageable and debuggable
Details matter: Quote handling, operator recognition, and edge cases require careful attention
Clean interfaces enable composition: Well-defined token types create clear boundaries between shell components
Error handling is crucial: Robust error detection prevents issues from propagating to later stages

With a solid understanding of lexical analysis, you're ready to tackle the next challenge: building a parser that can take your token stream and create meaningful command structures for execution.