Minishell Help

How the Lexer Works

What is Lexical Analysis?

Lexical analysis, or tokenization, is the first phase of shell command processing. Think of it as reading a sentence and identifying individual words, punctuation, and meaning - but for shell commands instead of natural language.

When you type echo "Hello World" | grep Hello > output.txt, the lexer doesn't see this as one long string. Instead, it breaks it down into meaningful pieces:

Input Character Sequence

Token Type

Token Value

Meaning

echo

TOKEN_WORD

"echo"

Command to execute

"Hello World"

TOKEN_WORD

"Hello World"

Quoted argument

|

TOKEN_PIPE

"|"

Pipe operator

grep

TOKEN_WORD

"grep"

Next command

Hello

TOKEN_WORD

"Hello"

Argument to grep

>

TOKEN_REDIRECT_OUT

">"

Output redirection

output.txt

TOKEN_WORD

"output.txt"

Target file

Token Types and Recognition

Our Minishell lexer recognizes several distinct types of tokens, each serving a specific purpose in shell command processing:

🔤 TOKEN_WORD

Purpose: Represents commands, arguments, filenames, and any text that isn't a special operator.

Examples:

  • echo → Command name

  • file.txt → Filename

  • "quoted string" → Quoted argument (quotes removed)

  • 'literal text' → Single-quoted text (quotes removed)

  • $USER → Variable (will be expanded later)

🔗 TOKEN_PIPE

Symbol: |
Purpose: Connects commands in a pipeline, passing the output of one command as input to the next.

# Example: count lines in all .c files find . -name "*.c" | wc -l # ^ # TOKEN_PIPE
📥 TOKEN_REDIRECT_IN

Symbol: <
Purpose: Redirects input from a file to a command.

# Read input from file.txt sort < file.txt # ^ # TOKEN_REDIRECT_IN
📤 TOKEN_REDIRECT_OUT

Symbol: >
Purpose: Redirects command output to a file (overwrites existing content).

# Write output to file.txt echo "Hello" > file.txt # ^ # TOKEN_REDIRECT_OUT
📝 TOKEN_REDIRECT_APPEND

Symbol: >>
Purpose: Redirects command output to a file (appends to existing content).

# Append output to log.txt echo "New entry" >> log.txt # ^ # TOKEN_REDIRECT_APPEND
📋 TOKEN_REDIRECT_HEREDOC

Symbol: <<
Purpose: Starts a here-document, allowing multi-line input until a delimiter is reached.

# Here-document example cat << EOF # ^ # TOKEN_REDIRECT_HEREDOC Line 1 Line 2 EOF

The Finite State Machine

The heart of our lexer is a finite state machine (FSM) that processes input character by character. Each state represents a different context for interpreting characters, and transitions between states occur based on the current character and lexer rules.

Lexer States

  1. LEXER_NONE - Initial state, no token being built

    In this state, the lexer scans for the start of new tokens. It skips whitespace and determines what type of token to begin based on the current character.

  2. LEXER_WORD - Building a regular word token

    Accumulates characters that aren't whitespace, quotes, or special operators into a word token. Transitions out when encountering delimiters.

  3. LEXER_UNI - Inside single quotes

    Treats all characters literally until the closing single quote. No variable expansion or special character interpretation occurs.

  4. LEXER_DUO - Inside double quotes

    Similar to single quotes but allows variable expansion (though this happens in later processing phases, not during lexical analysis).

Character-by-Character Processing

Let's trace through the lexer's operation on a concrete example to understand how state transitions and token creation work in practice.

Example: Processing echo "Hello World" | grep Hello

  1. Initialize: State = LEXER_NONE, Position = 0

    Input: echo "Hello World" | grep Hello ^ Position 0, State: LEXER_NONE
  2. Characters 0-3 ('echo'):

    • Char 'e': Not whitespace/quote/special → State = LEXER_WORD, Start = 0

    • Chars 'c', 'h', 'o': Continue in LEXER_WORD state

  3. Character 4 (space):

    • Whitespace encountered → End current token

    • Create TOKEN_WORD with value "echo"

    • State = LEXER_NONE

  4. Character 5 ('\"'):

    • Double quote → State = LEXER_DUO

    • Start = 6 (position after opening quote)

  5. Characters 6-16 ('Hello World'):

    • All characters including space treated literally

    • Continue in LEXER_DUO state

  6. Character 17 ('\"'):

    • Closing quote → End token with value "Hello World"

    • Create TOKEN_WORD, State = LEXER_NONE

  7. Character 19 ('|'):

    • Special character → Create TOKEN_PIPE immediately

    • No state change needed for single-char operators

Position: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Input: e c h o " H e l l o W o r l d " | State: W W W W N D D D D D D D D D D D D N N N N Legend: N=LEXER_NONE, W=LEXER_WORD, D=LEXER_DUO
Position 4: TOKEN_WORD ("echo") Position 17: TOKEN_WORD ("Hello World") Position 19: TOKEN_PIPE ("|") Position 24: TOKEN_WORD ("grep") Position 30: TOKEN_WORD ("Hello")

Special Character Handling

Special characters require immediate attention and create tokens as soon as they're encountered. The lexer recognizes both single-character and two-character operators.

Single-Character Operators

These create tokens immediately upon detection:

Character

Token Type

Action

|

TOKEN_PIPE

Create token, advance position by 1

>

TOKEN_REDIRECT_OUT

Check next char for '>>', else single redirect

<

TOKEN_REDIRECT_IN

Check next char for '<<', else single redirect

Two-Character Operators

The lexer uses look ahead to detect two-character operators. When it encounters '>' or '<', it checks the next character:

if (lexer->text[lexer->pos] == '>') { if (lexer->text[lexer->pos + 1] == '>') { type = TOKEN_REDIRECT_APPEND; // ">>" special = true; // Advance by 2 positions } else type = TOKEN_REDIRECT_OUT; // ">" }

The special flag indicates whether to advance by 1 or 2 positions, ensuring two-character operators are consumed completely.

Quote Processing Deep Dive

Quote handling is one of the most complex aspects of shell lexical analysis. Our lexer supports both single quotes (literal) and double quotes (with potential variable expansion in later phases).

Single Quote Behaviour

Single quotes preserve everything literally. No special characters, variables, or escape sequences are interpreted within single quotes.

# Examples of single quote behaviour echo 'Hello $USER' # Outputs: Hello $USER echo 'pipe | redirect >' # Outputs: pipe | redirect > echo 'multi line text' # Preserves newlines literally

Single Quote Processing

  1. Encounter opening single quote → State = LEXER_UNI

  2. Set start position to character after opening quote

  3. Consume all characters literally until closing quote

  4. Create TOKEN_WORD with content between quotes

  5. Return to LEXER_NONE state

Double Quote Behaviour

Double quotes allow variable expansion and some special character interpretation, though the actual expansion happens in later processing phases. During lexical analysis, the content is captured as-is.

# Examples of double quote behaviour echo "Hello $USER" # $USER will be expanded later echo "Current dir: $(pwd)" # Command substitution preserved echo "Tab:\tSpace: " # Whitespace preserved literally

Quote Edge Cases

Several edge cases require careful handling in quote processing:

Empty Quotes
echo "" # Creates TOKEN_WORD with empty string value echo '' # Creates TOKEN_WORD with empty string value

Empty quotes are valid and create word tokens with empty string values.

Adjacent Quotes
echo "Hello"'World' # Two separate tokens: "Hello" and "World" echo 'Part1'\"Part2\" # Two separate tokens that get joined later

Adjacent quoted strings create separate tokens that are joined in post-processing.

Unclosed Quotes
echo "unclosed quote # ERROR: EINVAL returned echo 'missing end # ERROR: EINVAL returned

Unclosed quotes result in a lexer error and command rejection.

Word Token Joining

After initial tokenization, the lexer performs a post-processing step to join adjacent word tokens. This handles cases where quoted strings appear next to regular words or other quoted strings.

Why join tokens? Consider the input hello"world"test. During lexical analysis, this becomes three separate tokens:

  • TOKEN_WORD ("hello")

  • TOKEN_WORD ("world")

  • TOKEN_WORD ("test")

But semantically, this should be treated as a single argument: helloworldtest. The joining process combines these into one TOKEN_WORD.

// Simplified joining logic static void try_join(t_list *token) { t_token *current = token->content; t_token *next = token->next->content; // Only join adjacent word tokens if (current->type != TOKEN_WORD || next->type != TOKEN_WORD) return; // Create new combined token t_token *new_token = create_token( ft_strjoin(current->value, next->value), TOKEN_WORD ); // Replace current token and remove next token->content = new_token; // ... clean-up and continue joining }

Joining Examples

  1. Input: file"name".txt

    Initial tokens: ["file", "name", ".txt"]

    After joining: ["filename.txt"]

  2. Input: echo "Hello "'World'

    Initial tokens: ["echo", "Hello ", "World"]

    After joining: ["echo", "Hello World"]

  3. Input: command arg1 | next

    Tokens: ["command", "arg1", "|", "next"]

    No joining: Pipe breaks word adjacency

Error Handling and Edge Cases

Robust lexical analysis requires careful handling of error conditions and edge cases. Our lexer implements several safety mechanisms to ensure reliable operation.

Memory Management

The lexer carefully manages memory allocation and clean-up:

  • Token Creation: Each token allocates memory for its value string

  • List Management: Tokens are stored in a linked list with proper clean-up functions

  • Error Recovery: On allocation failure, partial tokens are cleaned up

  • ENOMEM Handling: Memory exhaustion sets errno and returns gracefully

// Error handling in token creation t_token *token = create_token(value, type); if (token == NULL) { free(value); // Clean up partial allocation errno = ENOMEM; // Set error code return (false); // Indicate failure }
Input Validation

The lexer validates input before processing:

  • Null Checks: Handles NULL lexer or text pointers

  • Empty Input: Returns NULL for empty strings gracefully

  • Quote Matching: Detects and reports unclosed quotes

t_list *run_lexer(t_lexer *lexer) { // Input validation if (lexer == NULL || lexer->text == NULL || lexer->text[0] == '\\0') return (NULL); // ... processing ... // Quote validation at end if (lexer->state == LEXER_UNI || lexer->state == LEXER_DUO) { errno = EINVAL; // Invalid input return (NULL); } }
State Consistency

The lexer maintains consistent internal state throughout processing:

  • Position Tracking: Current position never exceeds string length

  • Start Position: Token start position properly updated on state changes

  • State Transitions: Only valid state transitions are allowed

Performance Considerations

While correctness is paramount, the lexer is also designed with performance in mind. Several optimizations ensure efficient processing of shell input.

Single-Pass Processing

The lexer processes input in a single pass, character by character, without backtracking. This provides O(n) time complexity where n is the input length.

Minimal Memory Allocation

Memory allocation is minimized through:

  • Using ft_substr() to extract token values efficiently

  • Reusing lexer structure across multiple tokenization calls

  • Avoiding unnecessary string copying during state transitions

Efficient Special Character Detection

Special characters are detected using ft_strchr(SPECIAL_CHARS, char), which provides fast O(1) lookup for the small set of special characters.

Integration with the Shell Pipeline

The lexer is the first component in the shell's processing pipeline, but it doesn't work in isolation. Understanding how it integrates with other components is crucial for the overall shell architecture.

Shell Processing Pipeline

  1. Input Collection: Shell reads command line from user

  2. Lexical Analysis: Lexer converts input to token stream

  3. Parsing: Parser builds command structure from tokens

  4. Expansion: Variables and wildcards are expanded

  5. Execution: Commands are executed with proper I/O setup

Token Stream Format

The lexer outputs a linked list of tokens that the parser can process sequentially. Each token contains:

typedef struct s_token { char *value; // Token content (malloc'd string) t_token_type type; // Token classification } t_token;

This simple but powerful structure allows the parser to make decisions based on token type while having access to the original text for commands, arguments, and filenames.

Testing and Validation

Comprehensive testing is essential for a reliable lexer. Our testing strategy covers both normal operation and edge cases.

# Simple commands echo hello ls -la cat file.txt # Expected tokens: [echo][hello], [ls][-la], [cat][file.txt]
# Various quote scenarios echo "hello world" echo 'literal $USER' echo "mixed 'quotes'" echo "" echo '' # Test both quote types and nesting
# All operator types echo hello | grep h cat < input.txt echo output > file.txt echo append >> log.txt cat << EOF # Verify correct token types generated
# Complex scenarios echo"no spaces"between'quotes' command|pipe>redirect echo "unclosed quote multiple spaces between words # Test error handling and boundary conditions

Common Pitfalls and Solutions

Learning from common mistakes can save significant debugging time. Here are frequent issues encountered when implementing shell lexers:

Incorrect Quote Handling

Problem: Treating quotes as part of the token value.

Solution: Set token start position after opening quote, end before closing quote. Quotes are delimiters, not content.

❌ Wrong: "hello" → TOKEN_WORD("\"hello\"") ✅ Right: "hello" → TOKEN_WORD("hello")
Missing Word Joining

Problem: Adjacent quoted strings treated as separate arguments.

Solution: Implement post-processing to join adjacent word tokens.

❌ Wrong: file"name" → ["file", "name"] (2 arguments) ✅ Right: file"name" → ["filename"] (1 argument)
Operator Precedence Issues

Problem: Not recognizing two-character operators like '>>'.

Solution: Use look ahead to match the longest possible operator first.

❌ Wrong: >> → [">"][">"] (two redirections) ✅ Right: >> → [">>"] (one append redirection)
Memory Leaks

Problem: Forgetting to free token values and list nodes.

Solution: Implement proper clean-up functions and use them consistently.

// Proper clean-up void free_token(t_token *token) { if (token) { free(token->value); // Free the string free(token); // Free the structure } }

Conclusion: The Foundation of Shell Processing

The lexer might seem like a simple component - after all, it just breaks text into pieces. But as we've seen, it's the critical foundation that enables everything else in the shell to work correctly.

Key Takeaways:

  • State machines provide clarity: The FSM approach makes complex parsing logic manageable and debuggable

  • Details matter: Quote handling, operator recognition, and edge cases require careful attention

  • Clean interfaces enable composition: Well-defined token types create clear boundaries between shell components

  • Error handling is crucial: Robust error detection prevents issues from propagating to later stages

With a solid understanding of lexical analysis, you're ready to tackle the next challenge: building a parser that can take your token stream and create meaningful command structures for execution.

Last modified: 16 June 2025