TurExp (version 3)

A Turing Machine equivalent of regular expressions. A nondeterministic esoteric programming language, nondeterministic in the computer science sense of always choosing the right option when there's a choice.

The language runs on a nondeterministic Turing Machine, or a simulator of one. The machine has a tape, extending infinitely in both directions, with 1112064 possible symbols (one per Unicode character, but see below) and a current position that's pointing to a symbol on the tape (not between symbols).

Whitespace in the code is ignored except in quoted mode and after a backslash. There is no built-in support for comments, though no-ops such as ([]"comment goes here")? can be used instead.

program → ((?v3))? compiler-directive* instruction
A program is an instruction (typically a sequence instruction). Programs can optionally start with (?v3) to indicate that they use TurExp version 3. (This is not necessary to use version 3–specific features; it's mostly there for the possibility of a version 4 or higher.)

Contents

Character encoding

Version 2 was documented as having 256 symbols, and the documentation said "The code, input, and output are assumed to be in an 8-bit superset of ASCII (though implementations may choose to treat input and output as numbers instead)". However, the actual implementation used the platform-default character encoding and actually seems to support Unicode just fine, except that characters outside Latin-1 can't be entered as hex codes.

Version 3 fully supports Unicode. There are 1112064 possible symbols, one for each Unicode character except the surrogate code points (d800-dfff), which TurExp doesn't consider to be valid characters.

In version 3, the (?u) and (?-u) compiler directives determine the encoding of the source file. (?u) indicates that everything from that point on is in UTF-8; (?-u) indicates that everything from that point on is Latin-1. Whichever one is in effect at the end of the program determines the encoding for input and output.

However, no implementation actually supports these directives, because the only version 3 interpreter is written in JavaScript, and its input and output are strings rather than binary, so it just ignores these directives. They're still specified in case there are any non-JavaScript implementations at some point, and recognized for compatibility with such implementations.

Input and output

Before the program starts, a block of text is read as input. This text is put onto the tape before the program starts, and the current position is set to the first character of this text. The rest of the tape is filled with null characters (\00).

When the program ends, the interpreter outputs whether the program got to the end (accepted) or if all branches failed (rejected). If the program accepted, the interpreter looks to the left and the right of the current position for the nearest null character, and outputs everything on the tape between those characters. If the current position is on a null character, then it starts looking one character to the left. This means that, e.g., if the tape has \00foo\00 on it (and the rest is filled with null characters), and the current position is anywhere between the f and the \00 directly after the word foo, inclusive, the program will output foo; otherwise, it will output nothing.

If the program forks into multiple copies and more than one complete successfully, one of them will arbitrarily chosen to be the output. In the current interpreter, the same output will always be chosen for the same program/input combination, but that's not guaranteed, and the output chosen can differ between interpreters and versions of an interpreter. If the "Show all results" checkbox is checked (or the -a option is given to the Haskell interpreter), then the interpreter will continue running until all copies of the program have succeeded or failed.

There is no way to do input while the program is running, and no good way to do output before the program finishes. For debugging purposes, in version 3, you can check "Trace" to show every command that's executed, and you can print the current state using the (?t) instruction, which is a no-op that's always traced even if "Trace" isn't checked. You can put text in the parentheses after the ?t, which will also be printed.

Complex instructions

instructionsimple-instruction
Described below. Read and/or write symbols and/or move the tape head.
instructioncompiler-directive
Described below. Changes the current mode. (Technically not an instruction, but can be used where an instruction can, except before *, +, or ?.)
instruction( instructions )
Executes the specified instructions as a sequence. () is a no-op.
instruction[ instructions ]
Forks the machine into as many copies as their are instructions, and each copy executes a different instruction from the brackets. [] causes the current machine to fail (i.e., the current fork is not the one that produces the final output).
instruction(?^ characters ) (version 3 only)
Reads a single character and fails if it matches any of the specified characters. characters technically can be any instructions, but things other than characters and ranges might behave unexpectedly. See the section on modes for details.
instruction(?? instructions ) (version 3 only)
Chooses one of the instructions at random (equally likely) and executes it. Doesn't attempt to choose another option if it fails.
instructioninstruction *
Executes the specified instruction as many times as necessary.
instructioninstruction +
a+ is a shortcut for (aa*); i.e., executes the specified instruction at least once.
instructioninstruction ?
a? is a shortcut for [()a]; i.e., optionally executes the specified instruction.
instruction(?$) (version 3 only)
Halts the program and causes it to complete successfully.
instruction(?t (char - " ( ) [ ] \ | " (char - ")* ")* ) (version 3 only)
A no-op that causes the interpreter to print the current program state, as described above. The details are implementation-specific, and implementations can choose to ignore this instruction.

Multiple *, +, ? operators cannot appear next to each other without parentheses. Compiler directives cannot come between the statement and the operator.

The (?s) instruction (version 3 only)

There are two new instructions that work together to create a temporary variable:

instruction(?S instructions )
Acts like a sequence instruction, but introduces a new scope. These instructions cannot nest.
instruction(?s instructions )
Must be used within an (?S) instruction. This instruction acts like a [] instruction, except that every (?s) instruction within a single (?S) instruction always chooses the same index.

For instance, (?S(?sab)(?s12)) first matches either a or b, and then matches either 1 (but only if the first match was a) or 2 (but only if the first match was b). This is equivalent to [(a1)(b2)], but it can be useful if there are a lot of instructions between the first and second match, and it can be useful when used with (?r).

Compiler modes

The compiler has several modes, which affect the interpretation of various instructions. The compiler mode can be changed using compiler directives. More details on these modes are in later sections.

Directives run at compile-time and affect any characters after them in the source file, regardless of whether or when the directive would be executed at runtime; for instance, ([ab`]c) is equivalent to ([ab]`c) (both choose between two options, reading a or b, and then write c), and not equivalent to [(ac)(bc)(`c)]. Version 2 of TurExp eliminates run-time mode, which would change this behavior.

Direction compiler directives cannot be used in modeless mode.

Symbols

symbolssymbol | .

Symbols are used for reading and writing. The syntax for a symbol depends on the character mode.

For normal mode

symbolcharacter - reserved-character - whitespace | \ (character - hex-digit) | \ hex-digits

reserved-character → one of . \ " { } ` ' # > < | ( ) [ ] * + ? ^

A normal character or a character preceded by a backslash stands for its ASCII value. A backslash followed by a hex literal is the value of the literal. The number of digits in the hex literal is determined by the digits mode.

For binary mode

symbolhex-digits | \ character

Like normal mode, but the meaning of backslash is reversed. An unrecognized character is a syntax error.

For quoted mode

symbolcharacter - "

Each character stands for its ASCII value.

Ranges (version 3 only)

symbols(?r symbol symbol? symbol )

In any situation where a symbol is expected, a range can be used instead (except within another range, or in quoted mode). These are notated as (?r, the first character, an optional second character, and then the last character, and ). For instance, (?raz) is equivalent to abcdefghijklmnopqrstuvwxyz, and (?r028) is equivalent to 02468 (skipping the odd numbers because the second character is two characters after the first character). Characters use the same syntax that they normally would (so backslashes are necessary to escape reserved characters, and they're entered in hex when hex mode is active).

A range is equivalent to the corresponding sequence of characters, whatever that means in the context where it's used. For instance, ['(?r15)] matches any of the characters 1, 2, 3, 4, or 5, whereas (`(?r15)) writes the string 12345. Modeless mode treats ranges a bit differently.

A range cannot be directly followed by *, +, or ?, and can't be at the top level, without being in some sort of parentheses or brackets.

Simple instructions

The meaning of a simple instruction depends on the current RW mode.

In read mode

simple-instructionsymbols

Reads a symbol from the tape. If the inversion mode is non-inverted, the symbol on the tape must be the same as the one specified in the instruction; if the inversion mode is inverted, it must be different. Otherwise, the current fork fails. If the case mode is case-insensitive, then the match is case-insensitive (regardless of whether the symbol was specified as a character or a hex code). Afterwards, moves in the current direction.

If a dot (.) is specified instead of a symbol, skips over the current symbol without failing, regardless of inversion mode.

In write mode

simple-instructionsymbols

Writes the specified symbol, then moves in the current direction. If a dot is specified instead of a symbol, just moves in the current direction. Inversion mode and case mode are ignored.

In modeless mode

simple-instructionsymbols symbols direction

direction< | > | |

Reads the first symbol (possibly inverted), writes the second, and then moves in the specified direction. Compiler directives cannot be between the three parts of the instruction. . can be specified for either or both symbols to skip reading or writing. The direction can be specified even in quoted mode, and it can be preceded by whitespace even in quoted mode.

If a range is specified for either symbol, the entire instruction is repeated. For instance, (?r13)4> is equivalent to 14>24>34>. If ranges are specified for both symbols, then the ranges must be the same length, and corresponding elements are used. For instance, [#(?r08)(?r19)>90>] will add 1 to the current digit, wrapping around if it's 9.

Macros (version 3 only)

macro-charcharacter - (" . ( ) [ ] \)

compiler-directive(?M macro-char macro-char* . instruction* )

instruction(?m macro-char instruction* )

Macros can be defined using (?M), which can appear anywhere a compiler directive can, and can be called with (?m).

For a simple macro with no arguments, it can be defined as (?M name . instructions) (notice there's a dot there), where name is any single character except quotation mark, backslash, dot, parentheses, or brackets. It can be used with (?m name).

For arguments, when defining the macro, put the parameter names between the macro name and the dot (parameter names must also be single characters). If there's a parameter named x, then you can use it with (?mx). To pass arguments, simply put them after the macro name when calling it. For instance, the macro (?M2x.(?mx)(?mx)) performs any instruction you give it as an argument twice; once this is defined, (?m2a) expands into aa.

If a macro is passed more arguments than it expects, then all of the remaining arguments go into whatever the last parameter is. If a macro is passed fewer arguments than it expects, then remaining parameters will expand to nothing. If a macro that doesn't expect any arguments is passed arguments, those arguments are ignored.

If a macro or a macro argument expands to multiple instructions, the interpretation of those multiple instructions depends on context; for instance, if it's in a sequence instruction, then the different instructions are executed in sequence, if it's in a fork instruction, then each instruction becomes a different fork, etc. If a macro expands to multiple instructions, then it's an error for the (?m) instruction to be immediately followed by a postfix operator or to be at the top level.

Compiler directives take effect before macro expansion. For instance, (?Mfx.'(?mx))`(?mfa) writes a rather than reading it, because write mode (`) is active where the a is. Compiler directives do not count as macro arguments, so (?mf`a) has one argument, not two.

Macros can only be called after they are fully defined. This implies, among other things, that they can't be recursive.

Interpreter written in JavaScript (TurExp 3)

See below for descriptions of the version 2 examples