A Turing Machine equivalent of regular expressions. A nondeterministic esoteric programming language, nondeterministic in the computer science sense of always choosing the right option when there's a choice.
The language runs on a nondeterministic Turing Machine, or a simulator of one. The machine has a tape, extending infinitely in both directions, with 1112064 possible symbols (one per Unicode character, but see below) and a current position that's pointing to a symbol on the tape (not between symbols).
Whitespace in the code is ignored except in quoted mode and after a backslash. There is no built-in support for comments, though no-ops such as ([]"comment goes here")?
can be used instead.
(?v3)
)? compiler-directive* instruction
(?v3)
to indicate that they use TurExp version 3. (This is not necessary to use version 3–specific features; it's mostly there for the possibility of a version 4 or higher.)
Version 2 was documented as having 256 symbols, and the documentation said "The code, input, and output are assumed to be in an 8-bit superset of ASCII (though implementations may choose to treat input and output as numbers instead)". However, the actual implementation used the platform-default character encoding and actually seems to support Unicode just fine, except that characters outside Latin-1 can't be entered as hex codes.
Version 3 fully supports Unicode. There are 1112064 possible symbols, one for each Unicode character except the surrogate code points (d800
-dfff
), which TurExp doesn't consider to be valid characters.
In version 3, the (?u)
and (?-u)
compiler directives determine the encoding of the source file. (?u)
indicates that everything from that point on is in UTF-8; (?-u)
indicates that everything from that point on is Latin-1. Whichever one is in effect at the end of the program determines the encoding for input and output.
However, no implementation actually supports these directives, because the only version 3 interpreter is written in JavaScript, and its input and output are strings rather than binary, so it just ignores these directives. They're still specified in case there are any non-JavaScript implementations at some point, and recognized for compatibility with such implementations.
Before the program starts, a block of text is read as input. This text is put onto the tape before the program starts, and the current position is set to the first character of this text. The rest of the tape is filled with null characters (\00
).
When the program ends, the interpreter outputs whether the program got to the end (accepted) or if all branches failed (rejected). If the program accepted, the interpreter looks to the left and the right of the current position for the nearest null character, and outputs everything on the tape between those characters. If the current position is on a null character, then it starts looking one character to the left. This means that, e.g., if the tape has \00foo\00
on it (and the rest is filled with null characters), and the current position is anywhere between the f
and the \00
directly after the word foo
, inclusive, the program will output foo
; otherwise, it will output nothing.
If the program forks into multiple copies and more than one complete successfully, one of them will arbitrarily chosen to be the output. In the current interpreter, the same output will always be chosen for the same program/input combination, but that's not guaranteed, and the output chosen can differ between interpreters and versions of an interpreter. If the "Show all results" checkbox is checked (or the -a
option is given to the Haskell interpreter), then the interpreter will continue running until all copies of the program have succeeded or failed.
There is no way to do input while the program is running, and no good way to do output before the program finishes. For debugging purposes, in version 3, you can check "Trace" to show every command that's executed, and you can print the current state using the (?t)
instruction, which is a no-op that's always traced even if "Trace" isn't checked. You can put text in the parentheses after the ?t
, which will also be printed.
*
, +
, or ?
.)
(
instructions )
()
is a no-op.
[
instructions ]
[]
causes the current machine to fail (i.e., the
current fork is not the one that produces the final output).
(?^
characters )
(version 3 only)
(??
instructions )
(version 3 only)
*
+
a+
is a shortcut for (aa*)
; i.e., executes the specified instruction at least once.
?
a?
is a shortcut for [()a]
; i.e., optionally executes the specified instruction.
(?$)
(version 3 only)
(?t
(char - "
(
)
[
]
\
| "
(char - "
)* "
)* )
(version 3 only)
Multiple *
, +
, ?
operators cannot appear next to each other without parentheses. Compiler directives cannot come between the statement and the operator.
(?s)
instruction (version 3 only)There are two new instructions that work together to create a temporary variable:
(?S
instructions )
(?s
instructions )
(?S)
instruction. This instruction acts like a []
instruction, except that every (?s)
instruction within a single (?S)
instruction always chooses the same index.
For instance, (?S(?sab)(?s12))
first matches either a
or b
, and then matches either 1
(but only if the first match was a
) or 2
(but only if the first match was b
). This is equivalent to [(a1)(b2)]
, but it can be useful if there are a lot of instructions between the first and second match, and it can be useful when used with (?r)
.
The compiler has several modes, which affect the interpretation of various instructions. The compiler mode can be changed using compiler directives. More details on these modes are in later sections.
"
switches to quoted mode, where most special characters are treated like normal characters. In quoted mode, "
switches to normal mode.
{
switches to binary mode, where symbols are entered as hex values.
}
switches to normal mode, where symbols are entered as characters.
`
switches to write mode.
'
switches to read mode.
#
switches to modeless mode, where each statement can read and write, and where the direction is specified per-instruction.
<
switches to left mode.
>
switches to right mode.
|
switches to stop mode, where the cursor doesn't move.
^
toggles the inversion mode.
[^abc]
won't work, because that means (not a
) or (not b
) or (not c
), and any character matches at least one of those options. Instead, use a sequence of reads in stop mode; for instance, ^|abc^
checks that the character isn't a
, then checks if it isn't b
, then checks if it isn't c
, and the |
means the cursor doesn't move so it checks the same character each time.
(?^characters)
command, which does that for you. It's syntactic sugar for ('|^characters^>.)
, except the ^
s always enter and exit inversion mode, respectively, and the last direction change depends on the current direction at the start of the instruction.
(?i)
switches to case-insensitive mode.
(?-i)
switches to case-sensitive mode.
\
. It can only be changed when in binary mode (in normal mode, you can instead use {x}
, {u}
, and {U}
, which temporarily switch to binary mode). It can be two-digit, four-digit, or six-digit, and is initially two-digit.
x
switches to two-digit mode.
u
switches to four-digit mode.
U
switches to six-digit mode.
(?u)
switches to UTF-8 mode.
(?-u)
switches to Latin-1 mode.
Directives run at compile-time and affect any characters after them in the source
file, regardless of whether or when the directive would be executed at runtime; for instance, ([ab`]c)
is equivalent to ([ab]`c)
(both choose between two options, reading a
or b
, and then write c
), and not equivalent to [(ac)(bc)(`c)]
. Version 2 of TurExp eliminates run-time mode, which would change this behavior.
Direction compiler directives cannot be used in modeless mode.
symbols → symbol | .
Symbols are used for reading and writing. The syntax for a symbol depends on the character mode.
symbol → character - reserved-character - whitespace
| \
(character - hex-digit)
| \
hex-digits
reserved-character → one of
. \ " { } ` ' # > < | ( ) [ ] * + ? ^
A normal character or a character preceded by a backslash stands for its ASCII value. A backslash followed by a hex literal is the value of the literal. The number of digits in the hex literal is determined by the digits mode.
symbol → hex-digits
| \
character
Like normal mode, but the meaning of backslash is reversed. An unrecognized character is a syntax error.
symbol → character - "
Each character stands for its ASCII value.
symbols → (?r
symbol symbol? symbol )
In any situation where a symbol is expected, a range can be used instead (except within another range, or in quoted mode). These are notated as (?r
, the first character, an optional second character, and then the last character, and )
. For instance, (?raz)
is equivalent to abcdefghijklmnopqrstuvwxyz
, and (?r028)
is equivalent to 02468
(skipping the odd numbers because the second character is two characters after the first character). Characters use the same syntax that they normally would (so backslashes are necessary to escape reserved characters, and they're entered in hex when hex mode is active).
A range is equivalent to the corresponding sequence of characters, whatever that means in the context where it's used. For instance, ['(?r15)]
matches any of the characters 1
, 2
, 3
, 4
, or 5
, whereas (`(?r15))
writes the string 12345
. Modeless mode treats ranges a bit differently.
A range cannot be directly followed by *
, +
, or ?
, and can't be at the top level, without being in some sort of parentheses or brackets.
The meaning of a simple instruction depends on the current RW mode.
simple-instruction → symbols
Reads a symbol from the tape. If the inversion mode is non-inverted, the symbol on the tape must be the same as the one specified in the instruction; if the inversion mode is inverted, it must be different. Otherwise, the current fork fails. If the case mode is case-insensitive, then the match is case-insensitive (regardless of whether the symbol was specified as a character or a hex code). Afterwards, moves in the current direction.
If a dot (.
) is specified instead of a symbol, skips over the current symbol without failing, regardless of inversion mode.
simple-instruction → symbols
Writes the specified symbol, then moves in the current direction. If a dot is specified instead of a symbol, just moves in the current direction. Inversion mode and case mode are ignored.
simple-instruction → symbols symbols direction
direction → <
| >
| |
Reads the first symbol (possibly inverted), writes the second, and then moves in the specified
direction. Compiler directives cannot be between the three parts of the instruction. .
can be specified for either or both symbols to skip reading or writing. The direction can be specified even in quoted mode, and it can be preceded by whitespace even in quoted mode.
If a range is specified for either symbol, the entire instruction is repeated. For instance, (?r13)4>
is equivalent to 14>24>34>
. If ranges are specified for both symbols, then the ranges must be the same length, and corresponding elements are used. For instance, [#(?r08)(?r19)>90>]
will add 1 to the current digit, wrapping around if it's 9.
macro-char → character - ("
.
(
)
[
]
\
)
compiler-directive → (?M
macro-char macro-char* .
instruction* )
instruction → (?m
macro-char instruction* )
Macros can be defined using (?M)
, which can appear anywhere a compiler directive can, and can be called with (?m)
.
For a simple macro with no arguments, it can be defined as (?M name . instructions)
(notice there's a dot there), where name is any single character except quotation mark, backslash, dot, parentheses, or brackets. It can be used with (?m name)
.
For arguments, when defining the macro, put the parameter names between the macro name and the dot (parameter names must also be single characters). If there's a parameter named x
, then you can use it with (?mx)
. To pass arguments, simply put them after the macro name when calling it. For instance, the macro (?M2x.(?mx)(?mx))
performs any instruction you give it as an argument twice; once this is defined, (?m2a)
expands into aa
.
If a macro is passed more arguments than it expects, then all of the remaining arguments go into whatever the last parameter is. If a macro is passed fewer arguments than it expects, then remaining parameters will expand to nothing. If a macro that doesn't expect any arguments is passed arguments, those arguments are ignored.
If a macro or a macro argument expands to multiple instructions, the interpretation of those multiple instructions depends on context; for instance, if it's in a sequence instruction, then the different instructions are executed in sequence, if it's in a fork instruction, then each instruction becomes a different fork, etc. If a macro expands to multiple instructions, then it's an error for the (?m)
instruction to be immediately followed by a postfix operator or to be at the top level.
Compiler directives take effect before macro expansion. For instance, (?Mfx.'(?mx))`(?mfa)
writes a
rather than reading it, because write mode (`
) is active where the a
is. Compiler directives do not count as macro arguments, so (?mf`a)
has one argument, not two.
Macros can only be called after they are fully defined. This implies, among other things, that they can't be recursive.
reserved-character | → | one of . \ " { } ` ' # > < | ( ) [ ] * + ? ^ | |
symbol | → | character - reserved-character - whitespace | normal mode |
| | \ (character - hex-digit) | normal mode | |
| | \ hex-digit{n} | normal mode; n depends on the digits mode | |
| | hex-digit{n} | binary mode; n depends on the digits mode | |
| | \ character | binary mode | |
| | character - " | quoted mode | |
symbols | → | symbol | |
| | . | ||
| | (?r symbol symbol? symbol ) | version 3 | |
simple-instruction | → | symbols | in read or write mode |
| | symbols symbols (< | > | | ) | in modeless mode | |
macro-char | → | character - (" . ( ) [ ] \ ) | |
compiler-directive | → | one of " { } ` ' # > < | ^ | |
| | one of x u U | version 3, binary mode | |
| | (? (i | u )* (- (i | u )+)? ) | version 3, must have at least one i or u
| |
| | (?M macro-char macro-char* . instruction* ) | version 3 | |
instruction | → | simple-instruction | |
| | compiler-directive | ||
| | ( instruction* ) | ||
| | [ instruction* ] | ||
| | instruction (* | + | ? ) | limitations on instruction | |
| | (?^ instruction* ) | version 3 | |
| | (?? instruction* ) | version 3 | |
| | (?S instruction* ) | version 3, not inside (?S...)
| |
| | (?s instruction* ) | version 3, only inside (?S...)
| |
| | (?m macro-char instruction* ) | version 3, macro-char must be defined | |
| | (?t (char - " ( ) [ ] \ | " (char - " )* " )* ) | version 3 | |
| | (?$) | version 3 | |
program | → | compiler-directive* instruction | version 2 |
| | ((?v3) )? instruction* | version 3, must expand to 1 instruction |
Some instructions expand to different numbers of instructions: simple instructions with (?r)
expand to one instruction per item in the range, (?m)
can expand to any number of instructions depending on the macro definition, and compiler directives expand to 0 instructions. Postfix operators can only be used on instructions that expand to exactly 1 instruction, and the top-level must have exactly 1 instruction after expansion. Expansion takes place before determining which macro argument is associated with which parameter, and before deciding the probabilities for (??)
instructions.
Additionally, two postfix operators can't occur in a row.
Version 2 is not backwards-compatible with version 1. The differences:
*
, +
, and ?
go before the instructions they modify in version 1, and after them in version 2.
!
, which was removed in version 2.
In version 1, there was a mode mode, which was initially compile-time but could be switched to runtime using the !
directive (but it couldn't be switched back). In runtime mode, all compiler directives would act like statements, affecting instructions that are executed after the directive executes, so for instance (['`]a)
would either read a
or write a
(equivalent to (['a][`a])
).
The main issue with this was that "
was a compiler directive, which meant that which characters were quoted could change depending on how the program runs. For instance, if (foo"bar)?
executes, then the ending parentheses and question mark don't end the instruction and are instead treated as normal characters, whereas if it doesn't, that's the end of the instruction and a question mark to indicate the instruction is optional.
This made things harder to implement, so I first changed the postfix operators to prefix so at least the compiler would know up front if there's a situation like the one I just described, and then got rid of runtime mode entirely (and changed the operators back to postfix), given that it's optional and I expected people (or at least me) to just choose compile-time mode anyways.
Version 3 attempts to be backwards-compatible with version 2 (aside from some Unicode-related issues). The new instruction types generally start with (?
, a sequence of characters that was always a syntax error in version 2: (?i)
(case), (?u)
(Unicode), (?M)
/(?m)
(macros), (?^)
(negation), (??)
(random), (?S)
/(?s)
(same), (?t)
(trace), (?$)
(accept).
Of those, (?M)
/(?m)
(macros), (?^)
(negation), and (?S)
/(?s)
(same) are fairly easy to mechanically translate into version 2 code. (?$)
is trickier, (?i)
requires Unicode knowledge, and (?u)
(Unicode), (??)
(random), and (?t)
(trace) do things that can't be done in version 2.
Also, in version 2, compiler directives can't come after the main instruction.
This assumes a right-infinite tape (going to the left of the beginning of the tape is undefined behavior), and doesn't support input. Input is assumed to be empty; output is in hexadecimal. (This is because actual ASCII output was harder in version 2; the version 3 translation shown here also uses hexadecimal so you can see how version 3 instructions are shortcuts for version 2 instructions.)
Template:
(
#
..<\00.<\000<\000>
commands go here
'^\00*^\00
)
Commands:
Brainfuck | TurExp (v2) | TurExp (v3) |
---|---|---|
+
| [01|12|23|34|45|56|67|78|89|9a|ab|bc|cd|de|ef|
| [(?r08)(?r19)|9a|(?rae)(?rbf)|
|
-
| [fe|ed|dc|cb|ba|a9|98|87|76|65|54|43|32|21|10|
| [(?rfb)(?rea)|a9|(?r91)(?r80)|
|
<
| ..>..>
| ..>..>
|
>
| ..<..<[^\00.|^(\000<.0>)]
| ..<..<[^\00.|^(\000<.0>)]
|
[
| ([^0.|^(0.<^0.>^)]
| ([^0.|^(0.<^0.>^)]
|
]
| )*0.<0.>
| )*0.<0.>
|
.
|
|
|
Or, if you're using version 3, then you can define macros like this:
#
(?M+. [(?r08)(?r19)|9a|(?rae)(?rbf)|(f0<[(?r08)(?r19)>9a>(?rae)(?rbf)>f0>])])
(?M-. [(?rfb)(?rea)|a9|(?r91)(?r80)|(0f<[(?rfb)(?rea)>a9>(?r91)(?r80)>0f>])])
(?M<. ..>..>)
(?M>. ..<..<[^\00.|^(\000<.0>)])
(?MLi. ([^0.|^(0.<^0.>^)](?mi))*0.<0.>)
(?Mp. ..<
(?S(?s(?r09)!>(?raf)!>)'^\00*^\00^\00*^#(?s\00(?r09)<\00(?raf)<)^!.<*^(?s!(?r09)>!(?raf)>))
(?S(?s(?r09)!>(?raf)!>)'^\00*^\00^\00*^#(?s\00(?r09)<\00(?raf)<)^!.<*^(?s!(?r09)|!(?raf)|)))
(
..<\00.<\000<\000>
commands go here
'^\00*^\00
)
and then +
is (?m+)
, -
is (?m-)
, <
is (?m<)
, >
is (?m>)
, [
is (?mL
, ]
is )
, and .
is (?mp)
.
Note that this reverses left and right; this is to make it easier for output to be in the correct order (the memory extends infinitely to the left, and the output infinitely to the right, separated by a null character).
See below for descriptions of the version 2 examples
Interpreter written in Haskell (updated 2020; source only, download and run; 56K compressed, 412KiB uncompressed; also includes T-Write)
Original interpreter written in Haskell (not compatible with modern versions of GHC; source only; download and run; 49K compressed; also includes T-Write)
The interpreter can either be run interactively, by simply typing ./turexp
from the command prompt, or with command-line arguments, by typing ./turexp filename input
. If only filename is provided, the input will be empty. The -a
option can also be used to show all possible outputs rather than just one.
The following examples are included with the newer version of the interpreter:
101+100
and are interpreted in little-endian order (i.e., backwards from how we normally write numbers).
a
nb
nc
n with n ≥ 1, i.e., if the input some number of a
's followed by the same number of b
's and then the same number of c
's. This is an example of a problem that neither finite automata/regular expressions (in the computer science theory sense) nor push-down automata/context-free grammars can solve but Turing machines can.
[^a]a[^abc].
, i.e., a character that's not a
, then the letter a
, then a character that's not a
, b
, or c
, then any character (but demonstrating that [^abc]
doesn't work like it does in regular expressions).