RE-type (esoteric programming language)

Lexiness

Each character in the program's source code is either treated as a literal character or a special character. Unless otherwise specified, when the syntax specifies a specific character, the character must be a special character, and when the syntax specifies that any character is allowed, the character must be a literal character.

By default, the characters space, tab, carriage return, linefeed, and # " \ . ^ $ - { ? * + : ` ' < > ( ) [ ] are treated as special characters, and all other characters are treated as literal characters. If a \ is followed by a non-alphanumeric character, that character is is treated as a literal character (so e.g. \+ is a literal +).

" enters or exits quoted mode. In quoted mode, all characters except " and \ are treated as literal characters, and a backslash followed by a character that would normally be a special character causes it to be treated as a special character. Quoted strings can overlap syntactic constructs, so something like (a"bc\)d" is valid and equivalent to (abc)d.

When treated as special characters, whitespace is ignored and # starts a comment until the end of the line. Whitespace is not necessary to separate tokens.

There are also some character escapes, which always result in literal characters:

\\ and \" always result in a literal \ and ", regardless of quoted mode.
\t, \r, and \n represent tab, carriage return, and linefeed characters
\x followed by two hex digits, \u followed by four hex digits, or \U followed by six hex digits, represent any character in Latin-1, the Basic Multilingual Plane, or all of Unicode, respectively. (Note that \U is followed by six hex digits, not eight, since the maximum Unicode character is \U10ffff.)

The escapes \V, \T, and \L are used for other purposes. Any other alphanumeric escape is reserved for future use.

Types

There are two kinds of types: string types and function types.

String types

There are many different string types, which differ in what strings they can represent. A value of one string type can be used where a different string type is expected when the set of strings that can be represented by the type you have is a subset of the set of strings that can be represented by the type that's expected.

The string types are:

Single-character (any literal character, e.g. a or \.): A type with only one value. There is one such type for each character.
Sequence type ((type1 type2 type3)): Any string that can be formed by concatenating a string of type1, a string of type2, and a string of type3. Any number of types can be specified, including 0; () is a type whose only value is the empty string.; This is equivalent to just putting things next to each other in a regular expression, but in RE-type, sequence types must always be enclosed in parentheses.
Alternative/union type ([type1 type2 type3]): Can have any value from type1, type2, type3, etc. Any number of types can be specified, although [] (0 types) can't have any values.; This is equivalent to the | operator in typical regular expressions; something like foo|bar in a Perl-style regular expression would be written like [(foo)(bar)] in RE-type.
Option type (type?): Can have any value from type or the empty string. Mostly equivalent to [()type].
Repetition types (type{n}, type{n,}, type{n,m}): Any string that can be formed by concatenating some number of strings of type. For {n}, there must be exactly n strings; for {n,}, at least n; for {n,m}, at least n and at most m. Within braces, special vs. literal characters don't matter. type{0,} (0 or more) can be abbreviated type*; type{1,} (1 or more) can be abbreviated type+.
Range types (char1-char2): Any single character within the specified range; specifically, any character whose character code is at least that of char1 and at most that of char2. . is a shorthand for \x00-\U10ffff (any character, including newlines).
Negated character classes ([^characters]): Any single character other than the ones listed. Both single characters and ranges can be listed.

String types only store strings; they don't have any sort of memory of how they were constructed. For example, the string a;b;c of type (.*;.*) is always interpreted as (a;b, ;, c), even if it was constructed by concatenating a of type .*, ; of type ;, and b;c of type .*. The rules for determining which interpretation is used are as follows:

Alternative types always prefer earlier alternatives over later alternatives.
For sequence types, if there are any interpretations where the first value is different, the one with the most-preferred first value is preferred. Otherwise, the one with the most-preferred second value is preferred, then the third, etc.
Repetition types work similarly to sequence types, but ending the repetition is considered the least-preferred option by default. It's possible to change this so that ending the repetition is the most-preferred option; for numeric repetitions, add a ? inside the braces (e.g. .{0,5?}); for * and +, use {*?} and {+?}, respectively.
For option types, nonempty is preferred, but this can be changed by using {??} instead of ?.

Function types

Functions are first-class; functions can take functions as arguments and have functions as their result. However, there's no way to make a function that can take either a function or a string as an argument, or have either a function or a string as its result.

The function type taking type1 as an argument and returning type2 is written as <type1 type2>. If more than two types are specified, then it represents the type of a function taking multiple arguments, which are curried; that is, <type1 type2 type3> is a function that takes two arguments (type1 and type2), and is equivalent to <type1 <type2 type3>>.

If one type is specified, <type> is equivalent to type. This is sometimes useful for disambiguation; single characters and single characters repeated a fixed number of times are by default interpreted as expressions in contexts where that makes sense, so if you want those as types, you can enclose them with angle brackets.

Expressions

String expressions

Any single literal character is an expression that returns that character. Its type is a single-character type.

`expr1 expr2 expr3' concatenates the results of the expressions. Any number of expressions can be specified; zero expressions (`') results in the empty string. Its type is a sequence type made from the types of the expressions.

expr{n}, where n is a natural number, concatenates n copies of the result of expr. Its type is a repetition type. If expr is a function, then the result of that function is repeated n times.

expr: examines strings, and is described in its own section.

Let and variable expressions

A let expression has the form type name: expr1 expr2. This creates a variable of type type with name name whose value is expr1, and then evaluates expr2 with that variable in scope. The variable is not in scope in expr1, so this can't be used for recursion.

The name is any single literal character. (This means that there can be at most 1,112,064 variables in scope at a time (0x10ffff minus 0x800 for the surrogates).) Variables can shadow other variables with the same name.

The variable can be accessed with $name.

A typedef is similar, but with \T instead of a type: \T name: type expr allows type to be represented as $name in expr.

Function expressions

A lambda expression has the form type name expr. This makes a function that takes one parameter of type type named name, which computes the expression expr. Like for let expressions, name is a single literal character, and the variable can be accessed with $name.

If ^ is used in place of the type, then the type is inferred from context.

Function application has the form func arg.

Precedence for function application and lambda expressions, as well as disambiguating between function application and concatenation, is determined automatically based on how many and what type of argument each function expects. There's no way to override this. Postfix operators ({n} and :) always have higher precedence than function application, but since applying them to functions applies them to the result of the function, if you want, say, ($f $x){5}, you can write $f{5} $x instead.

Immediately applying a lambda expression is not allowed; use a let expression instead.

Logging expressions

\L character expr logs a thing, but otherwise behaves like expr. Supported characters are:

E (expecting): at compile time, logs what type and how many arguments are expected here.
T (type): at compile time, logs the type of expr.
S (strings): at run time, logs the value of all string variables in scope
V (verbose): at run time, logs detailed information about all variables in scope

Any other character logs that character, on its own line, along with the line number of the \L expression, although other uppercase letters may be used in future versions of the language for other things.

Note that this language makes no guarantees about laziness/strictness or what is evaluated when. \L is not intended to be (intended not to be) a good, reliable way of producing output; it's only meant as a testing/debugging feature.

Colon expressions

expr: results in a different function depending on the type of expr.

Some types are considered "constant types". A constant type is defined as either a single character type or a sequence type made up entirely of constant types (including the empty sequence type). (Constant types are all types with only one possible value, but not all types with only one possible value are constant types.) Any time one of these functions would pass a value of a constant type to a function you provide, it instead doesn't pass an argument. This is so you don't have to deal with parameters that you're probably going to discard because they can already be inferred from the type.

In the following, a, b, and c are string types that are inferred based on the parameters to $x:.

Type of `$x`	Type of `$x:`	Description
`(x y z)`	`<<x y z a> [a]>`	For a sequence type, `$x:` takes a single function, passes that function each element of the sequence that isn't a constant type, and then returns the result of that function. (Note that the result is an alternative type with a single possibility; this is for consistency and to prevent ambiguity in certain edge cases.)
`[x y z]`	`<<x a> <y b> <z c> [a b c]>`	For alternative types, `$x:` takes one parameter for each option, which is either a function or a string depending on whether the type is a constant. It chooses the parameter corresponding to which option was chosen, and passes `$x` as a parameter. `x?` is treated as equivalent to `[() x]`, with the empty-string option being first, regardless of whether the empty-string option is preferred.
`x?`	`<a <x b> [a b]>`
`x{0,m}`, `x*`	`<a <[a b] x b> [a b]>`	For repetition types, `$x:` takes two parameters and loops over the items, explained in more detail below.
`x{n,m}`, `x+`	`<<x a> <[a b] x b> [a b]>`
`x-y`, `[^xy-z]`	`<a <[a b] b> [a b]>` `<a <[a b c] b> <[a b c] c> [a b c]>` `<a <[a b c d] b> <[a b c d] c> <[a b c d] d> [a b c d]>`	For range and negated character class types, `$x:` takes two or more parameters (depending on the number of possible characters) and loops a number of times determined by which character it is. Explained in more detail below.
`<x y>`	`<x y:>`	For functions, applying the colon operator gives a function that does what the original function did, then applies the colon operator to the result.
any string type	`<.* $x>` `<.* .* $x>` etc.	Applying the colon operator to a string type gives a function that takes one or more parameters and returns a value of that type depending on the number of characters in the parameter(s). Explained in more detail below.

Loops

Applying the colon operator to a repetition type or a character class type creates a loop.

If $x is a repetition type whose minimum is zero (including *), then the first argument to $x: specifies an initial value, and the second argument is a function that takes the value from the previous iteration of the loop and the next string from the repetition (if it's not a repetition of a constant type) and returns the next value. This is basically the same as Haskell's foldl or Python's reduce with an initial value specified.

If $x is a repetition type whose minimum is not zero (including +), then the first argument to $x: is a function that takes the first string in the repetition and returns an initial value. The second argument is then called for each value starting from the second. If the first argument is the identity function, this is basically the same as Haskell's foldl1 or Python's reduce with no initial value.

If the pattern that's repeated can match the empty string and has no maximum number of matches, then it can match the empty string an infinite number of times. In that case, the second argument is called until it reaches a fixed point, after which it'll move onto the next string or end the loop. This is the only way to create an unbounded loop in RE-type.

If $x is a range type or a negated character class type with 256 or fewer possible characters, then $x: takes two arguments, where the first argument provides an initial value for the loop, and the second argument is a function that takes the current value and returns the next value. If $x is the first possible character (based on Unicode character code), then the initial value is used; if it's the second, the function applied once to the initial value is used; if it's the third, the function applied twice to the initial value is used, etc.

If $x is a range type or negated character class type with between 257 and 65536 possible values, then $x: takes three arguments; if more than 65536, then four arguments. The first argument specifies the initial value, and the rest of the arguments are called a number of times that depends on the digits in the base-256 representation of which character it is. For instance, if there are a maximum of 0x800 (so ≤ 65536) possible values for $x and $x is value 0x203 (when starting from zero), then $x: $a $b $c is equal to $c $c $c $b $b $a.

If $x is of type ., then $x: can be used to compute the character code of $x; however, since characters \ud800-\udfff aren't considered valid characters (and therefore not valid values of the type .), the character codes of all characters from \ue000 onwards will all be off by 0x800.

In all loops, the type of the parameters that take the value from the previous iteration can't be inferred with ^.

Type colon

You can also put a type before the colon operator. If $x is a type, then $x: is a function that takes some number of arguments and returns a value of type $x.

For types with 256 or fewer possible values, and for types with an infinite number of possible values, $x: $a gives the nth value (zero-based) of type $x, where n is the length of $a. If $x has more than 256 possible values but not more than 65536, $x: $a $b gives the nth value where n = length of $a × 256 + length of $b. For more possible values, more arguments are used, with the lengths of the arguments being interpreted as the digits in a base-256 number. It's a runtime error n is too high.

The order used is as follows:

Range and inverted character class: ordered by Unicode character code
Sequence: lexicographical order
Alternative: all the possible options for the first alternative, then all the possible options for the second alternative, etc.
Optional: empty string, then all possible options for the type
Repetition: all possible options for repeating zero times, then all possible options for repeating once, etc.

For some types, the same string can be accessed in two or more ways; there's no attempt to remove these redundant strings.

.: can be used to convert from a character code to a character, but, like when converting the other way, since characters \ud800-\udfff aren't considered valid, the character codes from \ue000 onwards will all be off by 0x800.

RE-type (version 2)

Contents

General stuff