RE-type (version 2)

RE-type is a pure functional esoteric programming language where regular expressions are types.

This is based on a programming language I started around the time I made TurExp, so it uses a similar syntax for regular expressions. I never finished that version because type checking was too slow and I couldn't figure out how to fix it; it turned out this was because type checking that language was NP-hard. This language solves that problem by using stochastic type checking; that is, programs with type errors have some chance of running anyways.

Contents

General stuff

A program is a function from strings (type .*) to strings. The input is read before the program starts, and then passed to this function; the result of this function is the program's output. Both the input and output can include line breaks.

All strings, input, output, and code are Unicode. This language does not consider the surrogate characters \ud800-\udfff to be valid characters; the only way to get these characters into strings is if there are unmatched surrogates in the code or input, and if there are, the behavior of the program is undefined.

A program can optionally start with a version specifier, either \Vretype2 or \Vretype2d. If this is specified, it must be the first thing in the program other than comments and whitespace. The version specifier must be followed by a newline.

The d in the second option stands for dynamic typing, and it disables the stochastic type checker. In either case, type mismatches are caught at runtime. Also, this only affects mismatches between different string types; mismatches between a string type and a function type are caught deterministically at compile time no matter what.

Lexiness

Each character in the program's source code is either treated as a literal character or a special character. Unless otherwise specified, when the syntax specifies a specific character, the character must be a special character, and when the syntax specifies that any character is allowed, the character must be a literal character.

By default, the characters space, tab, carriage return, linefeed, and # " \ . ^ $ - { ? * + : ` ' < > ( ) [ ] are treated as special characters, and all other characters are treated as literal characters. If a \ is followed by a non-alphanumeric character, that character is is treated as a literal character (so e.g. \+ is a literal +).

" enters or exits quoted mode. In quoted mode, all characters except " and \ are treated as literal characters, and a backslash followed by a character that would normally be a special character causes it to be treated as a special character. Quoted strings can overlap syntactic constructs, so something like (a"bc\)d" is valid and equivalent to (abc)d.

When treated as special characters, whitespace is ignored and # starts a comment until the end of the line. Whitespace is not necessary to separate tokens.

There are also some character escapes, which always result in literal characters:

The escapes \V, \T, and \L are used for other purposes. Any other alphanumeric escape is reserved for future use.

Types

There are two kinds of types: string types and function types.

String types

There are many different string types, which differ in what strings they can represent. A value of one string type can be used where a different string type is expected when the set of strings that can be represented by the type you have is a subset of the set of strings that can be represented by the type that's expected.

The string types are:

Single-character (any literal character, e.g. a or \.)
A type with only one value. There is one such type for each character.
Sequence type ((type1 type2 type3))
Any string that can be formed by concatenating a string of type1, a string of type2, and a string of type3. Any number of types can be specified, including 0; () is a type whose only value is the empty string.
This is equivalent to just putting things next to each other in a regular expression, but in RE-type, sequence types must always be enclosed in parentheses.
Alternative/union type ([type1 type2 type3])
Can have any value from type1, type2, type3, etc. Any number of types can be specified, although [] (0 types) can't have any values.
This is equivalent to the | operator in typical regular expressions; something like foo|bar in a Perl-style regular expression would be written like [(foo)(bar)] in RE-type.
Option type (type?)
Can have any value from type or the empty string. Mostly equivalent to [()type].
Repetition types (type{n}, type{n,}, type{n,m})
Any string that can be formed by concatenating some number of strings of type. For {n}, there must be exactly n strings; for {n,}, at least n; for {n,m}, at least n and at most m. Within braces, special vs. literal characters don't matter. type{0,} (0 or more) can be abbreviated type*; type{1,} (1 or more) can be abbreviated type+.
Range types (char1-char2)
Any single character within the specified range; specifically, any character whose character code is at least that of char1 and at most that of char2. . is a shorthand for \x00-\U10ffff (any character, including newlines).
Negated character classes ([^characters])
Any single character other than the ones listed. Both single characters and ranges can be listed.

String types only store strings; they don't have any sort of memory of how they were constructed. For example, the string a;b;c of type (.*;.*) is always interpreted as (a;b, ;, c), even if it was constructed by concatenating a of type .*, ; of type ;, and b;c of type .*. The rules for determining which interpretation is used are as follows:

Function types

Functions are first-class; functions can take functions as arguments and have functions as their result. However, there's no way to make a function that can take either a function or a string as an argument, or have either a function or a string as its result.

The function type taking type1 as an argument and returning type2 is written as <type1 type2>. If more than two types are specified, then it represents the type of a function taking multiple arguments, which are curried; that is, <type1 type2 type3> is a function that takes two arguments (type1 and type2), and is equivalent to <type1 <type2 type3>>.

If one type is specified, <type> is equivalent to type. This is sometimes useful for disambiguation; single characters and single characters repeated a fixed number of times are by default interpreted as expressions in contexts where that makes sense, so if you want those as types, you can enclose them with angle brackets.

Expressions

String expressions

Any single literal character is an expression that returns that character. Its type is a single-character type.

`expr1 expr2 expr3' concatenates the results of the expressions. Any number of expressions can be specified; zero expressions (`') results in the empty string. Its type is a sequence type made from the types of the expressions.

expr{n}, where n is a natural number, concatenates n copies of the result of expr. Its type is a repetition type. If expr is a function, then the result of that function is repeated n times.

expr: examines strings, and is described in its own section.

Let and variable expressions

A let expression has the form type name: expr1 expr2. This creates a variable of type type with name name whose value is expr1, and then evaluates expr2 with that variable in scope. The variable is not in scope in expr1, so this can't be used for recursion.

The name is any single literal character. (This means that there can be at most 1,112,064 variables in scope at a time (0x10ffff minus 0x800 for the surrogates).) Variables can shadow other variables with the same name.

The variable can be accessed with $name.

A typedef is similar, but with \T instead of a type: \T name: type expr allows type to be represented as $name in expr.

Function expressions

A lambda expression has the form type name expr. This makes a function that takes one parameter of type type named name, which computes the expression expr. Like for let expressions, name is a single literal character, and the variable can be accessed with $name.

If ^ is used in place of the type, then the type is inferred from context.

Function application has the form func arg.

Precedence for function application and lambda expressions, as well as disambiguating between function application and concatenation, is determined automatically based on how many and what type of argument each function expects. There's no way to override this. Postfix operators ({n} and :) always have higher precedence than function application, but since applying them to functions applies them to the result of the function, if you want, say, ($f $x){5}, you can write $f{5} $x instead.

Immediately applying a lambda expression is not allowed; use a let expression instead.

Logging expressions

\L character expr logs a thing, but otherwise behaves like expr. Supported characters are:

Any other character logs that character, on its own line, along with the line number of the \L expression, although other uppercase letters may be used in future versions of the language for other things.

Note that this language makes no guarantees about laziness/strictness or what is evaluated when. \L is not intended to be (intended not to be) a good, reliable way of producing output; it's only meant as a testing/debugging feature.

Colon expressions

expr: results in a different function depending on the type of expr.

Some types are considered "constant types". A constant type is defined as either a single character type or a sequence type made up entirely of constant types (including the empty sequence type). (Constant types are all types with only one possible value, but not all types with only one possible value are constant types.) Any time one of these functions would pass a value of a constant type to a function you provide, it instead doesn't pass an argument. This is so you don't have to deal with parameters that you're probably going to discard because they can already be inferred from the type.

In the following, a, b, and c are string types that are inferred based on the parameters to $x:.

Type of $xType of $x:Description
(x y z) <<x y z a> [a]> For a sequence type, $x: takes a single function, passes that function each element of the sequence that isn't a constant type, and then returns the result of that function. (Note that the result is an alternative type with a single possibility; this is for consistency and to prevent ambiguity in certain edge cases.)
[x y z] <<x a> <y b> <z c> [a b c]> For alternative types, $x: takes one parameter for each option, which is either a function or a string depending on whether the type is a constant. It chooses the parameter corresponding to which option was chosen, and passes $x as a parameter. x? is treated as equivalent to [() x], with the empty-string option being first, regardless of whether the empty-string option is preferred.
x? <a <x b> [a b]>
x{0,m}, x* <a <[a b] x b> [a b]> For repetition types, $x: takes two parameters and loops over the items, explained in more detail below.
x{n,m}, x+ <<x a> <[a b] x b> [a b]>
x-y, [^xy-z] <a <[a b] b> [a b]>
<a <[a b c] b> <[a b c] c> [a b c]>
<a <[a b c d] b> <[a b c d] c> <[a b c d] d> [a b c d]>
For range and negated character class types, $x: takes two or more parameters (depending on the number of possible characters) and loops a number of times determined by which character it is. Explained in more detail below.
<x y> <x y:> For functions, applying the colon operator gives a function that does what the original function did, then applies the colon operator to the result.
any string type <.* $x>
<.* .* $x>
etc.
Applying the colon operator to a string type gives a function that takes one or more parameters and returns a value of that type depending on the number of characters in the parameter(s). Explained in more detail below.

Loops

Applying the colon operator to a repetition type or a character class type creates a loop.

If $x is a repetition type whose minimum is zero (including *), then the first argument to $x: specifies an initial value, and the second argument is a function that takes the value from the previous iteration of the loop and the next string from the repetition (if it's not a repetition of a constant type) and returns the next value. This is basically the same as Haskell's foldl or Python's reduce with an initial value specified.

If $x is a repetition type whose minimum is not zero (including +), then the first argument to $x: is a function that takes the first string in the repetition and returns an initial value. The second argument is then called for each value starting from the second. If the first argument is the identity function, this is basically the same as Haskell's foldl1 or Python's reduce with no initial value.

If the pattern that's repeated can match the empty string and has no maximum number of matches, then it can match the empty string an infinite number of times. In that case, the second argument is called until it reaches a fixed point, after which it'll move onto the next string or end the loop. This is the only way to create an unbounded loop in RE-type.

If $x is a range type or a negated character class type with 256 or fewer possible characters, then $x: takes two arguments, where the first argument provides an initial value for the loop, and the second argument is a function that takes the current value and returns the next value. If $x is the first possible character (based on Unicode character code), then the initial value is used; if it's the second, the function applied once to the initial value is used; if it's the third, the function applied twice to the initial value is used, etc.

If $x is a range type or negated character class type with between 257 and 65536 possible values, then $x: takes three arguments; if more than 65536, then four arguments. The first argument specifies the initial value, and the rest of the arguments are called a number of times that depends on the digits in the base-256 representation of which character it is. For instance, if there are a maximum of 0x800 (so ≤ 65536) possible values for $x and $x is value 0x203 (when starting from zero), then $x: $a $b $c is equal to $c $c $c $b $b $a.

If $x is of type ., then $x: can be used to compute the character code of $x; however, since characters \ud800-\udfff aren't considered valid characters (and therefore not valid values of the type .), the character codes of all characters from \ue000 onwards will all be off by 0x800.

In all loops, the type of the parameters that take the value from the previous iteration can't be inferred with ^.

Type colon

You can also put a type before the colon operator. If $x is a type, then $x: is a function that takes some number of arguments and returns a value of type $x.

For types with 256 or fewer possible values, and for types with an infinite number of possible values, $x: $a gives the nth value (zero-based) of type $x, where n is the length of $a. If $x has more than 256 possible values but not more than 65536, $x: $a $b gives the nth value where n = length of $a × 256 + length of $b. For more possible values, more arguments are used, with the lengths of the arguments being interpreted as the digits in a base-256 number. It's a runtime error n is too high.

The order used is as follows:

For some types, the same string can be accessed in two or more ways; there's no attempt to remove these redundant strings.

.: can be used to convert from a character code to a character, but, like when converting the other way, since characters \ud800-\udfff aren't considered valid, the character codes from \ue000 onwards will all be off by 0x800.

Interpreter