C// transpiler to Rust

Instructions

Dependencies

You will need rustc and cargo installed. This program has no dependencies on external crates other than std.

You will also need something that can extract gzipped tar files.

Compiling and testing the transpiler

The download link is at the bottom of the page.

cd into the directory, then type cargo build.

You can run the unit tests with cargo test, and run all the included test programs by typing ./do-test. You can run a specific test by typing ./do-test name-of-file (without the .cdd extension or the test-input/ prefix). This will run four tests per file with different options, putting the output into the test-output-* directories, and compare the files with each other and with test-output-expected/*.txt if the file exists. (For both this and the original version, this only tests for correctness, not optimizedness.)

Running the transpiler

To run, either type

cargo run -- options input-file.cdd > output-file.rs

target/debug/cdd-rs options input-file.cdd > output-file.rs

options includes:

--default-share: Changes the default mode to @share (see below).
--optimize: Actually apply the optimizations. If this is not specified, either every struct value will be owned (and copied even when not necessary), or, if --default-share is supplied, every struct value will be reference counted.
-O0 through -O4: -O0 is the default, and -O4 is the same as --optimize. Anything in between will perform only a subset of the optimization steps, which could be helpful if you're trying to figure out what they do or how they work.

Known language differences from the transpiler to C

The modifiers @share, @own, and @borrow can be specified before struct types (explained below).
The original transpiler and the specification requires that structs be defined before they're used in function parameters and return types. Due to differences in how I decided to structure the code, this requirement isn't enforced by this transpiler. structs still need to be defined before they're used as members in other structs, though (this prevents recursive structs).

In addition, the following behaviors, which are undefined or implementation-defined in the transpiler to C, happen to be defined in this version of the transpiler to Rust (but could change in future versions):

ints are 32-bit.
Integer overflow panics (because that's what happens in Rust).
All local variables are initialized to containing all 0's.

Information about the algorithm

Type modes

Every variable, argument, and return type is assigned a type mode. There are three possible type modes, and (unlike in the transpiler to C) you can explicitly specify one to use (syntax: @own struct struct_name var_name):

Mode	Translates to (in Rust)	Equivalent in the transpiler to C	Notes
`@own`	`T`	`modifyable = true`	cheap to modify, assigning to anything but `@borrow` is expensive unless it's the last use
`@share`	`Rc<T>`	`modifyable = false` `isContained = false`	expensive to modify, assigning to other `@share`s is cheap
`@borrow`	`&T`	`modifyable = false` `isContained = true`	cannot modify

Some more notes:

In the current implementation, @borrow can only be used on function arguments, and variables declared @borrow cannot be modified (either by reassigning them or by assigning to one of their members).
Members of @share variables can be modified, but doing so will (potentially) make a copy.
Since @share variables have a reference count (unlike in the C implementation, where memory just leaks), there's an additional optimization where if the reference count is 1, it won't actually make a copy. This means that only using @share is effectively an alternative solution to the problem of implementing value semantics, namely using runtime reference counting rather than static analysis.

If optimizations are disabled, every variable where the mode isn't specified is marked as @own if --default-share isn't specified on the command line, and @share if it is. If optimizations are enabled (≥ -O2), then --default-share only affects struct members; the mode of everything else is determined by the optimizer.

In addition, each use of a variable is marked as "copy" or "move". If optimizations are disabled, then every variable use is "copy", and there's no way to change that.

The algorithm

For ≥ -O2, all parameters start out @borrow and everything else (except struct members) starts out @share.

Analyze liveness (≥ -O1): Goes through each function backwards, with a mutable set of live variables, which is initially empty. Each time it encounters a use of a non-live variable, it marks that use as "move" and adds it to the set of live variables. Each time it encounters an assignment to a variable or variable declaration, removes it from the list of live variables. This corresponds to the findUsedVariables and analyzeLiveness steps in the C version; however, this version assumes that expressions are evaluated left-to-right (so in f(x, x), the second x could be the last use of x), whereas the C version assumes that evaluation order is undefined (so in f(x, x), neither x is treated as the last use).
Find modifications (≥ -O2):

If a member of a variable is assigned to, or it (or a non-@share member of it) has a "move" use in a context that takes an @own variable, then it's marked as @own.
If a @borrow parameter is assigned to, it's upgraded to @share. (This is why -O2 is required before defaulting to parameters to @borrow; otherwise an implied @borrow parameter could be assigned to.)
If a function only returns @own, then the return type is upgraded to @own.

Corresponds to findModifiedVariables and findTransferOpportunities in the C version.

Find share opportunities (≥ -O3):

If a @borrow variable is used in a context where it takes a @share variable, that variable is upgraded to @share.

Kind of corresponds to setting isContained in the C version, except in that version it just used a simple heuristic (is the parameter's type different from the return type?).

Find unneeded reference counts (≥ -O4):

Every time a @share variable is used in a context that expects a @share, or a @share variable is assigned a value that comes from something else @share, that variable is marked.
Any unmarked @share variables except function parameters are upgraded to @own, so they don't need a reference count (since there's nowhere to take advantage of it).
If a function does not return anything @share, then its return type is upgraded to @own.

Doesn't have an equivalent in the C version. Since shared variables didn't have reference counts in the C version (since it just leaked memory), there aren't any reference counts to optimize away. (That said, a function that e.g. calls a constructor and just returns its result unmodified probably could benefit from this, e.g. make_return in constructor.cdd.)

Fix borrows (≥ -O1): Goes through every function call, and every time an variable is borrowed, further uses of the same variable in the same call are marked as moves. This is because e.g. f(&x, x) could be generated by the liveness analyzer, but is invalid (since the borrow lasts for the duration of the call), so the second x becomes cloned. (This wasn't an issue for the C version, since it would have classified both x's as moves regardless of whether one was borrowed.)

Each of steps 2, 3, and 4 run in a loop until they reach a fixed point, going through the whole program each time. (Functions do only depend on functions they call, not functions that call them, but this means that I don't have to figure out what the actual call graph is like, and can also handle recursion.) Variables can only go from @borrow to @share to @own, and can't go back, which guarantees that this will eventually terminate.

In all cases, if a mode is explicitly specified, then the optimizer won't change it.

After all that, when generating the code, there's a huge match statement with a case for each combination of expected mode, mode of the expression, and whether moving is allowed:

		`@borrow =`…	`@share =`…	`@own =`…
…`= @borrow`		`expr`	`Rc::new(expr.clone())`	`expr.clone()`
…`= @share`	Move	`&*expr`	`expr`	`unwrap_or_clone(expr)`
…`= @share`	Copy	`&*expr`	`Rc::clone(&expr)`	`(*expr).clone()`
…`= @own`	Move	`&expr`	`Rc::new(expr)`	`expr`
…`= @own`	Copy	`&expr`	`Rc::new(expr.clone())`	`expr.clone()`

For comparison, here's what the C implementation does:

		`!modifyable =`…		`modifyable =`…	`struct` member `=`…
		`isContained`	`!isContained`	`modifyable =`…	`struct` member `=`…
…`= !modifyable`		no copy `/shared/`		copy	copy
…`= modifyable`	Function return val	no copy `/borrowed/`	no copy `/return transfer/`
	`isTransfer`		no copy `/transfer/`
	`!isTransfer`		copy
…`=` `struct` member		copy			copy

The differences:

The Rust version distinguishes between assigning from @borrow and @share, and copies in the @share = @borrow case. The C version doesn't even have access to the information to make that distinction. In the C version, borrowed parameters and the equivalent of @share variables can be mixed freely, as long as borrowed parameters can't escape the function; if they can escape the function at all, then they're marked as not borrowed.
The Rust version distinguishes between moving and copying from a @share variable, which the C version doesn't. In the @share = @share case, this is used to determine whether to increment the reference count (which doesn't exist in the C version), and in the @own = @share version this is used to potentially eliminate a copy if the reference count is 1 (which the C version can't do, since there's no reference count).
The C version treats moving from a function return and from a variable as separate cases (though this has the same result as the Rust version).
In the C version, struct members are not pointers, so they're always copied.

Known limitations

Some known limitations with the algorithm as implemented, and things that could potentially be improved:

There's no support for arrays, recursive structures, or anything else that would actually be expensive to copy. That means that I haven't attempted to implement optimizations for just treating small values as non-pointers (since that would apply to everything this program is handling; the point here is that I'm pretending these two-int structs are actually big lists), and this language probably isn't useful for taking actual measurements of how long things take (since the aforementioned optimization would always apply). It also means that I haven't written any code to deal with Boxes, which would probably be necessary to support recursive structures.

(Speaking of Boxes, all the values here are probably directly on the stack, so moving might not actually be faster than copying anyways.)

Each variable can have only one mode. If, say, a variable is modified (@own is helpful) and then shared to more than one other variable (@share is helpful), then either each share makes a copy, or each modification makes a copy; it would be helpful if the variable could start out @own and then switch to @share after the last modification. The included tests choose-an-arg.cdd and rec-read-return.cdd are affected by this limitation.
Borrowing support is extremely limited; there's no way to return a borrowed value or put one into a struct. This means that I don't have to deal with lifetimes other than the lifetime of the current function and the lifetime of the function we're currently calling, but it means that some functions can't borrow when they should be able to.
Looping over the full program until reaching a fixed point probably isn't the most efficient in terms of how long it takes the optimizer to run (it could easily become quadratic in the number of functions if the functions are in the wrong order, and if only one function takes multiple iterations to settle, it'll still waste time looking at all the other functions), and sometimes misses things (e.g. in rec-read-mod-return.cdd, the return value of f is inferred as @share because of the recursive call to itself, which returns @share). This seems like the sort of thing that could be reduced to some known problem (constraint something? something optimization? something graph theory?), but I'm not sure what.
It doesn't even attempt to infer the mode of struct members (although simply allowing modes other than @own is an improvement over the transpiler to C). I'm not sure how it could go about that; that might need something more whole program optimization–y (since it's not possible to tell how a struct is used just by looking at its definition), or some way to be generic over the mode. Or maybe defaulting to @share, combined with the fact that refcount = 1 doesn't require copying, is sufficient.