We're currently doing the typing of integer literals wrong with regard to signed vs un

Yes. But we should make sure that if the match expression is a literal, we don't

Sort out integer literal typing,about ponylang/ponyc

Comments (8)

andymcn commented on May 14, 2024

How does whatever we do work with blocks that return a literal, eg if?

from ponyc.

andymcn commented on May 14, 2024

New plan after long discussion:

Basically we do top down typing for integer and float literals.

We type check expressions bottom up as now. Integer literals are assigned the type "integer literal" and float literals the type "float literal". These are internal types that only exist within the type checker, they cannot be used in Pony code.

When literals are combined by operators the result is still a literal. For example, an integer literal plus another integer literal gives an integer literal. An integer literal multiplied by a float literal gives a float literal.

When something with literal type is combined with something with an actual type, we infer what type the literal should and (if legal) we push that type back down the literal typed tree. For example, if we add an integer literal to something with type U32, then we infer that the literal must also be of type U32.

This works perfectly well for control structures such as if as well as for operators.

In some cases we won't be able to tell the type. For example:
var x = 3
Such cases will be errors. Later on we may add more sophistiocated inference based on use later in the code, which would then make some of those cases legal. However such changes can only make existing error cases legal, they cannot make existing legal code illegal or change the effect of existing legal code.

When the type needs to be specified for a literal, constructor call syntax for the required type is used, eg U8(4) will generate a U8 with value 4.

from ponyc.

andymcn commented on May 14, 2024

(Edit: to make future conversations easier, everything described in this comment wil be referred to as Plan C.)

The current state of my thinking.
Terminology:
UIF type - any unsigned, signed or float type. Equivalent to a "number" type, but less ambiguous in general use.

The basics

When type checking, every literal value is marked as having literal type (with a TK_LITERAL AST node). Literal sub-expressions, ie sub-expressions containing only things with literal type, are also marked as literal type. Each literal type has flags to indicate whether it must be a float, must be an integer or can be either (referred to as a number).

When something with literal type is used we coerce that literal to an actual type. "Using" includes assigning to a local or field, passing as a method argument or returning as a method return value. In all of these cases we know the set of types that are legal (referred to as the "target type") and pick the "best" one or generate an error if none is legal.

Examples:
var x:U64 = 3 // Target type is U64, 3 becomes U64
var x:None = 3 // Target type is None, error
var x:Number = 3 // Target type is Number, 3 becomes F64
foo(3) // 3 becomes type of argument of foo() or error, depending on definition of foo()

The definition of "best" type is based on a simple ordering: F64, F32, I128, U128, I64, U64, I32, U32, I16, U16, I8, U8 with F64 being the best.

When something with literal type is used is such a way that we cannot determine the type it is an error. This includes simply discarding the value and using it as a receiver.

Examples:
var x:U64 = 3; 4 // Error. 4 can become U64, but type cannot be deduced for 3
var x = 3 // Error, cannot do simultaneous literal and local inference
3.foo() // Error, no idea what foo() to call

An exception to this last point is the special functions equivalent to the infix operators. For these if either the receiver or the (single) argument has a known type then other is coerced to that type. If both have literal type then the whole sub-expression has literal type. Note that this requires combining the float / integer / number-ness of the 2 types appropriately, eg "number + float" gives a float and "integer + float" gives an error.

Examples:
3 + 4 // Has type literal number
3.add(4) // Has type literal number
3.0 + 4 // Has type literal float
3.add(U32(4)) // 3 becomes U32, whole expression has type U32
var x:U64 = 3 + 4 // 3 and 4 both become U64

The unary operators are handled similarly.

Control structures

For control structures (if, while, etc) we allow each branch to have a separately inferred type. Note that break statements count as "branches" for these purposes, so there may be an arbitrary number of branches. If the value of the whole control structure is assigned to something with a simple type then all the branches will end up with the same type anyway. But if it is assigned to a union then they may be different.

Examples:
var x:U64 = if b then 3 else 4 end // 3 and 4 both become U64
var x:U64 = if b then 3 else 4.0 end // Error, 4.0 cannot be coerced to U64
var x:Number = if b then 3 else 4.0 end // 3 becomes I128 and 4.0 becomes F64. Type of if block is (I128 | F64)

This means that control structures which have some branches with literal type and some with non-literal type just work.

Example:
var x:(U64 | None) = if b then 3 else None end // 3 becomes U64

If the value given by a control structure is used as a receiver then we have to have special consideration. We use the fact that we can't call methods on unions. This means that all the branches from such a control structure must give the same type. This gives us 3 possible cases:

A. At least 1 branch gives a type that is not a UIF type. This must be an error. Example:
(if b then 3 else None end).foo() // Error

Similarly 2 or more branches with different UIF types is an error. Example:

(while(c) do
  if b1 then break U64(3) end
  if b2 then break U32(4) end
  3
end).foo()  // Error

Note that I'm ignoring while else clauses in these examples for the sake of brevity, but they obviously need to be handled.

B. The non-literal branches give exactly 1 UIF type and no non-UIF types. In this case we coerce the literals to the other branch type. Example:
(if b then 3 else U32(4) end).foo() // 3 becomes U32

C. All the branches give literal types. In this case we treat the expression exactly as for a simple literal receiver above, except that combining the arguments is more complex since each might be a union of literals. This is why the properties float, integer and number need to be flags. Example:
(if b then 3 else 4 end).add(5) // Has type literal number
(if b then 3.0 else 4 end).add(5) // Has type literal float
(if b then 3.0 else 4 end).add(1 << 2) // Error, float + integer
(if b then 3.0 else 4 end).add(if b2 then 5 else 6.0 end) // Has type literal float
(if b then 3.0 else 4 end).add(if b2 then 5 else 1 << 6 end) // Error, float + integer
(if b then 3 else 4 end).foo() // Error, don't know what foo() to call

Break non-locality

One important case to watch out for is that break statements may have their value inferred, but may also appear within some other expression that is inferred completely separately. Example:

var x:U32 = (while c do
  var y:F64 = (if b then break 3 else 4 end)
  5
end)

The 4 becomes F64 because it is part of the y inference. However, the 3 becomes a U32 (as does the 5) because it is part of the x inference. This has 3 key implications:

A. Break statements do not affect the type of their immediate surroundings. The if block is only a literal at all because of the 4, not because of the 3.
B. When we coerce a literal sub-expression to a particular value we can only set the type of the literals that are actually part of that value, not all that happen to appear within the AST. In this example when we coerce the if block to be F64 we must take care NOT to coerce the 3 to F64, but to leave it as literal.
C. The if block only actually has 1 branch that gives back a type. When handling control branches we need to check for ones which do not contribute a type and act accordingly.

Since control structures have to be able to coerce their break values, independent of where they are within the control structure, we have to be able to find the breaks. To enable this literal types for control structures need to contain a list of pointers to the ASTs (not just the types) of all their branches, including breaks. An additional TK_ value is used to build a linked list of nodes whose data members point to the relevant ASTs.

Tuples

For tuple literals we handle each tuple element separately. Any tuple expression with at least one element with literal type is itself considered a literal type. A literal tuple type contains a list of element types, each of which is either a non-literal type, a non-tuple literal type or a nested literal tuple type. Example:
(1, None, (2, 3.0)) // Has type literal tuple (literal number, None, literal tuple (literal number, literal float))

When determining the "best" tuple type we pick the best type for the first literal element and if that is a tie we pick the best type for the second literal element and so on. Example:
var x:((U32, U32) | (F64, U8)) = (1, 2) // 1 becomes F64 and 2 becomes U8

Methods cannot be called on tuples (and hence infix operators cannot be used on them). This means that we do not need to worry about special cases above for handling literal receivers, this is always an error. Example:
var x:(U32, U32) = (1, 2).add((3, 4)) // Error

When tuples are combined with control structures we may end up with types that are unions of different tuple cardinalities (including 1, ie not a tuple). This is actually fine and just works out. Examples:
var x:(None | (U16, U32)) = if b then None else (3, 4) end // 3 becomes U16 and 4 becomes U32
var x:(U8 | (U16, U32)) = if b then 3 else (4, 5) end // 3 becomes U8, 4 becomes U16 and 5 becomes U32

Complex tuple target types with control structures

When we consider control structures and complex tuple target types our definition of "best" type falls down. For any literal sub-expression we may be comparing a pair of types, one for floats and one for integers. Example:

var x:((U64, (F64 | U8)) | (U64, U32)) =
  (2, if b then 3 else 1 << 4 end)

Clearly 2 must become U64. But we have a choice for the other literals:
A. 3 becomes F64 and 1 and 4 become U8, or
B. 1, 3 and 4 all become U32.

The rule we use is that first we pick the best float type (if any) and then the best integer type. So in this example we use option A.

Using formal parameters

When a target type includes a formal parameter we cannot necessarily determine the correct type to use for a literal. Instead we split the task into 2 parts:

A. We check at compile time that we are always guaranteed to have at least one legal type for the literal once reification occurs.
B. When reification occurs we then pick the actual types for the literals, which must always be possible because we already performed A.

Examples:
fun box f[A: Number]() => var x:A = 3 // OK, can always pick a legal type for 3
fun box f[A: Signed]() => var x:A = 3 // OK, can always pick a legal type for 3
fun box f[A: Number]() => var x:A = 3.0 // Error, f[U32] does not work
fun box f[A: (Number | None)]() => var x:A = 3 // Error, f[None] does not work

Note that we can always pick the best type at reification even if we don't know if it's going to be the type parameter before reification. Example:
fun box f[A: Number] => var x:(A | U64) = 3
f[F64]() // 3 becomes F64
f[U16]() // 3 becomes U64 (even though A is U16)

For more complex examples involving formal parameters exactly the same principle is followed. Example:

fun box f[A: Float, B: Number]() =>
  var x: ((U32, A) | (I32, B)) = if b then (1, 2.0) else (3, 4) end
  // OK, but don't know types until reification

f[F64, U32]()  // 1 becomes U32, 2.0 becomes F64, 3 becomes I32, 4 becomes U32
f[F32, F64]()  // 1 becomes I32, 2.0 becomes F64, 3 becomes I32, 4 becomes F64

Note that the 2 branches of the if block are inferred separately.

Canonical form

The combination of tuples, intersections and formal parameters make the implementation of determining whether there is a legal type (and what it is after reification) a nightmare. To get round this we convert the target type to a canonical form before we start coercion. To do this we push intersections inside tuples and unions, leaving us with intersections only of simple types and formal parameters (which cannot be tuples). This will be a bit of a pain to implement, but makes everything else much easier.

from ponyc.

andymcn commented on May 14, 2024

This is the outline of the new simplified plan D.

Whereas in plan C we aimed to always acheive the correct result, here we aim to acheive a correct result where sensible, but be a lot more willing to say we can't cope with nasty cases. The aim is to handle all "reasonable" cases and require the programmer to explicitly specify the type of literals in "unreasonable" cases. Obviously the definition of "reasonable" is highly subjective.

There are a few key changes from plan C. These are all interdependent to various degrees, so we cannot treat or apply them independently.

We no longer pick the "best" type, all ambiguity is treated as an error. Whilst picking the best type does simplify some simple cases it makes handling very complex cases much harder, partly because it makes it very difficult to determine when we're in a very complex case. Treating all ambiguity as an inference failure makes spotting and bailing out of complex cases very easy.
We only infer literal types assigned to single UIF type and formal parameters with constraints that are subtypes of [A: (Real[A] & Number)]. (Assigning here includes passing as an argument etc, as in plan C.)
Rather than trying to infer the types of tuples as single unit we handle each tuple element separately. This means we will now sometimes pick types for literals that actually work out to be illegal, but these errors will be caught by the type check for the coercing assignment.
Each literal is inferred to be exactly one of a single UIF type, a single formal parameter or an error. No complex compound types are allowed. This means there is no post-reification step.
As stated above, we do not try to deduce overly complicated types, if it gets too hard give up. This is only possible because of the other changes.

As part of the general simplification we will be giving up on complex type expressions much more quickly. For example, given 2 random types T1 and T2, consider this subtype expression:
((T1 | T2) & T1)
This is actually equivalent to type T1 and T2 is irrelevant. We used to try to work through this, now we don't bother. If this gets too hard (which depends on what T1 and T2 are) we give up, even though it always can be reduced to T1.

There are many things we can do to aid in cases like this. For plan D we will start by implementing exactly none of these. Since only very complex types will benefit from such tricks, it is probable that they will never actually be used. If we do end up running into such complex types then various heuristics can be implemented on an ad hoc basis.

The basics

The basic examples are unaffected by this change.

var x:U64 = 3 // Target type is U64, 3 becomes U64
var x:None = 3 // Target type is None, error
foo(3) // 3 becomes type of argument of foo() or error, depending on definition of foo()
var x:U64 = 3; 4 // Error. 4 can become U64, but type cannot be deduced for 3
var x = 3 // Error, cannot do simultaneous literal and local inference
3.foo() // Error, no idea what foo() to call

Unions

One case that has changed is assigning to a union. Since this is ambiguous it is now an error:
var x:Number = 3 // Error, ambiguity

Assigning to a union is OK if there is only one appropriate UIF type in the union. Examples:
var x:(U64 | None) = 3 // Target type is (U64 | None), 3 becomes U64
var x:(U64 | F64) = 3.0 // Target type is (U64 | F64), 3.0 becomes F64

Control cases can still have different types for different branches, even if both are literals. Example:
var x:(U32 | None) = if b then 3 else None end // 3 becomes U32
var x:(U32 | F64) = if b then 1 << 2 else 3.0 end // 1 becomes U32, 2 becomes U32, 3.0 becomes F64

But each branch must be unambiguous:
var x:(U32 | F64) = if b then 2 else 3.0 end // Error, 2 could be U32 or F64

Tuples

For tuples, in plan C we checked the types of all elements. Example:
var x:(Foo, U32) = (x, 1)
In plan C: if x is a Foo then the 1 becomes U32. If x is not a Foo then we could not infer a type for 1, error.
In plan D: we always infer that 1 is U32 and we ignore the Foo element. If x is not a Foo the assignment type check will spot this, so it's not our problem.

Given a more complex example:
var x:((Foo, U32) | (Bar, U64)) = (x, 1)
In plan C: we would check if x was a Foo and if x was a Bar pick a type appropriately. If x was neither a Foo or a Bar an error would occur.
In plan D: we ignore the first element and take the target type for the second to be (U32 | U64). This is ambiguous, so the type is not inferrable.

Another tuple example:
var x:((Foo, None) | (Bar, U32)) = (x, 1)
In plan C: again we would compare the type of x to Foo and Bar before deciding whether to coerce the 1 to U32 or generate an error.
In plan D: the target type for the second element is (None | U32) so the 1 becomes U32. If x is not a Bar this is an error, but it will be picked up by the assignment.

Formal parameters

Since we now limit literal assignments to constraints that are subtypes of [A: (Real[A] & Number)] the example in Range still works fine:
fun box create[A: (Real[A] & Number)]() => var inc:A = 1 // 1 becomes type A
Since [A: (Real[A] & Number)] can only ever be bound to a single type (not a union) this is unambiguous.

Since our rule specifies the constraint is a subtype this still works for restricting to smaller sets of UIF types. Examples:
fun box create[A: (Real[A] & (U8 | U16))]() => var inc:A = 1 // 1 becomes A, either U8 or U16
fun box create[A: (Real[A] & Float)]() => var inc:A = 1.0 // 1.0 becomes A, either F32 or F64

What we have outlawed is assigning literals to formal parameters with the looser constraint of just Number. Example:
fun box create[A: Number]() => var inc:A = 1 // Error, ambiguous since A may be a union

I have not come up with an example where this would actually be a problem, requiring A to possibly be a union and still needing to assign a literal to it. Since they may be unions you can't do anything with the values anyway, other than store them, hand them out or pass them through a match. If it does ever actually come up an additional formal parameter can be used to specify the initialiser type. Example:
fun box create[A: Number, B: (Real[B] & A)]() => var inc:A = B(1)

Assigning to unions of formal parameters and other formal parameters or UIF types has also been outlawed, since it is ambiguous. Examples:
fun box create[A: (Real[A] & Number), B: (Real[B] & Number)]() => var inc:(A | B) = 1
fun box create[A: (Real[A] & Number)]() => var inc:(A | U32) = 1
Both the above are errors due to ambuguity.

Canonical form

Since we no longer have to deal with intersections of unions of complete tuples we no longer need to convert types to canonical form.

After thought

Reading back this comment having written it, it doesn't really give the impression that plan D is particularly different to plan C, other than for formal parameters. However, the difference in complexity for the implementation is huge.

from ponyc.

andymcn commented on May 14, 2024

This is now done, but there are a few edge cases to tidy up, so I'm leaving this open for a little while.

from ponyc.

andymcn commented on May 14, 2024

I forgot about inference in match patterns. This won't be hard to do, but I want to check I have the right plan before I start.

Given:

match expression
| pattern =>...

We don't infer a type for the expression, because why would you want to do "match(3)" and what type could that possibly be?
The type for the pattern is inferred to be the type of the expression.
All the patterns in a given match infer to the same expression type. There's no inferring to previous patterns or anything like that.

from ponyc.

sylvanc commented on May 14, 2024

Yes. But we should make sure that if the match expression is a literal, we don't assert :)
Yes. Of course, it could be some subtype of the expression, but that's exactly what the literal inference does anyway.
Yes. Patterns are independent of each other, they are only dependent on the match expression.

from ponyc.

andymcn commented on May 14, 2024

All done (hooray)

from ponyc.

Sort out integer literal typing about ponyc HOT 8 CLOSED

Comments (8)

The basics

Control structures

Break non-locality

Tuples

Complex tuple target types with control structures

Using formal parameters

Canonical form

The basics

Unions

Tuples

Formal parameters

Canonical form

After thought

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent