coursera / courier Goto Github PK

Data interchange for the modern web + mobile stack.

Home Page: http://coursera.github.io/courier/

License: Apache License 2.0

Scala 49.34% Java 41.18% Groovy 1.49% Swift 3.79% Ruby 0.01% Lex 0.55% ANTLR 0.47% Python 0.18% Shell 0.09% TypeScript 2.92%

courier's People

Contributors

Stargazers

Watchers

courier's Issues

Top level union typeref containing custom type fails to compile in Scala

Example Courier schema:

namespace org.example.other

import org.coursera.customtypes.CustomUnionTestId
import org.coursera.customtypes.CustomRecord

typeref EvilUnion = union[CustomUnionTestId, CustomRecord]

The Scala compiler will fail because Custom and SingleElementCaseClassCoercer have not been imported.

Add support for Null

Pegasus schemas have a 'null' type. Courier has not supported it up to this point, largely because, when writing idiomatic Scala, there is no need for it. But to be 100% compatible with Pegasus, we should support it (and then discourage it's use!). Once supported, we can guarantee developers that any .pdsc schema can be handled correctly by Courier.

Please release latest to bintray

Hello! An older, unknown version of courier currently lives at https://dl.bintray.com/coursera/generic/courier .

Can somebody with credentials please release the latest version? Thanks!

Erem

Remove `nil` default value

Based on how we are observing the courier schema language used in practice. We plan to remove = nil defaults. What we’ve observed is that in many cases optional fields that could/should be marked as nil are not. This is inconvenient because it then requires that callees explicitly provide None for parameters that they should not need to. To fix this we are going to change the default behavior. All optional fields will be generated with = None by default in Scala code (= nil in Swift) unless a @explicit property is added to the field (exact name of the property TBD) .

Inheritance Support

It says in the document that the "include" key will add all the fields but does not do inheritance. While for new projects, this is easy to get around, people looking to switch to this framework are possibly relying on inheritance in their code (at least I am)

Allow swift codegen to produce classes instead of structs

I think it's a good idea to allow an option to choose classes for records instead of structs (default). This ticket is a suggest for the future rather than a bug with the current implementation.

First, a couple of things:

I work at LinkedIn and have helped on a very similar project to this. We started with structs, but later switched to classes.
I am not currently using this library so nothing here is blocking me.

Here's why I think classes are better for this use-case than structs:

Classes are much more stable in swift than structs (compilation). When we used structs (especially with associated types as you do for unions), we uncovered many compiler bugs which were extremely unpredictable and hard to track down. The compiler would usually segfault on some inputs. We filed a bunch of these bugs against Apple and they have fixed some of them, but not all of them.
Classes compile faster. For larger projects, this can become a problem quickly. In our tests, classes compile ~20% faster than structs.
Classes create smaller binaries. Structs can produce much larger binaries. (I think it's something to do with nested associated values getting 'inlined' somehow, but not sure about the details). I forget the exact numbers, but it was significant enough for us to move away from structs for data models.
Classes never need to be copied. I'm not 100% sure on this one, but we did see some hidden performance problems with structs and the compiler 'over-copying' them to be sure. Copy-on-write functionality only seems to be implemented for arrays and dictionaries and you don't get it for custom structs. Though normally this won't be a problem, the additional CPU and memory use seems like a waste. (I don't have any data/checks currently to back this one up...so take with a grain of salt).

I would opt instead to use final class. As far as I can tell, for immutable models, struct and final class are completely interchangeable. As in, from the view of the developer, there is no change in functionality or behavior between the two. Both are immutable, unsubclassable, and thread-safe. (There may be differences under the hood, but none of this will create any changes in program execution).

Therefore, I'd always opt for final class over struct if I could and it'd be cool if this library supported it.

Consider moving to jackson stream parsing

Its a lot faster than GSON, consumes lesser memory and does not use reflection. The generated code is a bit more verbose, but the tradeoff is worth it.

Generated Scala code for empty records triggers warnings about unused method parameters

[UnusedParameter] Parameter record is not used in method unapply.

Getting Started docs outdated

The docs reference outdated version 1.2.2 and there is no indication that SBT >1 is not supported

Improved "JSON Equivalent" binary protocol support

There are a number of binary protocols that provide "JSON equivalent" semantics:

PSON (non-standard pegasus codec)
BSON (mongoDB)
UBJSON
Smile
CBOR (not sure this is entirely JSON equivalent)
Flatbuffers
???

We should carefully review the performance of these (https://github.com/eishay/jvm-serializers/wiki is a good start) and determine which is the best and add codec support to Courier for that protocol.

Pegasus already includes support for PSON and BSON, so we really only need to review UBJSON, Smile, CBOR or any other contenders and determine if they are better than PSON or BSON, and if so, add codec support.

Fix issues with coercers and non-primitive types

I'm not clear on all the details here, but there are certainly some problems: https://phabricator.dkandu.me/D49527

Add `assemble` methods in Courier record companions that use field inclusion

Right now, records that use field inclusion have the standard two construction methods in their companion (building from DataMap and with all fields specified explicitly). I think it'd be useful to add additional methods, perhaps called assemble, that accept the included arguments as records instead of individual fields. For example:

record A {
  field1: int
  // many others
}

record B {
  ...A
  fieldB: string
}

I'd like to construct B with B.assemble(A(...), "fieldB"), instead of B(1, ..., "fieldB").

For records with multiple inclusions, I think generating a single assemble that accepts all included records plus all additional fields is sufficiently useful (rather than worrying about the complexity of overloads for different inclusion combinations).

Cyclic references unsupported

Courier is unable to compile structures like so:

namespace example

record Test1 {
  tst: Test2
}

namespace example

record Test2 {
  tst: Test1
}

So it's impossible to express complex recursive structures (like JSON tree, for example) with courier schemas.

Discussion Topic: Python 3 Bindings

Hey y'all,

Having Python bindings for Courier has become increasingly important for us at Instrumental as we scale up our dependence on the language for ML and other flows. As such, I've posted a work-in-progress PR that generates python3 bindings from courier templates. It's at the point where it works in most cases and can be used to idiomatically create, serialize, deserialize, and validate the types in your courier templates. It's not at the point where it's worth reviewing the code.

In the README you will see the details of what is yet-to-be-implemented and what I intend to leave for later. At this point I will be interested in a couple high-level questions:

Will a Python3 generator be a good addition to main-line courier?
General input on the API and idioms of the generated code (e.g. courier.dumps(obj))
Among the remaining features called out under Features missing in action in the PR, some rough sense of priority. Or if they are necessary at all for merge into main-line.

I will be continuing to work on it here and there for the next few weeks before it will be ready for merge. Particularly we will be dogfooding it internally to hammer out the rough points of the generated API.

See the python test-cases for examples of how the API works as-written. The basic gist is:

# Assume we have generated python bindings into the `generated` package
import generated.courier as courier
from generated.org.example.MagicEightBall import MagicEightBall
from generated.org.example.MagicEightBallAnswer import MagicEightBallAnswer

json = """{"question": "Will I ever love again?", "answer": "IT_IS_CERTAIN"}"""
ball = courier.parse(MagicEightBall, json) # raises courier.ValidationError if doesn't match schema

assert(courier.serialize(ball) == json) # Passes
ball.message = 'Am I human?'
new_ball = MagicEightBall(message='Am I human?', answer=MagicEightBallAnswer.IT_IS_CERTAIN)

assert(ball == new_ball) # Passes

Scala 2.13 support

It would be great to see support for Scala 2.13 for this project.

On quick review, it looks like the main breaking change here is the change to the Map API.

Other than that it seems like https://github.com/coursera/courscala would also need to be updated.

Add support for construction-time validation

The Problem

While Courier supports data validation, the validator functions it provides need to be called explicitly.

That's feasible when used with the right frameworks - such as Naptime - which call the validator functions at interface boundaries, but it is impractical for other use cases.

For example, consider the following model:

record JeopardyResponse {
  @validate.regex = {
    "regex": "^(What|Who) is"
  }
  question: string
}

A user might prefer if invalid constructions such as new JeopardyResponse(question = "Why is the sky blue") failed early without any explicit call to validation.

Currently the closest they can get is by either

calling validation at every construction site, which is error prone and tedious
rolling their own wrapper types that call validation (more complicated than it sounds, e.g. generated Scala classes are final)

The bottom line is that users who want their models validated at construction time have to write code that could be generated.

Proposed Solution

Adding a new annotation would let users label types which need to be validated at construction time:

@validateConstruction
record JeopardyResponse {
  @validate.regex = {
    "regex": "^(What|Who) is"
  }
  question: string
}

Courier would then generate constructor code that calls the validator function and signals failure in an idiomatic way. For example, in Java and Scala, new JeopardyResponse(question = "Why is the sky blue") would throw an IllegalArgumentException.

transform "0" to boolean

Hello

I have a case where json has keys with boolean values encoded as strings "0" and "1".
It is from external system and cannot influence it.

Did implement coercer and custom type, and managed to get
coerced values from "0" or "1" to IntBoolean(false) and IntBoolean(true)
but cannot manged to get those values with record.data()
nor to get them into avro.

Simple example of what I did is

IntoBooleanRecord.courier

namespace test
record IntBooleanRecord {
  key : IntBoolean
}

IntBooleanCoercer.scala

package test

import com.linkedin.data.template.{Custom, DirectCoercer}

case class IntBoolean(value: Boolean) extends AnyVal

class IntBooleanCoercer extends DirectCoercer[IntBoolean] {

  override def coerceInput(obj: IntBoolean): AnyRef = {
    Boolean.box(obj.value)
  }

  override def coerceOutput(obj: Any): IntBoolean = {
    obj match {
      case value: String =>
        if (value =="0") {
          IntBoolean(false)
        }else if (value =="1") {
            IntBoolean(true)
        } else {
          throw new IllegalArgumentException(s"$value is not 0 or 1")
        }
      case _: Any =>
        throw new IllegalArgumentException(
          s"Field must be string with value 0 or 1, but was ${obj.getClass}"
        )
    }
  }
}

object IntBooleanCoercer {
  registerCoercer()
  def registerCoercer(): Unit = {
    Custom.registerCoercer(new IntBooleanCoercer, classOf[IntBoolean])
  }
}

IntBoolean.courier

namespace test

@scala.class = "test.IntBoolean"
@scala.coercerClass = "test.IntBooleanCoercer"
typeref IntBoolean = boolean

scala code to test

val json=

        """{
        |     "key": "1"
        |}""".stripMargin


  val dataMap = DataTemplates.readDataMap(json)
  val record=IntBooleanRecord(dataMap,DataConversion.SetReadOnly)

  println(record) // IntBooleanRecord(IntBoolean(true))
  println(record.data()) // {key=1}

I was expecting for record.data() to output {key=true}

Is there something I am doing wrong or this is not supposed to function this way?
if not, what would be the way to do it?
Help would be appreciated.

Dangling doc comment causes java.lang.NullPointerException

Example input:

record Record {
  field: int
  /** dangling doc comment */
}

Provide coercers for "standard" jvm types

Hello,

Thanks for writing courier !
It is unclear to me which JVM is targeted by courier but it would be really nice to have default bindings for common non-primitive types from the java standard library :

java 1.6 +

BigDecimal
BigInt

java 1.8 +

java.time.Instant (bound to long as nb of millis from epoch , see timestamp-millis in avro spec)
java.time.LocalTime (bound to int as nb of millis from epoch , see time-millis in avro spec)
java.time.ZonedDateTime or java.time.LocalDateTime (bound to ISO8601 string )

Generated constructors for Records should not have default arguments

(I'm talking about the generated scala code, and I've never looked at the code generators for other languages, but I assume that those might have similar behavior.)

Issue description

When you give a field in a courier record a default value, it generates an apply method with default arguments for that field. This can hide bugs where you construct the record without giving it all the data that it needs. Therefore, I suggest that the generated methods not have default arguments.

Obviously, this will probably break a lot of existing code that uses courier. Maybe there should be some sort of "generateDefaultArgument" annotation to ease the transition.

Example

Here is an example of a bug that the current behavior hid from me:

record SpecificationWithId {
  creatorName: CreatorName
  name: SpecificationName
  ...Specification
}
record Specification {
  isStandalone: boolean?
  template: AnyData
  children: array[NodeRequest] = []
  preCreatedNodes: array[PreCreatedNode] = []
}

object SpecificationWithIds {

  def toTuple(specificationWithId: SpecificationWithId):
    (QualifiedSpecificationName, Specification) = {
    val qualifiedSpecificationName = QualifiedSpecificationName(
      specificationWithId.creatorName,
      specificationWithId.name)
    val specification = Specification(
      specificationWithId.isStandalone,
      specificationWithId.template,
      specificationWithId.children)
    (qualifiedSpecificationName, specification)
  }

}

I added the preCreatedNodes field after I wrote the SpecificationWithIds.toTuple "deconstructor", and I forgot that I needed to update the "deconstructor".

Expand our benchmark suite

Courier currently has simple JMH benchmark runnable:

https://github.com/coursera/courier/tree/benchmark/benchmark

We should flesh out these benchmarks and test out cases such as large arrays. We should also benchmark our supported binary protocols using this utility. Once we are satisfied with the benchmarks, we should merge this into master.

Generated typescript for recursively defined records import themselves

When generating typescript bindings from courier records that contain recursive definitions, the generated typescript interface file includes an invalid import of itself.

record test {
  recursiveField: test
}

import { test } from "./.test";

export interface test {
  
  recursiveField : test;
}

Add Swift support

From talking to Swift developers, idiomatic bindings should look something like:

Record:

/**
A fortune cookie.
*/
struct FortuneCookie {

    /**
    A fortune cookie message.
    */
    let message: String

    var certainty: Float?

    let luckyNumbers: [Int]

    let map: [String: Int]

    let simple: Simple
}

Union:

import Foundation

enum Telling {
    case FortuneCookieType(FortuneCookie)
    case MagicEightBallType(MagicEightBall)
    case StringType(String)
}

Enum:

import Foundation

enum MagicEightBallAnswer {

    case IT_IS_CERTAIN

    /**
    Where later is at least 10ms from now.
    */
    case ASK_AGAIN_LATER

    case OUTLOOK_NOT_SO_GOOD
}

Typeref:

import Foundation

/**
IOS 8601 date-time
*/
typealias DateTime = String

JSON serializer:

We are currently prototyping with SwiftyJSON.

Open issues:

Is it possible to make defaulted fields immutable (let instead of var)? If so, how?
How should we bind to protocols?

Handle enums that are not all-caps better

We have a number of persisted models with enum types in which the string representations of the enums are camel-cased. Courier only supports all-caps enums, so these enums cannot be migrated to Courier.

Possible solutions:

Allow enum symbols to be camel-cased, rather than enforcing all enums to be all-caps.
Allow alias values for enum symbols i.e. allow the string representation of the enum symbols to be specified.

Make a release?

I noticed that unapply on unions was added in September, and so it is not part of the current 0.4.1. Can we get a 0.12.3 release into sonatype?

Generated code legibility

I love twirl, but its one down-side is legibility of the generated code due to excessive whitespace.

What do you guys think would be a good solution to this? Adding a scalariform pass is the first thing that comes to mind, but I'm not sure what your thoughts are wrt adding that dependency.

Add generated copy() methods to generated Swift bindings

In Swift, we can handle modifications to immutable types in a similar way to how we modify Scala immutable types-- via a copy method. In Scala copy methods look like:

def copy(field1: String = this.field1, field2: Int = this.field2, field3: Boolean = this.field3)

Which makes it easy to perform a copy and change only whatever fields one wants changed using named parameters, e.g.:

instance.copy(field2 = 5)

instance.copy(field1 = "updated")

Integrate Pegasus Java generator with Courier

Courier provides an API in the generator-api project that is used by the build system integrations (gradle-plugin and sbt-plugin). This API is then implemented by each language specific generator (scala, swift, android java).

However, there is no implementation of the API for the standard Pegasus Java data binding generator. As a result, it is not possible to generate Pegasus Java data bindings using the Courier schema language.

This is a relatively straight forward task. We simply need to define a new java/generator project and define a java/generator/src/main/java/org/coursera/courier/JavaGenerator.java class that implements PegasusCodeGenerator with a generate method that simply delegates to the existing Pegasus Java generator implementation.

Be sure to document how to set up a Courier project for Java in a README and link to it from the main courier documentation!

Prototype a grammar for Courier

We’ve used the .pdsc file format up to this point for Pegasus schemas for a few reasons:

It’s already been implemented, stable and well tested
The extensibility of JSON has proven very convenient. We have been able to add a number of custom properties to schemas (“defaultNone”, “isTranslatable”, “scala.class” …)

However, writing JSON by hand has some limitations:

Verbosity makes authoring new schemas somewhat annoying/unpleasant
JSON syntax “gets in the way”, reducing readability
All doc strings must be a single line

And the .pdsc JSON structure has a number of warts as well:

Maps and arrays must be declared using an inconvenient “type”: { “type”: “array”, … } format
“defaultNone” feels bolted on
All references to types not in the same namespace must be fully qualified

Proposed .courier Grammar

We have a few goals for a grammar:

Retains extensibility of .pdsc format: All types and fields may be extensible with arbitrary data (which must either be JSON or be fully isomorphic with JSON).
The mapping between the grammar and the .pdsc format must be clear and direct (unless there is a * VERY compelling reason not to). This implies that we should continue to use the same keywords used in the .pdsc format unless there is a compelling reason not to.
Foster consistent style: keywords should be lowercase, developer defined types should be PascalCase, field names should be camelCase
Provide a Scala / Swift friendly syntax, also make use of syntax from GraphQL, JSON and HOCON as appropriate
Support multi-line markdown style documentation strings
Eliminate most (ideally all) cruft found in the .pdsc format

Examples:

namespace org.coursera.fortune
include org.coursera.models.common.DateTime

/** A fortune. */
record Fortune {
  @{ "isTranslatable": true }
  title: string,
  /** The fortune telling. */
  telling: FortuneCookie | MagicEightBall | string // a union
  createdAt: DateTime? = nil // optional defaulted to nil/None

}

namespace org.coursera.fortune

enum MagicEightBallAnswer {
  IT_IS_CERTAIN
  /** Where later is at least 10 ms from now. */
  ASK_AGAIN_LATER
  OUTLOOK_NOT_SO_GOOD
}

namespace org.coursera.fortune

record FortuneCookie {
  ...SomeRecord // include fields from another record
  luckyNumbers: array[int]
  exampleMap: map[int, string]
}

namespace org.coursera.models.common

/** ISO 8601 date-time. */
@{
  "scala": {
    "class": "org.joda.time.DateTime",
    "coercerClass": "org.coursera.models.common.DateTimeCoercer"
  }
}
typeref DateTime = string

Grammar

Features:

namespaces and includes at top of file
C-style comments: /* */ and //
/** */ style doc strings, but with markdown support
Declare parameters using: <identifier> ":" <type> (= <default>)?
Default values will be JSON literals: “a string”, 1, 3.14, true, ... , [ … ], { … }
Declare generic types in <type>[<typeParams>] syntax
Insignificant commas (?)
Properties expressed using annotation style syntax (@...)
Swift style optional type declarations: <type> “?”(Alternatively we could use Scala style instead, e.g. optional[])
Uniform GADT (generic algebraic data type) representation: <type>[<typeParams>]. We will use this for maps (map[key, value]) arrays (array[items]) and unions (union[Member1, Member2, ...]) and may the syntax user defined generic types in the future.
GraphQL fragment style includes: …

Markdown

We will start with a “Vanilla” markdown format that is compatible with both Scaladoc and Swift doc strings.

IDE Support

IntelliJ “Custom Language Support” Plugin. Initially we need at least Lexer/Parser/Syntax Highlighter support, but if it’s not too much work, we should go for comprehensive integration.

Future Work

Includes
Multiple schema definitions per file
Markdown references (and appropriate conversion to Scaladoc and Swift doc strings)

Putting an apostrophe (') in a single-line comment (//) makes courier fail to parse the file

For example

namespace org.coursera.learning.course.activity

record Example {
  // The example's field
  field: int
}

causes

[info] Courier: Generating Scala bindings for .pdsc and .courier files for 'compile' configuration.
[error] Courier generator error, cause: java.io.IOException: /Users/marc/base/coursera/infra-services/libs/models/src/main/pegasus/org/coursera/learning/course/activity/Example.courier,"field" or "org.coursera.learning.course.activity.field" cannot be resolved.
[error] 4,19: Type not found: field
[error] 4,16: token recognition error at: '''
[error] 4,19: missing ':' at 'field'

It compiles fine if I remove the apostrophe.

Also interesting: apostrophes in multi-line comments (/**/) don't break it.

Add support for Sets

Developers often want to represent a set in Courier.

Currently, the only available approach is to use a Pegasus array and transform it to/from a Scala Set manually.

Adding an option in our .pdsc files that developer could use to specify the desired binding type, e.g.:

{ "type": "array", "items": "Example", "scala": { "type": "set" } }

Would allow us to generate an appropriate Set binding class.

New top-level type for ids

Right now Courier records work nicely as resource models in Naptime, but Courier doesn't support non-primitive resource keys well.

The problem

It seems natural to use records for composite resource keys. For example, suppose we have a repositories.v1 resource with key :organization~:repositoryName. We can define a record:

record RepositoryId {
  organization: string
  repositoryName: string
}

and make requests like GET /repositories.v1/coursera~courier.

However, if we use this key in another model, it'll be serialized as an object, not as a URL-usable string like "coursera~courier". For example:

record PullRequest {
  title: string
  repositoryId: RepositoryId
}

may be serialized as:

{
  "title": "Example PR",
  "repositoryId": {
    "organization": "coursera",
    "repositoryName": "courier"
  }
}

For client convenience, it'd be nice to have this instead:

{
  "title": "Example PR",
  "repositoryId": "coursera~courier"
}

because then clients can easily pull out the repositoryId field and construct a repositories.v1 request.

A possible workaround right now is to define all ids as typeref RepositoryId = string, which produces the desired serialization, but requires language-specific coercion to preserve type safety.

Proposed solution

Proposed new syntax:

id RepositoryId {
  organization: string
  repositoryName: string
}

where id objects are string-serialized when used in other record or id objects, using one of the string codecs Courier already supports.

DataTemplate.readRecord broken on newly generated scala records

(See #71 for reproduction)

For quite a few versions now (since 5acb817) I think generated scala records have been incompatible with the readRecord methods in org.coursera.courier.templates.DataTemplates

Attempting to read a record results in java.lang.NoSuchMethodException: org.coursera.records.test.Simple$.apply, of course because the previous apply(DataMap, DataConversion) is now build(DataMap, DataConversion)

My temptation is to update DataTemplates to call build instead of apply, but that would be backwards incompatible for Courier users who are still operating on old templates. Could also perform two lookups in case the first one fails, but that will double the reflection work for either old or new clients. Curious your thoughts how to remediate?

This is relatively high priority for us at Instrumental so would love your thoughts.

SBT plugin does not resolve .courier references from dependencies

The plugin currently only resolves files in the <lib>/pegasus directory but does not offer a simple way to configure or resolve .courier files from a dependent library. This leads to minor annoyances with shared models from other packages that you want to reference in courier sources and is a blocker for splitting up a centralized models repository. Note that courier codegen classes are resolvable because that is post-SBT plugin resolver code path.