Giter Club home page Giter Club logo

smlfmt's Introduction

smlfmt

A custom parser and code formatter for Standard ML. smlfmt is fast, configurable, and usable for large projects.

All of the SML code in this repository (everything in src) has been formatted by smlfmt. Take a look!

(Nov 2, 2023) Released v1.1.0!

(Jan 10, 2023) Released v1.0.0!

(Dec 29, 2022) changed repository name to smlfmt (used to be parse-sml).

Features

  • Full support for Standard ML according to the formal definition.
  • Fast -- see performance results below.
  • Usable for large projects.
  • Helpful error messages with visual code references and syntax highlighting.
  • Support for various SuccessorML syntax extensions.
  • Configurable indentation size and maximum width.
  • Emacs integration (see smlfmt.el)

Performance

smlfmt is fast enough for live reformatting on large individual files, and can reformat large projects in seconds.

Here are measurements for smlfmt --preview-only INPUT > out on my MacBook Air (M2, 2022).

Input Size (LoC) Time
src/prettier-print/PrettierExpAndDec.sml 1024 31 ms
src/base/PrettyTabbedDoc.sml 1349 29 ms
src/parse/ParseExpAndDec.sml 1363 28 ms
MLton elaborate-core.fun 3942 104 ms
MLton x86-allocate-registers.fun 11034 216 ms
Entire smlfmt source code (smlfmt.mlb) 20000 (approx.) 305 ms
Entire MLton source code (mlton.mlb) 160000 (approx.) 6.6s

(Timings are averages reported by hyperfine.)

Examples: Error Messages

Example 1

Example 2

Build and run

You need mlton installed.

Do make and then pass a .sml file to format it in-place, for example:

$ make
$ ./smlfmt --preview src/smlfmt.sml       # modifies src/smlfmt.sml

You can also pass code through stdin and stdout.

$ echo "val x = 5 val y = 6" | ./smlfmt   # formatted output written to terminal

$ cat foo.sml | ./smlfmt > bar.sml        # take foo.sml as input and write output to bar.sml

$ ./smlfmt < foo.sml > bar.sml            # same as above

To format many files all at once, pass a .mlb file. Use --force to skip overwrite confirmations.

$ ./smlfmt --force src/smlfmt.mlb

Command-line options

--force skips interactive overwrite confirmations.

--preview also prints any formatted code to stdout, with syntax highlighting (if shown on terminal supporting ANSI colors).

--preview-only is the same as --preview, but also skips writing to file. (This is incompatible with --force.)

--read-only will only parse files (and report syntax errors). No interactive confirmation, and no files will be modified. (This is incompatible with --force.)

--check will verify that files are already formatted (for example, for CI). No interactive confirmation, and no files will be modified. (This is incompatible with --force.)

-mlb-path-var '<key> <value>' for handling path variables, similar to MLton's path maps.

-ribbon-frac <real> (default 1.0) controls pretty-printing. The ribbon-frac (between 0 and 1) controls how dense each line is, excluding indentation. Low ribbon-fracs will have very little non-whitespace content on each line, whereas high ribbon-fracs will attempt to fill the line as much as possible. Default = 1.0

-max-width <int> (default 80) controls pretty-printing. This is the desired maximum number of columns in each line. Note that it is not always possible to ensure that the output is within the max width. (For example, deeply nested let...in...end expressions will always indent further, and long comments/strings are not broken up.)

-indent-width <int> (default 2) is the desired number of spaces used in the formatted output.

-tab-width <int> (default 4) informs the pretty-printer of the intended tab-width for the input source. Typically this doesn't matter, but there are some cases where a particular tab-width can influence the vertical alignment of multiline comments and strings. (See discussion in issue #28 for more details.)

-engine [pretty|prettier] (default prettier) selects which pretty printing engine to use. The prettier engine is the newest version. The pretty engine is the old version.

--debug-engine enables debugging output, for developers. This flag requires that the --preview-only flag is also enabled.

-allow-top-level-exps [true|false] (default true) controls whether or not top-level expressions (terminated by a semicolon) are allowed.

-allow-opt-bar [true|false] (default false) controls whether or not SuccessorML optional bar syntax is allowed.

-allow-record-pun-exps [true|false] (default false) controls whether or not SuccessorML record punning syntax is allowed.

-allow-or-pats [true|false] (default false) controls whether or not SuccessorML or-pattern syntax is allowed.

-allow-extended-text-consts [true|false] (default false) controls whether or not SuccessorML extended text constants are allowed. Enable this to allow for UTF-8 characters within strings.

smlfmt's People

Contributors

athas avatar azdavis avatar brandonspark avatar harrisongrodin avatar ii8 avatar kopecs avatar must-compute avatar p-ouellette avatar ratsclub avatar shwestrick avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

smlfmt's Issues

Error recovery

It would be nice to have some sort of error-recovery mechanism, allowing the parser to "work around" small issues and continue parsing the rest of the source. There are some standard techniques (e.g. used by yacc and others) which we should look into.

Format of `andalso` and `orelse`

It seems that smlfmt does not support formating long chain of andalso and orelse, which will exceed the -max-width limit, e.g.:

fun main () =
  if
    1 = 1 andalso 2 = 2 andalso 3 = 3 andalso 4 = 4 andalso 5 = 5 andalso 6 = 6 andalso 7 = 7 andalso 8 = 8
  then
    print "foo\n"
  else
    print "bar\n"

Grouping of function application

Currently,

val _ = this is an extremely long function application which must be wrapped onto additional lines due to its length

formats to

val _ =
  this is an extremely long function application which must be wrapped onto
    additional
    lines
    due
    to
    its
    length

It feels odd that the grouping happens per application; I would expect

val _ =
  this
    is
    an
    extremely
    long
    function
    application
    which
    must
    be
    wrapped
    onto
    additional
    lines
    due
    to
    its
    length

since the entire nested function application can't be grouped together.

My immediate thought for a solution would be to write a simple auxiliary function which traverses the left spine of an App until it reaches a non-App, performing the group only at the top level. It would be nice if we could accomplish this without violating compositionality, though... (That said, given how common curried function application is, perhaps it wouldn't be the worst place to add a special case?)

Errors at EOF

Problem

Error positions are currently reported by attaching to a token, but this doesn't work at the end of the file.

For example, on a file containing just the word val, the parser attempts to parse a pattern immediately after the val, but since this is the last token of the file, it crashes.

This particular example crashes at the call tok i below, because it is an out-of-bounds access.
https://github.com/shwestrick/parse-sml/blob/6e61e1e431ca2aa8f4ddf6f3a47a9d1f8a0b5851/parse/ParsePat.sml#L106-L111

But this problem is more pervasive: many of the errors throughout the parser are similarly designed.

Suggested Solution

Introduce a new error function which similar to previous but now is aware of the token stream, e.g.

fun error (toks: Token.t Seq.t) {pos: int, what: string, explain: string option} =
  if pos >= Seq.length tokens then
    ... (* EOF error *)
  else
    ... (* normal error *)

Then all of the error sites need to be replaced (should be a nice little bit of zen refactoring πŸ’†πŸ»β€β™‚οΈ)

Remove `--stdio` flag

With #84 it doesn't seem necessary anymore. We can use the file input list to determine how to handle stdin/stdout (if no files are given, then use stdin/stdout for input and output).

Report errors involving multiple source lines

Currently, errors are always single-line errors. But it would be really nice to be able to mention multiple lines of the
source in some cases, e.g. "parser error on line 10 but the real problem is on line 8".

To handle this, errors need to be more loosely structured. One possible structure is to describe errors as sequences of elements, where each element is either (1) a paragraph of prose, or (2) a source code reference. We could also add additional elements in the future if we need to, such as lists of elements, inline code within a paragraph, etc.

(Damn, this is going to end up being just another markup DSL isn't it...)

Need to handle infixity correctly for `.mlb` files

Currently, all SML files are parsed using a hard-coded set of infix identifiers corresponding to the initial open standard basis.

However, defining infix operators should carry through in basis declarations in .mlb files. For example if an earlier SML file defines an infix operator and then a later SML file uses that operator in a nonfix style, then this should be an error.

Handling comments for pretty-printing

Currently, comments are ignored for pretty-printing and are not tracked by the AST.

@HarrisonGrodin suggests attaching comments to the current (or next) expression or declaration in the AST. Then, we can pretty-print by e.g. printing the comment before its attached node, which is idempotent.

For example below, comment A would attach to fun foo, comment B would attach to case, and C and D would attach to 0 and 1:

(* A *)
fun foo x = ...

fun bar x =
  case x of (* B *)
    NONE => (* C *) 0
  | SOME y => y + (* D *) 1

each unique `.mlb` should only be parsed once

Currently, when parsing an .mlb, if some other .mlb is included multiple times (e.g. a library) then we will parse the same file multiple times. This is not really a problem, but is not efficient. We could save a few cycles by deduplicating.

(Note: The MLB semantics says that each unique .mlb file should only be "evaluated" once, even if included multiple times. This is significant for SML because loading arbitrary code can have side-effects. But for the purpose of this project---where we only parse code but do not execute it---it's not a big deal to parse a file multiple times.)

Style for nested `if` expressions?

Following up on the discussion here.

Current approach is:

if b then
  if b' then
    e11
  else
    e10
else if b' then
  e01
else
  e00

Alternative might be:

if b then
  if b' then
    e11
  else
    e10
else
  if b' then
    e01
  else
    e00

The former is better for chained if conditions in the else clauses (else if ...), whereas the latter is more symmetric/compositional.

Imprecise basis merging for nonfix declarations

test/fail/basis-merging-nonfix.mlb should fail, but currently it succeeds without error.

The problem: nonfix declarations should override previous infix declarations when merging bases. Currently, they do not.

Suggested solution: while parsing SML source, keep two separate InfixDicts around: one for infix declarations, and one for nonfix declarations. They can cancel each other out when there is any overlap. This allows us to cancel infix/nonfix across source files, as needed for the above test.

Autoformatter tests

It seems like it'd be a good idea to have a test suite for the autoformatter. At a base level, might be good to write "expectation tests", i.e. pairs of files and their autoformatted counterparts, so git diffs would at least show how style changes.

Given expectation tests, might be good to test some desired properties:

  • Idempotence: format o format = format. Can run the formatter twice and make sure the output is the same as formatting once.
  • Token-preserving: can check that lex file = lex (format file)
  • Comment-preserving: can insert comments immediately before and after each token and make sure they stay (including relative position to tokens).

Infix style with large expressions

Currently, this is produced by the autoformatter:

val _ =
  f x + case g x of
          nil => 0
        | x :: xs => x + List.length xs

I'm not sure if it should change, though.

One principle I'm thinking we might want to adopt is:

Modifying code should not move independent code.

This would help modification to formatted code maintain a diff primarily relevant to the code which was actually modified.

Using the current style, modifying the expression f x (e.g., turning it into (f x * 2), or separately renaming f to fLongerName) would move the entire case expression forward, which seems somewhat undesirable since the case branches haven't changed at all.
One idea would be to have the style be

e1
+ e2

which may be grouped if possible; this is consistent with how other infix operators are used (such as |>), and it makes sure the infix operator doesn't get lost at the end of a line.

One downside to this style is that renaming the infix operator itself could move blocks of code; not sure if there's a good fix to this aside from

e1
+
e2

which seems a bit verbose.

Vertically align patterns

Building on the naive #41, I would argue that aligning patterns recursively leads to more readable code. For example:

(* instead of *)
(NONE, _)
(_, NONE)
(SOME x, SOME y)

(* could have *)
(NONE  , _     )
(_     , NONE  )
(SOME x, SOME y)
(* instead of *)
(15, _)
(312, _)

(* could have *)
(15 , _)
(312, _)

The most basic form of alignment, as shown here, would be recursively underneath many tuple patterns.
[Side note: what about record patterns? Should record keys be reordered?]

Unsure if there should be any alignment for sums; e.g.:

(* instead of *)
Apple (x, y)
Orange (x, y)

(* might want *)
Apple  (x, y)
Orange (x, y)

Less clear if this is a good idea, though; while constructors can sometimes be related, their data is often disjoint (and of different types/forms).

After aligning individual patterns, should probably line up each column of patterns in fun declarations, too.

autoformatting: infix chains

The autoformatter currently doesn't handle chains of infix expressions very well, especially with functions that take multiple arguments.

INPUT

infix 2 ++
val x =
  func1 a ++ func1 b ++ func4 a b c d ++ func3 a b c ++ func5 a b c d e
  ++ func3 a b c ++ func1 b ++ func1 c

CURRENT OUTPUT 😬 😬

infix 2 ++
val x =
  func1 a ++ func1 b ++ func4 a b c d ++ func3
                                           a
                                           b
                                           c ++ func5
                                                  a
                                                  b
                                                  c
                                                  d
                                                  e ++ func3
                                                         a
                                                         b
                                                         c ++ func1
                                                                b ++ func1
                                                                       c

Ambiguity in SML grammar definition

Is this valid SML?

val 1 : int :: [] = [1]

Both SML/NJ and MLton's parsers reject it, but I think it's valid according to the Definition.

The Definition says: "If vid has infix status, then β€œexp1 vid exp2” (resp. β€œpat1 vid pat2”) may occur – in parentheses if necessary – wherever the application β€œvid{1=exp1,2=exp2}” or its derived form β€œvid(exp1,exp2)” (resp β€œvid(pat1,pat2)”) would otherwise occur."

This suggests that if the following program is valid, the program up above should be too. (And of course, both MLton and SML/NJ happily accept this program.)

nonfix ::
val ::(1 : int, []) = [1]

Just to be clear, the grammar for patterns is:

<pat> ::= <pat> <vid> <pat>     # where <vid> is an infixed identifier
        | <pat> : <ty>
        | ...

So there is seemingly nothing preventing 1 : int :: [] from being a valid pattern. Why don't SML/NJ and MLton accept it??

Support mlb option "allowExtendedTextConsts true"

Currently, smlfmt will report an error on non-ascii input.

Example file:

val a = "🍰"

Error message:

-- SYNTAX ERROR ----------------------------------------------------------------

Invalid character.

test.sml
  | 
1 | val a = "🍰"
  |          ^

Strings can only contain printable (visible or whitespace) ASCII characters.

Expected behavior

Strings need to handle UTF8 non-ascii characters.

Support top-level expressions

It's not in the Definition, but is supported by many compilers. This would be useful for formatting code that uses the use function.

Parser bug: unexpected functor

This example fails to parse:

fun f x = x
functor F(type t) = struct end

The error:

$ smlfmt test.sml
-- PARSE ERROR ---------------------------

Unexpected token.

test.sml
  |
2 | functor F(type t) = struct end
  | ^^^^^^^

Expected beginning of expression.

This seems like it should be a quick fix. I have a hunch this is due to the strdec / core declaration weirdness in the SML grammar. The parser might be stuck attempting to parse a core declaration, and therefore expects for the expression x immediately before functor to continue. Fixing this might be as simple as forcing at most a single declaration to be parsed whenever strdec -> core dec...? But not sure. Will need to look into it.

TODO: new release

We've had a few nice changes since v1.0.0!

  • #77: --safety-check flag for checking that the output of smlfmt is correct, i.e., that the result is valid SML that exactly matches the source except for whitespace.
    • Note: originally called --check but renamed to --safety-check in #92.
  • #84, #86: automatically detect when stdin / stdout should be used.
  • #87: --read-only flag, useful for checking for syntax errors
  • #93: make install formula
  • #92: --check flag, to check if a file has already been formatted. Useful for CI.
  • Some bugfixes, e.g. 656885b

It would be nice to get #92 in before this next release, since that is in the works. Done.

Also, we should add mention of https://github.com/diku-dk/smlfmt.el to the README, perhaps in feature list. Done.

Editor integration

I don't know whether this should be a PR adding it to the README or whatnot, but I have found this Emacs function quite handy for running smlfmt on the current file buffer:

 (defun smlfmt ()
   (interactive)
   (save-buffer)
   (let* ((file (buffer-file-name))
          (ret (with-temp-buffer
                 (let ((code (call-process "smlfmt" nil '(t t) nil "--force" file)))
                   (if (= code 0)
                       nil
                     (buffer-string))))))
     (if (eq ret nil)
         (revert-buffer t t t)
       (message "%s" ret))))

Heuristic for placing fun decl body on separate line

The heuristic for determining whether the body of a function declaration is placed on a newline seems rather conservative and can lead to seemingly wasted horizontal space, especially when mixed with val declarations that do not seem to use the same heuristic.

❯ smlfmt -max-width 45 --preview-only z.sml 
---- z.sml ----
infixr 1 $

val f = fn (g, x) => x
val f = fn (g, x) => g x
val f = fn (g, x) => g $ x
val f = fn (g, x) => g $ g x
val f = fn (g, x) => g $ g $ x
val f = fn (g, x) => g $ g $ g x
val f = fn (g, x) => g $ g $ g $ x
val f = fn (g, x) => g $ g $ g $ g x
val f = fn (g, x) => g $ g $ g $ g $ x
val f = fn (g, x) => g $ g $ g $ g $ g x
val f = fn (g, x) => g $ g $ g $ g $ g $ x
val f = fn (g, x) => g $ g $ g $ g $ g $ g x
val f = fn (g, x) =>
  g $ g $ g $ g $ g $ g $ x

fun f (g, x) = x
fun f (g, x) = g x
fun f (g, x) = g $ x
fun f (g, x) = g $ g x
fun f (g, x) = g $ g $ x
fun f (g, x) =
  g $ g $ g x
fun f (g, x) =
  g $ g $ g $ x
fun f (g, x) =
  g $ g $ g $ g x
fun f (g, x) =
  g $ g $ g $ g $ x
fun f (g, x) =
  g $ g $ g $ g $ g x
fun f (g, x) =
  g $ g $ g $ g $ g $ x
fun f (g, x) =
  g $ g $ g $ g $ g $ g x
fun f (g, x) =
  g $ g $ g $ g $ g $ g $ x
--------

`sharing` specs move to line of previous spec

sharing specs seem to jump on the same line as existing code. For example, this:

signature S =
  sig
    type t1
    type t2

    sharing type t1 = t2
  end

autoformats to:

signature S =
  sig
    type t1
    type t2 sharing type t1 = t2
  end

Revisit `Promise` overhead in document structure

Interesting discussion here with Matthew about different implementations of Promise and their performance. We should revisit this and see if there are further performance improvements lurking here.

Some context: in cb74c66, I created MemoizedPromise which yielded a slight time improvement at the cost of a slight increase in space. It's possible that a different implementation could yield the same time improvement but with smaller space overhead.

Vertically align tokens

We should at least align delimiters, imo, e.g.:

val apple  = 0
and orange = 1

val () =
  case foo of
    (NONE, _)        => ()
  | (_, NONE)        => ()
  | (SOME x, SOME y) => ()

datatype apple = Foo | Bar
and orange     = Baz

I would also prefer right-aligning keywords before name definitions so names are aligned, e.g.:

eqtype apple
   and orange

datatype apple  = Foo | Bar
     and orange = Baz

Unsure if this is a popular opinion, though.

Unusual command line interface

Currently there seems to be no way to write the output to a different file.
For example:

smlfmt --preview-only a.sml > b.sml

will produce an invalid b.sml that doesn't compile.
And further the default behavior without passing arguments is to overwrite the given file after asking for interactive confirmation.

The typical behavior for command line text transformation tools like this is to use stdin and stdout by default, not ask for confirmation and not overwrite by default. Here are some examples:

# The ubiquitous sed tool takes input from a given file or stdin and outputs to stdout, unless `-i` is passed for "in place"
sed s/a/b/ > output.txt
sed s/a/b/ input.txt > output.txt
sed -i s/a/b/ file.txt

# The haskell code formatter ormolu also prints to stdout, `--mode inplace` is used to modify the file in place
ormolu Module.hs > FormattedModule.hs
ormolu --mode inplace Module.hs

# The popular minify tool has a `-o` option to specify an output file instead of stdout
minify script.js > script.min.js
minify -o script.js script.js

# djot markup language processor parses input from stdin, unless input files are listed, and always outputs to stdout.
djot a.dj b.dj c.dj > result.html

smlfmt is at odds with a convention that, in my opinion, is quite reasonable.

Rightward drift in deep function application

This output has a too much rightward drift in the first branch:

fun filterIdx p s =
  case s of
    Flat (Full slice) =>
      Flat
        (Full
           (AS.full
              (SeqBasis.filter gran (0, AS.length slice) (AS.nth slice)
                 (fn k => p (k, AS.nth slice k)))))

  | Flat (Delay (i, j, f)) =>
      Flat (Full (AS.full (SeqBasis.filter gran (i, j) f (fn k => p (k, f k)))))

  | _ => filterIdx p (force s)

If we increase the max width to 100, it becomes more reasonable again:

fun filterIdx p s =
  case s of
    Flat (Full slice) =>
      Flat (Full (AS.full (SeqBasis.filter gran (0, AS.length slice) (AS.nth slice) (fn k =>
        p (k, AS.nth slice k)))))

  | Flat (Delay (i, j, f)) =>
      Flat (Full (AS.full (SeqBasis.filter gran (i, j) f (fn k => p (k, f k)))))

  | _ => filterIdx p (force s)

To fix this, I think "splittable expressions" need to be generalized to handle multiple possible splits...

Continuous Integration Support

Hello,

I would like to run smlfmt on CI to check if the files are formatted correctly. I tried to run all the command-line flags and couldn't find one that would return an error code in case of a file not being formatted. Is there some way to setup something like this?

Thanks for the awesome tool!

SuccessorML syntax

Hi there - Thanks for the nice tool.

I'm wondering what your stance is on smlfmt being able to format code that uses the SuccessorML syntax additions:

http://mlton.org/SuccessorML

Personally I like allowOptBar, allowOrPats, and allowRecordPunExps of those, and having to choose between being able to use them, and being able to autoformat code, is a tough choice!

Long single-line comments can force large indentation increases

INPUT

val x = ((* this is a long commment which will cause the following expression to be formatted badly *)
  case foo of
    Baz => Bar
  | Bar => Baz)

OUTPUT

val x =
  ((* this is a long commment which will cause the following expression to be formatted badly *)case foo of
                                                                                                  Baz =>
                                                                                                    Bar
                                                                                                | Bar =>
                                                                                                    Baz)

Suggested solution: disallow inlining of comments that are above a threshold. Add a -max-inline-comment <int> command-line option to control this.

Implementation Notes: It turns out this is a little tricky to fix. In TokenDoc.insertComments, we first add comments back in, either inlined or not, based on whether or not the surrounding documents were placed above or beside one another. I first thought we could just change this step. However, comments are also "raised" as high as possible in the doc structure (see discussion here). This complicates things. Perhaps after tokens have been raised, we should do a final pass which re-inserts comments that were inlined but should not have been.

autoformatting: if-then-else chaining increases indentation

The autoformatter should handle chains of if-then-else expressions to make sure that indentation doesn't get
funked up, like in the example below.

INPUT

val x =
  if b1 then e1
  else if b2 then e2
  else if b3 then e3
  else if b4 then e4
  else if b5 then e5
  else if b6 then e6
  else if b7 then e7
  else e8

CURRENT OUTPUT

val x =
  if b1 then
    e1
  else
    if b2 then
      e2
    else
      if b3 then
        e3
      else
        if b4 then
          e4
        else
          if b5 then
            e5
          else
            if b6 then e6 else if b7 then e7 else e8

autoformatting: keep blank lines from source where appropriate

Currently, the autoformatter throws blank lines away. We should preserve these, perhaps with some sort of mild reformatting. One possibility would be to allow up to 2 adjacent blank lines. (Personally, I tend to use 1 blank line between small function definitions and 2 blank lines between large function definitions. It makes it much easier to spot where the function definition ends.)

Examples

Top-level declarations often have one or two blank lines between.

fun f x =
  let ...
  in ...
  end

fun g y = ...

fun h z = ...

Groups of related declarations, separated by blank lines.

val a = f x
val b = g x
val c = h x

val a' = f' x
val b' = g' x
val c' = h' x

Non-idempotent formatting

This is an example of non-idempotent formatting, which seems to be due to how comments are handled. We should fix this.

Original:

if offset i + j + 1 < offset (i+1) then
(* innerIdx := j+1 *) ()
else
( outerIdx := advanceUntilNonEmpty (i+1)
(* ; innerIdx := 0 *)
);

Here's the diff between running the formatter once and twice. This seems to happen because the comment crosses the max width.

                 if offset i + j + 1 < offset (i + 1) then
                   ()
                 else
-                  (outerIdx := advanceUntilNonEmpty (i + 1)(* ; innerIdx := 0 *) );
+                  (outerIdx
+                   := advanceUntilNonEmpty (i + 1) (* ; innerIdx := 0 *));

Note that smlfmt catches this when run with --check:

$ ./smlfmt --check src/lib/github.com/mpllang/mpllib/OldDelayedSeq.sml 
WARNING: src/lib/github.com/mpllang/mpllib/OldDelayedSeq.sml: non-idempotent formatting detected. 
Don't worry! The output is still correct; this is only an aesthetic issue. To help improve `smlfmt`, 
please consider submitting a bug report: https://github.com/shwestrick/smlfmt/issues
overwrite src/lib/github.com/mpllang/mpllib/OldDelayedSeq.sml [y/N]?

Handle fixities for `open` declarations (opening bases) in MLBs

MLB allows for binding bases to variables and then later opening them. Currently, when parsing a .mlb, we ignore these. This is incorrect for fixity parsing, because opening a basis can add/remove infix operators (via SML infix and nonfix declarations).

Comments move to the same line as the following declaration

This code:

structure S = (* A *)
  struct
    (* B *)
    val x = 0
    (* C *)
    val y = 1
  end
(* D *)

val z = 2

autoformats to:

structure S =
  (* A *)
  struct
    (* B *) val x = 0
    (* C *) val y = 1
  end
(* D *) val z = 2

A is probably fine, but B/C/D are clearly a bit odd. I would expect at least B and C to stay where they are; not sure where D should go, but it probably shouldn't pull val z closer, at least.

Issue with persistence of infixity status

Consider the following code fragment:

val << = fn (x, y) => x + y
fun foo x =
  let
    infix <<
  in
    x << 5
  end


val res = 2 << 3

It is rejected by the Standard ML of New Jersey compiler with the following error:

Ξ» ~/P/parse-sml on add-local-dec  rlwrap smlnj test.sml   19:10:08
Standard ML of New Jersey (64-bit) v110.99.1 [built: Mon Apr 19 12:32:21 2021]
[opening test.sml]
test.sml:11.5-11.17 Error: operator is not a function [overload - bad instantiation]
  operator: 'Z[INT]
  in expression:
    2 <<

However, the parser currently accepts the program with:

 ~/P/parse-sml on main β—¦ ./main test.sml                 19:10:50

val << = fn (x, y) => x + y
fun foo x =
  let
    infix <<
  in
    x << 5
  end


val res = 2 << 3

Lexing succeeded.
Parsing...

val << = fn (x, y) => (x + y)
fun foo x = let infix << in (x << 5) end
val res = 2 << 3

Parsing succeeded.

This seems to suggest to be that the infdicts are storing information past when they should, if I'm understanding what you were trying to do with the infdicts correctly.

Automatically detect if stdin isn't a terminal

It would be nice if --stdio wasn't required when stdin isn't a terminal. I think this should be as straightforward as the following, if there's interest:

val stdio = stdio orelse
              (* if stdin is not a terminal and an incompabible flag is not set, take input from stdin *)
              (not (Posix.ProcEnv.isatty Posix.FileSys.stdin)
                  andalso not (doForce orelse preview orelse previewOnly orelse not (List.null inputfiles)))

Given some POSIX APIs are already used this shouldn't impose any additional platform constraints.

Actually, given that --stdio isn't (yet) in a release, would just testing this be generally preferable, instead of a flag? I think this would still be fine for @azdavis too, given how millet would invoke.

Optional semicolons between topdecs

Problem: currently, the parser rejects programs that have semicolons between top-level declarations.

Suggested solution: redefine the Ast.ast type to allow for optional semicolons between top-level declarations.

Background: The <strdec> class allows semicolons to separate structure-level (and core) declarations. So, for example, a single <topdec> could contain multiple structure declarations separated by semicolons. But notably, you can't do the same with signatures or functors! This is absolute madness, and clearly an oversight in the definition of the grammar.

Strangely, the definition separately defines "programs" as sequences of top-level declarations:

<program> ::= <topdec> ; <program>

This effectively allows for sequencing <strdec> with <sigdec> and <fundec> by separating with semicolons. It's a bit awkward though, because the discussion in the definition is overly focused on the semantics of interactive REPLs. It seems equivalent to just allow for an optional semicolon between all top-level declarations. Then, perhaps the semantics of the REPL care about the placement of the semicolons, but the parser certainly should not.

autoformatting: reflow multi-line comments and strings

Currently, the pretty-printer doesn't handle multi-line elements very well. See example below.

This is a little tricky because in the layout algorithm, each text element is assumed to not contain newlines. To fix this, we should probably just do special-case handling of comments and strings, perhaps as a separate pass over a TokenDoc.

Example of bad formatting

INPUT

val x = (* this is a
         * multiline comment
         * which needs some love
         *)
    "I wonder where \
    \the rest of this string will go?"

OUTPUT

val x =
  (* this is a
         * multiline comment
         * which needs some love
         *) "I wonder where \
    \the rest of this string will go?"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.