shwestrick / smlfmt Goto Github PK
View Code? Open in Web Editor NEWA custom parser/auto-formatter for Standard ML
License: MIT License
A custom parser/auto-formatter for Standard ML
License: MIT License
With #84 it doesn't seem necessary anymore. We can use the file input list to determine how to handle stdin/stdout (if no files are given, then use stdin/stdout for input and output).
Currently, this is produced by the autoformatter:
val _ =
f x + case g x of
nil => 0
| x :: xs => x + List.length xs
I'm not sure if it should change, though.
One principle I'm thinking we might want to adopt is:
Modifying code should not move independent code.
This would help modification to formatted code maintain a diff primarily relevant to the code which was actually modified.
Using the current style, modifying the expression f x
(e.g., turning it into (f x * 2)
, or separately renaming f
to fLongerName
) would move the entire case
expression forward, which seems somewhat undesirable since the case branches haven't changed at all.
One idea would be to have the style be
e1
+ e2
which may be grouped if possible; this is consistent with how other infix operators are used (such as |>
), and it makes sure the infix operator doesn't get lost at the end of a line.
One downside to this style is that renaming the infix operator itself could move blocks of code; not sure if there's a good fix to this aside from
e1
+
e2
which seems a bit verbose.
Currently, smlfmt will report an error on non-ascii input.
Example file:
val a = "π°"
Error message:
-- SYNTAX ERROR ----------------------------------------------------------------
Invalid character.
test.sml
|
1 | val a = "π°"
| ^
Strings can only contain printable (visible or whitespace) ASCII characters.
Expected behavior
Strings need to handle UTF8 non-ascii characters.
It would be nice if --stdio
wasn't required when stdin isn't a terminal. I think this should be as straightforward as the following, if there's interest:
val stdio = stdio orelse
(* if stdin is not a terminal and an incompabible flag is not set, take input from stdin *)
(not (Posix.ProcEnv.isatty Posix.FileSys.stdin)
andalso not (doForce orelse preview orelse previewOnly orelse not (List.null inputfiles)))
Given some POSIX APIs are already used this shouldn't impose any additional platform constraints.
Actually, given that --stdio
isn't (yet) in a release, would just testing this be generally preferable, instead of a flag? I think this would still be fine for @azdavis too, given how millet would invoke.
e.g.,
!e (* instead of *) ! e
`e (* instead of *) ` e
although
e1 := e2 (* instead of *) e1:=e2
e1 + e2 (* instead of *) e1+e2
This is an example of non-idempotent formatting, which seems to be due to how comments are handled. We should fix this.
Original:
smlfmt/src/lib/github.com/mpllang/mpllib/OldDelayedSeq.sml
Lines 213 to 218 in 8090550
Here's the diff between running the formatter once and twice. This seems to happen because the comment crosses the max width.
if offset i + j + 1 < offset (i + 1) then
()
else
- (outerIdx := advanceUntilNonEmpty (i + 1)(* ; innerIdx := 0 *) );
+ (outerIdx
+ := advanceUntilNonEmpty (i + 1) (* ; innerIdx := 0 *));
Note that smlfmt
catches this when run with --check
:
$ ./smlfmt --check src/lib/github.com/mpllang/mpllib/OldDelayedSeq.sml
WARNING: src/lib/github.com/mpllang/mpllib/OldDelayedSeq.sml: non-idempotent formatting detected.
Don't worry! The output is still correct; this is only an aesthetic issue. To help improve `smlfmt`,
please consider submitting a bug report: https://github.com/shwestrick/smlfmt/issues
overwrite src/lib/github.com/mpllang/mpllib/OldDelayedSeq.sml [y/N]?
It would be nice to have some sort of error-recovery mechanism, allowing the parser to "work around" small issues and continue parsing the rest of the source. There are some standard techniques (e.g. used by yacc and others) which we should look into.
Error positions are currently reported by attaching to a token, but this doesn't work at the end of the file.
For example, on a file containing just the word val
, the parser attempts to parse a pattern immediately after the val
, but since this is the last token of the file, it crashes.
This particular example crashes at the call tok i
below, because it is an out-of-bounds access.
https://github.com/shwestrick/parse-sml/blob/6e61e1e431ca2aa8f4ddf6f3a47a9d1f8a0b5851/parse/ParsePat.sml#L106-L111
But this problem is more pervasive: many of the errors throughout the parser are similarly designed.
Introduce a new error function which similar to previous but now is aware of the token stream, e.g.
fun error (toks: Token.t Seq.t) {pos: int, what: string, explain: string option} =
if pos >= Seq.length tokens then
... (* EOF error *)
else
... (* normal error *)
Then all of the error sites need to be replaced (should be a nice little bit of zen refactoring ππ»ββοΈ)
This code:
structure S = (* A *)
struct
(* B *)
val x = 0
(* C *)
val y = 1
end
(* D *)
val z = 2
autoformats to:
structure S =
(* A *)
struct
(* B *) val x = 0
(* C *) val y = 1
end
(* D *) val z = 2
A
is probably fine, but B
/C
/D
are clearly a bit odd. I would expect at least B
and C
to stay where they are; not sure where D
should go, but it probably shouldn't pull val z
closer, at least.
Currently, the autoformatter throws blank lines away. We should preserve these, perhaps with some sort of mild reformatting. One possibility would be to allow up to 2 adjacent blank lines. (Personally, I tend to use 1 blank line between small function definitions and 2 blank lines between large function definitions. It makes it much easier to spot where the function definition ends.)
Top-level declarations often have one or two blank lines between.
fun f x =
let ...
in ...
end
fun g y = ...
fun h z = ...
Groups of related declarations, separated by blank lines.
val a = f x
val b = g x
val c = h x
val a' = f' x
val b' = g' x
val c' = h' x
Currently, all SML files are parsed using a hard-coded set of infix identifiers corresponding to the initial open standard basis.
However, defining infix operators should carry through in basis declarations in .mlb
files. For example if an earlier SML file defines an infix operator and then a later SML file uses that operator in a nonfix style, then this should be an error.
Consider the following code fragment:
val << = fn (x, y) => x + y
fun foo x =
let
infix <<
in
x << 5
end
val res = 2 << 3
It is rejected by the Standard ML of New Jersey compiler with the following error:
Ξ» ~/P/parse-sml on add-local-dec rlwrap smlnj test.sml 19:10:08
Standard ML of New Jersey (64-bit) v110.99.1 [built: Mon Apr 19 12:32:21 2021]
[opening test.sml]
test.sml:11.5-11.17 Error: operator is not a function [overload - bad instantiation]
operator: 'Z[INT]
in expression:
2 <<
However, the parser currently accepts the program with:
~/P/parse-sml on main β¦ ./main test.sml 19:10:50
val << = fn (x, y) => x + y
fun foo x =
let
infix <<
in
x << 5
end
val res = 2 << 3
Lexing succeeded.
Parsing...
val << = fn (x, y) => (x + y)
fun foo x = let infix << in (x << 5) end
val res = 2 << 3
Parsing succeeded.
This seems to suggest to be that the infdicts are storing information past when they should, if I'm understanding what you were trying to do with the infdicts correctly.
Problem: currently, the parser rejects programs that have semicolons between top-level declarations.
Suggested solution: redefine the Ast.ast
type to allow for optional semicolons between top-level declarations.
Background: The <strdec>
class allows semicolons to separate structure-level (and core) declarations. So, for example, a single <topdec>
could contain multiple structure declarations separated by semicolons. But notably, you can't do the same with signatures or functors! This is absolute madness, and clearly an oversight in the definition of the grammar.
Strangely, the definition separately defines "programs" as sequences of top-level declarations:
<program> ::= <topdec> ; <program>
This effectively allows for sequencing <strdec>
with <sigdec>
and <fundec>
by separating with semicolons. It's a bit awkward though, because the discussion in the definition is overly focused on the semantics of interactive REPLs. It seems equivalent to just allow for an optional semicolon between all top-level declarations. Then, perhaps the semantics of the REPL care about the placement of the semicolons, but the parser certainly should not.
Currently, when parsing an .mlb
, if some other .mlb
is included multiple times (e.g. a library) then we will parse the same file multiple times. This is not really a problem, but is not efficient. We could save a few cycles by deduplicating.
(Note: The MLB semantics says that each unique .mlb
file should only be "evaluated" once, even if included multiple times. This is significant for SML because loading arbitrary code can have side-effects. But for the purpose of this project---where we only parse code but do not execute it---it's not a big deal to parse a file multiple times.)
The autoformatter currently doesn't handle chains of infix expressions very well, especially with functions that take multiple arguments.
INPUT
infix 2 ++
val x =
func1 a ++ func1 b ++ func4 a b c d ++ func3 a b c ++ func5 a b c d e
++ func3 a b c ++ func1 b ++ func1 c
CURRENT OUTPUT π¬ π¬
infix 2 ++
val x =
func1 a ++ func1 b ++ func4 a b c d ++ func3
a
b
c ++ func5
a
b
c
d
e ++ func3
a
b
c ++ func1
b ++ func1
c
This output has a too much rightward drift in the first branch:
fun filterIdx p s =
case s of
Flat (Full slice) =>
Flat
(Full
(AS.full
(SeqBasis.filter gran (0, AS.length slice) (AS.nth slice)
(fn k => p (k, AS.nth slice k)))))
| Flat (Delay (i, j, f)) =>
Flat (Full (AS.full (SeqBasis.filter gran (i, j) f (fn k => p (k, f k)))))
| _ => filterIdx p (force s)
If we increase the max width to 100, it becomes more reasonable again:
fun filterIdx p s =
case s of
Flat (Full slice) =>
Flat (Full (AS.full (SeqBasis.filter gran (0, AS.length slice) (AS.nth slice) (fn k =>
p (k, AS.nth slice k)))))
| Flat (Delay (i, j, f)) =>
Flat (Full (AS.full (SeqBasis.filter gran (i, j) f (fn k => p (k, f k)))))
| _ => filterIdx p (force s)
To fix this, I think "splittable expressions" need to be generalized to handle multiple possible splits...
Currently, errors are always single-line errors. But it would be really nice to be able to mention multiple lines of the
source in some cases, e.g. "parser error on line 10 but the real problem is on line 8".
To handle this, errors need to be more loosely structured. One possible structure is to describe errors as sequences of elements, where each element is either (1) a paragraph of prose, or (2) a source code reference. We could also add additional elements in the future if we need to, such as lists of elements, inline code within a paragraph, etc.
(Damn, this is going to end up being just another markup DSL isn't it...)
test/fail/basis-merging-nonfix.mlb
should fail, but currently it succeeds without error.
The problem: nonfix
declarations should override previous infix declarations when merging bases. Currently, they do not.
Suggested solution: while parsing SML source, keep two separate InfixDict
s around: one for infix declarations, and one for nonfix declarations. They can cancel each other out when there is any overlap. This allows us to cancel infix/nonfix across source files, as needed for the above test.
For example:
-- PARSE ERROR ------------------------------
Unexpected token.
foo.sml
|
1 | fn
| ^^
Here, the error is supposed to pointing at the token fn
, but the alignment is wrong because there is a tab before that token.
MLB allows for binding bases to variables and then later opening them. Currently, when parsing a .mlb
, we ignore these. This is incorrect for fixity parsing, because opening a basis can add/remove infix operators (via SML infix
and nonfix
declarations).
The printer seems to include trailing whitespace:
signature S =
sig
type t
#####
type u
end
#
val x = 3
sharing
specs seem to jump on the same line as existing code. For example, this:
signature S =
sig
type t1
type t2
sharing type t1 = t2
end
autoformats to:
signature S =
sig
type t1
type t2 sharing type t1 = t2
end
It seems that smlfmt
does not support formating long chain of andalso
and orelse
, which will exceed the -max-width
limit, e.g.:
fun main () =
if
1 = 1 andalso 2 = 2 andalso 3 = 3 andalso 4 = 4 andalso 5 = 5 andalso 6 = 6 andalso 7 = 7 andalso 8 = 8
then
print "foo\n"
else
print "bar\n"
It seems like it'd be a good idea to have a test suite for the autoformatter. At a base level, might be good to write "expectation tests", i.e. pairs of files and their autoformatted counterparts, so git diffs would at least show how style changes.
Given expectation tests, might be good to test some desired properties:
format o format = format
. Can run the formatter twice and make sure the output is the same as formatting once.lex file = lex (format file)
We've had a few nice changes since v1.0.0
!
--safety-check
flag for checking that the output of smlfmt
is correct, i.e., that the result is valid SML that exactly matches the source except for whitespace.
--check
but renamed to --safety-check
in #92.stdin
/ stdout
should be used.--read-only
flag, useful for checking for syntax errorsmake install
formula--check
flag, to check if a file has already been formatted. Useful for CI.It would be nice to get #92 in before this next release, since that is in the works. Done.
Also, we should add mention of https://github.com/diku-dk/smlfmt.el to the README, perhaps in feature list. Done.
Building on the naive #41, I would argue that aligning patterns recursively leads to more readable code. For example:
(* instead of *)
(NONE, _)
(_, NONE)
(SOME x, SOME y)
(* could have *)
(NONE , _ )
(_ , NONE )
(SOME x, SOME y)
(* instead of *)
(15, _)
(312, _)
(* could have *)
(15 , _)
(312, _)
The most basic form of alignment, as shown here, would be recursively underneath many tuple patterns.
[Side note: what about record patterns? Should record keys be reordered?]
Unsure if there should be any alignment for sums; e.g.:
(* instead of *)
Apple (x, y)
Orange (x, y)
(* might want *)
Apple (x, y)
Orange (x, y)
Less clear if this is a good idea, though; while constructors can sometimes be related, their data is often disjoint (and of different types/forms).
After aligning individual patterns, should probably line up each column of patterns in fun
declarations, too.
It's not in the Definition, but is supported by many compilers. This would be useful for formatting code that uses the use
function.
Hello,
I would like to run smlfmt
on CI to check if the files are formatted correctly. I tried to run all the command-line flags and couldn't find one that would return an error code in case of a file not being formatted. Is there some way to setup something like this?
Thanks for the awesome tool!
INPUT
val x = ((* this is a long commment which will cause the following expression to be formatted badly *)
case foo of
Baz => Bar
| Bar => Baz)
OUTPUT
val x =
((* this is a long commment which will cause the following expression to be formatted badly *)case foo of
Baz =>
Bar
| Bar =>
Baz)
Suggested solution: disallow inlining of comments that are above a threshold. Add a -max-inline-comment <int>
command-line option to control this.
Implementation Notes: It turns out this is a little tricky to fix. In TokenDoc.insertComments
, we first add comments back in, either inlined or not, based on whether or not the surrounding documents were placed above or beside one another. I first thought we could just change this step. However, comments are also "raised" as high as possible in the doc structure (see discussion here). This complicates things. Perhaps after tokens have been raised, we should do a final pass which re-inserts comments that were inlined but should not have been.
Currently,
val _ = this is an extremely long function application which must be wrapped onto additional lines due to its length
formats to
val _ =
this is an extremely long function application which must be wrapped onto
additional
lines
due
to
its
length
It feels odd that the grouping happens per application; I would expect
val _ =
this
is
an
extremely
long
function
application
which
must
be
wrapped
onto
additional
lines
due
to
its
length
since the entire nested function application can't be grouped together.
My immediate thought for a solution would be to write a simple auxiliary function which traverses the left spine of an App
until it reaches a non-App
, performing the group
only at the top level. It would be nice if we could accomplish this without violating compositionality, though... (That said, given how common curried function application is, perhaps it wouldn't be the worst place to add a special case?)
The heuristic for determining whether the body of a function declaration is placed on a newline seems rather conservative and can lead to seemingly wasted horizontal space, especially when mixed with val declarations that do not seem to use the same heuristic.
β― smlfmt -max-width 45 --preview-only z.sml
---- z.sml ----
infixr 1 $
val f = fn (g, x) => x
val f = fn (g, x) => g x
val f = fn (g, x) => g $ x
val f = fn (g, x) => g $ g x
val f = fn (g, x) => g $ g $ x
val f = fn (g, x) => g $ g $ g x
val f = fn (g, x) => g $ g $ g $ x
val f = fn (g, x) => g $ g $ g $ g x
val f = fn (g, x) => g $ g $ g $ g $ x
val f = fn (g, x) => g $ g $ g $ g $ g x
val f = fn (g, x) => g $ g $ g $ g $ g $ x
val f = fn (g, x) => g $ g $ g $ g $ g $ g x
val f = fn (g, x) =>
g $ g $ g $ g $ g $ g $ x
fun f (g, x) = x
fun f (g, x) = g x
fun f (g, x) = g $ x
fun f (g, x) = g $ g x
fun f (g, x) = g $ g $ x
fun f (g, x) =
g $ g $ g x
fun f (g, x) =
g $ g $ g $ x
fun f (g, x) =
g $ g $ g $ g x
fun f (g, x) =
g $ g $ g $ g $ x
fun f (g, x) =
g $ g $ g $ g $ g x
fun f (g, x) =
g $ g $ g $ g $ g $ x
fun f (g, x) =
g $ g $ g $ g $ g $ g x
fun f (g, x) =
g $ g $ g $ g $ g $ g $ x
--------
Hi there - Thanks for the nice tool.
I'm wondering what your stance is on smlfmt
being able to format code that uses the SuccessorML syntax additions:
Personally I like allowOptBar
, allowOrPats
, and allowRecordPunExps
of those, and having to choose between being able to use them, and being able to autoformat code, is a tough choice!
This example fails to parse:
fun f x = x
functor F(type t) = struct end
The error:
$ smlfmt test.sml
-- PARSE ERROR ---------------------------
Unexpected token.
test.sml
|
2 | functor F(type t) = struct end
| ^^^^^^^
Expected beginning of expression.
This seems like it should be a quick fix. I have a hunch this is due to the strdec / core declaration weirdness in the SML grammar. The parser might be stuck attempting to parse a core declaration, and therefore expects for the expression x
immediately before functor
to continue. Fixing this might be as simple as forcing at most a single declaration to be parsed whenever strdec -> core dec...? But not sure. Will need to look into it.
Currently, the pretty-printer doesn't handle multi-line elements very well. See example below.
This is a little tricky because in the layout algorithm, each text element is assumed to not contain newlines. To fix this, we should probably just do special-case handling of comments and strings, perhaps as a separate pass over a TokenDoc
.
INPUT
val x = (* this is a
* multiline comment
* which needs some love
*)
"I wonder where \
\the rest of this string will go?"
OUTPUT
val x =
(* this is a
* multiline comment
* which needs some love
*) "I wonder where \
\the rest of this string will go?"
The autoformatter should handle chains of if-then-else expressions to make sure that indentation doesn't get
funked up, like in the example below.
INPUT
val x =
if b1 then e1
else if b2 then e2
else if b3 then e3
else if b4 then e4
else if b5 then e5
else if b6 then e6
else if b7 then e7
else e8
CURRENT OUTPUT
val x =
if b1 then
e1
else
if b2 then
e2
else
if b3 then
e3
else
if b4 then
e4
else
if b5 then
e5
else
if b6 then e6 else if b7 then e7 else e8
I don't know whether this should be a PR adding it to the README or whatnot, but I have found this Emacs function quite handy for running smlfmt
on the current file buffer:
(defun smlfmt ()
(interactive)
(save-buffer)
(let* ((file (buffer-file-name))
(ret (with-temp-buffer
(let ((code (call-process "smlfmt" nil '(t t) nil "--force" file)))
(if (= code 0)
nil
(buffer-string))))))
(if (eq ret nil)
(revert-buffer t t t)
(message "%s" ret))))
We should at least align delimiters, imo, e.g.:
val apple = 0
and orange = 1
val () =
case foo of
(NONE, _) => ()
| (_, NONE) => ()
| (SOME x, SOME y) => ()
datatype apple = Foo | Bar
and orange = Baz
I would also prefer right-aligning keywords before name definitions so names are aligned, e.g.:
eqtype apple
and orange
datatype apple = Foo | Bar
and orange = Baz
Unsure if this is a popular opinion, though.
Following up on the discussion here.
Current approach is:
if b then
if b' then
e11
else
e10
else if b' then
e01
else
e00
Alternative might be:
if b then
if b' then
e11
else
e10
else
if b' then
e01
else
e00
The former is better for chained if
conditions in the else
clauses (else if ...
), whereas the latter is more symmetric/compositional.
Is there any way to stop smlfmt from formatting comments? In particular, if I write a comment at the end of a file, it always gets formatted to occupy a single line. I want to disable this behavior.
Interesting discussion here with Matthew about different implementations of Promise
and their performance. We should revisit this and see if there are further performance improvements lurking here.
Some context: in cb74c66, I created MemoizedPromise
which yielded a slight time improvement at the cost of a slight increase in space. It's possible that a different implementation could yield the same time improvement but with smaller space overhead.
Currently, comments are ignored for pretty-printing and are not tracked by the AST.
@HarrisonGrodin suggests attaching comments to the current (or next) expression or declaration in the AST. Then, we can pretty-print by e.g. printing the comment before its attached node, which is idempotent.
For example below, comment A would attach to fun foo
, comment B would attach to case
, and C and D would attach to 0
and 1
:
(* A *)
fun foo x = ...
fun bar x =
case x of (* B *)
NONE => (* C *) 0
| SOME y => y + (* D *) 1
Is this valid SML?
val 1 : int :: [] = [1]
Both SML/NJ and MLton's parsers reject it, but I think it's valid according to the Definition.
The Definition says: "If vid has infix status, then βexp1 vid exp2β (resp. βpat1 vid pat2β) may occur β in parentheses if necessary β wherever the application βvid{1=exp1,2=exp2}β or its derived form βvid(exp1,exp2)β (resp βvid(pat1,pat2)β) would otherwise occur."
This suggests that if the following program is valid, the program up above should be too. (And of course, both MLton and SML/NJ happily accept this program.)
nonfix ::
val ::(1 : int, []) = [1]
Just to be clear, the grammar for patterns is:
<pat> ::= <pat> <vid> <pat> # where <vid> is an infixed identifier
| <pat> : <ty>
| ...
So there is seemingly nothing preventing 1 : int :: []
from being a valid pattern. Why don't SML/NJ and MLton accept it??
Currently there seems to be no way to write the output to a different file.
For example:
smlfmt --preview-only a.sml > b.sml
will produce an invalid b.sml that doesn't compile.
And further the default behavior without passing arguments is to overwrite the given file after asking for interactive confirmation.
The typical behavior for command line text transformation tools like this is to use stdin
and stdout
by default, not ask for confirmation and not overwrite by default. Here are some examples:
# The ubiquitous sed tool takes input from a given file or stdin and outputs to stdout, unless `-i` is passed for "in place"
sed s/a/b/ > output.txt
sed s/a/b/ input.txt > output.txt
sed -i s/a/b/ file.txt
# The haskell code formatter ormolu also prints to stdout, `--mode inplace` is used to modify the file in place
ormolu Module.hs > FormattedModule.hs
ormolu --mode inplace Module.hs
# The popular minify tool has a `-o` option to specify an output file instead of stdout
minify script.js > script.min.js
minify -o script.js script.js
# djot markup language processor parses input from stdin, unless input files are listed, and always outputs to stdout.
djot a.dj b.dj c.dj > result.html
smlfmt is at odds with a convention that, in my opinion, is quite reasonable.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.