modus-continens / modus Goto Github PK

View Code? Open in Web Editor NEW

279.0 279.0 9.0 573 KB

A language for building Docker/OCI container images

Home Page: https://modus-continens.com

License: GNU Affero General Public License v3.0

Rust 89.35% Python 10.56% Shell 0.09%

build-system datalog docker dockerfile language reproducible-builds

modus's People

Contributors

Stargazers

Watchers

Forkers

isgasho snapbuy icodein evanrichter mayhemheroes bbenzikry iq-scm jpx40

modus's Issues

Currently, nullary predicates are denoted as identifiers without parentheses, e.g. a :- b, c. @maowtm suggested to consider using the syntax with (), e.g. a() :- b(), c(). The parentheses can also be optional. Here are my thoughts about this:

Without parentheses

Plus: The same as in Prolog, XSB Datalog
Plus: Similar to the names of build stages in Dockerfiles
Minus: Non-intuitive for the majority of users, because the majority is not familiar with Prolog.

With parentheses

Plus: Intuitive for the majority of users
Plus: The same as in Souffle Datalog
Minus: When used as image predicates, something like ubuntu might be more intuitive than ubuntu(), since it is just a reference to an image, and not a call

With optional parentheses

Plus: Most flexible
Minus: It might be non-intuitive that a :- b(), c is the same as a() :- b, c()

@barr , @thevirtuoso1973 , what are your thoughts about it?

System test infrastructure

Add a framework for executing integration/system tests.

Implement Dockerfile pretty-printer

Efficient solver with cost

Negation

Possible approaches:

Negation as failure - compute a sub-proof that checks whether a negated (grounded) literal has a proof; if it does not, then we say the negation is true. (It requires that the negated literal is grounded, which nicely mirrors the way we handle built-ins.)
I think this is what Prolog has.
The theoretical issue with this is that it uses non-monotonic reasoning - we simply claim !p if we haven't proved p.
Stratification based approach - used for better semantics by enforcing some kind of ordering on the program's clauses. I am not really sure what we gain in practice.
Well-founded semantics - introduces the notion that facts can be true, false or unknown. Quite a big jump in semantics.

Relevant links:

Implement top-down proof generator for non-recursive case

`::merge` remaining issues

Avoid using shell escape (proposed to create a dedicated binary)
Use deterministic mount paths to make caching work - currently because of the random mount path nothing inside merge can be cached.

`docker build` integration...?

As it turns out, it is actually possible to use arbitrary buildkit frontend in a Dockerfile, by adding a special comment at the beginning:

#syntax = image/of/frontend

This allows the user to simply DOCKER_BUILDKIT=1 docker build -f the source file, even though it is not in Dockerfile syntax. And it requires no additional installation of any buildkit tools. If we create a buildkit frontend which takes in Modusfile (and, by some means, the query), user can simply docker build our Modusfile without even needing to download the Modus program themselves...

Wonder if this is a thing worth exploring... Also, actually this is currently the only way I know of to tell Docker to use a custom buildkit frontend (instead of running an additional buildkit service).

New syntax & IR

Main thing(s) left to do:

Add expression operators to the IR.
; for logical or.
#62
Analysis that checks whether a clause is an image/logic/etc. expression.

Improved Max Depth

There are queries/programs one could construct that should work¹, but would exceed some fixed maximum depth.

There may be a way to dynamically determine the maximum depth, given a program.

An ideal value for the maximum depth would be the number of possible facts that can be inferred from a program.
This is because paths in an SLD tree correspond to a proof, where every intermediate node proves a (sub)goal that's required to prove the desired goal. Therefore the depth of a path cannot exceed the number of possible facts that can be inferred from a program.
Of course in Modus we have ungrounded facts so there are infinite possible facts that can be inferred from programs, so we will consider queries instead.

So I think we should use an expression like predicateCount * numConstants^maxArity to approximate the number of facts that can be inferred from a query.

Alternative approach that I was previously considering is below:

Some kind of bottom-up evaluation which levies the maximum chain of expansion may work, e.g for:

foo.
bar :- foo.

something like: (foo, 1) then (bar, 2).

Although one issue with the bottom-up approach is infinite number of producible facts (due to string_concat), e.g.:

f(X) :- f(Y), string_concat(Y, "1", X).
f("")

And also we would have to address that we have ungrounded facts.

Of course, there are also programs which are fully recursive foo :- foo. and would always be infinite. ↩

BuildKit backend

Build the ~~build graph - a directed acyclic graph with nodes being images and edges being operations (?)~~ buildkit LLB, which is a graph of build operations connected by dependencies.
Learn more about BuildKit

Porting OpenJDK Images to Modus

This issue tracks the progress of porting OpenJDK images to Modus, s.t. they no longer use Dockerfiles.

https://github.com/docker-library/openjdk

This is an image provided by Docker as a "Docker Official Image" so it may be interesting to note which parts of their image creation pipeline is a general step vs. specific to the openjdk image.

Some early ideas of features to Modus that might be added to assist this:

Multi-file Modusfiles
Better JSON/YAML support. Some kind of way to import literals/terms/etc. Somewhat related is this cool config language I found https://github.com/dhall-lang/dhall-lang.
CSV support, treating the rows as tuples of ground facts.
Some way to use buildx's multi platform building?

We might find that #69 is related.

Also, see https://github.com/modus-continens/openjdk-images.

Things that would be directly applicable in our port:

Numeric comparisons
Negation

Build front-end image on CI

Namespace/differentiation for builtin rules

use(url, image, path).

url(str, protocol, host, port, path).
image(str, name, tag).
path(str, directory, filename).

or having builtin be prefixed with something like $path or @path

or having uppercase names for builtins etc

or having proper namespaces, such as std/run(...)

Allow space before input

Currently, something like this will fail to parse:

          a :- from("alpine").
          b :- a, run("echo aaa > /tmp/a").

Tried to change the parser myself but I was just not familiar enough with how nom works... @thevirtuoso1973 please help

Make stuff (A)Rc?

Not something urgent now, but it looks like we are cloning Literal etc a lot. Plus when we add source info into literal (?) it will be even more information needed to clone.

Also now that ClauseID need to store a literal for built-in rules it also has the same problem.

Fix Static Check

Update with modus-continens/docs#5
Deal with the recursive clause issue, we can probably evaluate more kinds by changing the implementation.

Parser: comment doesn't work inside rules

This fails:

python(python_version, "ubuntu", distr_version) :-
    # from(f"ubuntu:${distr_version}"),
    string_concat("ubuntu:", distr_version, X),
    from(X),
    # run(f"apt-get install -y python${python_version}").
    string_concat("apt-get install -y python", python_version, Y),
    run(Y).

Not sure what's the best solution. Maybe we should just filter out these things before parsing?

Better error message for operators

error: unknown predicate - _operator_merge_begin
  ┌─ merge-test.Modusfile:6:6
  │
6 │   )::merge.
  │      ^^^^^

Should ideally just say "unknown operator merge"

Use `alpine` for front-end

The current debian image is too big for good UX (120 MB fetched from scratch).

Implement Docker multi-stage build backend

`( from("") )::copy` fails to parse

Looks like it can't handle spaces in certain positions...
@thevirtuoso1973

Tests run in parallel despite `#[serial]`

When we run all the test suites there's a chance the tests marked #[serial] will still be run in parallel. This doesn't appear to happen if we run an individual file's tests.

Could be a bug with the library we use.

Miscellaneous buildkit stuff

Support passing flags like --progress=plain etc.
Add predicate labels
~~Tagging~~ Just json output is enough
Output json containing a description of all images
Take context and -f Modusfile
modus ls - docker image store

Formatted strings

Custom operators

I discussed this with @mechtaev in chat before we decided to put everything on GitHub so it's more visible. Here is a recall of the basic idea...

Previously it was decided that operators would be implemented as "start/end" internal predicates, so

foo :- (a, b)::merge

will become

foo :- merge_begin(...), a, b, merge_end(...)

when translating to the IR, where merge_begin and merge_end would actually not be accessible to the user.

However when implementing it I thought that there is really no reason why we have to forbid any manual usage of these "intrinsics", since they do not break any datalog logic. We could have a defined way to name and use them, for example _predicate_merge_begin(...end) which both takes an "ID" argument to indicate paring, and any ::merge would be translated into those. They will be "intrinsics" in the sense that "run" is an intrinsic - the SLD solver just treats them as always true (as long as it is fully grounded), and they are really only meant to be interpreted by any image generation code.

However, by allowing user access to this "special" namespace, we could allow them to define their own operators. Consider the following Modusfile:

build_project :-
  (cc("a.cpp"), cc("b.cpp"))::with_std("c++14"),
  (cc("c.cpp"), cc("d.cpp"))::with_std("c++17").

_operator_with_std_begin(id, std) :- _operator_env_begin(id, "CFLAGS", f"-std=${std}").
_operator_with_std_end(id, std) :- _operator_env_end(id, "CFLAGS", f"-std=${std}").

Note that the with_std operator is "defined" in the last two line.

I do admit that this just feels too much like a hack, so maybe it's better to hide this _predicate namespace after all and, if needed, have some other way to define operators.

set_entrypoint: variadic arguments or array data-type?

We previously discussed the possibility for having complex data types like arrays, but I remember we decided to go with all strings for now. Since the entrypoint property is a string array, how should set_entrypoint be implemented? Currently it just take one argument and make the entrypoint an array of length 1. (I think this covers most of the use cases, since you can just specify an entry shell script anyway.)

Nicer error message when current directory contains no Modusfile

$> modus build . 'all("release", "stable")' 23:18:52 thread 'main' panicked at 'called Result::unwrap() on an Err value: Os { code: 2, kind: NotFound, message: "No such file or directory" }', src/main.rs:63:57 note: run with RUST_BACKTRACE=1 environment variable to display a backtrace$

Implement Modusfile parser

Combine Datalog and Dockerfile parsers:

Make sure local front-end image is the correct version before building

Maybe just tag the image with the crate version, and use the correct tag when generating Dockerfiles

Proposed implementation for `::merge`

After some thinking I think we can go with the following. The consequence is that as long as the instructions inside ::merge only consists of run, copy and ::copy everything can be squished into one layer. Given this, do we want to simply forbid having anything other than run and (::)copy inside ::merge?

The produced layer would be a normal exec layer.
For each copy or ::copy inside ::merge, mount the source image to a temporary directory like /modus_merge_tmp.xnMpVW0Oo6.

Collect all the runs and copys together in order, and generate a script like the following:

sh -c 'run-command-1' && \
...
cp -R '/modus_merge_tmp.xnMpVW0Oo6/copy/src/path' '/copy/dst/path' && \
...

Make the layer run this script, with all the necessary mounts.

@mechtaev any thoughts?

Format Strings & Escape Chars

There seems to be an issue with using escape chars in f-strings.

Dockerfile COPY relative path

Currently modus transpile doesn't work fully correctly because it turns out COPY --from=src_image . . in Dockerfile does not respect the WORKDIR property of src_image, and will instead resolve any relative path relative to /.

Example:

FROM alpine AS s1
WORKDIR /tmp
RUN echo hello > hello

FROM alpine
COPY --from=s1 hello hello
RUN cat hello

Step 4/6 : FROM alpine
 ---> c059bfaa849c
Step 5/6 : COPY --from=s1 hello hello
COPY failed: stat hello: file does not exist

Whitespace in parser

e.g. a :: b

Correct handling of query

Currently it just reads one literal and ignore the rest, so e.g. from("alpine"), run("apk update") will only build from("alpine").

Also,

Should not panic or error when there are ungrounded built-in literal calls

This causes a panic, but it really shouldn't:

a :- string_concat("", A, A), b(A).
b("hello").

Adding a dummy rule will currently avoid this panic, and give valid proof:

a :- string_concat("", A, A), b(A).
b("hello").
string_concat(A, B, C) :- string_concat(A, B, C).

Implement Datalog parser

Format string starting with a variable fails parsing

thread 'main' panicked at 'can parse outside format expansion: Error(VerboseError { errors: [(LocatedSpan { offset: 0, line: 1, fragment: "${target_folder}/buildkit-frontend", extra: () }, Nom(Escaped))] })', src/translate.rs:57:14

This is probably because escaped returns an error if nothing can be consumed.

Error handling

Let's discuss this in our next meeting...

Pass along source span in AST/Literal/Proofs etc

Decide on how to do this. Add a span to every data type in the AST (e.g. Expression, ModusClause)? Do we want to copy this along whenever we construct e.g. Literal? How to make sld solver aware of/perserve this (so that if e.g. no matching rule we can point to the source)?

Do we want to / how do we report resolution failure? e.g.

Failed to resolve input a("aabbb")
  a("aabbb") requires string_concat("a", X1, "aabbb") + string_concat(X1, "b", X2) + a(X2)
    taking X1 = "abbb", X2 = "abb" // maybe don't report expansion of empty rule explicitly, to make error message shorter?
      a("abb") requires string_concat("a", X3, "abb") + string_concat(X3, "b", X4) + a(X4)
        taking X3 = "bb", X4 = "b"
          a("b") requires string_concat("a", X5, "b")
            which is not satisfiable. // this line may also be: which requires X5 to be known, in the case of invalid ungrounded variables (not the case here)

what to do with alternative rules? report them all and in each case explain why they are not satisfiable?

Do we want to report multiple errors at once? If so then we can't just panic or return error whenever something is wrong.
How do we represent errors at the time of them happening? Maybe a big enum of various types of errors that can happen? I.e. ResolutionError { history: GoalWithHistory, reason: ResolutionErrorReason (e.g. Unsat(Literal), RequireGrounded(Vec<Variable>)) }
For non modus errors (e.g. unable to connect to docker) maybe consider a different way of reporting them, so e.g. if one error happens abort rather than continue? Maybe consider thiserror which makes wrapping of runtime errors easy.
Could catch some errors through semantic analysis? The advantage is that we would have the full structure of the Modusfile instead of the IR. Best case this is simply adding the tag string to the enum type Tag.
Output the SLD tree as a DOT file to let the user display the graph.
Better parse errors output. We can use a custom error printer. However, "Tag" errors are a bit inconvenient - I'll consider submitting a PR to nom to deal with this upstream.

set_env, append_path

"Least Common" Groundness

Consider:

foo(X) :- bar(X) ; baz.

This should be allowed, just enforce that X cannot be ungrounded.
Currently it will throw an error since there are two different signatures for foo.

Cache control

I had a discussion with @barr, and he suggested to use operators that define conditions of cache invalidation. We were thinking of something like that:

::no_cache always invalidate the cache

::invalidate_cache_if_before(CMD) invalidate cache if CMD, executed before the layer is applied, returns non-exit code.

::invalidate_cache_if_after(CMD) invalidate cache if CMD, executed after the layer is applied, returns non-exit code.

::invalidate_cache_if_changed(PATH) invalidate cache if the file/directory changes after the layer is applied.

This is just a rough sketch of the API.

Originally posted by @mechtaev in #57 (comment)

Or maybe something like ::invalidate_cache_if_different(CMD_BEFORE, CMD_AFTER)

Originally posted by @mechtaev in #57 (comment)

Implement the builtin
Implement for buildkit
Implement for Dockerfile (things other than a simple ::no_cache may not be possible)

Support for Windows platform?

Currently there is a few implicit assumption in the code that we're building Linux containers with Modus itself also running on Linux. These includes

path.starts_with("/")
Use of signal handlers

I don't know much about Windows containers - we might not even need to support them if they are quite rare and requires a lot of extra effort, but if we want Modus itself to be runnable on Windows then I will have to address the second point.

Builtin predicates

Basic support & SLD resolution for built-in predicates (#39)
Properly augment the proof tree to indicate what constants were used to unify those built-ins. I.e. we should be able to see run("apt ...") directly in the tree easily.
Implement more built-in stuff
- Image instructions - from, run, copy, etc.
- Logic predicates - semver_greater, more string stuff (?), numeric calculations (?), parsing (path(str, parent, filename))
- Operators?

Unification - asserting equality

Equality predicate
- Temporary: can re-use string_concat
Translation from X = "foo" to that predicate.

Proof pretty-printer

Caching

Caching is an important part of Docker and any build system. Docker currently does not provide sufficient tools to control caching. Here are some things we need to think about:

Caching layers

When we execute something like run(f"git clone ${URL}"), the layer will be reused as long as ${URL} does not change, even if the git repository gets updated. In principle, we can add something like run(f"git clone ${URL}")::no_cache, but then the cache will be invalidated every time, even if the repository is not updated. BuildKit provides an ad-hoc support for git, but this will not work for other VCS and any other scenarios. So, it would be nice to have a more fine-grained control over caching of layers.

Caching images

In Docker, the registry of images and the build cache are separate things. Thus, it is unclear what our "minimal build tree" means w.r.t. caching, i.e. do we compute the minimal tree w.r.t the images in the database, or w.r.t. to the cached data?

Ideally, we should pick a principled and intuitive approach that is also not too far from what Docker and BuildKit do, so that it was realistic to implement and maintain.

@barr , @maowtm , @thevirtuoso1973 , any thoughts?

Turing completeness of string_concat

I've been playing around with this for a bit since we basically have all the core datalog logic stuff ready, and I made this thing:

output(S) :- run(S).
entry(S) :- anbncn(S), output("Accept!").

anbncn("") :- .
anbncn(S) :- abc_split(S, A, B, C), streq(A, B), streq(B, C).
streq(A, B) :- string_concat("", A, B).
charne("a", "b").
charne("a", "c").
charne("b", "c").
charne(A, B) :- charne(B, A).
abc_split(S, A, B, C) :-
    count_chars(S, "a", A, BC),
    count_chars(BC, "b", B, CC),
    count_chars(CC, "c", C, "").

# normal route
count_chars(S, C, O, R) :-
    string_concat(C, RR, S),
    count_chars(RR, C, O2, R),
    string_concat("1", O2, O).
# other char
count_chars(S, C, O, R) :-
    charne(C, C1),
    string_concat(C1, RR, S),
    streq(O, ""),
    streq(R, S).
# empty case
count_chars("", C, O, R) :-
    streq(C, C),
    streq(O, ""),
    streq(R, "").

The same strategy can be used to turn any turing machine into a modus program that takes entry(S) for some string S, and output a docker image iff TM accept (just use the string as a tape, have something like TM(q, tape_left, tape_current, tape_right)). This means that the claim that this language is not Turing complete in the doc repo is actually false.

`from(...)::set_workdir(...)` syntax error

It looks like there has to be a parenthesis i.e. (from(...))::set_.... Was this deliberate/can this be relaxed?