upenn-acg / processcache Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 1.36 MB

ProcessCache

Rust 94.08% C 5.67% Shell 0.06% Makefile 0.19%

processcache's People

Contributors

Stargazers

Watchers

processcache's Issues

Handle chdir

Handle chdir.

Execution struct needs to have both starting_cwd and cwd.
Intercept chdir.
Implement handle_chdir() which changes the curr execution's cwd to the new cwd. It also changes this for all child executions.
Use cwd() instead of starting_cwd()... basically everywhere...
EXCEPT when checking preconditions (CachedExecution should still have just starting_cwd).
Test: single process.
Test: parent and child process.

Evolving Cache Lookup. Evolving Skipping Mechanism.

The current caching/skipping scheme is very conservative, but that is just because it is the simplest way to implement the functionality. Each process can call execve and have its own execution struct. If it doesn't, its inputs and outputs are just added to its parents execution struct. If there is more than one process with its own execution struct, i.e., at least the root process performs a fork and the resulting child calls execve (and potentially its child processes can also do so, creating their own children, and so on), we recursively check that the metadata, inputs, and outputs match for each execution in the tree. If ANY of these doesn't match on metadata, inputs, or outputs, we return that the execution overall is not skippable, and we let it run.

Obviously, we can imagine that this is far too conservative. Example: clustal has 904 jobs. If just one of those jobs doesn't match on just one input file, or env var, all 904 jobs will be rerun.

The bioinformatics jobs will run as a root process from which all child processes will be forked, so its only one layer of nesting. In general, this means we want to skip at the “leaves” of the execution tree. We want to be able to skip some child executions and run the ones we need to.

The current mechanism for skipping executions only allows us to check the root execution, and then change its next system call to exit. This won’t work if we have: root clustal process which spawns two child jobs. Let’s just say one can be skipped, one cannot. Currently, we don’t have the functionality to run the root until it forks the child that should be skipped and then skip that, while also recognizing when the other child is forked and has called execve and we should let it run.

Update: I've been focusing on the caching aspect of this for now. The above info is great and will be useful at some point, but in reality we have no real skipping mechanics really working and testing. So first we need to get BASIC cache lookup and BASIC skipping mechanisms working and tested.

Try IOTracker out with the hmmer workflow

I've been adding bits and pieces to IOTracker as we go, but it has been a little adhoc. Now is a good time to try out one of the bioinformatics workflows with IOTracker to see what happens.

First off, I want to see if this baby can run at all. It probably will fail on some system call(s) we aren't handling yet. I will list them here as documentation.
Next, see what all is missing and make note of what needs to be added to IOTracker (probably make a new issue).
Gather information from what we find. #24 has listed quite a few things we are interested in. We will start there.

The system calls it required were: poll, times, madvise. Also needed to skip the exit posthook event. The workflow displays beautifully disparate use of files by processes which is great to see. These will be fantastic benchmarks! I am excited to report some empirical data, so I will now work on gathering said data (following #24).

Question: How many execs are done by different workloads?
For this workload (and also for clustal and raxml because they behave the same way), it'll be 900 some (one for each job to run) plus one for the jobrunner process itself. In theory, each of these jobs is appropriate for caching on its own. Right now the jobrunner process creates the output files for the jobs (which are the outputs IOTracker will care about for each job). Say one of the output files gets deleted. So if we go to rerun the whole thing, we can't skip because one output file is missing. So we would run the parent exec, skip all the execs that haven't had their files changed, and then just run the one whose file had changed...I think that makes sense...

Note: hmmer needed these additional system calls to run through IOTracker: poll, times, madvise.

Track process lineage

Add a data structure to represent lineage of processes: a map from (child pid) --> (parent pid). Update this map upon interception of fork. This structure will be useful in the future, this is not on the critical path, just an easy thing to add in now.

Log openat modes instead of every read / write

Right now we are intercepting every read and write. This is expensive and noisy. For now, let's instead log open calls by their mode. If a file is opened for reading only, we assume it's been read. If a file is opened with writing only, we assume it was written to. If it opens for both, we assume it was read and written.

Remove read and write handling in event loop
Stop intercepting read and write
Extend handling of openat to report the mode the file is opened in
move this logic to a helper function handle_openat()
call this function in the posthook, not the prehook. The fd is not correct until the posthook, and we only want to report successful openings in read, write, or read/write mode, and successful file creations

Error on `ls` program

Hi Kelly,

I'm trying run ls on the latest version of ProcessCache (b953a81) and I'm seeing this error:

> cargo run -- ls
     Running `target/debug/io_tracker ls`
thread 'main' panicked at 'Process has already exec'd and is trying to exec again!', src/execution.rs:401:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Log execve arguments and env variables

Right now we aren't tracking anything about execve. We need to get the arguments and environment variables, in addition to the pathname (which is easy). @gatoWololo has experience with this so I leave it in his capable hands.

Report exit status of root process

We need to intercept wait4 (or waitpid?) and get the exit status of the “highest ranking” process of the execution. Right now, we don’t have a way to get it. We need to have the tracer spawn a dummy process, which spawns the actual program, and then we can have the dummy process wait on the program. Because we will be tracing the dummy process and the actual program, so we will see when the dummy process waits on the program, and we can get the exit status then.

Treat tar as one big execution

stdout files
not creating new exec in exec prehook
output files? haven't thought about this yet

Handling stdout

Lots of programs we want to cache will output something to stdout, so we have to handle it. Basically, what we want is, whenever the program writes to stdout, we want it to write to some file (ex: cached_stdout.txt) in our cache. Then, when we go to skip this execution, we will write the contents of the cached file to stdout... thus completing the illusion of the execution running 🔮

fd = open(file);
dup2(fd, 1);

This would open the file, and then make the process write to our file when it goes to write to file descripter 1. According to the dup2 man page, the close on exec flag for the new duplicate descriptor is off. So I think we would want to also make sure that close on exec is off for fd in the example above as well. I think we could use fcntl() with F_SETFD to set the close on exec flag. We would perform this in the tracee after we fork() but before the tracee calls execve. I think doing dup2(fd, 1) closes stdout, so the program we are caching won't write to stdout like it expects. So we would need to write the cached contents to stdout at the end of the program?

Robust erroring for processes or executions that modify the same resource

Just wanted to make a central issue for this because it just keeps popping up in other issues 😬

Let's be sure we panic! or unreachable! or something whenever:

Processes within the same execution tree (sharing the same root process) modify the same resource
Processes in disparate execution trees (other executions in the cache) modify the same resource

Eventually, directories have to be handled...

... and the point of this issue is to remind me of this 😬 Obviously, not going to try to do that on top of a redesign right now, but I don't want to forget about directories (because I totally already have, more than once).

getcwd
mkdir(at)
getdents64
chdir

Add custom Error type for IOTracker

Many of our function will have several errors types that can be returned. We should strive to propagate these errors up instead of just .unwrap() or .expect(...). We should add our own custom error type to represent what errors we expect. This is pretty easy to do and I can do it.

Write to file instead of printing to stderr

Currently we are just printing to stderr. We should write to an output file instead, it'll be cleaner to look at.

Ensure all system calls called by each bioinformatics workflow are handled

Using this issue to remind myself, as I go through each of the bioinformatics workflows and try to get them working with P$, that each one is probably calling slightly different system calls. Obviously, I have strace-ed them all before and generally know the calls. But I want to be sure in the all excitement of refactoring that all the calls are still covered.

Update: for now I'm individually finding the system calls for each job. Then I'll compile them into an alphabetized list 👍

Simplify handle_* functions

All the helper functions for the currently handled system calls are pretty messy. They all need to be cleaned up, and the log writer logic needs to be taken out from all of them.

handle_read is already simplified.

Print ALL reasons a cached execution is not "skippable"

Right now, my cache checking function does not print (debug!) ALL reasons for a cached execution to be deemed "not skippable", it just short circuits. But, there could be more than one reason for it to not be appropriate to skip (inputs don't match, some metadata doesn't match), and we want to know all of them.

TO-DO:

Change get_cached_root_execution() to print all reasons a cached execution is deemed "not skippable".

Put the $ in P$ (Caching Discussion)

UPDATED

With IOTracker being fairly robust to clustal, and the skipping of executions being worked on separately, it is time to start brainstorming about caching. So now we know what the inputs and outputs are of an execution, and we can skip the execution. But should we? I think that's what left to figure out.

What makes something skippable?

We have seen this execution before and have the appropriate output files, output file hashes, and input file hashes stored in our cache.
Its inputs have not changed since the first time we saw them.
It was successful. Executions that failed previously will not be skipped; they will be allowed to run and fail again (literal execve system calls that immediately fail is what I'm talking about here).
It is deterministic / conflict free***** (I don't know what the definition of this is really yet, visions of dependency graphs dance in my head 💃)

**How do we know when the inputs have changed?**

We hash the version the execution is trying to use and compare it against our original hash for the input file. If they match, this execution can be skipped, i.e. we copy appropriate output files to appropriate absolute paths. Or maybe we don't even have to do that, if the output files are already there and also match their hashes we have in our cache.

What is an input?

A resource whose starting state contents are utilized in some way by the execution (like a file of data being read in).
Also cwd, environment variables, executables, command line arguments, etc...

What is an output?

A resource whose contents are changed by the execution (like a file being written to or created).

When do we start tracking an input?

When the execution opens a file as read only, we can imagine it's going to be used for reading the contents. We want to hash the file now, as this is the state we are looking for in the future when we see this execution again and are considering skipping it.

When do we start tracking an output?

When the execution opens a file as write only (creating a file also means opening it for write only), we can imagine the contents of that file are going to be changed by this execution. We hash the file at the end of the execution.

I opened another discussion based issue with a "simple" example as a walkthrough of the preliminary cache design (#47).

We might eventually want to build a background inotify daemon to track the inputs and outputs. This is pretty complicated, and we are not familiar with the inotify API, so this is a solution that will be explored if we need the performance boost.

I also think about things in this way:
A unique execution has:

inputs and outputs (not going into detail here)
"root process" executions (execve calls made by the original process)
child process executions (execve calls made by child processes the original process spawned)

Further Thoughts:

Different kinds of programs have different implications for cache implementation:

Programs that just call fork after their first execve. This will be treated like one big execution. So either skip the whole thing or run the whole thing. Kind of a simple case from a caching view.
Next up: a single process program. It can do one or more execve calls.

If it only has one execve call or this is the final execve call, we could use the replace with exit method to skip the execution.
Otherwise we have to use some other method, maybe replacing the call with a noop? It's kind of weird. I don't think the "replace execve with exit and continue" method will work because the process will just exit. Definitely has to be handled in a different way from fork + execve. It feels like we should be able to pick and choose which execve calls to run in this scenario, just have to get the method right. I think programs that have a mix of "root execution" calls and "child execution" calls will be the trickiest (see 3 below).

Programs that fork + execve (what I expect the most). These programs can come with some mix of fork and execve, like the parent (root) process calling a couple execve calls of its own. But looking just at a straight fork + execve, we skip this by stopping the execve in the prehook and replacing it with exit, and continuing, setting the correct exit code in the correct register for the child process and copying the correct outputs from the cache.

Finally...

Okay, let's say it's a nice fork + execve multiprocess program (i.e. the perfect candidate). What if it has some nondeterminism, how would we know, and what would we do about it? Luckily, the bioinformatics workloads involve each process reading and writing to its own files, and no others. In the future, we may need to do conflict tracing or determinism enforcement, depending on benchmark choice and "best effort" vs. perfect coverage (like DetTrace). I think that perfect coverage will cause performance degradation, so we will have to figure out what "best effort" means for Process Cache.

Clean Up Logging

Logging is super important to understand what is happening at various levels. I will clean up the logging and document how it should be used.

Track failed file system accesses

Right now we are only tracking successful accesses to file system resources, we need to do this for failed ones as well.

System calls:

Can we simplify execution structs and get rid of Execution::PendingRoot?

We use an enum to store information about each unique execution. Each execution can be Successful or Failed. But, there is also a special case, PendingRoot. This represents the first process we see, before it actually calls execve. We can't just call it Failed, because it isn't failed, it hasn't even called execve. Before we find out if the root execution's "execve" call succeeds, its kinda just pending. I want to know which one the root is and doing it in the enum seems easiest.

pub enum Execution {
    Failed(ExecMetadata),
    PendingRoot,
    Successful(ExecMetadata, ExecAccesses, ChildExecutions)
}

We have to create the first execution struct outside trace_process(), so we don't accidentally overwrite it within trace_process().

let first_execution = RcExecution::new(Execution::PendingRoot);
let f = trace_process(async_runtime.clone(), Ptracer::new(first_proc), first_execution.clone());

Another option I guess would be to have Rc<RefCell<Option<Execution>>> passed into trace_process with just Execution::Successful and Execution::Failed as enum variants instead of the current Rc<RefCell<Execution>>. But then we would have if let Some(exec) = curr_execution potentially all over the place. The current solution may just be the best we can do, and honestly, I don't think it's all that clunky to begin with. We currently match on 3 things max in each function. Getting rid of PendingRoot would mean we match on 2.

So we will have this issue open for if we come up with another idea! Before the design changes again 😉

Get other bioinformatics benchmarks running

Process Cache can handle all 905 clustal jobs 🥳. So now we want it to handle the other bioinformatics workflows as well. Can also break these out into more issues eventually if needed for clarity and to make myself feel good about closing issues ✔️

Other workflows:

hmmer: runs on its own at least

Update: It calls madvise and times, neither of which we are handling. After some discussion, we think madvise is probably okay to just let pass through, and the value for times can be hardcoded.

Second Update: the reason hmmer is "trying to copy a file to the cache that is already there" (and thus panic!-ing) is because it spawns a thread. The root process creates the output file, and then the thread writes to it using the open file descriptor from the parent. So, we have a file descriptor from parent to child, and we also have a thread problem. I'm going to see if I can find documentation for hmmer and maybe the threading can be suppressed. I should also probably panic! if a clone call uses the CLONE_THREAD flag... 👎

try one job with P$ to see if we are missing any syscalls.
whitelist madvise
handle times by hardcoding the value(s). The value(s) are clock ticks, and the struct has four: user time, system time, user time of children, and system time of children.
figure out other reason it is panic!ing
suppress threads! (can be done by using the --cpu 0 flag!
get all jobs running through P$ and cached

raxml: runs on its own at least

Update: It panic!s because it opens a file, writes to it, closes it, opens it, writes to it... 😠

Second Update: With the panic! case commented, a single raxml jobs through P$ beautifully and caches all the output files (there are like 5, the most we've handled!).

try one job with P$ to see if we are missing any syscalls
change panic! to a debug error for this case.
get all jobs running through P$ and cached

bwa: not sure if I got it running on its own, have to looking into this.

mothur:

We are able to use the newest version of mothur 🙏 so that's nice, no need for docker. But , mothur is kind of strange. So far, it uses unlink (and actually does delete files) and uses renameat2. We haven't discussed renameat2, but I have verified that those are the only two syscalls we are missing. This affects how we keep track of things right now... we are very "event oriented", i.e. we log each access to a file separately. This means we aren't tracking if something "overrides" a previous event such as:

create file foo.txt succeeds
rename foo.txt foo1.txt
rename foo1.txt foo2.txt
Here, the "create" succeeding means the PRECONDITION must be that the file did not exist.
The first rename creates our first "postcondition".
The second rename creates our second "postcondition", and actually overrides the previous one.

Skipping Those Pesky Executions

We are at the point where we are ready to really attack the next big part of the project. We have been talking for a while about how to skip an execution. Below is the current plan for how to implement this part (and only this part, not logic about whether it should be skipped or any cache stuff).

In the prehook, we see the execve.
Like we do now to see if it's a successful execution, we do a get_next_event() and check that it is a TraceEvent::Exec(_), meaning success.
If it fails, well, nothing to do there.
If it succeeds, we do a continue and let the loop call get_next_event(). This will stop at the prehook of the first system call that is called after the execve. Here we change the system call to exit (change rax and orig_rax value) with ptrace setregs, and updating the first argument (rdi) to reflect the exit code (as the first argument to exit is the exit status). Then we do a continue because it needs to properly cleanup and shutdown.

Whitelist System Calls

Currently we allow all system calls unfiltered and opt-in to intercept only specific system calls. We should probably switch to a white list system where we explicitly allow system calls and explicitly intercept some. This way we make sure that we list all system calls and none pass silently through.

Simple fork, exec, file creation and modification tracking

For now, let's make the simplest tracker we can. We will print to standard out for now. The IOTracker's purpose is to track inputs and outputs (file system modifications). We can intercept them with ptrace and report them.

Trace fork and print when it happens. Print the new child's pid and the parent's pid.
Trace execve and print when it happens.
Trace whenever a file is created and print info (for now just handle openat).
Trace whenever a file is modified and print info (for now handle write).
Trace whenever a file is read and print info (just read for now).

TODO Testing

Print the arguments for each execve. (moving this to its own issue so I'll just mark it done)

Note: broke out the remaining list to their own issues.

Make execution structs per task instead of fully global

Right now all the execution structs are housed in the same global structure, and every time an event happens, the global structure is accessed. Instead, we want to have an execution struct per task. Then, we only have to access the global structure to add the execution struct, all changes to the struct will also be reflected in the global struct thanks to reference counting and interior mutability.

Sanity Check Testing

There are still a lot of moving and ever changing parts in this project, so setting up a full testing suite isn't the goal here. The goal is to have some sanity checks in place to at least be sure the basic functionality doesn't break when changes are made, and that we don't reintroduce bugs (looking at you fork-signal race).

For DetTrace, the testing was mostly C programs. These were compiled with make files and our tests involved calling execve on binaries. We want to avoid the whole make file / binary testing mess if we can. @gatoWololo had a great idea where we can take the current run_tracer_and_tracee() function, and make a slightly different version of it. It will take a function as a parameter, and this function is what the tracee will run and we will trace. Within the original function, a fork is done, and the parent calls trace_program(). This will now return the GlobalExecutions struct for examination for testing purposes. The child does run_tracee() and that currently does an execvp, so we will replace that with a test_function() call, where test_function() is whatever function we want the tracee to run.

Test function:

Create run_function_as_tracee() based on run_tracer_and_tracee(). Change the signature to take a function as a parameter and return anyhow::Result<GlobalExecutions>.
Change trace_program() to return anyhow::Result<GlobalExecutions>, ignore it in the return in run_tracer_and_tracee(), return it for run_function_as_tracee().
Create run_tracee_function() based on run_tracee() that calls the function instead of using execvp. Also, it must take the function as a parameter.

Caching Example Walkthrough

Let's try to write out a simple (if that's even possible) example to demonstrate the cache's workflow with P$.

Example Program:

To start, say file1.txt exists and file2.txt does not exist.
Program opens file1.txt for reading only and reads the contents.
Program creates file2.txt (HOW it makes it is very important. Does it use creat or open? What mode does it open with? Does it use O_TRUNC? O_APPEND? Don't you just love this system call interface? Isn't this just so intuitive? 🧠)
Program writes to file2.txt.
Program exits.

file1.txt is an input, and its contents are read. We hash the file when we see it opened as read only.
I guess the executable is another input, should be hashed at the start as well?
Also all the usual suspects: cwd, environment variables, yada yada yada...
file2.txt is an output, as it is created and written to. We hash the file when the program exits. We would then copy the file to our cache.

How do we know we can skip?
The hashes of file1.txt and the executable should match ours and the file should be present in the file system in the same location it was before.

How do we skip?
We skip the execution (#42).
We can then copy our file2.txt to its appropriate absolute path for the execution. This also means we need to keep track of that path, if we need to copy the output file over.

Can we get away with not copying the output file over?
If the hashes of file1.txt and the executable matched, and also file2.txt matches and is in the right spot in the file system, we don't have to copy over the file.

Further thoughts:
What if the program only used one file file1.txt? It reads the contents. Then it writes to the file. I think we can handle this, whether it uses O_APPEND or O_TRUNC. This is a little in the weeds, probably represents edge cases, but important to think about and document nonetheless.

We hash the file when it's opened for reading, this is the input file.
We hash the file at the end of the execution, this is the output file.
When we see this execution again, if our input file matches the one the new execution is using, we can just replace this file by copying over the output file from the cache.

Roughly what I need to implement:

Alter data structures to include file name, full path, and hash of the file
Hash input files (access, openat, open, read, pread64, fstat, newfstatat, stat) when we first see the access.
Hash output files (creat, open, openat, write, writev at the end of the execution.
Serialize the data structure to a file.
Deserialize the data structure.
Look ups in the data structure.
Copy output files to the "cache" at the end of execution.
Copy output files from the "cache".

Small fixes/changes leftover from the most recent PR

Just an issue with misc things leftover from the last PR, that are not my top priority today but I need to have them documented so I can do something about them some other time.

report correct exit code for skipped processes (we record it now, we just aren't doing anything with that info on the skipping run)
hash lookup over a linear lookup when we go to add a file to an execution's output files
use slogging to avoid passing caller_pid in to different functions if possible, for example add_output_file_hashes()
in create_new_execution() we are assuming if it isn't TraceEvent::Exec then it is a failed execution. We probably should explicitly match on the event we get back. And panic! if we get something unexpected.
give a real error message when threads are detected unlike what's currently there ("THREAAAAADSSSSSSSS!!!") which is funny but not helpful.
Use unreachable! instead of panic! when appropriate.

Track failed execve calls

Right now we are only tracking execve calls which succeed. Often programs execve a bunch of different paths until one is successful. We want to know that the execve failed so we know to not cache it, and just let it run when we encounter it (it fails, should be pretty quick, right? 🙂 )

Remove unnecessary syscalls from intercepted set

Currently intercepting extra syscalls we don't really need. TODO list them here. Also, remove the extra logging that goes along with them!

As a side note: the logging in general should be cleaned up, but it's not my highest priority and I don't want to add yet another issue right now.

Better identification of tracked resources

Right now we only report information about resources by fd or path. Whenever we can, we should identify them by inode instead. Updating this issue because I am only going to track open by mode for reads and writes, instead of read and write themselves (#20)

file access / modification: for openat and open modes, report full path, fd, and inode.
file create (openat, open, creat): report fd, full path, and inode.

Failure to sereliaze/deserialize cache

After implementing #57. I'm currently seeing the following error:

Error: io_tracker::run_tracer_and_tracee(): Failed while tracing program. file: src/main.rs, line: 86.

Caused by:
    0: io_tracker::execution::trace_program(): Unable to serialize execs to our cache file. file: src/execution.rs, line: 54.
    1: io_tracker::cache::serialize_execs_to_cache(): Cannot write to cache location: "./IOTracker/cache/cache". file: src/cache.rs, line: 815.
    2: No such file or directory (os error 2)

Basically. We cannot create the cache file because the directories don't exist: "./IOTracker/cache/cache". We need to make sure the directories IOTracker/ and cache/ exist. Probably using:

    Recursively create a directory and all of its parent components if they are missing.
    Examples
    use std::fs;
    fs::create_dir_all("/some/dir")?;

Similarly I expect:

let exec_struct_bytes = fs::read("./research/IOTracker/cache/cache").expect("failed");`

To fail on my system.

I can fix this to make it create the proper directories and not use the ./research/IOTracker/cache/cache hard coded path?

One question: Are you calling P$ from outside the project directory? The path ./research/IOTracker/cache/cache lends me to believe you are? I have been calling P$ from cargo run -- ARGS.

Handle special close(1) [tar]

Handle when close(1) happens because tar is dumb.

Intecept close.
Need to store the fd that stdout was duped to in the Execution (lol, I have gone back and forth on this a lot now), but man it's just so much easier mutating something in an async function when it's wrapped in Rc<RefCell<>>.
If process successfully does close(1), i.e. in the posthook we see that the return is successful, set close_stdout_fd = true
Check for this flag in the prehook. we need to inject a close(newfd) to make sure we stop writing to the stdout file. This will involve inject_system_call() and restore_state(). Then set the flag to false again.

Report the mode in which file system resources are opened

I think I was doing this with the Log/LogWriter, so this will be really easy to add in. The mode tells us a lot about what the process plans to do with the resource, and we should track this info as it'll be useful in the future for caching.

Next inputs to track

After going through some simple examples, I have a list of the next round of inputs that IOTracker needs to be able to handle.

"Basic" Stat Family: for now, we report path for stat and lstat. For lstat, we report if it is a symlink (as this changes the validity of the information in the stat struct). For fstat we report the fd.

stat
fstat
lstat

"Expanded" Stat Family:

newfstatat
fstatat64 (eventually?)
(fstatat actually just calls fstatat64 or newfstatat depending on arch, I have only seen newfstatat)
cwd: it's an input to a child process. (for now we will check it and report it upon each execve)
Flags! There are so many! So this serves as a reminder that we need to handle them. This will probably need to be its own issue(s) at some point. At this point, syscalls we at least somewhat handle that have flags are: open, openat, newfstatat, execveat, clone.

Log only successful syscalls by default, add command line option to log all syscalls

It can be very noisy (and arguably not useful) to track every syscall. It would be useful to be able to toggle between logging all syscalls and logging only ones that succeed. I'm going to add a command line option to opt into seeing all syscalls, and the default will be to only log calls that succeed.

Add command line argument
Add the logic to skip failed execve in the main event loop of execution.rs
Handle this for open, openat, and creat.

Using IOTracker to empirically test what real programs do.

The design of ProcessCache must be informed by what programs are actually doing. Currently there are many unknowns or speculations on our part about how real programs execute. Below I have started a non-exhaustive list of some questions about how programs execute. Notice by "programs", I mean workflows we hope to target with ProcessCache.

Question: How many execs are done by `different workloads?
Reasoning: This will determine the granularity of our caching and the amount of cacheable-units we can hope to have. This will also inform us of what are good domains for ProcessCache, i.e. I expect bash scripts to have a buncha exec calls, while a single python script not so much.
Question: How many getdents are done by processes?
Reasoning: See #23, we don't have a clear way of handling directory reads. Knowing how many programs are actually doing this will inform us whether the proposed solution on #23 would even work.
Question: How often are outputs of a sub-exec computation reused as inputs the parent computation.
Reasoning: I have been thinking about what exec-units must be reexecuted when an input has changed. As we discussed during our 1/28 meeting, should we reexcute the parent if a child computation needs reexecuting? (I have a bunch of preliminary thoughts on this, but it is too early to write them down). But knowing this will greatly inform the design of our dependency graph and whether parent processes should be reexecuted...
Question: How often are the "at" variants of system calls called (such as openat instead of open)? And when they are used, how often is the dirfd actually used, instead of just AT_FDCWD?
Reasoning: This will help us understand how common it is for programs to touch the file system outside the cwd they start in.
Question: How often are different flags called in different syscalls?
Reasoning: Many syscalls, such as openat and execveat, have flags, and we want to know how often they are called so we know what common cases we are facing. Plus, there are so many flags, so knowing which to prioritize first would be helpful.

Feel free to add more!

Implement PID Namespace

Printing PIDs is useful for giving a lightweight identifier to each process. This becomes more relevant as we have more processes. The read PIDs are not very useful and worse, they change from execution to execution. So I think we should run the process on a PID namespace so they are always the same (assuming determinism). I can do this!

Threads as part of parent execution

Implement threads as part of parent execution.

"At" Variants of System Calls

Many system calls have an "at" variant that we will need to handle precisely in the future. This issue serves as a place to list these as they come up, so we will know what needs to be done later. Edit to add Omar's list to this one.

Current:

execveat
openat
newfstatat

Eventually:

Better Error Messages

I have been having fun with error handling :)

In the simplest case you can handle Result::Errs by doing .expect("Error Message"). This has several disadvantages (all issues associated with panicking, etc). We use the anyhow crate for better error handling.

Without going into too much detail, we now get nicer error messages on failures with a pseudo-stacktrace (stack traces are only available for errors on nightly (unless you panic)). Like so:

io_tracker::execution::run_process(): Process Pid(154290) failed. file: src/execution.rs, line: 211.

Caused by:
    0: io_tracker::execution::do_run_process(): Unable to get syscall name for syscall=65535. file: src/execution.rs, line: 255.
    1: io_tracker::system_call_names::get_syscall_name(): System call number 65535 out of bounds. file: src/system_call_names.rs, line: 8.

This should help us track down errors and provides semantic information of why the error happened as well as the context from where it happened.

All we have to do is make sure when handling any errors, e.g. using the ? operator, we use .with_context(|| context!("error message")) like here:

let name = get_syscall_name(event_message as usize).with_context(|| context!(
                    "Unable to get syscall name for syscall={}.",
                    event_message
                ))?;

The .with_context method is provided by anyhow. The context! macro is my own which adds: function name, file name, line number. It also wraps the format! macro. Notice it accepts string literals like

context!("Unable to fetch regs for unspecified syscall")

or formatted strings. Similarly when using anyhow::bail! we can use the context! macro like so:

bail!(context!("Unhandled system call {:?}", name));

This produces nice errors like:

io_tracker::execution::run_process(): Process Pid(158683) failed. file: src/execution.rs, line: 211.

Caused by:
    io_tracker::execution::do_run_process(): Unhandled system call "brk" file: src/execution.rs, line: 253.

This has already been implemented with my latest changes and we just have to be diligent about adding context info!

Panic when caching "top level" exec

The first time I run the command:

cargo run  -- /usr/bin/echo "foo"
foo
number of child execs: 0

It runs successfully!

However, running it again (once it is cached) it fails with error:

thread 'main' panicked at 'Should not be trying to get child execs from pending root execution!', src/cache.rs:355:17

Looking at the RUST_LOG=info of the failing exec we see:

  INFO trace_process{pid=Pid(73223)}:Syscall{name="execve"}:outputs_match{pid=Pid(73223)}: Checking inputs and outputs of children
  INFO trace_process{pid=Pid(73223)}:Syscall{name="execve"}:execution_matches{pid=Pid(73223)}: Number of cached children: 0
  INFO trace_process{pid=Pid(73223)}:Syscall{name="execve"}: Initiating skip of execution!
  INFO trace_process{pid=Pid(73223)}:Syscall{name="execve"}: Serving outputs
  INFO trace_process{pid=Pid(73223)}:Syscall{name="execve"}:serve_outputs_from_cache{pid=Pid(73223)}: Serving outputs from cache.
  INFO trace_process{pid=Pid(73223)}:Syscall{name="access"}: 
  INFO No more tasks to execute! Executor done.

Basically the error is happening when calling:

println!(
        "number of child execs: {}",
        first_execution.child_executions().len()
    );

This happens because:

                                // We don't skip if it failed. We just let it fail.
                                if cached_exec_succeeded && new_exec_succeeded {
                                    s.in_scope(|| info!("Initiating skip of execution!"));
                                    skip_execution = true;
                                    s.in_scope(|| info!("Serving outputs"));
                                    serve_outputs_from_cache(tracer.curr_proc, &cached_exec)?;
                                }
                            } else {
                                curr_execution.update_root(new_execution);
                            }

We end going into the if statement here ^. And then the exec is done (since we skipped it). So we never get to run curr_execution.update_root(new_execution); and thus never move the curr_execution from PendingRoot to Successful.

Adding

                                    // Update status if this is the root before skipping?
                                    if curr_execution.is_pending_root() {
                                        curr_execution.update_root(new_execution);
                                    }

Inside the if statement seems to fix it?

Handle remaining system calls for clustal

The system calls that are not in some way handled for clustal are:

writev: just want to report success or failure really for now.
poll: don't have to do anything for now.
wait4: we don't have to do anything here... we get the exit status of a child process IFF it has called its own execve, and we use ptrace to do it, the same way we do it for the "root process".
close: in the case of clustal, it's just closing the file that was open. Other programs may delete files with close, but I know clustal isn't doing that (yay for simplicity). So nothing to do here.

Handle statfs

Handle statfs.

Statfs:

New file event StatFs.
Preconditions.
Postconditions.
Failures: ENOENT (this is what I am seeing), EACCESS (search permissions).
MyStatFs enum.

Handle pipe, socket, connect [tar]

tar calls these, and generally we want to lump an execution all together if something calls one of these 3 syscalls. But this is pretty complicated. Maybe for tar for now we just... kinda ignore 'em because tar is going to be one big execution anyway.

Simple Tests

Just an issue for listing the sanity check tests I want to write:

Sanity check: run an empty C program (or perhaps an empty Rust function?) and just make sure it finishes successfully.
Sanity check: spawn a bunch of processes in a loop and be sure they exit correctly (fork-signal race is the worst). One version where they exec, one where they don't exec.
Sanity check: Check that a simple program reads the files we expect it to. (have to be careful here, if we want this to be a portable test, I can't just check paths or inodes)
SC: check that a program that spawns a child process which does not exec: has ONE SUCCESSFUL execution struct and within that struct it has ONE child process.
SC: check that a program that spawns a child process which DOES exec: has TWO SUCCESSFUL execution structs and within the parent's it has ONE child process.
SC: check that exit codes get set for each execution (failing ones and successful ones).
Maybe something with command line arguments?

Simplify Crate

Currently there is a lot of dead/unused code that makes it a bit confusing to work with IOTracker. We should clean this up and simplify...

(Somewhat) exhaustive list of inputs/outputs to track

I want to keep a list of inputs and outputs that would be valuable to track.

Inputs:

Later Inputs:

Outputs:

Files created: openat, open, creat
Files modified: open for write only, rename, (eventually pwrite, write, but just checking the mode for open is simpler for now)
Files read / accessed: open for read only, stat family, pread, read
Files deleted: close and unlink, for now assume any unlink or close deletes the file entirely

Later Outputs:

Output to stdout/stderr
Writing to pipes

Directory Handling

Example Problem

Consider the following workload. make is executed with the Makefile:

*.c:
    gcc -o $@

This says "For all C files, compile them and name them the same as the C file without the extension". Assume our directory has files: one.c, two.c, three.c, four.c. IOTracker/ProcessCache would detect these four files as inputs to our computation. Now imagine we add an additional file: five.c. The correct thing to do is to reexecute this computation as there is now an additional input to our make process, but with our current design, ProcessCache would see none of the inputs to the computation have changed (all inputs were recorded on the first execution), so it would skip running this process entirely, wrong!

The problem is that we're missing one input to our computation, the reading of all files in the directory. Which maps down to the getdents/getdents64 (short for get directoy entry) system call. In general any FS system call that does a "for all" read has this issue, thankfully, this is the only one that comes to mind right now.

Notes on getdents

The getdents API is very low level. It doesn't open file descriptors or allow you to manipulate file. Instead it just gives you an array with linux_dirent` structs, which contain the inode and name of files. This doesn't help us directly but it is worth thinking about.

getdents is non-recursive, which is nice. This means we don't have to consider subdirectories or recursively read directories. From a syscall interception point of view, they will look like separate getdents calls.

Solution?

We could handle this is by considering a directory that gets read to be an input to a computation. So when a directory is modified (like in the example above) this will be considered an input change and thus the computation will be reexecuted. Inotify has support for being notified on directory modification which is nice.

This approach may end up being too conservative and unnecessarily reexecuting. For example, if a someone adds an unrelated file that is never used by the program touch omar.txt, under this approach we will end up having to reexecute the program even though there was no need to (since the input directory has been modified).

I don't know if this will work... When do we start checking for changes to a directory? After the respective exec computation is done? This won't work, as I expect any multi-exec computation will probably write output files to the relevant directory, making it seem like the directory has been modified... This issue is long enough so we can talk about it "in-person" but something to think aboutl