buildbarn / bb-remote-execution Goto Github PK

Tools for Buildbarn to allow remote execution of build actions

License: Apache License 2.0

Go 94.03% Shell 0.02% Starlark 4.78% Jsonnet 0.02% HTML 1.15%

bb-remote-execution's Introduction

Buildbarn Remote Execution

Translations: Chinese

This repository provides tools that can be used in combination with the Buildbarn storage daemon to add support for remote execution, allowing you to create a build farm that can be called into using tools such as Bazel, BuildStream and recc.

This repository provides three programs:

bb_scheduler: A service that receives requests from bb_storage to queue build actions that need to be run.
bb_worker: A service that requests build actions from bb_scheduler and orchestrates their execution. This includes downloading the build action's input files and uploading its output files.
bb_runner: A service that executes the command associated with the build action.

Most setups will run a single instance of bb_scheduler and a large number of pairs of bb_worker/bb_runner processes. Older versions of Buildbarn integrated the functionality of bb_worker and bb_runner into a single process. These processes were decomposed to accomplish the following:

To make it possible to use privilege separation. Privilege separation is used to prevent build actions from overwriting input files. This allows bb_worker to cache these files across build actions, exposing it to the build action through hardlinking.
To make execution pluggable. bb_worker communicates with bb_runner using a simple gRPC-based protocol. One could, for example, implement a custom runner process that executes build actions using QEMU user-mode emulation.
To work around a race condition that effectively prevents multi-threaded processes from writing executables to disk and spawning them. Through this decomposition, bb_worker writes executables to disk, while bb_runner spawns them.

This repository provides container images for each of these components. For bb_runner, it provides two images: bb_runner_bare and bb_runner_installer. bb_runner_bare has no userland/linux install, it just has the bb_runner executable. Typically the actions that will run on a runner do expect some userland to be installed.

It would be nice if you could just use any image of your choosing as the image that your build actions will run on. Like using Ubuntu 16.04 image, to take advantage of the fact that bazel project provides ready-to-use toolchain definitions for them.

What makes that tricky is that that image will not have bb_runner installed. This is where bb_runner_installer image comes in. It doesn't actually install anything, but it provides the bb_runner executable through its filesystem. You have to configure your orchestration of choice to mount this filesystem from bb_runner_installer into the image of your choice that you want to run on. This way you can use a vanilla image and just run the bb_runner executable from Buildbarn's provided container. There's a few tricks to check if the volume is already available, you can see an example of how to do this in the docker-compose example.

Please refer to the Buildbarn Deployments repository for examples on how to set up these tools.

bb-remote-execution's People

Contributors

Stargazers

Watchers

bb-remote-execution's Issues

scheduler handles workers above maximum size poorly

if a worker tries to join a predeclared queue and advertises a worker size above the maximum that the queue has set, the scheduler emits an error trace every one or two milliseconds for every worker, which is probably undesirable.

There's no urgency, and I can probably try to take a look at the problem later, just filing an issue so I don't forget about it.

Call gazelle instead of buildifier in CI pipeline

Something I should have brought up in your previous PR: would it be possible to call Gazelle instead of Buildifier here? Gazelle automatically does a Buildifier, so there is no need to do both. That said, let’s already go ahead and merge this.

Originally posted by @EdSchouten in #36 (comment)

Allow runners to have multiple platform configurations

As I understand the worker configuration definition. a worker can have a combination of properties, such as this one in the dockers example:

     platform: {
        properties: [
          { name: 'OSFamily', value: 'Linux' },
          { name: 'container-image', value: 'docker://marketplace.gcr.io/google/rbe-ubuntu16-04@sha256:b516a2d69537cb40a7c6a7d92d0008abb29fba8725243772bdaf2c83f1be2272' },
        ],
      },

However, as far as I can tell, the platform specification cannot specify multiple properties set for a given runner, e.g this way.

     platform: [{ 
        properties: [
          { name: 'OSFamily', value: 'Linux' },
          { name: 'container-image', value: 'docker://marketplace.gcr.io/google/rbe-ubuntu16-04@sha256:b516a2d69537cb40a7c6a7d92d0008abb29fba8725243772bdaf2c83f1be2272' },
        ],
      }, { 
        properties: [
          { name: 'OSFamily', value: 'LinuxToo' },
          { name: 'container-image', value: 'docker://marketplace.gcr.io/google/rbe-ubuntu16-04@sha256:b516a2d69537cb40a7c6a7d92d0008abb29fba8725243772bdaf2c83f1be2272' },
        ],
      }],

I am aware that this can be implemented by specifying multiple runners, but that would have consequences for the concurrency specifications. Specifically, my reading of the code is that if N is the number of concurrent runners that can be used in parallel on the worker, then two runners would have to be split N1+N2 <= N, otherwise the worker can be overloaded if all the runners are allocated by the scheduler.

Background: I recently tried to add a couple of old machines as workers to our Goma/Buildbarn system, with Windows cross-compile on Linux, but the system became unstable for some reason. While testing the upgraded system in the past couple of days I found that the instability is still present, and seemed to be caused by the case-insensitive file system mount (ciopfs) we need to use for the Windows cross-compile (many Windows SDK files are included with incorrectly cased names, from inside the SDK) . ciopfs seems to stall at times, causing long periods of the worker and ciopfs not doing any building, one case lasted about 40 minutes. My guess is that this problem is related to both SSD disk speed and possibly the number of parallel processes (we have a the same ciopfs configuration on a different worker, with much a faster CPU, more cores/threads, and a NVMe disk, which does not have this problem).

While there may be other ways to get a case-insensitive filesystem running, one alternative possibility would be to assign these workers to be Linux-only workers, without Windows-cross-compile (I have not yet tested this configuration).

However, it does not seem like the action system permit multiple platform specifications; "Use one of these platforms". This indicates that the Windows cross-compile workers need to be specifies as "LinuxWindows" platforms, while Linux workers have to be specified as "LinuxOnly". However, the "LinuxWindows" workers should also be able to run "LinuxOnly" builds, and AFAICT that is not possible, except by specifying multiple runners, and splitting the concurrency number between each of them, which also means halving the performance, except if there are Windows+Linux builds going on at the same time.

IMO either the concurrency system must be changed so that only N number of runners can be active at a time, or a single runner group should be available for multiple platforms.

Directory downloads process one-file-at-a-time

We use bazel's directory api in for tar ball unpacking and when we use bb-remote-execution it only processes one file a time which takes ~1 hour in our case.

I built a PR to fix this, but I don't believe it's production ready:
allada@27022d6

Better Logging

Buildbarn's logging isn't great.

I propose we improve it. One method could be to use a gRPC interceptor and log all calls made. This would have the benefit of minimal disruption to the core code.

Any thoughts or suggestions on how to implement this?

NOT_FOUND error is hard to diagnose.

We get NOT_FOUND errors in our bazel builds. An example is below. Since they use a hash as their identifier, they are hard to diagnose. Something indicating the bazel target or affected source files would be much clearer.


ERROR: /home/jenkins/workspace/_big_feature_restore-buildbarn-2/src/platform/analyzer/BUILD:703:1: C++ compilation of rule '//platform/analyzer:AudioServer' failed (Exit 34). Note: Remote connection/protocol failed with: execution failed io.grpc.StatusRuntimeException: NOT_FOUND: Build job with name buildnode|6e35fd52-cb53-45bc-a726-87366fe601f5 not found: java.io.IOException: io.grpc.StatusRuntimeException: NOT_FOUND: Build job with name buildnode|6e35fd52-cb53-45bc-a726-87366fe601f5 not found
	at com.google.devtools.build.lib.remote.GrpcRemoteExecutor.executeRemotely(GrpcRemoteExecutor.java:177)
	at com.google.devtools.build.lib.remote.RemoteSpawnRunner.lambda$exec$0(RemoteSpawnRunner.java:265)
	at com.google.devtools.build.lib.remote.Retrier.execute(Retrier.java:237)
	at com.google.devtools.build.lib.remote.RemoteRetrier.execute(RemoteRetrier.java:104)
	at com.google.devtools.build.lib.remote.RemoteSpawnRunner.exec(RemoteSpawnRunner.java:252)
	at com.google.devtools.build.lib.exec.SpawnRunner.execAsync(SpawnRunner.java:225)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:123)
	at com.google.devtools.build.lib.exec.AbstractSpawnStrategy.exec(AbstractSpawnStrategy.java:88)
	at com.google.devtools.build.lib.exec.ProxySpawnActionContext.exec(ProxySpawnActionContext.java:52)
	at com.google.devtools.build.lib.rules.cpp.CppCompileAction.execute(CppCompileAction.java:1351)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$4.execute(SkyframeActionExecutor.java:832)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.continueAction(SkyframeActionExecutor.java:966)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor$ActionRunner.run(SkyframeActionExecutor.java:938)
	at com.google.devtools.build.lib.skyframe.ActionExecutionState.runStateMachine(ActionExecutionState.java:114)
	at com.google.devtools.build.lib.skyframe.ActionExecutionState.getResultOrDependOnFuture(ActionExecutionState.java:78)
	at com.google.devtools.build.lib.skyframe.SkyframeActionExecutor.executeAction(SkyframeActionExecutor.java:562)
	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.checkCacheAndExecuteIfNeeded(ActionExecutionFunction.java:710)
	at com.google.devtools.build.lib.skyframe.ActionExecutionFunction.compute(ActionExecutionFunction.java:256)
	at com.google.devtools.build.skyframe.AbstractParallelEvaluator$Evaluate.run(AbstractParallelEvaluator.java:450)
	at com.google.devtools.build.lib.concurrent.AbstractQueueVisitor$WrappedRunnable.run(AbstractQueueVisitor.java:387)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: io.grpc.StatusRuntimeException: NOT_FOUND: Build job with name buildnode|6e35fd52-cb53-45bc-a726-87366fe601f5 not found
	at io.grpc.Status.asRuntimeException(Status.java:532)
	at io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:576)
	at com.google.devtools.build.lib.remote.GrpcRemoteExecutor.lambda$executeRemotely$0(GrpcRemoteExecutor.java:141)
	at com.google.devtools.build.lib.remote.Retrier.execute(Retrier.java:237)
	at com.google.devtools.build.lib.remote.RemoteRetrier.execute(RemoteRetrier.java:104)
	at com.google.devtools.build.lib.remote.GrpcRemoteExecutor.executeRemotely(GrpcRemoteExecutor.java:130)
	... 22 more

Support grpcs

Misunderstanding

Buildbarn creates output directories based on the root directory

Buildbarn sets output directories relative to root directory as opposed to the working directory.

This can be seen here.

The RE protos (v2) specify that the working directory as:

  // The paths are relative to the working directory of the action execution.
  // The paths are specified using a single forward slash (`/`) as a path
  // separator, even if the execution platform natively uses a different
  // separator. The path MUST NOT include a trailing slash, nor a leading slash,
  // being a relative path. The special value of empty string is allowed,
  // although not recommended, and can be used to capture the entire working
  // directory tree, including inputs.
...
repeated string output_directories = 4;

  // The working directory, relative to the input root, for the command to run
  // in. It must be a directory which exists in the input tree. If it is left
  // empty, then the action is run in the input root.
string working_directory = 6;

This becomes an issue when the working directory is not specified as the input root and the output is specified as a subdirectory of the input directory outside of the working directory.

build_directory
├── input_root
└── output_dir

bb browser

Looking at an example action from a buildstream execution through the eyes of bb-browser shows the following buildbarn-buildstream.pdf

Buildstream specifies the working directory as /buildstream/autotools/hello.bst/doc/amhello whilst expecting buildbarn to provide ../../../../../ relative to the working directory as an output directory (which would be /).

Buildbarn seems to interpret this as ../../../../../ relative to /.

Required change

Specified paths:

Working directory is relative to input root
Output directories are relative to working directory

Current paths:

Working directory is relative to input root
Output directories are relative to input root

Document whether this is secure against malicious clients

How secure are the sandboxed builds? More specifically, if a client submits a malicious build job, can it compromise the server?

worker/runner does not handle command with `./` in front of the command name

I am attempting to get Pants integrated with the remote-apis-testing project. The integration work is on this branch: https://gitlab.com/tdyas/remote-apis-testing/-/tree/pants_client_support

Pants probes the remote execution environment to find the location of Python interpreters by running a remote execution request. The command passed to the server has "./script.sh" as the name of the command (with script.sh provided in the input root).

Buildbarn rejects the request with the following error:

 Exception: Error from remote execution: 3-INVALID_ARGUMENT: "Failed to create input directory \".\": Invalid filename: \".\""

This should reproduce on my remote-apis-testing branch if you run: cd docker && ./run.sh -s docker-compose-buildbarn.yml -c docker-compose-pants.yml

Any advice on why Buildbarn is generating this error? The path technically is a path relative to the input root and should be in spec, which states:

// The arguments to the command. The first argument must be the path to the
// executable, which must be either a relative path, in which case it is
// evaluated with respect to the input root, or an absolute path.

https://github.com/bazelbuild/remote-apis/blob/5a7b1a472165a0ea47b9060089855385fe351193/build/bazel/remote/execution/v2/remote_execution.proto#L456-L458

*This of course assumes that the "./script.sh" as the command is the cause. It was definitely this request though because the Pants debug logging shows that this was the request that was sent.)

Scheduler should report action as Executing when it starts, instead of waiting until ExecutionUpdateInterval

Currently Bazel shows actions as "scheduled", but not running, for a long time, because the is only updated when the action completes or ExecutionUpdateInterval time has passed. This results in a poor user experience that looks like this:

[549 / 553] [Sched] AssetCatalogCompile Modules/Iconography/IconographyResources-intermediates/xcassets; 50s

Service resolution via Consul

As mentioned in the office hour - we're looking to resolve service addresses in the various configs via Consul, so that we don't need to use any dynamic templating system to look up IPs + ports; this would allow us to have static configs, which obviates the need to have config reloading logic inside buildbarn services or buildbarn reloading logic when the config changes (at least for our use case).

After some investigation, this should be possible with minimal intrusion in buildbarn code:

buildbarn takes on a dependency on a grpc Resolver plugin library
buildbarn has an underscore import of the resolver plugin library somewhere; an init() in the plugin library registers a resolver for a custom scheme (effectively adding support directly to grpc for consul:// URLs) NOTE: not all existing plugins support this, but they should IMO, and I could fork and add it)
configs are rewritten to use consul:// URLs; these are passed to the custom resolver by gRPC.

There are a couple resolver libraries that already exist, though they are of varying quality.

One caveat is that there are (at least) two ways resolution can happen under the hood:

via the Consul HTTP API. If this approach is used, the Consul HTTP API will apparently return services in a stable order, which means that the gRPC load balancing policy should be changed to "round_robin" in order to shard requests appropriately; this is called out in various READMEs of the aforementioned plugins, and introduces another imposition on buildbarn code (unless customizing dial options is already plumbed up to the user configuration level?)
via DNS SRV records. If the Go stdlib is used to do this, LookupSRV does the right thing in terms of randomizing order/respecting priority; no additional dial option should be necessary to change the load balancing policy away from the default "pick_first". However, no plugins seem to exist yet that use SRV records.

What (if any) approach here would the buildbarn team be willing to consider to facilitate this usecase?

adopt one of the Consul resolvers above as-is, provide a way to customize grpc dial options (or set automatically for certain schemes)
adopt a fork of one of the Consul resolvers above, with fixes
adopt a new resolver plugin that works off of SRV records, obviating concerns about the load balancing policy
some other approach?

Failed to build bb-remote-execution on Windows

env: Windows 10 64bit

build cmd on powershell

bazel build --platforms=@io_bazel_rules_go//go/toolchain:windows_amd64 //...

output error:

Loading: 
Loading: 0 packages loaded
INFO: Repository com_github_bazelbuild_remote_apis instantiated at:
  E:/git/buildbarn/bb-remote-execution/WORKSPACE:45:16: in <toplevel>
  E:/git/buildbarn/bb-remote-execution/go_dependencies.bzl:149:18: in go_dependencies
Repository rule go_repository defined at:
  C:/users/10172740/_bazel_10172740/rpt2bk7z/external/bazel_gazelle/internal/go_repository.bzl:214:32: in <toplevel>
DEBUG: C:/users/10172740/_bazel_10172740/rpt2bk7z/external/bazel_gazelle/internal/go_repository.bzl:209:18: com_github_bazelbuild_remote_apis: gazelle: rule @bazel_remote_apis//build/bazel/remote/asset/v1:asset imports "github.com/bazelbuild/remote-apis/build/bazel/remote/execution/v2" which matches multiple rules: @bazel_remote_apis//build/bazel/remote/execution/v2:remote_execution_go_proto and @bazel_remote_apis//build/bazel/remote/execution/v2:execution. # gazelle:resolve may be used to disambiguate
gazelle: rule @bazel_remote_apis//build/bazel/remote/execution/v2:execution imports "github.com/bazelbuild/remote-apis/build/bazel/semver" which matches multiple rules: @bazel_remote_apis//build/bazel/semver:semver_go_proto and @bazel_remote_apis//build/bazel/semver. # gazelle:resolve may be used to disambiguate
ERROR: An error occurred during the fetch of repository 'com_github_bazelbuild_remote_apis':
   Traceback (most recent call last):
	File "C:/users/10172740/_bazel_10172740/rpt2bk7z/external/bazel_gazelle/internal/go_repository.bzl", line 212, column 10, in _go_repository_impl
		patch(ctx)
	File "C:/users/10172740/_bazel_10172740/rpt2bk7z/external/bazel_gazelle/internal/go_repository.bzl", line 310, column 17, in patch
		fail("Error applying patch %s:\n%s%s" %
Error in fail: Error applying patch @com_github_buildbarn_bb_storage//:patches/com_github_bazelbuild_remote_apis/golang.diff:
Assertion failed: hunk, file ../patch-2.5.9-src/patch.c, line 354
patching file build/bazel/remote/asset/v1/BUILD
ERROR: E:/git/buildbarn/bb-remote-execution/WORKSPACE:45:16: fetching go_repository rule //external:com_github_bazelbuild_remote_apis: Traceback (most recent call last):
	File "C:/users/10172740/_bazel_10172740/rpt2bk7z/external/bazel_gazelle/internal/go_repository.bzl", line 212, column 10, in _go_repository_impl
		patch(ctx)
	File "C:/users/10172740/_bazel_10172740/rpt2bk7z/external/bazel_gazelle/internal/go_repository.bzl", line 310, column 17, in patch
		fail("Error applying patch %s:\n%s%s" %
Error in fail: Error applying patch @com_github_buildbarn_bb_storage//:patches/com_github_bazelbuild_remote_apis/golang.diff:
Assertion failed: hunk, file ../patch-2.5.9-src/patch.c, line 354
patching file build/bazel/remote/asset/v1/BUILD
ERROR: no such package '@com_github_bazelbuild_remote_apis//': Error applying patch @com_github_buildbarn_bb_storage//:patches/com_github_bazelbuild_remote_apis/golang.diff:
Assertion failed: hunk, file ../patch-2.5.9-src/patch.c, line 354
patching file build/bazel/remote/asset/v1/BUILD
INFO: Elapsed time: 1.069s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (0 packages loaded)
FAILED: Build did NOT complete successfully (0 packages loaded)

From logs, It seems somthing wrong on patching. Do I need some other configurations on Windows 10?

Load Balancing and Redundancy without Proxy

Hi,

One way to get load balancing and redundancy is to use proxies (like bb_storage with sharding, mirroring, etc).

But passing all traffic through proxies increases latency and resource consumption.
For me it also causes additional problems, since our machines typically have 2 x 10 Gbit/s network interfaces combined into 20 Gbit/s with ethernet bonding. The underlying interface is chosen based on hash of the TCP source port, resulting in unbalanced load, especially since the bb_storage proxy opens only a single TCP connection to each cache instance.

Another approach could be to let the bazel client side control the load balancing and choose a cache instance. And then connect directly to the bb_scheduler and tell which cache instance to use for each action.

Perhaps bb_scheduler and bb_worker could connect to alternative cache instances, based on the instance_name, in the current remote execution protocol from the client?

What do you think?

Regards,
Ulrik Falklöf

Use the Remote Worker API instead of a custom protocol... or not?

At the Bloomberg Build Event and FOSDEM earlier this month, we chatted about convergence of remote worker protocols between Buildbarn and BuildGrid. Even though RWAPI would be the most obvious choice, it was mentioned that the authors of this protocol have less interest in using it externally than initially anticipated.

@buchgr stated that he is going to see whether he can get RWAPI moved to https://github.com/bazelbuild/remote-apis. When imported, we should hold discussions to see whether the protocol can be simplified. Once that's done, we should alter bb-scheduler and bb-worker to use that protocol instead.

Old tracking bug: https://github.com/EdSchouten/bazel-buildbarn/issues/7

Don't clean all build directories between runner completions (only clean completed runner directory)

Our setup looks something like this:

Worker with ~16 runners.
Each runner gets its own connection to worker (ie: each runner is set concurrency=1)

We do not use concurrency because it waits for all jobs to finish on that runner before fetching more jobs.

This setup however doesn't work because every time a worker's job finishes it will cause the build directory to be cleaned up here:

func (d cleanBuildDirectory) Close() error {
	err := d.BuildDirectory.Close()
	d.initializer.Release()
	return err
}

However, it appears that we already cleanup the build directory after each job finishes. Currently we can just comment out these lines and dc.initializer.Acquire(buildDirectory.RemoveAllChildren) which causes it to never clean up between jobs at the risk of if a runner dies it may leave a file forever.

Our use-case doesn't have this as an issue because we handle failure cases aggressively and just terminate the node if it has too many failures (and we terminate nodes if they are running more than a set amount of time).

I see two major ways to upstream this.

We can make a configuration simply instruct it to not clean the directory between runs (let each worker completion do it [which already happens]).
Make a kind of locking system with folder garbage collection. But this seems complicated and likely will result in edge cases.

Empty input directories are omitted

Rules proto grpc project runs cp -r on multiple input directories. Each directory is created by an aspect for a proto_library target. During local execution Bazel creates output directories defined by declare_directory() for aspects and then calls the action to populate them. If the action doesn't produce any files empty directories stay as they were. When the same build is executed remotely with Buildbarn empty directories are not present in the inputs (while non-empty ones are there). I assume this is a bug in Buildbarn -- it doesn't download empty directory trees. #47 makes me think that if there's no files the directory tree will not be built. Since declare_directory() was called I would expect it to be present in dependent actions no matter it's contents.

CI Testing

The Buildbarn project has no CI testing.

We should have CI testing which ensures every commit has tests run, formatting, linting etc. Eventually posting code coverage and if the build is passing.

bb-scheduler web view listing sessions

The bb-scheduler has already a web view listing individual operations/actions. But an additional view listing bazel client invocations, instead of the individual operations/actions they consist of, would give a much better overview to understand what is going on in the cluster.

Example of a new view listing bazel invocations

Invocation Id	First Action	Priority	Queued Actions	Executing Actions	Finished Actions	Failed Actions	Keywords
d252ce12…	2020-02-25 15:44	10			1202	2	host=machine3 user=bob
182bfe59…	2020-02-25 16:13	0	82	28	301		host=machine7 user=alice
372bce18…	2020-02-25 16:14	0	53				host=machine3 user=bob

Notes about each column:

Invocation Id: tool_invocation_id field in REv2 protocol.
First Action: Queue time for first action with this invocation id.
Priority: E.g. of the first action with this invocation id.
Queued Actions: Number of actions currently waiting in queue with this invocation Id.
Executing Actions: Number of currently executing actions with this invocation Id.
Finished Actions: Number of finished actions with this invocation Id.
Failed Actions: Number of failed actions with this invocation Id.
Keywords: Perhaps derived from information passed via bazel’s new --remote_exec_header option.

There should also be:

a link for listing all individual operations/actions, for a particular Invocation Id.
a button for aborting all actions with a particual invocation id.

Optionally there could also be a link to a bb-browser web interface for the particular invocation Id, but all builds handled by bb-scheduler might not be using BES and should still be shown in the bb-scheduler view.

Problems Updating Runner to Latest Bazel

As documented here and solved here - we needed to update the runner image in order to use bazel 0.25.2 which initially failed.

Buildbarn requires a clean of the temporary directories. The new Ubuntu image supplied by Google contains a folder in /tmp which Buildbarn fails to remove on startup. This caused Buildbarn to crash.

For the moment, we apply a patch to get it working.

Add Buildstream compatibility

This issue will encompass the work required to bring Buildbarn into a state which it can work with Buildstream in accordance with the REAPI spec.

Buildbarn Issues:

#18: This needs to be resolved as it is not REAPI compliant and causes issues with clients which choose to use working directories other than the input root.
#20: This is likely required to allow local_build_executor.go to provide the input root as an output directory.
- Alternatively, bb-storage could be altered to allow local_directory to accept "." as a filepath component to Lstat to allow root to be selected for Lstat whilst the outputParentDirectory is root.
#23: This could be useful to allow clients to provide their own toolchains and avoid using toolchains provided by the runner.

Following the implementation of fixes for these two issues, I predict that Buildbarn should work with Buildstream as long as the runner executes the action with the input root as the root through some form of sandboxing.

bb_runner caches results with ExitCode != 0

An action that run a command that return a non-zero exit code, e.g due to a compile error, is still cached as "OK". In my particular case, the script that would run the action failed to start, and returned 2 as the exit code.

This causes at least the Goma server (which is what I am using) to think that it received a current build failure. The result is that, without touching/editing some file (multiple in my case), it will not be possible to complete the build.

Relevant code lines:

bb-remote-execution/pkg/builder/caching_build_executor.go

Line 50 in cfd4231

} else if !action.DoNotCache && status.ErrorProto(response.Status) == nil {

bb-remote-execution/pkg/builder/local_build_executor.go

Line 317 in cfd4231

response.Result.ExitCode = runResponse.ExitCode

bb-remote-execution/pkg/environment/local_execution_environment.go

Line 110 in cfd4231

// Wait for execution to complete. Permit non-zero exit codes.

Building bb_runner and bb_worker fails on Mac M1/arm64

I have been trying to build bb_runner and bb_worker on an arm64 Mac M1 machine, but it fails due to not being able to resolve a dependency for nodejs_darwin_arm64 (I have previously unsuccessfully attempted to cross-compile for arm64, but only got x64 executables)

Bazel version is 4.2.1

Local installations of Nodejs tested have been both arm64 capable v12 and v16

AFAICT the problem was introduced by an npm dependency in bb_storage, although the root cause is probably in one of the nodejs/npm dependencies.

INFO: Repository npm instantiated at:
  <workdir>/WORKSPACE:175:12: in <toplevel>
  <storage>/external/build_bazel_rules_nodejs/index.bzl:78:17: in npm_install
Repository rule npm_install defined at:
  <storage>/external/build_bazel_rules_nodejs/internal/npm_install/npm_install.bzl:654:30: in <toplevel>
ERROR: An error occurred during the fetch of repository 'npm':
   Traceback (most recent call last):
        File "<storage>//external/build_bazel_rules_nodejs/internal/npm_install/npm_install.bzl", line 556, column 31, in _npm_install_impl
                node = repository_ctx.path(get_node_label(repository_ctx))
Error in path: Unable to load package for @nodejs_darwin_arm64//:bin/node: The repository '@nodejs_darwin_arm64' could not be resolved
ERROR: Error fetching repository: Traceback (most recent call last):
        File "<storage>//external/build_bazel_rules_nodejs/internal/npm_install/npm_install.bzl", line 556, column 31, in _npm_install_impl
                node = repository_ctx.path(get_node_label(repository_ctx))
Error in path: Unable to load package for @nodejs_darwin_arm64//:bin/node: The repository '@nodejs_darwin_arm64' could not be resolved
ERROR: <storage>/external/com_github_buildbarn_bb_storage/pkg/otel/BUILD.bazel:53:8: @com_github_buildbarn_bb_storage//pkg/otel:stylesheet depends on @npm//purgecss/bin:purgecss in repository @npm which failed to fetch. no such package '@npm//purgecss/bin': Unable to load package for @nodejs_darwin_arm64//:bin/node: The repository '@nodejs_darwin_arm64' could not be resolved
ERROR: Analysis of target '//cmd/bb_runner:bb_runner' failed; build aborted: Analysis failed
INFO: Elapsed time: 0.220s
INFO: 0 processes.
FAILED: Build did NOT complete successfully (1 packages loaded, 3 targets configured)

bb_runner: Allow it to chroot into the input root

This comes as a proposal to implement behavior which is currently outside of the REAPI spec but is seen as an additional non standard platform property within RBE.

The proposal is that the worker/runner be provided with an optional platform property allowing the client to request that the action be executed within a sandbox within the input root. Currently this is the case but this is simply coincidental that the build directory is also the input root currently (See #20).

Change visibility of bb_runner image layers to public to support alternative base images

I need to add the bb_runner container layers (:bb_runner_layer and :passwd_layer) to a custom base image - an alternative to the debian8 and ubuntu16 base images. I can accomplish this by changing the visibility of the bb_runner layers to public.

PR is forthcoming.

Avoid scheduling two identical actions on the same worker

Currently it is possible for two actions marked doNotCache to hit the same worker. These actions cannot be deduplicated but can cause issues due to the deterministic action subdirectory naming convention.

To avoid having builds fail in this case, the scheduler should avoid scheduling two identical actions onto the same worker.

Example failure:

Note: Remote connection/protocol failed with: execution failed java.io.IOException: com.google.devtools.build.lib.remote.ExecutionStatusException: INTERNAL: Failed to acquire build
environment: Failed to create build subdirectory "e3414cebacf640db": file exists

Support GetOperation function of Operations API

BuildBarn does not appear to implement the GetOperation function in the Google Operations API. The Pants build tool assumes that the remote execution server implements this function (which Pants uses to poll for request completion), and thus Pants can submit a request to a BuildBarn instance, but immediately gets back an UNIMPLEMENTED gRPC error when it calls GetOperation.

The remote execution specification would seem to require implementing GetOperation (but not other Operations API functions):

If the client remains connected after the first response is returned after
the server, then updates are streamed as if the client had called
[WaitExecution][build.bazel.remote.execution.v2.Execution.WaitExecution]
until the execution completes or the request reaches an error. The
operation can also be queried using [Operations
API][google.longrunning.Operations.GetOperation].

The server NEED NOT implement other methods or functionality of the
Operations API.

Is there any reason why GetOperation is not implemented? If not, any objection if I submit a PR to implement it?

I'm tracking the Pants-side of this at pantsbuild/pants#9876.

Schedule actions based on invocation id

It is often better to finish work instead of letting new work in.

Example

Bazel client A submits a batch of actions to remote queue.
Then bazel client B submits a huge batch of actions. They are queued waiting to be executed.
Then client A submits a small final set of actions for its session.
Then client B submits many more actions.

Current behavior

Actions submitted in step 2 are executed before actions from step 3.

Wanted behavior

The actions submitted by client A in step 3, should be executed before queued actions from client B in step 2. Because that would allow the already started client A to finish, and therefore reduce the average latency of completed build sessions.

Suggestion for bb-scheduler

Keep track of initial QeueuedTimeStamp for each new bazel invocation (a.k.a. tool_invocation_id in REv2). Call it invocationTimeStamp and associate it with the tool_invocation_id.
Sort actions in queue based on:
1. Priority
2. InvocationTimeStamp
3. QeueuedTimeStamp
Create some mechanism for discarding information about old tool_invocation_ids.

Discussion

It could be argued that bb-scheduler can’t know if client B has much less remaining work to do than A, and would finish earlier if given a chance. That could be a reason for allowing a limited set of resources also to B, before A is finished. But that might be hard to accomplish without making the queue algorithm much more complex. And the need for such complexity would be reduced if actions can be executed both locally and remotely and taking result from what finishes first, like bazels --experimental_spawn_scheduler option.

With other proprietary solutions, we experienced that a client machine could become an bottleneck for the whole cluster, if not balancing actions from several clients, but that should not occur with bazel remote execution, since actions are not queued until after their input files have been uploaded.

An alternative could be to use correlated_invocations_id instead of tool_invocation_id in REv2.

What do you think?

More resilient error handling if bb-runner disappear

If a bb-worker disappear, then the remaining bb-workers seems to take over gracefully.

But if a bb-runner has crashed or is not running, then the bb-worker will insist on repeatedly accepting new actions for the unavailable bb-runner, resulting in lots of failed builds.

Would it make sense to try preventing bb-worker from accepting new actions for bb-runners that are not alive? Or handle the ‘connection refused’ error more gracefully?

Support remote persistent workers

This ticket gathers information about remote persistent workers and their requirements for a buildbarn implementation.

Bazel has a concept of persistent workers since 2015 these workers help alleviate the issue that certain compilers such as the java compiler has a long startup and warmup time.

As an optimization, Bazel would wrap the javac instance with a small program that takes WorkRequests to feed the compiler and returns WorkResponses describing the next item to compile keeping the jit warm and fast.

Other languages such as Haskell suffers of the same issue and has therefore also implemented such wrapper programs. Previously this feature was only available for local execution but recently Bazel introduced --experimental_remote_mark_tool_inputs flag for annotating remote actions so that they too can recognize a remote persistent work request according to this design document.

An implementation in Buildbarn could be similar to a hardware runner. The scheduler would be configured to utilize worker stickiness in order to reuse persistent workers. And a custom worker which is persistent worker aware would have custom logic for:

Startup - Start the tool in a tool workspace if necessary
Action invocation - Feed the tool with WorkRequests through stdin and use WorkResponse to create action outputs
Cleanup - Shut down the tool and discard the tool workspace

The exact details would be part of the implementation.

Todo:

Key extraction for tool key in scheduler
Configurable stickiness in scheduler for specific keys
Persistent worker aware worker/runner that invokes actions via stdin
Cleanup logic for stale persistent workers
Upload a specification to the Remote Execution API project

Buildbarn allows action to write under user home directory

Behavior

Buildbarn allows action to write under user home directory

Specific Scene

here is an easy action to reproduce the scene

def _write_home_file_impl(ctx):
    out_file = ctx.actions.declare_file("a")
    ctx.actions.run_shell(
        outputs = [out_file],
        inputs = [],
        arguments = [out_file.path],
        command = "echo \"command exit not normally\" > ~/hello; echo 1 > $1",
    )
    return [DefaultInfo(files = depset([out_file]))]

bazel build local result:

INFO: From Action shell_command/a:
: /home/ziyi/hello: Read-only file system

while, bazel build remote with buildbarn successfully

Probable Resolution

use spawn_stratege=local to run bazel command locally can make bazel allow action to write under user home directory, but it does not conform to bazel's concept while its default option is sandbox
I have also seen chroot_into_input_root option in bb_worker, but it needs input root contains a full userland installation, and I think it is hard for my build process. So Is there any solution for the problem? Thanks!

Buildbarn is incompatible with recc

Buildbarn does not function correctly as a RE service for recc in a variety of common cases.

For example, when using CMake, or any build system doing out-of-tree builds, it is typical to have a directory structure such as

project-root (input root)
+- sources
`- my.build (working directory)
   +- objects
   `- Makefile

The build is started by running Make in project-root/my.build, compiling the inputs in ../sources into objects.

However, due to the bug described in #18, Buildbarn will attempt to create the output object files in project-root/objects instead of project-root/my.build/objects. This causes a server-side error, in my case along the lines of: Assembler messages: Fatal error: can't create objects/source_1.c.o: No such file or directory.

Buildbarn needs to run as root for some reason

I've encountered an interesting issue.

Here's the error I got when running a remote build:

ERROR: ${HOME}/${REPO}/${PATH}/BUILD.bazel:6:1: Extracting AndroidManifest.xml from ${TARGET}.aar failed (Exit -1): zipper failed: error executing command 
  (cd ${OUTPUT_BASE}/execroot/${REPO} && \
  exec env - \
    BAZEL_DO_NOT_DETECT_CPP_TOOLCHAIN=1 \
    PATH=/bin:/usr/bin \
  bazel-out/host/bin/external/bazel_tools/third_party/ijar/zipper x bazel-out/k8-fastbuild/bin/${PATH}/${TARGET}.aar -d bazel-out/k8-fastbuild/bin/${PATH}/_aar/${TARGET} AndroidManifest.xml)

Upon debugging I traced it to this code. (Bazel 3.1.0). Bazel checks permissions of the directory being created and fails because geteuid() == 65534 but linkstat.st_uid == 0 (root). When I run the bb-runner as root (i.e.runAsUser: 0) everything builds successfully.

Stack trace for more context:

And the log:

f 777 AndroidManifest.xml here

What do you think could be the cause of this issue? The output directory for the zipper is created by some bazel internals or maybe the runner, not manually by user code in rules, so I doubt it's a bug I can fix in my repository. I could provide an example, but it's going to take me a while to create a minimal one and strip all the internal code I can't share. Perhaps you have ideas from just knowing the issue is about output directory permissions?

Action execution fails if worker-created output directory is deleted

This appears to have been introduced by #41 (testing without that commit makes the problem go away). The output directory is now created by the worker, but filesystem.Directory follows the fd instead of the path, so the worker will only try to upload the precise directory it created.

If the action does something like this:

rmdir out
mkdir out

Then the action will fail, reporting that the output directory could not be found.

Notably, @io_bazel_rules_go//:stdlib does this.

More robust and scalable scheduler

I'm looking into large scale deployments of remote execution with buildbarn. I've identified the scheduler as an area we'd like to have more robust and scalable . The problems I have with the scheduler are:

Scheduler goes down, we lose all statuses of jobs
Not easy to elastically scale the max number of jobs
Not easy to make clones of schedulers with pods of workers and make frontends (or storage daemons) aware of new schedulers

I propose that Buildbarn make it possible to remove the storage of the job and its metadata from the actual implementation of the scheduling algorithm. This would allow another service to take care of the fallback incase the service goes down and it needs to repopulate.

Each Backend below could be a different scheduling algorithm. It could, for example, query for jobs with specific metadata, perhaps even depending on the workers it has available.

Have you any thoughts on this problem, comments, queries or input? I'd be especially interested if you had any ideas or suggestions on what to use with the JobStorage.

                                                                      Potential Design

                                                      +-------------+  +-------------+  +-------------+
                                                      |             |  |             |  |             |
               Current Design                         |  Frontend   |  |  Frontend   |  |  Frontend   |
                                                      |             |  |             |  |             |
     +-------------+    +-------------+               +------+------+  +------+------+  +------+------+
     |             |    |             |                      |                |                |
     |  Frontend   |    |  Frontend   |                      |                |                |
     |             |    |             |             +--------+----------------+----------------+--------+
     +-----+-------+    +------+------+             |                                                   |
           |                   |                    |                                                   |
           +-------------------+                    |                   JobStorage                      |
           |                   |                    |                                                   |
    +------+--------+  +-------+-------+            |                                                   |
    |               |  |               |            +--------+----------------+----------------+--------+
    |   Scheduler   |  |   Scheduler   |                     |                |                |
    |               |  |               |                     |                |                |
    +-+---------+---+  +---+---------+-+              +------+------+  +------+------+  +------+------+
      |         |          |         |                |             |  |             |  |             |
      |         |          |         |                |  Backend    |  |  Backend    |  |  Backend    |
+-----+--+ +----+---+  +---+----+ +--+-----+          |             |  |             |  |             |
|        | |        |  |        | |        |          +-+---------+-+  +------+------+  +--+--------+-+
| Worker | | Worker |  | Worker | | Worker |            |         |           |            |        |
|        | |        |  |        | |        |            |         |           |            |        |
+--------+ +--------+  +--------+ +--------+       +----+---+ +---+----+  +---+----+  +----+---+ +--+-----+
                                                   |        | |        |  |        |  |        | |        |
                                                   | Worker | | Worker |  | Worker |  | Worker | | Worker |
                                                   |        | |        |  |        |  |        | |        |
                                                   +--------+ +--------+  +--------+  +--------+ +--------+

Dynamic spawning of workers based on platform properties

For example, I have configured a platform like

platform(
    name = "platform",
    constraint_values = [
        "@bazel_tools//platforms:x86_64",
        "@bazel_tools//platforms:linux",
        "@bazel_tools//tools/cpp:clang",
    ],
    remote_execution_properties = """
        properties: {
          name: "container-image"
          value:"docker://gcr.io/my-image@sha256:d7407d58cee310e7ab788bf4256bba704334630621d8507f3c9cf253c7fc664f"
        }
        properties {
           name: "OSFamily"
           value:  "Linux"
        }
        """,
)

I have set this platform via --host_platform=//config:platform, but it seems that buildbarn has hardcoded platform information in the jsonnet config. Is that correct or am I misunderstanding it?

Queued operations does not seem to be sorted

This is about in_memory_build_queue.go

In platform queue, getNextNonBlockingOperation the first operation in the queuedOperations are returned, which would be the oldest remaining as they are added chronologically.

An operationEntry does contain priority, and the operation.

As far as I can tell, the queuedOperationsHeap is not sorted, so if a later invocation adds a new operation with a "better" (in this case lower) priority this will still be added last in the queuedOperationsHeap.

Adding a sort to this method do solve this but this can potentially cause a large amount of shuffling in case "better" priority operations are added frequently.

This issue has also some relation to issue #38 issue #38 as introducing an invocation time sorting would have the same problem.

Windows worker

Hi, not sure if this is an issue but I didn't found anything on the web.
I basically understand how buildbarn is working. my question is:
how is the procedure to setup a build cluster having windows/Linux workers?
Is build barn meant to be running a windows worker?
(currently trying to compile the components on windows but it fails with a missed BUILD file in
org_golang_google_grpc module)

Clean up queue is not sorted

This is about in_memory_build_queue.go

There is a clean up mechanism in the file, with a clean up queue, and also a system which uses cleanupKey to refer to the entries in the clean up queue.
Entries added to the clean up queue are added First-In-First-Out.

When the clean up queus is to be polled the following code is used:

func (q *cleanupQueue) run(now time.Time) {
	for len(q.heap) > 0 && !q.heap[0].timestamp.After(now) {
		heap.Pop(&q.heap).(cleanupEntry).callback()
	}
}

This does work fine to remove the top one in the heap when it has timed out.
But as far as I can tell there is no sorting of elements in the heap - new elements is just added at the end. So if an element with a LONG timeout is at the top, this will stop any other elements from being cleaned up.

BuildBarn always resolves symlinks to origin files

Our build is relying on the fact that symlinked files are actual symlinks.
It turns out that, using Remote Execution, a file that is symlinked via https://bazel.build/rules/lib/actions#declare_symlink is not an actual symlink, but the file itself.

Apparently there is a Bazel flag (https://bazel.build/reference/command-line-reference#flag--incompatible_remote_symlinks), which is enabled by default, that should ensure exactly this. Following this flag we find bazelbuild/bazel#6631, which describes the needed feature is already implemented by Bazel.

A small repository with a reproduction of the issue can be found here: https://github.com/castler/buildbarn_bazel_symlink_issue_repro

The BuildBarn worker was using the native buildDirectories.

Scheduler does not reschedule on Internal Errors

When a worker fails due to internal errors the scheduler does not reschedule the action but instead that grpc error is propagated
all the way back to bazel.

I believe the correct handling should rather be to reschedule the action again and keep the issuing client unaware of the issue.

The reason for this being an issue in our case is that cleaning of build-directories on workers sometimes fails prior to build start
due to locked files on nfs-mounts but I think that rescheduling is how all worker-related errors should be handled.

Autoscaling workers based on queue size

Buildbarn scheduler has the NewCounterVec("in_memory_build_queue_operations_queued_total") metric to describe how many operations have been enqueued since the start of the binary. I want to propose adding a NewGauge("in_memory_build_queue_operations_queued_current") metric with the number of currently enqueued operations. This change can enable automatically scaling the number of workers by it with a simple strategy: spawn more workers whenever this number is not zero (or some other small tolerance number).

Display bazel target in ListWorkerState / Queue scheduler interfaces

Bazel provides TEST_TARGET for all test actions and scanning through a list of test actions which each have Argv[0] as some test wrapper isn't too useful.

I propose adding an additional column (or replacing the existing Argv[0] column) with something slightly more intrusive. This data can be extracted from the action when the BasicOperationState is constructed to form a label of some sort. For bazel, this will consist of the values of TEST_TARGET, or maybe TARGET in some ideal future. This will allow you to pick out the actions relating to a target you're particularly interested in when scanning through a list of actions.

I have a POC working for this which replaces the Argv[0] column for the value of TEST_TARGET or Argv[0], I'm happy to tidy this up and submit it but wanted to get some opinions on how this should be implemented without becoming purely bazel based.

Unable to access GCS due to missing certs

Hi!

We're seeing problems when using GCS as a storage backend; specifically errors like:

Get https://storage.googleapis.com/<bucketname>/buildbarn-cas/<some-hash>:
oauth2: cannot fetch token: Post https://oauth2.googleapis.com/token: x509:
certificate signed by unknown authority

Is this fixable with a simple apt-get install ca-certificates or similar? Notably not all the images seem to have this problem; specifically we see this error when doing remote execution (as opposed to remote caching, which works fine).

Move input root to sub directory of build directory

That said, one change we should consider making to Buildbarn at some point is that we do move the input root into some kind of subdirectory, reserving the top-level directory for other storage purposes. For example, stdout and stderr of build actions are currently stored as files inside the input root. So like this:

input_root
├── .stdout.txt
├── .stderr.txt
├── input_file.c
└── output_file.o

Though most build actions generally don't care about that, it's a bit sloppy. If a build action would try to tar its entire input root, it would also archive its own stdout/stderr logs. We should consider restructuring it to something like this:

build_directory
├── stdout.txt
├── stderr.txt
└── input_root
    ├── input_file.c
    └── output_file.o

Originally posted by @EdSchouten in #18 (comment)

Scheduler endpoint should show list of workers and currently running actions

Right now the scheduler doesn't expose any information about its running state. We should add some kind of status page that shows at least the following information:

The list of workers connected to the scheduler,
Which actions these workers are currently executed,
The list of actions that is pending to be executed,
Optional: a list of recently completed actions.

We can simply link to bb-browser to show meaningful information about the actions.

Old tracking bug: https://github.com/EdSchouten/bazel-buildbarn/issues/12

RequestMetadata extraction does not work with Bazel

We just bumped to the latest buildbarn version in order to try out the simple action router with invocation based scheduling.

Unfortunately it seems like the InvocationId extraction is currently broken.

When checking the buildbarn scheduler GUI there is always only a single invocation Id Listed:

{ "@type": "type.googleapis.com/build.bazel.remote.execution.v2.RequestMetadata" }

Some further investigation showed that bazel does indeed send the RequestMetadata message as expected:

metadata {
tool_details {
tool_name: "bazel"
tool_version: "6.0.0-pre.20211117.1-2021-12-17 (@ed890679)"
}
action_id: "ee32f4898cb0cd9eee152d64360899f8422eb538364fd079827bc99e2371f81c"
tool_invocation_id: "34d8cb06-d35b-49a3-a00b-53dff390d54f"
correlated_invocations_id: "d5452b46-ca78-48f9-a67a-11ef39b58617"
action_mnemonic: "MockGen"
target_id: "@Fw_Framework//Framework/Api/Service:IMemProvider"
configuration_id: "7fe7f8dbd78a9aabed4bc3931acb7fab17e26433fb26359b5376dc8b66dcf1d2"
}
status {
}
method_name: "build.bazel.remote.execution.v2.Execution/Execute"
details {
execute {
request {
instance_name: "buildbarn"
action_digest {
hash: "ee32f4898cb0cd9eee152d64360899f8422eb538364fd079827bc99e2371f81c"
size_bytes: 144
}
}
responses {
name: "buildbarn/operations/850f56b2-8f8c-4502-840b-a335d1fcb70d"
metadata {
type_url: "type.googleapis.com/build.bazel.remote.execution.v2.ExecuteOperationMetadata"
value: "\b\003\022E\n@ee32f4898cb0cd9eee152d64360899f8422eb538364fd079827bc99e2371f81c\020\220\001"
}
}
responses {
name: "buildbarn/operations/850f56b2-8f8c-4502-840b-a335d1fcb70d"
metadata {
type_url: "type.googleapis.com/build.bazel.remote.execution.v2.ExecuteOperationMetadata"
value: "\b\004\022E\n@ee32f4898cb0cd9eee152d64360899f8422eb538364fd079827bc99e2371f81c\020\220\001"
}
done: true
response {
type_url: "type.googleapis.com/build.bazel.remote.execution.v2.ExecuteResponse"
value: "\n\235\006\022\324\001\n\210\001bazel-out/pclinux64-fastbuild-fwl-nostamp/bin/external/Fw_Framework/Framework/Api/Service/Mocks/Framework/Api/Service/IMemProviderMock.c\022G\n@766c99c125bcfbade34fa491503a3f2fb4f5b9
e7a4fc4c636d664eaa63c8a399\020\363\214\300\001\022\323\001\n\210\001bazel-out/pclinux64-fastbuild-fwl-nostamp/bin/external/Fw_Framework/Framework/Api/Service/Mocks/Framework/Api/Service/IMemProviderMock.h\022F\n@57583aaac3b8955915
f7a7773e5929de07cdfda20aae6c67754ffc79a58c64f9\020\264\304\002J\355\002\n={"hostname":"buildbarn-worker-centos7-instance","thread":"0"}\022\v\b\215\211\201\216\006\020\271\217\344@\032\v\b\215\211\201\216\006\020\340\345\3
67@"\f\b\215\211\201\216\006\020\356\206\310\337\002*\v\b\215\211\201\216\006\020\346\312\207A2\v\b\215\211\201\216\006\020\342\306\341i:\v\b\215\211\201\216\006\020\342\306\341iB\f\b\215\211\201\216\006\020\340\211\377\265\002J
f\b\215\211\201\216\006\020\340\211\377\265\002R\f\b\215\211\201\216\006\020\356\206\310\337\002Zf\n>type.googleapis.com/buildbarn.resourceusage.POSIXResourceUsage\022$\n\006\020\340\263\365\256\001\022\005\020\330\305\260\030\030
\200\340\262\r8\214]@\025P\310\026X\200=x'\200\001\003ZC\nAtype.googleapis.com/buildbarn.resourceusage.FilePoolResourceUsageb\006\020\376\302\235\314\001*\223\001Action details (cached result): http://172.17.0.3:7984/buildbarn/bl
obs/action/ee32f4898cb0cd9eee152d64360899f8422eb538364fd079827bc99e2371f81c-144/"
}
}
}
}
start_time {
seconds: 1639990413
nanos: 133000000
}
end_time {
seconds: 1639990413
nanos: 743000000
}

I added some debug printouts in in_memory_build_queue.go

// getRequestMetadata extracts the RequestMetadata message stored in the
// gRPC request headers. This message contains the invocation ID that is
// used to group incoming requests by client, so that tasks can be
// scheduled across workers fairly.
func getRequestMetadata(ctx context.Context) *remoteexecution.RequestMetadata {
	if md, ok := metadata.FromIncomingContext(ctx); ok {
		log.Print("Context Extraction OK")
		log.Print(md)
		for _, requestMetadataBin := range md.Get("build.bazel.remote.execution.v2.requestmetadata-bin") {
			log.Print("Parsing RequestMetadata")
			var requestMetadata remoteexecution.RequestMetadata
			if err := proto.Unmarshal([]byte(requestMetadataBin), &requestMetadata); err == nil {
				return &requestMetadata
			}
		}
	}
	log.Print("Returning nil RequestMetadata")
	return nil
}

And the received output is always

Dec 20 11:25:09 buildbarn-worker-frontend-centos7-instance bb_scheduler[3339]: 2021/12/20 11:25:09 Context Extraction OK
Dec 20 11:25:09 buildbarn-worker-frontend-centos7-instance bb_scheduler[3339]: 2021/12/20 11:25:09 map[:authority:[0.0.0.0:8982] content-type:[application/grpc] user-agent:[grpc-go/1.42.0]]
Dec 20 11:25:09 buildbarn-worker-frontend-centos7-instance bb_scheduler[3339]: 2021/12/20 11:25:09 Returning nil RequestMetadata

So it seems like the RequestMetadata message is always lost.

I'm not too familiar with GRPC-setups so I think I might need some pointers on how to proceed with the troubleshooting from here. Any input on this? @EdSchouten ?

Make bb-remote-execution go gettable

Currently this project can't be pulled down with go get as the proto definitions aren't built into .go files.

Is it possible to include these files so that non-Bazel users can build this repo?

Thanks,

Henry

Instance name lost when ExecuteRequest arrives in scheduler

We have a Goma/Buildbarn configuration, and I am currently testing an upgrade of the system

The Goma system uses the instance name "goma", and the storage and workers are configured for this name (and only this name).

During testing requests failed with 'No workers exist for instance "" ' for the specified platform.

Testing so far has shown that the Goma server backend (which is connected to the storage with a unix socket, as are the buildbarn backend components to each other) correctly configures "goma" as the instance name for the ExecuteRequest message.

However, when this message arrives in the pkg/builder/in_memory_build_queue.go function Execute(), in.InstanceName is an empty string. This suggest that the value is lost somewhere along the gome->storage->scheduler transmission chain. I did notice some possible locations for the problem, but the apparent lack of functionality for activity logging in the backend has so far prevented detailed debugging.

At present, during testing, I am working around this issue by forcing the instance name to "goma", and I could make that permanent until a fix by creating a branch of the repo, but I would rather avoid that.

For reference, the checkouts are currently:

bb-remote-execution: ee554d7
bb-storage: a89a0279
bb-browser: 61b07488

Scheduller validation breaks after docker image: 20210430T150052Z-ed1e2f4

Reproduction:
ERROR MESSAGE:

2021/10/08 18:56:04 Failed to read configuration from /scheduler.libsonnet: rpc error: code = Unknown desc = Failed to unmarshal configuration: proto: syntax error (line 37:31): unexpected token {

docker run -v $PWD/scheduler.libsonnet:/scheduler.libsonnet -it buildbarn/bb-scheduler: 20210612T211840Z-40946a1 /scheduler.libsonnet

Where lisonnet file =

admin_http_listen_address: ':80',
clientGrpcServers: [{
listenAddresses: [':8982'],
authenticationPolicy: { allow: {} },
}],
defaultExecutionTimeout: {
seconds: 1000,
nanos: 0
},
workerGrpcServers: [{
listenAddresses: [':8983'],
authenticationPolicy: { allow: {} },
}],
contentAddressableStorage: {
sharding: {
hashInitialization: 11946695773637837490,
shards: [
{
backend: { grpc: { address: 'storage-0.storage.buildbarn:8981' } },
weight: 1,
},
{
backend: { grpc: { address: 'storage-1.storage.buildbarn:8981' } },
weight: 1,
},
],
},
},
maximumMessageSizeBytes: 30 * 1024 * 1024 * 1024,
browserUrl: 'bb-browser.dev.kube.local:80',