Giter Club home page Giter Club logo

authjanitor's Introduction

AuthJanitor Logo

.NET Core

Manage the lifecycle of your application secrets in Azure with ease. Migrate to more secure, auditable operations standards on your own terms. AuthJanitor supports varying levels of application secret security, based on your organization's security requirements.

Disclaimer: Using AuthJanitor does not guarantee the security of your application. There is no substitute for a proper security review from a reputable cybersecurity and/or auditing partner.

๐Ÿ”ด This system has not been thoroughly tested yet! Please use at your own risk! ๐Ÿ”ด

Installation (Azure)

Create the AuthJanitor application

Follow the instructions at https://docs.microsoft.com/en-us/azure/active-directory/develop/quickstart-register-app to register a new application in the Active Directory tenant where you expect to run AuthJanitor. You may use any nomenclature you prefer; make note of the Client ID and generate a Client Secret. These will be necessary for the application to run properly. The application must have the following permissions:

  • Azure AD - Sign in and Read User Profile
  • Azure Service Management API - Access as organization users (ADMIN CONSENT REQUIRED)

(Optional for User Management to work)

  • Graph - Read Applications
  • Graph - Manage App Permission Grants and Role Assignments
  • Graph - Read All users' basic profiles

Deploy Infrastructure

Once your application is created, you can quickly deploy AuthJanitor's infrastructure to your Azure subscription by clicking the button below:

Deploy to Azure

You will need your Client ID and Client Secret from the application created above.

Deploy Code

At the moment, there is no quick way to deploy the AuthJanitor application (expect this soon). Currently, you must build/publish the code manually:

  1. Build and publish the AuthJanitor.AspNet.AdminUi project to a file folder.
  2. Upload the content of the publish/wwwroot directory to the $web share of the AuthJanitor storage account created by the ARM template
  3. Enable the "Static Website" feature of the AuthJanitor storage account from the Azure Portal. (Currently there is no way to do this via ARM.)
  4. Build and publish the AuthJanitor.Functions.AdminApi project to the Functions app created in the deployment.
  5. Navigate to the URL for the new Functions app and launch "Configuration Wizard" from the main menu, if it does not appear automatically.

๐Ÿ”“ Learn more about how AuthJanitor can improve the security around your application secrets here.

Secret Rotation Process

Secret Rotation Process

Glossary of Terms

Provider

A module which implements some functionality, either to handle an Application Lifecycle or rekey a Rekeyable Object.

Provider Configuration

Configuration which the Provider consumes in order to access the Application Lifecycle or Rekeyable Object.

Application Lifecycle Provider

A Provider with the logic necessary to handle the lifecycle of application which consumes some information (key, secret, environment variable, connection string) to access a Rekeyable Object.

Rekeyable Object Provider

A Provider with the logic necessary to rekey a service or object. This might be a database, storage, or an encryption key.

Resource

A GUID-identified model which joins a Provider with its Provider Configuration. Resources also have a display name and description. A Resource can either wrap an Application Lifecycle or a Rekeyable Object provider.

Managed Secret

A GUID-identified model which joins one or more Resources which make up one or more rekeyable objects and their corresponding application lifecycles. When using multiple rekeyable objects and/or lifecycles, a User Hint must be specified. When the rekeying is performed, the User Hints are matched between rekeyable objects and application lifecycles to identify where different secrets should be persisted. Managed Secrets also have a display name and description, as well as metadata on the validity period and rotation history.

Rekeying Task

A GUID-identified model to a Managed Secret which needs to be rekeyed. A Rekeying Task has a queued date and an expiry date; the expiry date refers to the point in time where the key is rendered invalid. A Rekeying Task must be approved by an administrator or will be executed automatically by the AuthJanitor Agent.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

authjanitor's People

Contributors

anthturner avatar ericmaino avatar jeffwilcox avatar microsoftopensource avatar peterbryntesson avatar renato-marciano avatar viniciussouza avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

authjanitor's Issues

ExternalSignal Azure Function Task waiting bug

TL;DR

Use Task.Run instead of creating and manually starting a Task in the ExternalSignal Azure Function.


The problem

The ExternalSignal Azure Function contains the following code:

var rekeyingTask = new Task(async () =>
{
var task = new RekeyingTask()
{
ManagedSecretId = secret.ObjectId,
Expiry = secret.Expiry,
Queued = DateTimeOffset.UtcNow,
RekeyingInProgress = true
};
await RekeyingTasks.Create(task);
await _taskExecutionMetaService.ExecuteTask(task.ObjectId);
},
TaskCreationOptions.LongRunning);
rekeyingTask.Start();
if (!rekeyingTask.Wait(TimeSpan.FromSeconds(MAX_EXECUTION_SECONDS_BEFORE_RETRY)))
{
log.LogInformation("Rekeying workflow was started but exceeded the maximum request time! ({0})", TimeSpan.FromSeconds(MAX_EXECUTION_SECONDS_BEFORE_RETRY));
return new OkObjectResult(RETURN_RETRY_SHORTLY);
}
else
{
log.LogInformation("Completed rekeying workflow within maximum time! ({0})", TimeSpan.FromSeconds(MAX_EXECUTION_SECONDS_BEFORE_RETRY));
return new OkObjectResult(RETURN_CHANGE_OCCURRED);
}

This will not work as intended. The Task that is created will seem to always finish almost immediately and the .Wait(...) will always return true the instant it does.

TPL history lesson and modern async best practices

As a quick aside, it's rare that you'd want to use the pattern of new Task(...) + task.Start()... even back in the "pure" TPL days. It's usually preferable to start tasks via Task.Factory.StartNew(...). Stephen Toub has a blog post on this topic here.

Ok, now, that's technically a historical document and there's actually a bigger problem with either of those approaches in the new world of async/await. The real problem with the code stems from passing an async lambda to the constructor of the Task (same would apply to using StartNew(...). For an explanation on why exactly I'll refer to another great blog post, this time from Stephen Cleary. From that post specifically, this is the problem that this code has:

Does not understand async delegates. This is actually the same as point 1 in the reasons why you would want to use StartNew. The problem is that when you pass an async delegate to StartNew, itโ€™s natural to assume that the returned task represents that delegate. However, since StartNew does not understand async delegates, what that task actually represents is just the beginning of that delegate. This is one of the first pitfalls that coders encounter when using StartNew in async code.

So, in this this code's specific case what's going to happen is that the rekeyingTask is going to start and as soon as the first await statement is reached that triggers an actual async call it's returning and that's going to cause the Task to complete. This is a problem because the logic right afterwards is to wait for the rekeyingTask for a certain amount of time, but the task will be considered completed almost instantly while the actual work is still off running asynchronously.

LongRunning...? Not really.

Now, I assume the reason this code was written this way in the first place was because someone thought they should shuffle the execution off to a dedicated background thread and that's why TaskCreationOptions.LongRunning is specified. However, this actually makes no sense because the work the task does itself is not compute bound. The task itself executes async methods itself which means it is completely safe to run on the task pool because it is not going to "hog" the task pool thread for any significant amount of time.

The modern async friendly solution

The solution is to prefer using Task.Run for these kinds of workloads. Task.Run is the modern way to "safely" enqueue async work to the task pool. Easy peasy, done. Almost...

Waiting without blocking

We have one more problem. The code today uses .Wait(...) to detect a timeout. Wait(...) is a blocking call and will block the thread that the Azure Function runtime used to invoke the ExternalSignal::Run method. This is... not good, right? So how can we solve this? Easy, use a Task.Delay(...) to represent the timeout and have the Run method await a Task::WhenAny of the rekeyingTask and the timeout task instead. Doing this is the last step to making the Run method fully asynchronous.

Slottable Cryptography

Since customers will inevitably have varying IT security policies, allowing the entire cryptographic subsystem to be slotted in and out with different algorithmic strengths and default configurations. This will hopefully help address #2 as well.

Prefer System.Text.Json over NewtonSoft.Json

This is a .NET Core 3 based codebase and, as such, I would highly recommend ditching the dependency on NewtonSoft.Json and instead preferring to go with System.Text.Json. I would suggest that's always the mentality we take for a new .NET Core codebase unless you know you're going to need some specific behavior that NewtonSoft.Json provides, but I'm not seeing anything in this codebase that requires it yet.

Benefits?

  • One less dependency
  • Performance of System.Text.Json is leaps and bounds above NewtonSoft.Json
  • Promotes the latest guidance for the .NET ecosystem to customers who engage with the codebase

Pattern of Task execution in ProviderManagerService::ExecuteRekeyingWorkflow seems problematic

There is a repeated pattern in the implementation of the ProviderManagerService::ExecuteRekeyingWorkflow method that appears to have a couple of problems. I will use this first instance as the example:

var testTasks = providers.Select(p => p.Test().ContinueWith(t =>
{
if (t.IsFaulted)
logger.LogError(t.Exception, "Error running sanity test on provider '{0}'", p.ProviderMetadata.Name);
}));
await Task.WhenAll(testTasks.ToArray());
if (testTasks.Any(t => t.IsFaulted))
throw new Exception("Error running one or more sanity tests!");

The breakdown is this:

  1. For the set of providers, Select the execution of the Test method
  2. Take the resulting Task produced from the call to Test and just tack on a continuation that handles logging if it faults and produce that Task as the result of the Select
  3. This produces an IEnumerable<Task>, but its execution has not happened yet because no one has enumerated it yet (deferred execution)
  4. Now it wants to await all of these Tasks concurrently so a call to Task.WhellAll is made, but first there is a call to .ToArray(). This call causes the deferred IEnumerable<Task> to be evaluated and actually trigger the calls to Test.
  5. Now after all the Tasks have been awaited by WhenAll the logic checks to see if any one of the Tasks have faulted. It does via the .Any(...) extension method on IEnumerable<Task> to see any one of those have faulted and, if so, throws an exception that stops the workflow from continuing to the next step.

First, a small performance recommendation: I would highly recommend using the overload of ContinueWith that takes a TaskContinuationOptions and specify the OnlyOnFaulted option. This is more optimal when you know you only want the continuation to fire if faulted because the Task won't even be scheduled for execution at all if the previous Task does not meet the condition.

Now, the problem is, as I highlighted in step 3, the evaluation of the IEnumerable<Task> stored in testTasks is deferred until the first piece of code attempts to enumerate it. This first happens when .ToArray() is called on it in step 4. This is fine, but the bigger problem is in step 5 because after the WhenAll has completed, the call to .Any(...) is actually causing a second enumeration of the deferred enumerable which will execute all of the logic again which, in this case, will cause Test() to be called again on all the providers. Not only that, but because the predicate is checking IsFaulted, and the Task instances were "just" kicked off, it's highly likely that they haven't even run to completion yet so checking IsFaulted will return false no matter what and the logic will proceed forward without issue.

A smaller, but related problem is that, had any one of the Tasks passed to WhenAll failed, an exception would have been thrown right at the call site where they were being awaited already, so there is no real need to check the individual tasks for their state here anyway.

Proposed changes for pattern:

var testTasks = providers.Select(p => p.Test().ContinueWith(t =>
    {
        logger.LogError(t.Exception, "Error running sanity test on provider '{0}'", p.ProviderMetadata.Name);
    },
    TaskContinuationOptions.OnlyOnFaulted));

try
{
    await Task.WhenAll(testTasks);
}
catch (Exception exception)
{
    throw new Exception("Error running one or more sanity tests!", exception);
}

NOTE: You don't even need .ToArray() because WhenAll has its own overload that takes an IEnumerable<> source.

Now, this pattern is also repeated about 7 more times in the same method, so if you fix one you should fix them all. Alternatively we can recognize the pattern and write a helper method once that we can use for all of them:

private static async Task PerformProviderActions<TProviderType>(ILogger logger, IEnumerable<TProviderType> providers, Func<TProviderType, Task> providerAction, string individualFailureErrorLogMessageTemplate, string anyFailureExceptionMessage)
            where TProviderType : IAuthJanitorProvider
        {
            var providerActions = providers.Select(p => providerAction(p)
                .ContinueWith(t =>
                {
                    logger.LogError(t.Exception, individualFailureErrorLogMessageTemplate, p.GetType().Name);
                },
                TaskContinuationOptions.OnlyOnFaulted));

            try
            {
                await Task.WhenAll(providerActions);
            }
            catch (Exception exception)
            {
                throw new Exception(anyFailureExceptionMessage, exception);
            }
        }

And then you replace all the individual sets with calls this like this N times:

await PerformProviderActions(
    logger, 
    providers, 
    p => p.Test(), 
    "Error running sanity test on provider '{0}'",
    "Error running one or more sanity tests!");

Reduce allocations when serializing by caching singleton instance of JsonSerializerSettings

The AuthJanitorHttpClient::Serialize method currently instantiates both a new JsonSerializerSettings and a CamelCasePropertyNamesContractResolver on every single call:

private string Serialize<T>(T obj) =>
JsonConvert.SerializeObject(obj, new JsonSerializerSettings()
{
Formatting = Formatting.None,
ContractResolver = new CamelCasePropertyNamesContractResolver()
});

Recommend moving the JsonSerializerSettings into a private static readonly field and just treating it as a singleton. This will reduce unnecessary allocations that contribute to more frequent GC cycles.

For the record there are no thread safety issues with these classes and standard use in ASP.NET itself is to register/configure a single instance at startup which is used to serialize any incoming/outgoing JSON requests.

FIPS Compliancy

Since it's likely AJ will end up in the hands of regulated customers, making sure we can adhere to FIPS is really important. I believe we're currently using some Managed crypto, which needs to be changed to their FIPS-compliant alternatives.

AuthJanitorProvider::Configuration should cache the value after first deserialization

The implementation of the AuthJanitorProvider::Configuration property is actually to de-serialize an object from the string of JSON stored in the SerializedConfiguration property:

https://github.com/microsoft/AuthJanitor/blob/b6c043c43909c131b152977aeeb0dfaf917ce37c/AuthJanitor.Providers/AuthJanitorProvider.cs#L77

This is a good convenience property to offer, but properties should not have such costly backing implementations. For example, look at this method on AppSettingsFunctionsApplicationLifecycleProvider::GetDescription that builds a description from about six different properties off of the configuration value object:

public override string GetDescription() =>
$"Populates an App Setting called '{Configuration.SettingName}' in an Azure " +
$"Functions application called {Configuration.ResourceName} (Resource Group " +
$"'{Configuration.ResourceGroup}'). During the rekeying, the Functions App will " +
$"be moved from slot '{Configuration.SourceSlot}' to slot '{Configuration.TemporarySlot}' " +
$"temporarily, and then to slot '{Configuration.DestinationSlot}'.";

Each of those Configuration.XXX calls is going to ultimately call the "getter" implementation of the property which is going to de-serialize the JSON string into a new AppSettingConfiguration instance six times. That's six times the CPU and six times the garbage generated on each JSON parse!

Now, you could argue that someone should probably not dereference the Configuration property itself that many times in the scope of the GetDescription method any way and that's a valid argument (one I might even make), but when you have a property that's that easy to use and there is a common understanding in the .NET ecosystem that reading properties is meant to be cheap, well... code like this will be written and it's not wrong. This class is actually four inheritance levels away from the base class that provides the behavior. Should it really know what the implementation is and try to optimize for it? I would suggest no. This is also just one example of the consumption of this property from a far removed subclass, it's all over the codebase.

There's two fixes for this, the first is you change it to a method explicitly. That indicates to callers there's a cost to it and that they should be smart about caching the result if they intend to use it more than once. I don't think that's the right answer here.

The second is to implement caching of the configuration value instance on first deserialization and have subsequent calls return the cached instance. This is what I would recommend. The added benefit of this is you instantly improve the performance of all your callers and none of them are any the wiser.

Implementation:

// Add a field to the class
private TConfiguration _cachedConfigurationInstance;

public TConfiguration Configuration
{
    get => _cachedConfigurationInstance ??= JsonConvert.DeserializeObject<TConfiguration>(SerializedConfiguration);
    set 
    {
        _cachedConfigurationInstance = value;
        SerializedConfiguration = JsonConvert.SerializeObject(value);
    }
}

Explore Durable Functions/Entities

It is on my short list to replace IDataStore with Entity Framework. As part of this endeavor, I'm wondering whether switching the Functions app to use Durable Entities might be a good idea, as most of the bulk of the Functions code is around CRUD operations, and less around actual orchestration. Using Durable Entities might vastly simplify maintaining the administrator interface.

https://docs.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-overview?tabs=csharp

Publish packages to NuGet

Once we get some more testing in place, I'd like to add an Actions task to publish packages to NuGet. There's a straightforward example that already exists so I don't think it'll be difficult to do, but we need to wait until there's a bit more test rigor, otherwise we'll end up pushing broken things to public feeds.

ViewModelFactory::GetViewModel forces sync over async which could lead to a deadlock

ViewModelFactory::GetViewModel, which has several overloads, is implemented as a synchronous method. However, it makes a call to IDataStore<Resource>::GetOne which is an asynchronous method. It forces this call to be blocking by accessing the .Result property of the Task returned from the call to GetOne. This can lead to deadlocks and other performance issues.

It's easy enough to change GetViewModel to be async, but it will cause ripples through the codebase (as going fully asynchronous usually does).

Proposed steps:

  1. Change GetViewModel overloads to be async - this is actually the easy part. It differs slightly per overload, but it will be easy enough to go make each of them async.
  2. Change all callers to expect GetViewModel calls to be async - this is the harder part, let's dig into why. First, GetViewModel is actually just a backing implementation for a bunch of Func<T, VM> factory methods that get registered into the services collection at start up. I actually like this, it's an elegant way to abstract the caller from how the view models are resolved without the need to introduce your own factory interface (love me some Func<>). Now, what would need to happen to propagate the asynchronous behavior is as follows:
    a. All of these Func<T, VM> would now need to become Func<T, Task<VM>> instead.
    b. All dependents who are expecting these to be injected would match that previous signature change
    c. All dependents need to be updated to await the result of the factory method instead of expecting it synchronously.
    d. Any dependent methods that were previously synchronous will also have to be updated to be async.
    e. Any callers of dependent methods of d now need to await those methods and if they themselves were synchronous, now go to step d again for its dependents and rinse and repeat until you get all the way to the top.

So it's a bunch of menial work which will not only ensure that you benefit from the asynchronous nature of the underlying call, but, more importantly, ensure you don't end up in a deadlock situation.

Token is treated as expired sometimes in OBO exchange

After a certain amount of time, the on-behalf-of flow ceases to function with the logged in user's token. The error returned is invalid_grant error code 500133. As far as I can tell, this corresponds to the assertion being expired, but I don't know why EasyAuth isn't catching this on the outside (and denying access correspondingly). Logging out and back in fixes the issue.

It might be worth looking for this and sending the user to the login flow if the error code comes back this way.

Testability: Separate services from API hosts

Services are designed to be hosted within functions and ASP.NET API services today. The tight coupling here makes testability difficult and also locks each service into the hosting choices that have already been made.

Unit/Build Tests

I admit to being absolutely no expert on how best to implement testing in C#; help is definitely wanted here. I'd like to think I write good code, but no one is perfect and eventually we're going to need actual build tests.

LoggerFactory issues in Startup

To begin with, in Startup.cs there are actually two separate LoggerFactory instances created:

var logger = new LoggerFactory().CreateLogger(nameof(Startup));
logger.LogDebug("Registering LoggerFactory");
builder.Services.AddSingleton<ILoggerFactory>(new LoggerFactory());

I want to focus on the first instance that's used only for the purposes of logging in Startup itself and then come back to the second.

The first LoggerFactory is created with no explicit providers registered and it lives only for the lifetime of the Startup::Configure. With no providers configured, the logging done through this instance isn't actually going any where and all the LogXXX calls effective get no-op'd.

To remedy this, you would want to create this first LoggerFactory more like this:

var logger = LoggerFactory.Create(builder =>
            {
                builder
                    .SetMinimumLevel(LogLevel.Debug) // NOTE: may want to conditionally control this
                    .AddConsole();
            }).CreateLogger<Startup>();

This would now cause the logging output to show up in the console instance with func.exe output for example.

Now, as to the second LoggerFactory that is registered into the service collection. This is actually not having any affect on the logging configuration of the WebJob. If you were to look at the builder.Services collection at runtime you'll see that the host already has a registered singleton for ILoggerFactory at this point and it's also configured a bunch of ILoggerProvider instances according to the host configuration as well. For your own safety the host prevents you from stomping on this instance and accidentally ruining all the logging infrastructure that the SDK wires up for you. Case in point, if this new LoggerFactory instance had taken over, there would be no logger output at all because, just like the first case discussed above, there aren't any providers configured with the factory instance being registered. Therefore, my suggestion would be to just remove the registration of the LoggerFactory with the services collection from Startup altogether.

Logging message templates are not making best use of semantic logging practices

TL;DR; - semantic logging placeholder naming should be treated as any other schema design


I've scanned the codebase and from what I can see all logger message templates are using an ordinal based approach for placeholder names. Placeholder names are meant to have semantic value so that, as the values are recorded, they are recorded with a name that the values can later be efficiently reported on/searched by in log aggregation tools.

For example:

Logger.LogInformation("Committing new Secret with name '{0}'", newSecret.Name);

Should really be:

Logger.LogInformation("Committing new Secret with name '{SecretName}'", newSecret.Name);

If the current, ordinal based placeholder approach is used, values would need to be queried by a field called "0" vs the proposed approach where you could actually query by a field named "SecretName".

Now, the above example is trivial because there's only a single value, but imagine a scenario where you are logging three integer values in the same template and they're logged as "0", "1" and "2" now. As the person who's trying to write a query to identify every time the value of "remaining inventory" has crossed below the value 100, how can I easily tell which of those values I'm supposed to query against? I'd have to actually go look at the message template and see where the value is and then reverse engineer it. What if instead the message template had used the placeholder names more like "InventoryUsed", "RemainingInventory", "NumberOfRubberChickensHarmedDuringInventoryRetrieval"? Well, now it's pretty clear which one I'm interested in.

Next, keep in mind these names actually end up in aggregate queries too, so the person who first writes the query might go figure out that the number they're interested in was in field "1", but imagine opening a sufficiently complex aggregate query up later and trying to make heads or tails of it if everything is named "0", "1" and "2" across hundreds of log message templates.

Finally, imagine someone later comes in and edits the message to add a new piece of data, but they kinda re-write the message a bit and put things in a different order. If the names were ordinal, there's a good chance someone tries to reorder the values so that they increase naturally. If they did this, all queries that worked for the original version of the message would start breaking because the values would be offset from their original ordinal names. This would not happen if meaningful placeholder names were used instead.

Retry handling

At times, calls to the management API might fail due to load or other reasons, and so we need the ability to retry an action if it throws an exception not bound to some other condition (like invalid password/token).

Since we are implementing features as interfaces, it's worth implementing something indicating whether or not the service supports the ability to/wants the outer code to retry its actions.

AuthJanitorProvider should not require IServiceProvider to be injected

This was an immediate code-smell for me when I saw it. Couple things:

  • No where does AuthJanitorProvider itself utilize the IServiceProvider instance
  • AuthJanitorProvider is the base class for some deeply nested hierarchies and by requiring a dependency be injected into it all types in the inheritance hierarchy now have to have the dependency injected as well. This is one thing if it's ILogger, but it's a completely different thing if it's IServiceProvider as it will potentially encourage more bad behavior up the chain.
  • It's exposed for use to all subclasses via a protected field, again further encouraging bad behavior.

The good news is, I only see this being used by two subclasses:

In both of these cases those classes could just ask for their respective dependency (both ICryptographicImplementation as it turns out) to be injected into their own constructor and clean up everything else so there's no more IServiceProvider proliferation any where.

Bonus: This will also improve the testability of the classes that were using the IServiceProvider via IoC as you won't have to mock that up/test it any more.

Add Legal notice for SVGs

In each Provider module, we should add a license for the SVG file(s):

(As a reminder, the original source is https://www.microsoft.com/en-us/download/details.aspx?id=41937)

##Legal Notices

Microsoft and any contributors grant you a license to the Microsoft documentation and other content in this repository under the [Creative Commons Attribution 4.0 International Public License](https://creativecommons.org/licenses/by/4.0/legalcode), see the [LICENSE](LICENSE) file, and grant you a license to any code in the repository under the [MIT License](https://opensource.org/licenses/MIT), see the [LICENSE-CODE](LICENSE-CODE) file.

Microsoft, Windows, Microsoft Azure and/or other Microsoft products and services referenced in the documentation may be either trademarks or registered trademarks of Microsoft in the United States and/or other countries.

The licenses for this project do not grant you rights to use any Microsoft names, logos, or trademarks.

Microsoft's general trademark guidelines can be found at http://go.microsoft.com/fwlink/?LinkID=254653.

Privacy information can be found at https://privacy.microsoft.com/en-us/

Microsoft and any contributors reserve all others rights, whether under their respective copyrights, patents, or trademarks, whether by implication, estoppel or otherwise.

Downtime Predictor icon

Based on the features of the Providers in a Managed Secret, we can tell whether that rekeying can be performed with or without downtime.

That is, all ALCs need to support the 3-phase workflow of BeforeRekeying, CommitNewSecrets and AfterRekeying . All RKOs need to support the GetSecretToUseDuringRekeying method.

If all of the above is true, we can (theoretically) perform a zero-downtime rotation, and we should distinguish that in some way on the UI.

AzureBlobDataStorageService::Cache issues

I'm noticing a few issues with the implementation of this method that I'd like to discuss...

Using a sync method to download

The AzureBlobDataStorageService::Cache method is itself async, but it's calling the synchronous version of BlockBlobClient::Download here:

It should be changed to use the asynchronous DownloadAsync at bare minimum.

Deserialization implementation is suboptimal

Looking at the implementation overall though, I would suggest a more optimal solution which will reduce code as well as runtime allocations:

            using (var blobJsonStream = new MemoryStream())
            {
                await Blob.DownloadToAsync(blobJsonStream);

                CachedCollection = JsonSerializer.Deserialize<List<TStoredModel>>(new ReadOnlySpan<byte>(blobJsonStream.GetBuffer(), 0, (int)blobJsonStream.Length));
            }

NOTE: this code is based off the WIP branch to make the switch to System.Text.Json, assumes that will be committed and is meant to build on that.

So, first, it allows the BlobDowloadClient to optimize the download and delivery to the actual MemoryStream with the additional bonus of reducing a bunch of code in this code base to get to CopyTo. Second it removes the allocation/copy of the bytes to a byte[] that is caused by the ToArray() call right now. It does this by simply using the existing underlying buffer of the MemoryStream itself and wrapping it in a ReadOnlySpan while being sure to specify that the span ends at the logical Length of the stream, not the totally size of the allocated buffer. Third, it avoids the conversion of the byte[] into a string which not only is another allocation/copy, as it goes through the overhead of the GetString(...) conversion. Since we already know the bytes to be UTF8 bytes, we can simply pass those bytes through raw to the JsonSerializer and let it work right on top of them directly which is all it wants in the first place.

Now, the only behavioral difference to consider here is if the writer of the JSON blob happened to put a byte order mark (BOM) in place this approach would not skip over that vs the GetString approach which would handle that. Thus, I'm making the assumption that these JSON files are stored without a BOM with this code. If they are definitively stored with a BOM we can account for that. If it's possible that they might be stored with or without a BOM then we can still account for that with more checks too. However, since it's the Commit method that seems to be the source of the blob content and we control the writing of that blob content and its implementation will always write without a BOM right now, then I think my change is safe.

This method doesn't actually "cache"?

Finally, the method is called Cache and it's called as a preamble in almost every other method (e.g. Get, GetOne, etc), but it doesn't actually cache the results... it will execute the download logic every single time. This doesn't seem like a caching method as I traditionally think of it. I would think that at least for the lifetime of the instance once Cache() is called it would not download the results again. Is this the intended behavior of this method?

Bigger picture, it's treated as a singleton by the application, so... in effect there is no caching going on here and you're always going to the source to download the latest. If there's multiple writers that might make a lot of sense, but then I guess I'd change the name from "cache" since it's always doing a read-through to the backing store. Or, if we really do want it to still act like a cache, this could be further optimized to remember the ETag from the last read and only perform the download/deserialization if the ETag has changed. This would at least optimize for the many read, few write scenarios.

Provider Action Breakout

There are a couple of advanced features that need to be implemented in the future: action rollbacks and multi-key awareness. These functions are predicated on provider modules being able to be run in reverse, and being able to accept multiple credentials, respectively.

In order to accomplish this, I think the best approach is to take each logical operation that a provider implements and re-implement it as a self-contained "action" module which understands how to run in forward and reverse.

This would be a major, breaking change. I'm open for alternative ideas!

Make serialization consistent

We tried to switch from Newtonsoft.Json to System.Text.Json a few months ago, which has introduced a handful of bizarre incompatibilities as I've been developing more features (I'm looking at you, 'lack of built-in support for polymorphic deserialization'). As of #101 this has gone back to being a problem. We need to go through and pick either Newtonsoft or System.Text and use it everywhere.

This is some ugly tech debt.

One-Key, Multiple-Apps Slottable Workflow calls Swap multiple times

When committing one key to multiple locations in an application (for example, Azure Functions has 2 settings that must be changed together to prevent the runtime from failing) the swap is being called once for each instance of the configuration item, instead of committing everything altogether and running a single commit on the application.

This is a P0, as there are multiple things that require this type of interaction. I'm still shooting to make rotating an interlocked key with an Azure Functions app our gold standard for zero-downtime testing (for a number of reliability and stability reasons which make it more difficult) and that requires this.

No caching of Azure client instances leads to constant overhead of configuring/connecting

So I'm noticing something that seems like a potential anti-pattern where Azure client instances are not being cached and instead constantly being re-materialized all the way from configuration up every single time.

It starts with the AzureIAuthJanitorProviderExtensions::GetAzure method. This method uses the fluent Azure SDK to construct the base IAzure instance which includes network calls to authenticate against, retrieve subscriptions and ultimately "bind" to the target Azure subscription. This also includes allocating a new RestClient instance which, if you dive deep enough into it, you'll find actually allocates a new HttpClientHandler each time. This is the root cause of the well known HttpClient anti-pattern which can/will ultimately result in TCP connection resource starvation.

By itself the current design of this method is probably as good as it can be, but it puts the onus on the caller, in this case the various IAuthJanitorProvider implementations, to know they need to cache the resulting client instance... spoiler alert tho: none of them do. ๐Ÿ˜ž

Now, that we understand the implementation and the costs here, let's dive into why it ends up being a problem. If you go look at the various IAuthJanitorProvider implementations you will find they constantly just call the extension method via this.GetAzure() whenever they want to do something against their respective Azure resources. For example, here's one such usage from the AppSettingsWebAppApplicationLifecycleProvider class which illustrates one of the bigger problems that comes from the current design:

private async Task ApplySecrets(string slotName, List<RegeneratedSecret> secrets)
{
if (secrets.Count > 1 && secrets.Select(s => s.UserHint).Distinct().Count() != secrets.Count)
{
throw new Exception("Multiple secrets sent to Provider but without distinct UserHints!");
}
IUpdate<IDeploymentSlot> updateBase = (await GetDeploymentSlot(slotName)).Update();
foreach (RegeneratedSecret secret in secrets)
{
var appSettingName = string.IsNullOrEmpty(secret.UserHint) ? Configuration.SettingName : $"{Configuration.SettingName}-{secret.UserHint}";
_logger.LogInformation("Updating AppSetting '{0}' in slot '{1}' (as {2})", appSettingName, slotName,
Configuration.CommitAsConnectionString ? "connection string" : "secret");
updateBase = updateBase.WithAppSetting(appSettingName,
Configuration.CommitAsConnectionString ? secret.NewConnectionStringOrKey : secret.NewSecretValue);
}
_logger.LogInformation("Applying changes.");
await updateBase.ApplyAsync();
_logger.LogInformation("Swapping to '{0}'", slotName);
await (await GetWebApp()).SwapAsync(slotName);
}
}

In this method you see a call to GetDeploymentSlot first, this actually calls GetWebApp which calls GetAzure(). It then later calls GetWebApp directly which again calls GetAzure. So within that one method call, you've gone through the full creation, configuration and subsequent leaking of the underlying TCP connections. This is just one highly localized example where the problem surfaces. Here's another one from WebApplicationLifecycleProvider::Test:

public override async Task Test()
{
var sourceDeploymentSlot = await GetDeploymentSlot(SourceSlotName);
if (sourceDeploymentSlot == null) throw new Exception("Source Deployment Slot is invalid");
var temporaryDeploymentSlot = await GetDeploymentSlot(TemporarySlotName);
if (temporaryDeploymentSlot == null) throw new Exception("Temporary Deployment Slot is invalid");
var destinationDeploymentSlot = await GetDeploymentSlot(DestinationSlotName);
if (destinationDeploymentSlot == null) throw new Exception("Destination Deployment Slot is invalid");
}

Now, we see the GetDeploymentSlot method being called three distinct times here in just this one method.

But wait, it gets worse. ๐Ÿ˜ซ If you understand the rekeying workflow of the AuthJanitor every provider is retrieved and goes through the following method call sequence:

Test -> GetSecretToUseDuringRekeying -> BeforeRekeying -> Rekey -> CommitNewSecrets -> AfterRekeying -> OnConsumingApplicationSwapped

Each one of those methods, if you dig into each provider implementation, usually makes one or more calls to GetAzure(). If you extrapolate out N providers calling GetAzure() M times per workflow every T increments of time (right now 2mins) you can see how this becomes problematic pretty quickly.

There are several ways to solve this, but I would personally want to think about it more before recommending a solution. I just wanted to surface this problem so that we understand it exists right now and can begin to have discussions about.

Rollback functionality

We pre-test and expect providers to be as honest as possible in that test, but things may still fail. We can't guarantee permissions for some things unless we try and fail, so there's not a good way to necessarily test every variable.

To that end, we should try to include a "rollback" feature in the future to avoid severe failures if something goes wrong mid-cycle.

What is the purpose of building the service provider manually in Startup.cs?

I was reviewing the Startup.cs logic and noticed this:

ServiceProvider = builder.Services.BuildServiceProvider();

I'm not sure what the value of this is. I do not ever see the Startup::ServiceProvider property being used (which is good), but then why do it at all?

The WebJobs runtime ultimately builds the final service provider that it will utilize to resolve services from that point forward. So what this logic is doing is building an intermediate service provider instance which could possibly be incomplete because you don't know/can't control if the WebJobs runtime is going to go out and add more services to the service collection after it calls your startup logic before it bakes the final container.

Again, the good news is nobody seems to reference the Startup::ServiceProvider property anyway I would recommend just removing it and the call to build the provider.

Prefer injecting ILogger<T> over just ILogger

There are several places in the code which depend on a "plain" ILogger instance being injected. I would suggest all of these be updated to request an ILogger<T> instead. This is a simple change that results in the logging output created via that logger instance to be automatically categorized under the name of T so that it's easier to filter by in any logging systems.

Create Component Documentation

A lot has changed recently in AuthJanitor, one of the key items being that the architecture is much cleaner and correctly modularized. The upshot is:

  • AuthJanitorService rolls up all the requisite integrations, providers, and workflows for processing a secret rotation and sending messages to agents to process a secret rotation
  • AuthJanitor.Repository contains AuthJanitorDbContext, which manages the data layer of AuthJanitor, if it is used. This is built on EF, so underlying data store providers will not be AuthJanitor's to maintain anymore, and it allows for really trivial integration to a lot of other consuming platforms (like ASP.NET Core)
  • ITokenCredentialProvider is expected to be implemented by the consumer to handle token acquisition callbacks
  • AuthJanitor.Automation.AdminApi and AuthJanitor.Automation.AdminUi -- as well as the shared code -- has been completely merged together into a Blazor Server app, which makes development much easier and allows for correctly targeting resources from a local debugger. ๐ŸŽ‰

There is a new consumer app, AuthJanitor.CLI which should also be documented. That can be used to run Core and optionally Repository as a non-web client. This should end up being the reference implementation, as it's likely to be a lot simpler than the ASP.NET Core implementation.

Token exchange error handling

Currently if the OBO token exchange fails, there is no meaningful way to tell why. This is important because one of the key reasons for failure is consent change. Since EasyAuth applications can't force-refresh their own consent (or at least, as far as I could find), users have to reset their own permissions or the admin must revoke consent for the entire tenant. This is just one potential use case -- I think it's appropriate to sanitize the error and return a useful message. For the consent issue, this might be to redirect the user to myapps.microsoft.com to clear consent.

An example error response to parse (consent mismatch, some information removed):

{
  "error":"invalid_grant",
  "error_description":"AADSTS65001: The user or administrator has not consented to use the application with ID '<<REMOVED>>' named 'AuthJanitor Administrator Tool'. Send an interactive authorization request for this user and resource. [...]",
  "error_codes":[65001],
  "suberror":"consent_required"
}

Function app Settings

Settings that need to be included as part of setup for deployment of the AdminAPI function:

STORAGE_WEB_URL - URL to the storage account hosting the web content (for Functions Proxies)
CLIENT_ID - AAD App Client ID
CLIENT_SECRET - AAD App Client Secret
TENANT_ID - AAD Tenant ID

These are not currently included in the repo.

AKS Application Lifecycle Provider

This is one of the more requested providers but it will also be very difficult for similar reasons to the AppServices providers, in that it needs to implement interlocked rotation.

Containerize Services

The code is all .NET core and could easily be containerized to make it easy to run in a local environment of for testing and validation. Given there are currently no tests this would be key to ensure the system continues to function as tests are added and code is refactored

AzureBlobStorageDataStore is not safe for concurrent reads/writes

The current implementation of AzureBlobStorageDataStore does not seem like it is safe to multiple instances reading and writing from the underlying blob where it stores its data. It's seems to have problems on multiple levels...

Singleton, but not thread safe

Within the app it is treated as a singleton which means it's possible that multiple threads could be accessing the class at the same time. The class in general is not hardened for thread safety so that could be a problem for starters.

Multi-instance definitely not safe

The real problem comes if the app scaled out and there were multiple instances of it running it would definitely have problems simply because there are no good concurrency checks between the point when the blob was read by the Cache method and when it's potentially written later by the Commit method.

Consider the following scenario:

  1. Instance 1 does a read of the blob
  2. Instance 2 does a read of the blob
  3. Instance 1 calls methods that mutate the data in memory
  4. Instance 2 calls methods that mutate the data in memory
  5. Instance 1 commits
  6. Instance 2 commits

The current implementation will always allow for Instance 2 to stomp on Instance 1's changes.

This is due to the fact that the ETag is not recorded by the implementation at the time of the read. So while the Commit method does get the ETag right before it writes to make sure it's writing over whatever version of the data was there in that instance, it doesn't mean it is the version of the data that was read by the Cache method and subsequently mutated.

Possible solutions

The potential fix for this is to remember the ETag at read time in the Cache method and then use that value at write time in Commit. However, this would present a problem that does not exist today because the implementation just allows clobbering and that is: what happens if you try to Commit, but another instance has updated the underlying data? Like any concurrency error, the only way to fix that would have to have ripples all the way upstream to the logic that was attempting to change the data in the first place so it could refresh the data it was working with an attempt to apply the same logical changes again, but even that may not be possible and then it has to propagate all the way back to the origin of the operation.

Now, it's possible that the implementation could act at the more logical level and detect non-conflicting changes to the overall JSON, but then that might actually make the case that it shouldn't be stored in a single JSON blob any way. That way you wouldn't have to load/cache the whole thing constantly as one and you also wouldn't have to write that logic because you could just rely on a per blob ETag instead. That would, of course, require a bigger change.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.