Giter Club home page Giter Club logo

cosmicclone's Introduction

Cosmic Clone

  1. Overview
  2. Deployment Steps
  3. Create backup copy of a collection
  4. Anonymize data of a cosmos collection
  5. Todos
  6. References
  7. Contributing

Overview

screen91

Cosmic Clone is a tool to clone\backup\restore and anonymize data in an azure Cosmos Collection. As more applications begin to use Cosmos database, self serve capabilities such as backup, restore collection have become more essential. Cosmos Clone is an attempt to create a simple utility that allows to clone a Cosmos Collection. The utility helps in below

  • Clone collections for QA, testing and other non production env.
  • Backup data of a collection.
  • Create collections with similar settings(indexes, partition, TTL etc)
  • Anonymize data through scrubbing or shuffling of sensitive data in documents.

Disclaimer: Please note this is not an official tool from the Azure Cosmos DB team, but a utility developed by an independent developer within Microsoft IT and offered on Github as a sample.

Deployment Steps

  1. Just Compile and Run the Code.
  2. Or Download a pre compiled binary from the releases section and run the “CosmicCloneUI.exe” file.
  3. For Best performance you can run the compiled code in an Azure VM that is in the same region as the source and destination Cosmos Collection.

As a prerequisite the tool needs the below

  • Install Microsoft .Net Framework 4.6.1 or higher
  • Source Cosmos collection and read only keys to its account
  • Destination Cosmos Account and its read write keys
  • If firewall is enabled for the Cosmos Account, ensure the IP address of the machine running the tool is allowed.

Create backup copy of a collection

Initial screen

screen1

Enter Source and Target connection details

screen2

If validation of the entered details fails an appropriate message is displayed below the Test Connection button.

If the access validation succeeds then the next screen shows various options for cloning of a collection.

Set migration options

screen3

All the options are checked by default but allow you to configure to optout of any.

For example: If you want to retain all the partition keys and indexes then you can keep the indexing policies and Partition keys check boxes checked. Uncheck these boxes if you do not want them to be copied.

If you do not want any of the documents to be copied but just a shell of the collection with similar settings, you can uncheck the Documents check box.

As you can observe the other check boxes for Stored procedures, User defined functions and Triggers all deal with copying code segments from collection to collection.

In the next page we move onto the Anonymization process. We will leave the anonymization discussion to the next section. For now, you can click next and initiate the cloning of the collection.

screen7

screen8

Explore the cosmos portal and one can observe the new collection created with the required settings.

Anonymize data of a cosmos collection

Post selection of cloning options as seen in the previous section, we see the below page

screen4

Here we can enter the rules and attribute details that need to be masked or sanitized.

To add a rule, click on the “Add Rule” button, a mini form to enter details is displayed.

A rule is an encapsulation of an attribute and the anonymization to be performed on it. A rule tells the Cosmic Clone tool what attribute to scrub and how.

The ‘Attribute to scrub’ represents the field that needs to scrubbed\anonymized.

The ‘Filter Query’ represents the where condition based on which this rule will be applied to various documents. If this rule must be applied to all documents, then leave this field as blank.

The ‘Scrub Type’ field provides options such as

  • Single value: Replace the attribute value with a fixed value
  • Null Value: Empty the attribute content.
  • Shuffle: Random shuffle the attribute values across all documents of the collection.
  • PartialMaskFromLeft: Masks the the attribute value partially starting from left with the given value
  • PartialMaskFromRight: Masks the attribute value partially starting from right with the given value

Sample rule1

screen5

This shuffles the Full name attribute value between all documents.

Sample rule2

screen6

To update key values of the Nested Entities you can configure an anonymization rule as above. Note the Filter Query that tells the tool to perform this operation only if the EntityType attribute of the document is an “Individual”.

Sample rule3

screen10

To mask a given attribute value partially with some text, you can use the Scrub Type options "PartialMaskFromLeft" or "PartialMaskFromRight"

Note there are options on the anonymization screen to validate, save and load these rules

Migration screen

screen7

Completion notification

screen8

Before and After anonymization

screen9

As can be inferred from above, documents will be sanitized based on rules.

Todos

  • Adapt to other Cosmos API like Graph and Cassandra apart from SQL API
  • Parellelize read and write to improve efficiency
  • Add anonymization option to mask with random values (predefined patterns and regular expressions)
  • Refactor some of the UI and utility code to improve maintainability
  • Write more tests

References

Static data masking https://docs.microsoft.com/en-us/sql/relational-databases/security/static-data-masking?view=sql-server-2017

Cosmos Data Import tool https://docs.microsoft.com/en-us/azure/cosmos-db/import-data

Cosmos Bulk executor tool https://docs.microsoft.com/en-us/azure/cosmos-db/bulk-executor-overview

Contributing

Contribution guidelines for this project

License

MIT

cosmicclone's People

Contributors

alexdrenea avatar aliuy avatar jkonecki avatar kranthimeda avatar michaljankowskii avatar microsoftopensource avatar msftgits avatar sacheinbaala avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cosmicclone's Issues

Is There An Official Tool Available? Can We Reference It In The Readme?

Latest commit adds:

Disclaimer: Please note this is not an official tool from the Azure Cosmos DB team, but a utility developed by an independent developer within Microsoft IT and offered on Github as a sample.

What is the equivalent official command line tool if I'd like to copy my entire Prod Azure Cosmos to a staging environment periodically?

I can create a PR to add this information to the readme if anyone knows what it is.

DATA LOSS: "Timestamp" fields are overwritten with "_ts" value from source

If source documents have a "Timestamp" datetime property its content is overwritten in target document with the datetime from the "_ts" field in the source. Maybe this is by design but its both destructive and undocumented.

Raising this against this project because I cannot find the Microsoft.Azure.CosmosDB.BulkExecutor library on github which this tool appears to use.

Spatial Index types causing an error

If the collections have spatial indexes then the application is giving an error and it says that it is not supported with the current api. Hope you can find a fix.

There is no check for offer type

There is no setting / check for offer type. We are forced to create collections that have throughput set at the container level, rather than shared throughput.

Does this tool work on macOS

I can't compile the CosmicCloneUI on macOS. Do I understand it correctly that this is a windows tool?
When I try running CloneConsoleRun, my IDE gives me warnings:

Debug symbols for assembly 'Microsoft.Azure.CosmosDB.BulkImport' could not be loaded correctly. 
Mono runtime doesn't support full pdb format.

Allow pattern or partial Masking

I think this is a fantastic and much needed tool. Thank you for your time on this.

Not an issue but I'd like to add two suggestions or requests to improve the tool. In order of importance for me.

One - allow partial or pattern masking. I would like to to say replace 1000-1000-1000-1000 with xxxx-xxxx-xxxx-1000. Or 01/01/2019 with xx/xx/2019. If you can already do this then let me know the format.

Two- allow filtering on the copy. You allow it on the mask but not on the full copy. I want to do something like If documentType = "ABC" then clone to the new collection else don't copy document to new collection.

Thank you.

OfferThroughput in CloneSettings is not respected

In the clone options page there is a setting for setting the target collection Offer Throughput however the value you insert there is not respected and the new collection will always be created with 10000 RUs.

When copying from one Cosmos container to another Cosmos container there is a spike in Http 429

We are trying to migrate data from one container to another container with partitions configured and RU/s set at Database level. The size of the container is 3GB. with 528915 documents. We set the destination throughput to 15000RU/s.

When we ran this we got a lot of http 429. At the end of the copy Cosmic Clone it gave the below stats. So why is it that I am getting http 429 when it has consumed 8909.58 RU/s but my source database RU/s is set to 15000RU/s

Due to the Http 429 is there a way to see data integrity and find if any of the documents got corrupted.

Batch Upload completed

Inserted 4778 docs @ 326.09 writes/s, 8909.58 RU/s in 14.652323 sec
Average RU consumption per document: 27.32

Summary of Batch 49 records retrieved 4778. Records Uploaded: 4778
Total records retrieved 528915. Total records uploaded 528915

Let to navigate back when operation finishes

I saw that when the copy operation finishes , you need to close the app and then you loose all the connection settings and you need to close the app and start from scratch.

It would be very nice to let user to navigate back in the steps once the operation is completed.

Command line support

Hi,

I reviewed the documentation but I would like to know if exist command line support to be used in a Script.

Many thanks in advance

Juan Antonio

Invalid method of counting documents

The number of document to copy is calculated in CosmosDBHelper.GetSourceRecordCount in an invalid way.

The code below

sourceTotalRecordCount = cosmosClient.CreateDocumentQuery<long>(
    UriFactory.CreateDocumentCollectionUri(sourceDatabaseName, sourceCollectionName), totalCountQuery, queryOptions)
    .AsEnumerable().First();

uses First() to obtain only the first page of the results, and assumes that this is the total number of documents.

In fact, for larger collections, Cosmos DB will return multiple pages with partial counts, which have to be summed up in order to obtain the true total.

The correct code should enumerate all pages in the response.

Error reading JObject from JsonReader.

Does anyone know why this error is appearing?

Collection Copy log
Begin Document Migration.
Source Database: database Source Collection: collection
Target Database: database Target Collection: collection
LogError
Error: Error reading JObject from JsonReader. Path '', line 0, position 0., Message: Error reading JObject from JsonReader. Path '', line 0, position 0.
Main process exits with error
LogError
Error: One or more errors occurred., Message: Error reading JObject from JsonReader. Path '', line 0, position 0.

Add support to copy multiple containers

Or, at the very least, when you get to the final screen after a copy, enable Previous button so that you can go back to the previous screens and just edit the names of the containers.

Apart from that, excellent tool. This could/should be added to ADF, pre-creating containers when copying CosmosDB to CosmosDb is a waste of time.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.