Giter Club home page Giter Club logo

azure-cosmosdb-bulkingestion's Introduction

page_type languages products description urlFragment
sample
java
azure
This tool provides the capability to bulk import data from Azure data lake or Azure blob storage into CosmosDB.
azure-cosmosdb-bulkingestion

Cosmos DB Tool for importing data from Azure Data Lake Store and Azure Blobs

This tool provides the capability to bulk import data from Azure data lake or Azure blob storage into CosmosDB. You should use this tool to import data size larger than 500GB. It maximizes the RU utilization in your CosmosDB collection, providing better performance than a traditional Azure data factory pipeline. Furthermore, if you can't exhaust all RU's with a single instance of the tool, you can run multiple instances on different VM's. At that point, the rate of ingestion will be only limited by the throuhgput you have provisioned for the collection. The tool is equipped to handle syncing in a multi instance configuration to avoid duplication of data.

Prerequisites:

  1. Please setup Azure Data Lake Store and upload your data in chunks, minimum recommended size 200 Mb. Learn more about Azure Data Lake Store

  2. Setup AAD application to access Data Lake Store from client. Learn more about AAD registration

  3. Setup Azure Storage Account and load your data in single container and make sure file names are unique.

  4. Setup 2 Cosmos DB Partitioned Collections.

    a. Partitioned collection with more RUs to ingest actual data.

    b. Standard collection for distributing load across multiple workers and for storing the migration status.

Note

Only account creation is mandatory, rest you can provide in the settings file, tool generates the collections if they don't exist. For help reach out [email protected] โ€‹

Tool Setup:

  1. Generate Jar using Maven: mvn package

  2. Update settings.properties using the following guide:

    • Map Id and Partition key columns of Json doc to Cosmos DB Document

      • Map the key1 column value in input document to Cosmos DB document id column by setting the value below. Empty value ignores this setting

          idField=key1
        
      • Map key2 column value in input document to Cosmos DB document partition key column by setting the value below. Empty value ignores this setting

          pkField=key2
        
      • Set the following boolean variable to true if you have left the idField above empty (it will generate a guid for id in Cosmos DB document) or false if you have mapped the id field above.

          useGuidForId=false
        
      • Set the following boolean variable to true if you have left the pkField above empty (it will generate a guid for pk column in Cosmos DB document) or false if you have mapped the pk field above.

          useGuidForPk=false
        
      • Provide partition key defined at the time of collection creation in the following field

          CosmosDbDataCollectionPkValue=<PartitionKey>
        
    • If you're importing data from blob storage provide storage account connection string under the Azure Blob Settings section else if it's data lake provide the connection settings under the Azure Data lake Setting section

    • Provide the Cosmos DB account credentials under the CosmosDB Settings section

Running the Tool:

Step 1: Run the following command to queue the data files to distribute load across multiple workers and to track the status of each file import:

java -Xmx8G -jar jsonstore-cosmosdb-import-1.0-SNAPSHOT-jar-with-dependencies.jar -conf <your settings.properties file absolute path> -storeType <adl|azureblob> -queue <your azure data lake store folder which contains all the data files/ azure blob container>

For ADL: 
java -Xmx8G -jar jsonstore-cosmosdb-import-1.0-SNAPSHOT-jar-with-dependencies.jar -conf settings.properties -storeType adl -queue <QueueName>

For Azure Blogs: 
java -Xmx8G -jar jsonstore-cosmosdb-import-1.0-SNAPSHOT-jar-with-dependencies.jar -conf settings.properties -storeType azureblob -queue <ContainerName>

Please use azure portal or https://github.com/mingaliu/DocumentDBStudio/releases/tag/0.72 to verify the queue.

Step 2: Execute the following command in Linux VM to start import of data into Cosmos DB data collection

nohup java -Xmx8G -jar jsonstore-cosmosdb-import-1.0-SNAPSHOT-jar-with-dependencies.jar -conf <your settings.properties file absolute path> -ingestionFrom cosmosdb >run.out 2>&1 &

Example: nohup java -Xmx8G -jar jsonstore-cosmosdb-import-1.0-SNAPSHOT-jar-with-dependencies.jar -conf settings.properties -ingestionFrom cosmosdb >run.out 2>&1 &

Note

Please run the same command on other VMs if you plan to distribute work across multiple workers.

If you want to test the data import, we support the following input stores:

Upload Adl Json docs file 

java -Xmx8G -jar jsonstore-cosmosdb-import-1.0-SNAPSHOT-jar-with-dependencies.jar -conf settings.properties -ingestionFrom adl -ingestionFilePath /test/part-v000-o000-r-00000

Upload Azure Blob file

java -Xmx8G -jar jsonstore-cosmosdb-import-1.0-SNAPSHOT-jar-with-dependencies.jar -conf settings.properties -ingestionFrom azureblob -ingestionFilePath {container name}|{blob full name}

Upload local Json docs file

java -Xmx8G -jar jsonstore-cosmosdb-import-1.0-SNAPSHOT-jar-with-dependencies.jar -conf settings.properties -ingestionFrom local -ingestionFilePath C:\Adobe\part-v000-o000-r-00000.txt

Step 3: Check the progress of Migration

Please use azure portal or https://github.com/mingaliu/DocumentDBStudio/releases/tag/0.72 to verify the migration progress using import tracking collection.

Recommendations

TBD

Known issues

TBD

azure-cosmosdb-bulkingestion's People

Contributors

cjsingh8512 avatar microsoftopensource avatar msftgits avatar v-hearya avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.