Giter Club home page Giter Club logo

data-engineer-project's People

Contributors

fpcarneiro avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

data-engineer-project's Issues

Step 2: Explore and Assess the Data

Step 2: Explore and Assess the Data

  • Explore the data to identify data quality issues, like missing values, duplicate data, etc.
  • Document steps necessary to clean the data

Step 5: Complete Project Write Up

Step 5: Complete Project Write Up

  • What's the goal? What queries will you want to run? How would Spark or Airflow be incorporated? Why did you choose the model you chose?
  • Clearly state the rationale for the choice of tools and technologies for the project.
  • Document the steps of the process.
  • Propose how often the data should be updated and why.
  • Post your write-up and final data model in a GitHub repo.
  • Include a description of how you would approach the problem differently under the following scenarios:
    • If the data was increased by 100x.
    • If the pipelines were run on a daily basis by 7am.
    • If the database needed to be accessed by 100+ people.

Step 3: Define the Data Model

Step 3: Define the Data Model

  • Map out the conceptual data model and explain why you chose that model
  • List the steps necessary to pipeline the data into the chosen data model

Step 1: Scope the Project and Gather Data

Step 1: Scope the Project and Gather Data

Since the scope of the project will be highly dependent on the data, these two things happen simultaneously. In this step, you’ll:

  • Identify and gather the data you'll be using for your project (at least two sources and more than 1 million rows). See Project Resources for ideas of what data you can use.
  • Explain what end use cases you'd like to prepare the data for (e.g., analytics table, app back-end, source-of-truth database, etc.)

Step 4: Run ETL to Model the Data

Step 4: Run ETL to Model the Data

  • Create the data pipelines and the data model
  • Include a data dictionary
  • Run data quality checks to ensure the pipeline ran as expected
    • Integrity constraints on the relational database (e.g., unique key, data type, etc.)
    • Unit tests for the scripts to ensure they are doing the right thing
    • Source/count checks to ensure completeness

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.