datastacktv / data-engineer-roadmap Goto Github PK
View Code? Open in Web Editor NEWRoadmap to becoming a data engineer in 2021
Home Page: https://datastack.tv
Roadmap to becoming a data engineer in 2021
Home Page: https://datastack.tv
As we are in cloud era, it's good to have knowledge on basic Cloud Architecture and common services from AWS/Azure etc.
I see you did mention a few cloud based tools, but it's good to have basic understanding of cloud services and how they are co related. A section named "Cloud fundamentals" maybe.
What do you think?
Cheers!
Ozone moved into GA late last year and has seen some adoption since then.
It can handle billions object so it is hailed as the replacement of HDFS which struggles around 400 million files.
Created a PR with updates to the markdown doc but someone else will need to fix the image.
Could you publish a text version? Even as an outline.
Can I ask which mind map app you used to create this beautiful outline? It's simply awesome! <3
BTW the structure and content of your course is refreshing to say the least. Just purchased your annual membership. Looking forward to more such great content.
It'd be great if you could add Kibana to visualize data. One of the popular components in ELK stack.
Looking at the roadmap it's too overwhelming to see so many frameworks, technologies to learn.
My suggestion is to divide the technologies horizontally and vertically with years of experience. This would narrow down the roadmap or give a clear road map than just mentioning the tech stack. Along with the division, mention the projects to be made for the corresponding years of experience.
The above is solely my opinion.
Love this graphic, and would love to use it (with attribution obviously). Could you clarify the license under which these images are distributed?
Hello I haven't seen frameworks like Celery https://docs.celeryproject.org/en/stable/index.html
Or spring https://spring.io/projects/spring-cloud-dataflow#overview
In my personal experience I had to create many batch pipelines, using these. Now with airflow I'm planning to move some. But still there are some legacy code I can't change :) so knowledge to maintain them are necessary.
The certificate for datastack.tv expired on 11/08/2021
If the site is no longer in use the link should be removed.
Hello,
Not sure how to edit a png to add a new pull request, so I'm creating this issue.
I believe GCP's Pub/Sub messaging system deserves to be under the "Messaging" section too.
Hello! congratulations for such an awesome roadmap, I think a data engineer should know about lambda architecture and kappa architecture. I think those are base architectures to start building custom data processing architectures for specific problems. Here are some resources:
Great project! I would suggest replacing Pulumi with the AWS CDK https://github.com/aws/aws-cdk. Its variants, cdk8s and cdk for terraform already have incredible utility for how long the projects have existed and in my opinion, the cdk is the dominant player in the Infrastructure as Code space
Not really a technical skill, but considering that most tech companies have adopted agile methodologies, I think having some knowledge of how Scrum or Kanban works is also an important skill for any data engineer.
Would there be interest in creating Cloud specific versions of the roadmap that goes into more specific details for each product choice?
I work at Google Cloud so would be happy to contribute towards that.
I think this is a great way to show all the options and now use this as reference for the wider ecosystem when people ask so thank you for creating this.
Hi,
How about including into the modern Data warehouse solutions Azure Synapse Analytics?
Cheers,
Kostas
digdag is a nice workflow scheduler. much easier to setup than airflow
This is an interesting point and bit controversial but why IAAC.
Well I am very active on this but I really found, this is still not that much established and even cloud engineers and every Infra doesn't follow this.
These tools are very useful in everyday work.
So all application have business processes, I saw you mentioned workflow scheduling but can that also be used for bpm kind of system
Right now this is more an invitation to discussion than a request.
What modeling techniques does a data engineer need and for what use cases? Does anybody do simulation before actually designing a system / solution? If yes: what are the tools / approaches?
Following potential use cases came to my mind:
Python
there is a library called simpy
. Does anybody have experience using it? Also, diagrams (e.g. UML) can be used to do logical modeling for almost everything: state diagrams, data flow, components etc.Hi everyone,
I LOVE THIS. Thanks!
I would humbly add, from my experience, 3 domains:
Hey there,
Thanks for putting together this awesome resource! I’d strongly suggest adding GitLab Pipelines to the CI/CD section. It’s an extremely useful platform and is actually what competitively prompted GitHub Actions to emerge, as far as I know.
Hope this helps!
As a data engineer one work lot of storage service which can be block storage and object storage. So maybe you can mention Storage systems like:
I think would be interesting to refine the knowledge of algorithms with Big O, Big theta notations and code complexity.
Algorithms seems vague. Moreover, things like SOLID and clean code would help as well.
AWS Cloud Formation as a general recommendation in my opinion doesn't make sense. I see Terraform with much more usage and bigger community, plus it is cloud agnostic.
I am a complete beginner who decided to follow the roadmap couple of months ago. Sharing a few books that helped me to get started.
I am a self learner who is looking forward to receiving further support on next steps.
Отлично, хорошо бы на русском! Nice.
Somewhere early in the tree maybe mention markdown? Useful for documentation, github issues, Jupyter.
Hello,
I would suggest the inclusion of encryption in transit SSL/TLS in the networking part and refine the data security & privacy with encryption at Rest
Hi,
First and foremost, nice job on characterizing concepts and the fields. I really liked the picture.
On the issue itself: Why do you characterize DynamoDB as Key-Value and not as Wide-Column?
If I was asked to characterize the difference. I would say that a key-value store (like Redis or RocksDB) is something where you know nothing about the Value part (except maybe its datatype); whereas on a column-wide store, it's still key-value store since you always need a primary key (aka partition key), or but where you can characterize the value into multiple sub-columns and have secondary indexes (aka sort key).
At least someone in Wikipedia agrees with me.
Am I missing something?
Thanks
For a modern data engineer knowledge of concurrency models is important.
threading
vs multiprocessing
, what are the differences, and what problems does Python have with threading.Apache Airflow
) vs state machines (example: Amazon Step Functions
) vs ... . Is actually covered by 'Data structures and algorithms', but maybe would be good to mention this as an example of how knowledge of them might be helpful for a data engineer.CUDA
on GPU.If you agree on at least some of the points, I can prepare the text.
CS Fundamental is mean Cloud Storage ?
I believe these are very old as well as very mature relational database solutions and should be added to this roadmap.
I know this is Data Engineer roadmap and not Data Science roadmap, however, I still think that Maths and Statistics should have its own box in the roadmap ( despite it could be included in the "Fundamentals" ).
Being a Data Engineer without enough math and statistics knowledge is considered as "Danger Zone" by Drew Conway’s diagram http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
Other than that, your roadmap is awesome @alexandraabbas, thanks a lot.
Where does tools like Openebula fit in?
https://opennebula.io/
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.