Biggudeta-Kyokyoku

This project is made for personal learning purpose, and to show case an instance of an end to end solution in a Big Data environment. Although the solution discussed in this project can be specific to a problem, it covers some of the basics from an end to end solution perspective. A basic level knowledge on Scala-Spark (py-spark is also enough), Linux, Docker, Virtual Machines, Ansible etc can be handy. Before we dive into the solution, let us discuss a few things about the tecnology stack used. It is just a brain dump and I don't intend to explain each of these technologies.

Unix to GNU to Linux Kernel

I am not sure why I am explaining this here, but always interested in some history - of Linux: It started when Unix was invented by Dennis Ritchie, as a solution to bringing in multi-user concept into Operating Systems. While Unix started becoming proprietary, Richard Stallman came up with GNU (is Not Unix) while failing to come up with a proper kernel system. Later on, Linus Torvalds came up with a super-cool now-highly-used Linux kernel, and he colloborated with Stallman to incorporate the highly powerful Linux Kernel to work with GNU - to become the so called LINUX. Distributions include REDHAT, Google Android, Ubuntu, DSL (Damn Small Linux) etc.

Which distribution to use?

Depends on what you want to do with your computer. Every distribution is specific to the need. For instance, if you want enterprise level support, go for Redhat, if you want really fast small light weight Linux, go for DSL and so on. In our example, we will centOS. By the way, is Linux free? Answer is open Source software doesn't always mean a free software. Vendors get paid here. So for Linux too.

Virtual Machines - basics..

If we need virtual machines running over the host, a Hypervisor should be managing the kernel resources to each of those VM instances. VMs can have separate OS, separate /usr/lib, specific configurations and capabilties. All of them have a pre-shared set of kernel resources managed through hypervisor. This has its overhead. Various virtual boxes are available in industry for you to have this set up in your environment. It is also important you know about vagrant, that it is a command line utility for managing the life cycle of virtual machines. It is able to provide a portable and reproducible development environment using virtual machines.

Docker - My understanding..

Docker works on the concept of containerisation. So how does it differ from VMs? Firstly, It is not a replacement of Virtual Boxes and Virtual Machines. Secondly, it can solve a lot of problems that VMs used to solve, with a less overhead. It spins up containers directly on to the Linux Kernel. Multiple containers share the resources of host kernel. Hence they can manage the utilities of actual Linux Kernel more efficiently. Also, having a container for each of your process or applications makes it easy for you to manage them. You can spin up as many containers as you want; each one is spinned up through different docker images, making use of the core host linux kernel. You invoke docker image through a docker run command. The command is turned on from the client, communicating the docker server that in turn spin up the container for that application - as simple as that. The whole notion of dockerisation becomes a real value add in cluster computing. You have client with a docker swam/mesos (manager) that simply manages containers in different nodes through its own docker server. And you spin up as many containers as you want and is really scalable.

Host and docker container

The host knows (ps -ef) what is running in each of its containers. You can notice that, process ID of the container process through host, is different from that you see for the same process when you are inside the container. However, you can't see any information about anything that runs on the host when you are inside the container and do a ps -ef. I think, that explains the abstraction - may be a formalised view of a container that handles only your application and only your process and not making any impact on the host - hopefully.

Anti docker patterns

I would say it is better not to mutate our container. It is my own understanding and hopefully stay relevant while I explore more advanced concepts and techs in Docker world in future. The container, ideally, should handle one process at a time. Oh wait! Really? This might be a bit hard hypothesis to satisfy when we apply dockerisation in Big Data environment. Assume that we need 5 nodes as 5 container and container should have a data node, task tracker and so forth as the essential daemons of Big Data - something that leads to a situation where it would contradict the above statement - We call it Anti Docker pattern. Here, we first spin up the container and then we keep on adding the processes - as part of installation - as part of advacement - as part of upgrades. That's basically mutating the container with no trace of state transitions. Since not formally written anywhere else, it is a pattern that enterprise can still follow. So.... this project is an anti docker pattern if we want to name it :D

Ansible

Ansible is an IT automation tool, mainly intended to configure systems, deploy software and manage highly intensive advanced IT management tasks such as continuous deployment. Please note that, it is just a configuration management systems that may work with other authentication management systems such as Kerberos, LDAP and so forth. I thought of going with Ansible, since I liked the simplicity and was able to use it and deploy while I know only very little about it. If you define the inventory and the set of operations in yml file, you are almost there. In inventory, you could define how many nodes has to be spinned up, and assign which nodes should be master node, namenode, secondary namenode, datanodes and task tracker. Then in yml file, we can configure for each node about the containers you need to spin up. In short, each node can be a container with multiple processes running in it. And it is easy to test too, and even change the system from docker to actual Virtual Machines. All you need to do is change the yml file from docker processes to actual VM machines, and change the inventory to the VM host names. VM can be more closely related to an actual cluster - a set of VMs that can act as multiple nodes. However, let us go with containerisation for easy testability and the fact that it can be changed to VM based at any time.

Spark-Scala

https://github.com/afsalthaj/supaku-sukara

This is my personal repo and exists for the personal purpose of learning scala, functional programming and Big Data spark, with the intend to share it with my friends later on.

Reference

https://github.com/Ranjandas/hadoop-yarn

afsalthaj / biggudeta-kyukyoku Goto Github PK

biggudeta-kyukyoku's Introduction

Biggudeta-Kyokyoku

Unix to GNU to Linux Kernel

Which distribution to use?

Virtual Machines - basics..

Docker - My understanding..

Host and docker container

Anti docker patterns

Ansible

Spark-Scala

Reference

biggudeta-kyukyoku's People

Contributors

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent