This is README documents my learning process for Hadoop.
I initially started learning about Big Data in general, but honed in on Hadoop after some research
-
Decided to learn NoSQL and Big Data ๐ฅณ
-
Watched video comparing relational databases (which I know) to NoSQL databases (which I don't know)
-
Started a course on Big Data + Hadoop
-
Started a MongoDB crash course, as I've already used MongoDB for work, and for personal web projects
-
Did more research and decided to learn Hadoop instead
-
Learned about evolution of technology that lead to Big Data
-
Learned definition of Big Data, datasets so large and complex they can't be processed using traditional tools
-
Learned about the 5 Vs of Big Data; Volume, Velocity, Variety, Value, Veracity
-
Decided to start a more practical course, as my current Hadoop course is very theoretical and lecture-based
-
Started downloading HDP Sandbox for VirtualBox (6 hour download time ๐)
-
Sandbox won't import from downloaded file
โ Error "Failed to import appliance C:/Users/nikun/Downloads/HDP_2.5_virtualbox.ova.
Result Code: E_INVALIDARG (0x80070057)"
-
Searched the internet extensively for a solution
-
Finally found a workaround by extracting the VMDK file and running it seperately
-
Workaround didn't work, turns out virtual box file is corrupted, will download it again overnight
-
Downloaded the HDP Sandbox again
-
Got further along than before
-
Ran into a memory error
-
Stack Overflow tells me I don't have enough RAM
โ Need 8GB free memory, I only have 8GB total memory
-
Finally got Hadoop running after upgrading my RAM
-
Navigated to localhost port provided by Ambari but I'm seeing loads of errors
-
Errors fixed themselves (the processes must have been initialising)
-
Uploaded 2 movie record databses using the Hive interface
-
Used Hive's SQL to query for the most popular movie up to 1998
โ The winner was Star Wars (1977) ๐
-
Decided to brush up on my SQL and started a seperate SQL course
-
Completed SQL course ๐
-
Returned back to Hive, where I'm now extremely comfortable using its SQL-like syntax
-
Used the HDFS web interface to to upload and view data files through its explorer
-
Opened a command line SSH connection to HDP Sandbox using Putty
-
Created and deleted data files using command line
-
Learned about MapReduce on a conceptual level; mapper, shuffle & sort, reducer
-
Also learned how MapReduce works across a cluster
-
Self-discovered VirtualBox Snapshots, to skip the long loading time for th Ambari dashboard to load all processes
-
Pausing Hadoop to Learn Kafka! https://github.com/johnobla/kafka