Giter Club home page Giter Club logo

map-reduce's Introduction

#5 MapReduce+

Q. What commands must be used to run your scripts?

python3 OneStepMean.py -r hadoop hdfs:///data/HT_Sensor_dataset.dat > output_OneStepMean.txt

python3 TwoStepMean.py -r hadoop hdfs:///data/HT_Sensor_dataset.dat > output_TwoStepMean.txt

python3 AllColumnsMean.py -r hadoop hdfs:///data/HT_Sensor_dataset.dat > output_AllColumnsMean.txt

Q. What technical errors did you experience?

While working on the extra task... I created the smaller size of the 'dat file' that contains only 10 rows of the original data. The programs worked fine when tested with the test data, but they kept getting killed or unable to connect with the original data. The process succeeded 1 out of 10 times. I made sure to have the latest docker images on the 'hadoop-resourcemanager', and 'hadoop-namenode'. Realizing it is something related to my configuration in CPU or Memory. I'm still in debugging mode.

Q. What conceptual difficulties did you experience?

Even after reading all the documents about map and reduce, when it came to actual practice, imagining how partitioning and data move in each map and reduce was quite confusing. I was not quite sure the concept of how to create the partitions or assign the data to the partition.

Also, using the dat file rather than csv file made me struggle a bit, as the data weren't easy to play with.

Q. How much time did you spend on each part of the assignment?

About 5 days +

Q. Track your time according to the following items: Gitlab & Git, Docker setup/usage, actual reflection work, etc.

Gitlab & Git: 5min

Docker setup/usage: 10min

Actual reflection work: Infinity

Q. What was the hardest part of this assignment?

The hardest part was trying to figure out how to get the number of items of the column to calculate the square root of it to create the partition with that number.

Q. What was the easiest part of this assignment?

Opening the canvas

Q. What did you actually learn from doing this assignment?

I learned how to use multi-step process with MRJob to process over a large data. Mapping and partitioning the data and reducing them in the hadoop environment actually made the process go a lot faster. This assignment included a lot of confusion and frustration. However, after seeing the solution, and getting more project and assignment done, I will get better at mapreduce jobs. Fingers crossed.

Q. Why does what I learned matter both academically and practically?

Whenever I look at the job descriptions for the Data Science positions, one of the requirements the company require is the Hadoop/Hive/Spark skills. I now understand why they consider it as a main properties of the applicants. My goal is to be able to play with Hadoop and Spark freely.

map-reduce's People

Contributors

ryandhjeon avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.