infochimps-labs / big_data_for_chimps Goto Github PK

A Seriously Fun guide to Big Data Analytics in Practice

Ruby 95.66% PigLatin 3.33% Shell 0.01% HTML 1.00%

big_data_for_chimps's Introduction

Big Data for Chimps: A Seriously Fun guide to Terabyte-scale data processing

This is the work-in-progress version of the upcoming O'Reilly book, Big Data for Chimps: A Seriously Fun guide to Hadoop and Terabyte-scale data processing.

Our intent is to provide the best guide for exploratory data analytics using Hadoop -- for data science in practice. We use high-level languages (Pig and Ruby) that make Hadoop a tool, not a framework, allowing re-use and rapid development. We'll cover enough Hadoop internals to save you from diving into the source code, and enough tuning advice to let you know where to drill deep.

In all cases, the focus is on maximizing your time and creativity -- on helping you uncover what question to ask and the right way to ask it.

O'Reilly has courageouly agreed to release the book under an http://creativecommons.org/licenses/by-nc-sa/3.0/[CC-BY-NC-SA]. To buy a physical copy of the book, or a Kindle (.mobi) or iOS/Nook (.epub), visite the early release http://shop.oreilly.com[O'Reilly bookstore] (TODO: link to early release page). Buy it now, and you'll get frequently-updated access and the final version once available.

License

This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.

Code is Apache licensed unless specifically labeled otherwise.

big_data_for_chimps's People

Contributors

Stargazers

Watchers

big_data_for_chimps's Issues

Describe specific practical applications for each case study

Tie each case study to two-ish practical vertical-focused use cases. For example, for the server log chapter, describe application to ad tech conversion metrics and simple security intrusion detection.

These will typically be in the form of end-of-chapter examples (sketch of the solution and how you'd approach it, but not the solution itself; readership hopefully provides).

/cc @dhruvbansal @joeman @timgasper -- super practical hardnosed use cases invited

It's a Big Storm, is it as Big as Two Chapters, or More?

This is more of a question that I can answer, so hopefully your git folks can speak up here, as well as future reviewers: How many chapters do you plan to dedicate to Storm? (Looks like 2 now.) Is the order right, as is? -Amy

Begin Each Chapter with Two-Flavor Introduction

Suggest that you start each chapter by warming up the reader and introducing the topic and concepts surrounding it for each chapter in your book. Second, use the start of each chapter to talk about "locality" and wrangle that concept, going deeper into it with each chapter, carrying the reader along. (I made notes in your files/each chapter start on this.) -Amy

Needs Why You Do What You Do.

Tell us why you get up and the morning and choose to do this stuff.

Feedback for early release

"First Exploration" Must Haves:
*Either complete, or delete, sentences that don't end (like the one on page 15 about Austin).
*The section that starts "Why" on page 16 has to get rid of your instructions to yourself and become a real section
*"Plot of this story" on page 17 has to be an actual section, not a numbered list
*"Examplars and Touchstones" as a bunch of unfinished sentences/thoughts. Those have to be fixed.
*On page 18 there are two random headers and then an image. The image needs to be put in context and the headers need to turn into sections or be deleted.

"The Stream" Must Haves:
*Take a look for any sentences that don't end and either finish or delete them.
*Is the final bullet in the Ruby Helper box on page 24 complete?
*Add some more detail to the "Running on the Cluster" section on page 27

"Chimpanzee and Elephant Save Christmas" Must Haves:
*Take a look for any sentences that don't end and either finish or delete them.
*Make the "The Reducer Guarantee" section on page 40 more than just bullet points - either give more context, or turn them into body text.
*Same for the "Partition Key and Sort Key" section on page 41
*Right now, the chapter just kind of ends, give it more of an ending. Turning the last section into a full section will likely help with this.

That's what we need to turn those chapters into the first early release of the book.

Thoughts on "HBase Data Model" chapter:

*The bulleted list on page 122 needs more context - some sort of intro or explanation rather than just starting it up.
*The "Note" on page 123 is much too long. Make it part of a section (or one of its own), and you can include the most "noteworthy" part of it as a smaller note within that.
*The first footnote on page 124 should be part of the text.
*Page 129 has unfinished sentences and a note that isn't a note.
*The note on page 130 should be part of the text.
*The note on page 132 is also much too long, also make that part of another sections, or its own, with a smaller note within.
*The bulleted list in the "IP Address Geolocation" section on page 135 needs context
*Does Table 16-5 really need to be its own section?
*"Review of HBase options" on page 137 needs to be a real section, not just bullets and todo's
*Same for "Feature Set review" and “Design for Reads”

Overall, this chapter looked really good. If you could also get these things done over the weekend, I'd like to make it part of the first release (give it some meat).

Clarify Hadoop's advantages over a database

From @kromerma reading first few chapters:

First chapters feedback

explain locality means disk head moving & also computer-computer moving
put in "numbers everyone should know"
move first section to end
expand on the "Hadoop is good at" list: unruly, huge, connected, dimensional
using the wikip data, show things db is good at & move to progressively less and less
- dimensional: with lots of properties am effectively making table on fly
- connected: flights -> airports -> geo -> weather stations -> weather // time
in "since there have been ten computers" part: Don't use phrase "talk to", say "Access data from"

After E&C Inc. save xmas:

Walk through carefully the geo-flavor example:

chimpanzee is labelling records to go to same place
... turning one record into zero one or many (spam, solo, gift for me & sis)
(also, unstructured => structured)
practical example: market basket analysis

What Tragedy Befell Chimp and Elephant?

It makes sense to abandon Chimp and Elephant once you progress into the book; however, you drop them too coldly. As you revise, make a note to the reader in the last chapter in which you will be using C & E and let the reader know that, while they've been useful characters, we won't be seeing them any more. That said, is there anything stopping you from weaving in a brief example that utilized C&E in later chapters? My point is, they get dropped to suddenly, and that's a lose thread. There are various ways to tie it up. -Amy

Operating System Environment for Examples

Hi,

Are the examples in the book meant to be run in the Linux operating system? Is it possible to run the examples in Windows? Thanks.

Shipment #1: 'Data Formats' and 'About'

About:

Flesh out the dangling sentence in 'Who for'
Move 'Who For'/'Who Not For' to the top.
Describe the fun we'll all be having with example code and exercises more fully

Data Formats

TODO

/cc @shoogie

Front Matter and Preface

Add a b-level header just after My Questions for You titled, "Chimpanzee & Elephant" and include a paragraph there in which you explain who Chimpanzee and Elephant are and how you will be using them. Explaining that upfront will save the reader from having to wait till later to understand what that's all about.
Add a few (brief) examples of the types or categories of "hard problems" you mention under About - What this Book Covers.
Clear up what a reader may get stuck on as a possible contradiction between "This is not a beginner's book" and "If you're a beginning user" mentioned three paragraphs later in the same section, About - What this Book Covers.
Add four b-level heads to the Who This Book is For section, titled, "Ruby" "Hadoop" and "Wukong" and say a bit about each. That is, say why you chose Ruby, put the paragraph about "All of the code in this book will run unmodified..." under "Hadoop," etc. The forth header should be "Essential Hadoop Reading," under which you can list books, sites, etc. that you think the reader should understand before moving on.
Rename "Probable Contents" to just "Contents" to avoid reader doubt and anxiety. This section will appear in the final book as "What's In This Book?" Also, write each summary as a prose paragraph (like you do for First Exploration (Ch. 1) and Simple Transform (Ch. 2) rather than bulleted lists. (I have a feeling you plan to do this, but wanted to be sure to note it.)