Giter Club home page Giter Club logo

Hi there

So GitHub recently enabled personal GitHub READMEs, which I think is a great idea. I've been wanting an excuse to start writing about things I work on and things I've been learning and reading, and this is a perfect reason to start: it combines my love of all things Data Science with my love of writing and teaching/explaining concepts to people.

  • šŸ”­ Iā€™m currently working on computer vision projects (2D, 3D, video), time-series forecasting, NLP tasks like sentiment analysis on alternative data sources via Transformer models, and applying deep learning techniques to good ol' fashioned tabular datasets, signal processing, AutoML

  • šŸŒ± Iā€™m currently learning pytorch_tabnet, NLP tasks with BERT, how to classify pulmonary embolisms from 3D CT medical image data for the Kaggle compeition.

  • šŸ’¬ You can often find me ... learning some new machine learning technique, reading and writing code, thinking about how to solve a problem I'm working on while riding my bike.

  • šŸ“« How to reach me: ... chris dot poptic @ gmail dot com

  • āš” Fun fact: I logged over 10,000 miles riding my bike and won the Ohio State Road Race Championship a few years back.

-- A few blogs I follow (more to come as I add to this list):

Sebastian Raschka's excellent blog Also, Follow him on Twitter

TODAY I LEARNED / THINGS I'M READING

2020-09-14

One thing that we ML engineers must prioritize is writing "production-ready code". That term has been thrown around a lot, and for good reason; code that integrates well with an app and can be used by a) other devs on your team or b) end-users is the only type of code that ultimately brings "value" to an organization.

BTW, an excellent book on this topic is Ben G Weber's "Data Science in Production: Building Scalable Model Pipelines with Python". Highly recommend. In Ben's words:

The general theme of the book is to take simple machine learning models and to scale them up in different configurations across multiple cloud environments

Ben continues:

Putting predictive models into production is one of the most direct ways that data scientists can add value to an organization. By learning how to build and deploy scalable model pipelines, data scientists can own more of the model production process and rapidly deliver data products.

Another important concept is API-first design. This is related to Writing Software from the Outside In, explained in another excellent blog post by Jesse Johnson.

It's easy to crank out an quick data anslysis or moddel of code that works, but has no way of "talking" with the rest of your system/app.

One phenomena any machine learning engineer has experienced is "Tech Debt". If you haven't heard of tech debt, it's fairly self-explanatory.

What contributes to complexity?

One of the most important techniques for managing software complexity is to design systems so that developers only need to face a small fraction of the overall complexity at any given time.

  • John Ousterhout

This is so true. On a large project, the only way to stay organized and keep your sanity is to leverage the object-oriented principle of abstraction. Functions are your friend here. If you find a block of related code that is used more than once, then follow Guido's Python programming language of D.R.Y. and Don't Repeat Yourself. Wrap that code in a function, cut and paste it into your utilities file, and do an import of it in your notebook.

David Tan speaks in detail about this in yet another blog post I highly recommend to any junion aspiring data scientist. https://www.thoughtworks.com/insights/blog/coding-habits-data-scientists

David Tan recommends a good software engineering practice of having a goal of moving your code out of your Jupyter notebook ASAP. By doing so, we accomplish three things:

  1. You reduce the cognitive load that we, as developers, experience when trying to hold a bunch of complex operations in our mind.
  2. You enable that chunk of code to be unit-tested (see Test-Driven Development), which gives you more confidence in the code you write.
  3. You make that chunk of code more reusable, not only by you, but by any other members of your project's team. Save them time from re-inventing the wheel; they can simply import your library and use the function you wrote.

2020-09-15

The upcoming Nvidia 3000-series GPUs have me pretty excited to train deep learning models more quickly and efficiently. Especially the massive 24 GB of VRAM on the RTX 3090, which will make training big Transformer models actually feasible for modern NLP tasks.

Because there's so much marketing hype surrounding the Ampere release and because, as a principle, I believe in trusting people's opinion to be more informed than mine if they are more experienced in an area (as any good Bayesian would), I read through a blog post by Tim Dettmers on choosing a GPU for your next deep learning PC. Tim has an excellent analysis on the topic, which I highly recommend.

A great webinar today by Thomas Ott, Customer Solutions Engineer/Business Data Scientist at H2O.ai called "Real AI Transformation: Getting Models Into Production" Some key takeaways: "What makes companies successful is getting models into production"

2020-09-21

Watched an excellent talk from AIDevFest20 on Machine Learning Design Patterns for MLOps by Valliappa Lakshmanan. This is such an important topic in the life of a machine learning engineer, and addresses a major pain-point in our daily work. I'm looking forward to reading his book, "Machine Learning Design Patterns", to be released in November 2020.

Some key points:

  • An ML workflow has all the dev challenges of ensuring things are repeatable, but also the added challenges during production of ensuring you can continuously retrain and serve a model.

  • How to resolve these challenges? Consider your workflow as a pipeline. This pipeline is an executable DAG of ML steps.

  • Each of those steps is a container -- data pre-processing and validation container, -- feature enginering container, -- train XGBoost model container, -- train LGB model container, -- evaluate model container -- deploy model container -- etc

  • These pipeline steps (containers) are not independent, but rather are connected. Like a network graph. Indeed, a directed graph. A directed graph where no subgraphs are cyclical such that you get caught in an endless loop (i.e. a Directed Acyclic Graph).
    -- Use dependency tracking: the output of each step in the pipeline is an artifact.

  • The output of one step is an artifact, which becomes the input of the subsequent step.

  • Google TFX Obviously, being employees of Google, they're biased towards Google products, but TFX is a great tool for this.

Cpop's Projects

attentionmask icon attentionmask

AttentionMask: Attentive, Efficient Object Proposal Generation Focusing on Small Objects (ACCV 2018, accepted as oral)

bilm-tf icon bilm-tf

Tensorflow implementation of contextualized word representations from bi-directional language models

capsule_networks icon capsule_networks

This is the code for "Capsule Networks: An Improvement to Convolutional Networks" by Siraj Raval on Youtube

cgp-cnn icon cgp-cnn

A Genetic Programming Approach to Designing CNN Architectures, In GECCO 2017 (oral presentation, Best Paper Award)

cincyhack2017 icon cincyhack2017

used MSFT Kinect facial detection to measure the sentiment of faces at the hackathon and aggregate all attendees' emotional data via JSON. I then did some rudimentary analysis on that data using Python.

densenet icon densenet

Densely Connected Convolutional Networks, In CVPR 2017 (Best Paper Award).

dilation icon dilation

Dilated Convolution for Semantic Image Segmentation

face_recognition icon face_recognition

The world's simplest facial recognition api for Python and the command line

flair icon flair

A very simple framework for state-of-the-art Natural Language Processing (NLP)

keras-one-cycle icon keras-one-cycle

Implementation of One-Cycle Learning rate policy (adapted from Fast.ai lib)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    šŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. šŸ“ŠšŸ“ˆšŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ā¤ļø Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.