Giter Club home page Giter Club logo

topn's Introduction

Who I am ..

๐Ÿ‘‹ I'm Paul Power, an experienced software and data technology leader with over 20 years experience shipping successful commercial products. I am based in Dublin, Ireland but I mostly work with globally distributed teams.

I am a senior individual contributor who helps product and development teams identify barriers to business value and prioritise the most approriate areas to focus and deliver on. I've been described as a distinquished engineer, senior principal engineer, senior staff engineer, chief architect, CTO and product lead depending on the company and context. I see myself as an experienced generalist who has just happens to be very good at identifing and solving awkward problems that typically span multiple, diverse, teams and stakeholders.

I'm open to new oppertunities right now so if you think I might be a good fit for your team or that I can help as a consultant or advisor then please get in touch to explore.

My previous experiences ...

I have been fortunate to spend much of my career working on building platforms, distributed services and products focused on data management and integration.

Specifically I've worked with teams at corporates like

  • Elastic - As Kibana tech lead for the Kibana platform for Elasticsearch based analytics and applicatons
  • Workday - As senior principal of of a SaaS data ingestion and onboarding platform for customer HR data
  • Informatica - As Chief Architect for SaaS and on premise data profiling, quality analysis, preparation and integration products

I've also enjoyed working with smaller/startups companies like

See LinkedIn for more details.

Iโ€™m currently exploring ...

Reach me via ...

topn's People

Contributors

peerside avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

topn's Issues

Problem: Sometimes hangs when using external runtimes

On my machine (see specs below) I can run inline tests ok but sometimes experience hangs when I run -r local or other parallel runtimes.

So as an example
New iterm
cd TopN
$ python TopN.py data/data-1M.txt --jobconf mapreduce.job.reduces=1
.. results ok !
$ python TopN.py data/data-1M.txt --jobconf mapreduce.job.reduces=1
.. results ok
$ python TopN.py data/data-10M.txt --jobconf mapreduce.job.reduces=1
.. doesn't complete, kill Ctrl-C

Hardware:
Model Name: MacBook Pro
Model Identifier: MacBookPro11,3
Processor Name: Intel Core i7
Processor Speed: 2.3 GHz
Number of Processors: 1
Total Number of Cores: 4
L2 Cache (per Core): 256 KB
L3 Cache: 6 MB
Memory: 16 GB

Software:
System Version: OS X 10.10.5 (14F2009)
Kernel Version: Darwin 14.5.0
Boot Volume: Macintosh HD
Python: Python 3.5.2 |Anaconda 4.2.0 (x86_64)
MrJob: 0.5.8

Problem: Test data doesn't include duplicates or gaps

Currently test data is generated from seq and shuffled. As a result it does not contain any duplicate values or gaps.

Instead generate test data, remove lines at random and fold a subset of lines back in again. Look at xsv or gshuf -n samplesize

Problem: Duplicate data values can lead to under reporting of TopN

Currently the heapq implementation allows duplicates which is not an issue when the test data is clean but in the case where there are duplicates and those duplicates happen to to be of one or more of the largest values in a partition we end up with a heap full of duplicates. As a result the final reducer produces less than 10 distinct values.

Consider wrapping the heap with a set which tracks the values in the heap and is tested before an entry is added. If the value already exists in the set then don't add to the heap.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.