Data Science at the Command Line

Home Page: https://datascienceatthecommandline.com

License: Other

Python 0.62% R 0.12% Shell 0.26% HTML 92.12% TeX 1.99% CSS 1.88% Makefile 0.19% JavaScript 1.32% NewLisp 0.02% Jupyter Notebook 1.16% Nunjucks 0.32%

bash book bookdown cli cowsay curl data-science ggplot2 gnuplot jq linux oreilly oreilly-books python r shell terminal unix zsh

data-science-at-the-command-line's People

Contributors

Stargazers

Watchers

Forkers

jobliz mbucc aficionado michaeljoseph dk-dev briandonoghue meachamrob joaofnfernandes bernardomenezes jeongyoonlee christianguettler xakon seanrife claudiamihai alouisos rankam shawnhansen dybwall1234 larrykite surana90 gwillink vfulco duweifu shuaizki tonyfischetti nangal frncscgmz williamfromtexas patrikbreitenmoser eyadsibai rhrish kiyanwang securapawn vegasvikk anb2 arrmac cbaader pgupta1980 korczis jjonesbarbato nkaul lazycrazyowl twistedmove kerouax jjedmorianktah fstrozzi gng0101 johnjohnsp1 timbers2 dataarch shameer hanifmahboobi welcheb evanvolgas rpsandell ike-okonkwo moradology jstirnaman gdhorne naranil mailmahee equialgo asmunduhreinn johardi edwardt jilitheyoda jeckep mullerovsky loperntu mpettis enrique-ibarra darkseed narayana1208 dsdinter xuusheng aborruso obirvalger pmahent khemanta sportebois nuttenscl kennethd vilkoos royseto alexanderyastrebov ghezzat andrewjohnlowe tarekdib03 mishnit rthor68 zeeshanali point-line-surface-body eydunn sophiazh tomwills optixlab akino1976 anthonyrr snewhouse cbonpolimi

data-science-at-the-command-line's Issues

Docker pull error 'incorrect username or password'

I'm trying to pull the latest version but getting an error message as shown below:
docker pull datascienceworkshops/data-science-at-the-command-line
Using default tag: latest
Error response from daemon: Get https://registry-1.docker.io/v2/datascienceworkshops/data-science-at-the-command-line/manifests/latest: unauthorized: incorrect username or password

My docker info output in case it may help:
docker info
Containers: 0
Running: 0
Paused: 0
Stopped: 0
Images: 0
Server Version: 18.03.1-ce
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 773c489c9c1b21a6d78b5c538cd395416ec50f88
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: 949e6fa
Security Options:
seccomp
Profile: default
Kernel Version: 4.9.87-linuxkit-aufs
Operating System: Docker for Windows
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 1.934GiB
Name: linuxkit-00155d01344c
ID: SUNZ:CTDE:UKFI:MPMJ:KJRG:ODSS:7LGL:MNUF:NKKI:GHRH:CSP3:DYLO
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
File Descriptors: 19
Goroutines: 35
System Time: 2018-04-29T21:39:50.4780016Z
EventsListeners: 1
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false

Scrape in python 3 env

Seems not to be working with python 3 versions.

args.expression = [e.decode('utf-8') for e in args.expression]
AttributeError: 'str' object has no attribute 'decode'

issue with "header"

Last line of "header" fails if the header contains spaces.

Change from

print_header $OLDHEADER

print_header "${OLDHEADER}"

"Executing a Command-line Tool" paragraph

Hi,
first of all this is a great book.

In "Executing a Command-line Tool" paragraph of chapter 2, page 21, there is "cd book/ch02/" command, but this directory does not exist in my vagrant installation.
Is it normal?

Thank you

movies.txt does not exist - but it does?

First of all, I would like to thank Crevax for solving the previous issue I had - I was able to pull the data and access the files in the book! But now, I am running into another issue. This may not be a problem with the data or files, but I believe it is a problem for those who are new to the command line because I am not able to resolve this issue myself.

Following the examples in Chapter 2 to get to the directory for chapter to and examine the 'movies.txt' file, I can see that the file is there, but when i try to run the command:
head -n 3 data/movies.txt
I get the error "head: cannot open 'data/movies.txt' for reading: No such file or directory" even though I can see the file with the ls command!
I have included a screenshot of my command line prompt for someone better versed in docker to see if I have made a mistake.

I think it is worth noting that I cannot exactly follow the docker run command from the book as the ` symbol throws errors in the windows cmd, so maybe my arbitrary directory naming of "dsacm" is the problem. Also there is the fact that I have to take a roundabout path to get to the actual directory containing the data - not sure if this has something to do with it either.

If this is an issue with my docker experience I would greatly appreciate some sources of how to learn docker (preferably other than the official docker website, it's descriptions are a bit abstract for someone with 0 formal CS education). But really if anyone could shed some light on the issue I am having I would be very grateful. Thanks!

Additional info: I am running Windows 10 x64 with Docker version 17.12.0-ce, build c97c6d6

Adding the data science toolbox to homebrew repository

The toolbox is great, as well as the book. But insofar as the book and the tools are so easy to use, perhaps, it's worth adding the tools to the homebrew repository as a single package? Or maybe to add readme describing how to install the tools without installing the environment, which is described in the book?

Question for vagrant up

I use virtual linux on my windows 7.
After I installed virtualbox and vagrant and followed the environment set up steps, the terminal show up the warming like this :
Progress: 90%There was an error while executing VBoxManage, a CLI used by Vagrant
for controlling VirtualBox. The command and stderr is shown below.

Command: ["import", "/home/datatest1/.vagrant.d/boxes/data-science-toolbox-VAGRANTSLASH-data-science-at-the-command-line/1.0.0/virtualbox/box.ovf", "--vsys", "0", "--vmname", "packer-virtualbox-iso_1523945099207_28351", "--vsys", "0", "--unit", "7", "--disk", "/home/datatest1/VirtualBox VMs/packer-virtualbox-iso_1523945099207_28351/packer-virtualbox-iso-disk1.vmdk"]

Stderr: 0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100%
Interpreting /home/datatest1/.vagrant.d/boxes/data-science-toolbox-VAGRANTSLASH-data-science-at-the-command-line/1.0.0/virtualbox/box.ovf...
OK.
0%...10%...20%...30%...40%...50%...60%...70%...
Progress state: VBOX_E_FILE_ERROR
VBoxManage: error: Appliance import failed
VBoxManage: error: Could not create the imported medium '/home/datatest1/VirtualBox VMs/packer-virtualbox-iso_1523945099207_28351/packer-virtualbox-iso-disk1.vmdk'.
VBoxManage: error: VMDK: cannot write allocated data block in '/home/datatest1/VirtualBox VMs/packer-virtualbox-iso_1523945099207_28351/packer-virtualbox-iso-disk1.vmdk' (VERR_DISK_FULL)
VBoxManage: error: Details: code VBOX_E_FILE_ERROR (0x80bb0004), component ApplianceWrap, interface IAppliance
VBoxManage: error: Context: "RTEXITCODE handleImportAppliance(HandlerArg*)" at line 886 of file VBoxManageAppliance.cpp

How can I fix it !!!!

How to install tools manually from source

hi,
I am reading your book at data-science-at-the-command-line.

I wanna install some of tools , such as 'Rio' , the book mentioned 'The installation instructions are for Ubuntu only'.

Would you be kindly to tell me where to get 'The installation instructions '.

Thanks in advance!

FengYu

Can't run dst command

I installed everything according to http://datascienceatthecommandline.com/. After logging in with putty, I simply can't run the dst command.

Rio question

Hello:
Very nick tools!! Running Rio on Mac OS X, But have always have this information “ ARGUMENT…ignored”. How to correct/remove this? Any helps?

seq 100 | Rio -nf summary

ARGUMENT '',~+~~file='/var/folders/_0/yvrx0nkx76nfvz85q1dxg27183vg3m/T/Rio-X9BIzDnt.pdf')}else~~+~if(exists('is.ggplot')&&is.ggplot(last)){ggsave('/var/folders/_0/yvrx0nkx76nfvz85q1dxg27183vg3m/T/Rio-X9BIzDnt.pdf',last,dpi=72,units='cm',width=20,height=15);}else{sink('/var/folders/_0/yvrx0nkx76nfvz85q1dxg27183vg3m/T/Rio-X9BIzDnt.pdf');print(last);}' ignored

Wrong output to cols example in Ch. 3.2.1

I'm on Ubuntu 16.04.6 LTS

Trying to run the ch 3.2.1 example on cols, I got a strange output :

|        day |  bill |  tip | sex    | smoker | time   | size |
| ---------- | ----- | ---- | ------ | ------ | ------ | ---- |
| 0001-01-07 | 16,99 | 1,01 | Female |  False | Dinner |    2 |
| 0001-01-07 | 10,34 | 1,66 | Male   |  False | Dinner |    3 |
| 0001-01-07 | 21,01 | 3,50 | Male   |  False | Dinner |    3 |
| 0001-01-07 | 23,68 | 3,31 | Male   |  False | Dinner |    2 |

When switching 'day' w/ 'sex', the upper-case operation works but the day column is still messep up :

| sex    |  bill |  tip | smoker |        day | time   | size |
| ------ | ----- | ---- | ------ | ---------- | ------ | ---- |
| FEMALE | 16,99 | 1,01 |  False | 0001-01-07 | Dinner |    2 |
| MALE   | 10,34 | 1,66 |  False | 0001-01-07 | Dinner |    3 |
| MALE   | 21,01 | 3,50 |  False | 0001-01-07 | Dinner |    3 |
| MALE   | 23,68 | 3,31 |  False | 0001-01-07 | Dinner |    2 |

I printed the head of my tips.csv (downloaded latest version) and it's all in order.

Any idea what went wrong ?

Installing the docker image on Windows with WSL

Hi all,

I believe I managed to have the environment up and running in Windows for thus of us who are interested. This requires the following steps, those are not too hard but can be time consuming. I'll assume you're using Windows 10.

Follow the steps described by Windows here.

This involves to 1.a) ensure you are a Windows Insider user, explained here

and that you have 1.b) WSL up and running, explained here.

I had the following issue but the fix is in the thread.

Next, and point 2) you'll need a valid docker setup with WSL 2, which is explained

in 2.a) the docker documentation

The following 2.b) article was also helpful, be sure to run Windows containers.

Finally, with docker up and running as seen in the pictures and with a WSL prompt open you can simply use the steps describe in point 2 of the book.

The only thing I didn't get right is mapping the book to a directory with a docker volume, if anyone has any ideas I'm all ears.

Hope it helps!

SSL error while downloading the toolbox

Hi Jeroen, performing vagrant up led to the following SSL error, can you please update the certificates or guide me to a solution. Thanks.

[centos@localhost MyDataScienceToolbox]$ vagrant up
Bringing machine 'default' up with 'virtualbox' provider...
==> default: Box 'data-science-toolbox/data-science-at-the-command-line' could not be found. Attempting to find and install...
default: Box Provider: virtualbox
default: Box Version: >= 0
==> default: Loading metadata for box 'data-science-toolbox/data-science-at-the-command-line'
default: URL: https://atlas.hashicorp.com/data-science-toolbox/data-science-at-the-command-line
==> default: Adding box 'data-science-toolbox/data-science-at-the-command-line' (v1.0.0) for provider: virtualbox
default: Downloading: https://atlas.hashicorp.com/data-science-toolbox/boxes/data-science-at-the-command-line/versions/1.0.0/providers/virtualbox.box
An error occurred while downloading the remote file. The error
message, if any, is reproduced below. Please fix this error and try
again.

SSL certificate problem: unable to get local issuer certificate
More details here: http://curl.haxx.se/docs/sslcerts.html

curl performs SSL certificate verification by default, using a "bundle"
of Certificate Authority (CA) public keys (CA certs). If the default
bundle file isn't adequate, you can specify an alternate file
using the --cacert option.
If this HTTPS server uses a certificate signed by a CA represented in
the bundle, the certificate verification probably failed due to a
problem with the certificate (it might be expired, or the name might
not match the domain name in the URL).
If you'd like to turn off curl's verification of the certificate, use
the -k (or --insecure) option.

jsontsv might replace json2csv

I just released jsontsv yesterday. It is similar to json2csv which you mention on your original blog post "Seven command line tools for data science." But I believe jsontsv is strictly speaking more powerful and not strictly speaking easier to use. Feedback is appreciated if you have time but in any case thanks for writing about this whole topic.

https://github.com/danchoi/jsontsv

Link to text file is now in binary format and that breaks book samples

Resource http://www.gutenberg.org/cache/epub/76/pg76.txt is now stored as a compressed file.
That breaks samples.

There are couple of ways to fix it:

Update link to point to uncompressed version of the text file
Update samples to uncompress file after curl downloads it

I think option one is better, because adding new command will no longer match samples in the book, and might confuse readers.

Second Virtual Machine

Hi
I alreday have a VM belonging to Mining Social Web of Matthew Russell.
I would like to know the precautions to install Command Line VM.
Do I have to execute vagrant also?
I am using windows 8.1 and I dont have any issues with Mattehw VM.

Any suggestion of recomendation are appreciated.

Thanks for your help

Franco

outdated project gutenburg link

The link (chapter 3.6) currently in the downloads something in binary. The correct link Huckleberry Finn that produces the desired result is:

http://www.gutenberg.org/files/76/76-0.txt

Refactor Rio

Rio started out as a proof-of-concept (see: http://jeroenjanssens.com/2013/09/19/seven-command-line-tools-for-data-science.html). However, over time, this quick-and-dirty Bash script has been proven to be useful to me (and who knows, perhaps even to others). Unfortunately, the code is quite messy making it difficult to maintain and extend.

Because of this, and also because it's playing a role in my upcoming book, I believe that Rio deserves to be cleaned up. (I've attempted to add some whitespace and newlines to that horrible SCRIPT string, but either Rscript or Bash didn't like that.) Please let me know if you have any suggestions.

Not able to use cols with sed

Hi,
I have this CSV (input_out.csv):

nome,id,url,start,creato,venue_id,logo
Palermo (Sicilia) - 25 Maggio 2016 25/5/2016 Karaoke di beneficenza di Chi ama la Sicilia,25180566753,http://www.eventbrite.it/e/biglietti-palermo-sicilia-25-maggio-2016-2552016-karaoke-di-beneficenza-di-chi-ama-la-sicilia-25180566753?aff=ebapi,2016-05-25T20:00:00,2016-05-03T20:40:12Z,15149091,https://img.evbuc.com/https%3A%2F%2Fimg.evbuc.com%2Fhttp%253A%252F%252Fcdn.evbuc.com%252Fimages%252F20814649%252F175077385541%252F1%252Foriginal.jpg%3Frect%3D0%252C172%252C860%252C430%26s%3Dd9a4bfa29cc27f85d8428de320cc9b3c?h=200&w=450&s=7ed859da13004f17403a9a3b0e1b2f7b
Tech-Marketplace & StartupItalia! Open Summit Tour 2016,25161107550,http://www.eventbrite.it/e/biglietti-tech-marketplace-startupitalia-open-summit-tour-2016-25161107550?aff=ebapi,2016-05-17T15:00:00,2016-05-03T10:49:31Z,15134493,https://img.evbuc.com/https%3A%2F%2Fimg.evbuc.com%2Fhttp%253A%252F%252Fcdn.evbuc.com%252Fimages%252F20794591%252F68137449621%252F1%252Foriginal.jpg%3Frect%3D473%252C0%252C3674%252C1837%26s%3De156d1a692274f2b3e48fa61b9e3964d?h=200&w=450&s=cfa6f9b49d4dee5180e53c71931a167d

I would like to replace the & characther that I have in the first column with &.

If I write:

cat input_out.csv | sed -e 's/[\/&]/test/g' > out.txt

I have what I want. But I would like to work only in the first column. But if I write:

< input_out.csv cols -c nome body sed -e 's/[\/&]/test/g' > out.txt

I have

sed: -e expression #1, char 4: unterminated `s' command

What's wrong in my command?

Thank you

Drakefile typo

Here: https://github.com/jeroenjanssens/data-science-at-the-command-line/blob/master/book/ch06/Drakefile

change last line ed to sed.

scrape syntax in chapter05 does not work

data/wiki.html scrape -b -e 'table.wikitable > tr:not(:first-child)'

this command cannot get the result as shown in the book.
if I just type scrape -b -e 'table.wikitalbe',then the content of the table is printed

so how to remove the first 'tr' as said in the book, please give me some help

New tool to format data into tables

I'm releasing a new tool that might be useful to the data science toolkit. (I don't know how else to inform you about it @jeroenjanssens other than a GitHub issue, since I don't use Twitter.)

https://github.com/danchoi/table

table formats lines of TSV, CSV, or DSV (delimiter-separated values) into a pretty plain text table, wrappings cells with long content to try to fit the table in the screen.

Rio -r flag

-r flag loads the dplyr library, but I got error:

cat iris.csv | Rio -e -r -v "df %>% group_by(species) %>% mean(sepal_length)"
Error: object 'r' not found
Execution halted
cat: /var/folders/h5/3xs5c90n0njgp7n9_qdwwj2m0000gn/T//Rio-OLVgmOU2.err: No such file or directory

scrape does not accept right XPath query?

Hi,
I need to do this query

curl -L -s -A "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:32.0) Gecko/20100101 Firefox/32.0" "http://www.unionesulserio.it"  | /usr/local/bin/nadir/scrape -be '(//a[(contains(.,"pretorio")) or (contains(.,"Pretorio"))])[1]'

I have tested (//a[(contains(.,"pretorio")) or (contains(.,"Pretorio"))])[1] XPath query with other tools and it seems to work, but scrape gives me "Invalid CSS selector".

Is my XPath really wrong?

Thank you

scrape: why empty results?

Dear @jeroenjanssens,
first of all thank you: I learned a lot from your book, reading it there is a bit 'more light in my brain.

I try to use scrape with an XML file that seems properly formatted, I use a rigth XPATH query, but I obtain empy result.

This is my command:

curl 'http://referendum2016.comune.palermo.it/AFFLSEZ_1_82053_R1.xml' | scrape -be '//SV'

What's wrong in it?

Thank you

Vagrant up issues

Hi,

I am having trouble starting the VM using vagrant up. I have tried on both windows and ubuntu.
I get to the password/private key authentication part and can progress no further. Working with the VirtualBox GUI doesn't help either.

Is there any other information I can provide to help diagnose the problem? My Vagrant config file is quite basic (mostly defaults, except that I tried to set password instead of relying on private key due to error- didn't help), but I can share that if needed.

Thanks alot.
Rohail

Unable to use Rio at terminal for ggplotting

Hi, I am following your book Data Science at the Command Line and its awesome. While most things have worked so far as I am installing individual components on my CentOS, there are issues off and on. Is this a good place to ask? For example the plotting code fashion.csv Rio -ge... throws the error display: no decode delegate for this image format `/tmp/magick-02f7rn9B' @ error/constitute.c/ReadImage/544. I have tried re-installing different versions of ImageMagick but without success. Your suggestion will be greatly appreciated. Thanks.

Sequence of dates

Allow me to introduce my dateutils, along with its dateseq command to produce sequences of dates (or date/times) not only faster but more portable and flexible.

Your semantics (you specify dates as day difference relative to today) would have to go through dateadd first though, as all tools in the toolkit take absolute dates. Your example from the book would hence become:

$ dateseq `dateadd today -2` today
2016-03-19
2016-03-20
2016-03-21

So I suppose one could write a wrapper so your dseq tool wouldn't break its API.

Rio ggplot2 not giving expected results

Working from book, on reproducing Fig 7-4 in the book, I run this from the ch07 directory:

$ <data/tips.csv Rio -ge 'g+geom_bar(aes(factor(size)))'
ARGUMENT '',~+~file='/var/folders/4s/gd5gw2bd2n16njbfqsg0j5yc0000gn/T/Rio-zQ6olMgv.png')}else~+~if(exists('is.ggplot')&&is.ggplot(last)){ggsave('/var/folders/4s/gd5gw2bd2n16njbfqsg0j5yc0000gn/T/Rio-zQ6olMgv.png',last,dpi=72,units='cm',width=20,height=15);}else{sink('/var/folders/4s/gd5gw2bd2n16njbfqsg0j5yc0000gn/T/Rio-zQ6olMgv.png');print(last);}' __ignored__

Loading required package: ggplot2
Loading required package: methods

I would expect that binary would stream to stdout that I could pipe to display or a .png file, but I get only this. Also, If I run the following, I get the results below:

$ <data/tips.csv Rio -e 'head(df)'
ARGUMENT '',~+~file='/var/folders/4s/gd5gw2bd2n16njbfqsg0j5yc0000gn/T/Rio-xGYRL7dR.png')}else~+~if(exists('is.ggplot')&&is.ggplot(last)){ggsave('/var/folders/4s/gd5gw2bd2n16njbfqsg0j5yc0000gn/T/Rio-xGYRL7dR.png',last,dpi=72,units='cm',width=20,height=15);}else{sink('/var/folders/4s/gd5gw2bd2n16njbfqsg0j5yc0000gn/T/Rio-xGYRL7dR.png');print(last);}' __ignored__

   bill  tip    sex smoker day   time size
1 16.99 1.01 Female     No Sun Dinner    2
2 10.34 1.66   Male     No Sun Dinner    3
3 21.01 3.50   Male     No Sun Dinner    3
4 23.68 3.31   Male     No Sun Dinner    2
5 24.59 3.61 Female     No Sun Dinner    4
6 25.29 4.71   Male     No Sun Dinner    4

What I'm not expecting is the line starting with ARGUMENT .... Could not figure out what is wrong. Any ideas?

Ch2 Paste & Bc Example Not Working

Example provided will cause an error with paste using Bash & Zsh on MacOS. Believe this will cause an issue on all variants, but haven't checked on Linux/Docker as of yet.

$ fac() { (echo 1; seq $1) | paste -s -d\* | bc; }
$ fac 5
> usage: paste [-s] [-d delimiters] file ...

Suggestion is to add in a - character, which will cause it to run correctly, making me think this is just an unchecked typo.

$ fac() { (echo 1; seq $1) | paste -s -d\* - | bc; }
$ fac 5
> 120

Rio -d flag

Hi there, I have a tab delimited file, and when I use Rio -d "\t" -e "summary(df)", I got some errors.
cat iris.tsv | head | Rio -d"\t" -e "summary(df)"
ARGUMENT '',stringsAsFactors=F);summary(df);last<-.Last.value;if(is.matrix(last)){last<-as.data.frame(last)};if(is.data.frame(last)){write.table(last,'/var/folders/h5/3xs5c90n0njgp7n9_qdwwj2m0000gn/T//Rio-fK1bGblg.png',sep=',',quote=T,qmethod='double',row.names=F,col.names=T);}else~+~~if(is.vector(last)){cat(last,sep='\n',~~+~~file='/var/folders/h5/3xs5c90n0njgp7n9_qdwwj2m0000gn/T//Rio-fK1bGblg.png')}else~~+~if(exists('is.ggplot')&&is.ggplot(last)){ggsave('/var/folders/h5/3xs5c90n0njgp7n9_qdwwj2m0000gn/T//Rio-fK1bGblg.png',last,dpi=72,units='cm',width=20,height=15);}else{sink('/var/folders/h5/3xs5c90n0njgp7n9_qdwwj2m0000gn/T//Rio-fK1bGblg.png');print(last);}' ignored

Do you have any idea?
Thanks,
Ming Tang

Rio-scatter to tty should display plot on X11

This will dump a PNG file to the tty:

< iris.csv Rio-scatter sepal_length sepal_width species

You never want that. If the output is not redirected to a file, display it on the screen with R's native displayer.

To see if stdout is a terminal you can use:

if [ -t 1 ] ; then
...
fi

close failed in file object destructor

Running your example curl -s 'http://en.wikipedia.org/wiki/List_of_countries_and_territories_by_border/area_ratio' | scrape -b -e 'table.wikitable > tr:not(:first-child)' | head I get:

close failed in file object destructor:
sys.excepthook is missing
lost sys.stderr

drake and grep: write error

Hi, great book !
I have an issue with running the first drake example:

Any clue ?

Book could use some reviews

The book Data Science at the Command Line has been getting some really good reviews on the O'Reilly product page. I have to say: it's great to get feedback from readers. (It's also a nice confirmation that I'm not the only person crazy enough to think that the command line can be used for doing data science.) ;)

On Amazon and other book websites, there are unfortunately currently very few or no reviews. This makes it difficult for someone to decide whether the book would be useful to them or not.

If you have read the book and you have an opinion about it, whether it's positive or negative, then it would be greatly appreciated if you would spend a few minutes writing it down and submitting it as a review to Amazon (or any other book website). More reviews means that potential readers can better inform themselves, which could potentially lead to more command-line users.

Thanks for helping me out!

Cheers,

Jeroen

PS. I realize that not everybody who's using this repository has actually read the book, so forgive me for posting this question here. If you're still wondering whether you should buy the book or not, the first chapter is available for free at O'Reilly.

PPS. Of course, if you have any feedback not suited for a review, then you can always open a GitHub issue or contact me on Twitter.

sql2csv --db "mysql: ..." says I don't have the necessary backend installed

vagrant@data-science-toolbox:~$ sql2csv --db "mysql://user:[email protected]:3306/database" --query "select count(*) from session"
You don't appear to have the necessary database backend installed for connection string you're trying to use.. Available backends include:

Postgresql: pip install psycopg2
MySQL: pip install MySQL-python

For details on connection strings and other backends, please see the SQLAlchemy documentation on dialects at:

http://www.sqlalchemy.org/docs/dialects/

To fix I did:

sudo apt-get update
sudo apt-get install libmysqlclient-dev
sudo pip install MySQL-python

Document how to provision this image on AWS or other cloud server

I'm sure it must be possible to install this set of tools directly onto a server running on AWS; I don't know yet if the supplied information supports this directly.

I have the vagrant-aws plugin for vagrant running.

Ah - some more searching - the directions (which are quite good) are at

http://datasciencetoolbox.org/

and select the tab "In the cloud"; that should be enough info to get it running.

Add Dockerfile

It would be awesome to have a Dockerfile, to autoinstall the project and run with. I'll open a Pull Request for this tomorrow.

Online book is missing appendix

The online version of the book is missing the appendix, which helpfully lists all of the command-line tools mentioned in the book with a brief description of each. It is mentioned several times in the text but seems to be missing from this repository and the hosted version of the book.

This is a super useful reference that I refer to often when trying to figure out how to do something on the command line, or to remind myself of tools that I don't use often—it should be included in the online version!

Scrubbing Data (Page 56)

@jeroenjanssens
On page 56 the following bash snippet does not work "out of the box" in the vagrant / virtualbox environment running Ubuntu 14.04 LTS as suggested in the book:

$ echo 'foo\nbar\nfoo' | sort | uniq -c | sort -nr
1 foo\nbar\nfoo
$ echo 'foo\nbar\nfoo' | sort | uniq -c | sort -nr | awk '{print $2","$1}' | header -a value,count 
value,count
foo\nbar\nfoo,1

The following snippet works correctly with -e parameter given to echo:

$ echo -e 'foo\nbar\nfoo' | sort | uniq -c | sort -nr
2 foo
1 bar
$ echo -e 'foo\nbar\nfoo' | sort | uniq -c | sort -nr | awk '{print $2","$1}' | header -a value,count 
value,count
foo,2
bar,1

Alternative way would be using printf instead of echo. There might be also differences between distributions and versions of Unix in terms of using -e parameter with echo.

The following snippet might be more universally acceptable:

$ echo $'foo\nbar\nfoo' | sort | uniq -c | sort -nr
2 foo
1 bar
$ echo $'foo\nbar\nfoo' | sort | uniq -c | sort -nr | awk '{print $2","$1}' | header -a value,count 
value,count
foo,2
bar,1

Wrong return code

There are some places that return a non-successful error code (anything but 0) when there's no error. For example, https://github.com/jeroenjanssens/data-science-at-the-command-line/blob/master/tools/header#L65

This is a big problem because some interprocess communication tools fail because they depend on a successful exit status. Should I submit a pull-request?

Stuck after running Docker image

Hi, Jeroen!

First of all, thanks for the awesome work and clear explanations about Data Science and Command Line Tools.

I've found the site when reading HN today. Then, I followed the through the first chapters until I got stuck at 2.2 Installing the Docker Image.

I'm aware it is currently a WIP to update the online version of the book. But I was wondering if there is something I could do to keep going once I ran the Docker container. Now, it seems the files you use are not there yet (e.g. book/ folder), which makes it difficult to follow your first examples.

Am I missing something here? I would appreciate some help 😃

Thanks in advance!

Twitter API v1 is outdated

Hi Jeroen,

on page 39 you call an outdated Twitter API:

$ curlicue -f credentials \
> 'https://api.twitter.com/1/statuses/home_timeline.xml'

It returns

<?xml version="1.0" encoding="UTF-8"?>
<errors>
  <error code="64">
    The Twitter REST API v1 is no longer active. Please migrate to API v1.1.
    https://dev.twitter.com/docs/api/1.1/overview.
  </error>
</errors>

The actual API endpoint is

https://api.twitter.com/1.1/statuses/home_timeline.json

So the proper call would be

$ curlicue -f credentials \
> 'https://api.twitter.com/1.1/statuses/home_timeline.json'

however it returns JSON and not XML as expected in v1.0 call, but since it is the end of the pipeline, and the returned value is not processed, it is not a big issue. I guess the API was deactivated at the end of last year.

Scrape trin Golang

Hi, do you mind writing a version of the scrape utility in Golang for better cross platform support? Thanks in advance!

sample is already a base tool

man sample

DESCRIPTION
     sample is a command-line tool for gathering data about the running behav-
     ior of a process.  It suspends the process at specified intervals (by
     default, every 1 millisecond), records the call stacks of all threads in
     the process at that time, then resumes the process.  The analysis done by

cols not working

It seems like cols is not working.
No matter how I run it, I always get this response:

usage: mktemp [-d] [-q] [-t prefix] [-u] template ...
mktemp [-d] [-q] [-u] -t prefix
mkfifo: /other_columns: Permission denied
/Applications/command-line-tools/cols: line 24: ${ARG~~}: bad substitution
tee: /other_columns: Permission denied

Off-topic: cron and csvsql

Hi,
I'm writing here because you have great experience with csvkit.

Did you ever have test cron with a csvsql process? If yes, are you able to produce an output file?
I have always 0 kb output file.

I have just opened an issue wireservice/csvkit#342. I think it could be something related to the creation of the temp sqlite file, but I'm not able to read anything also in the log (it's a zero kb file too).

Thank you very much

A standalone version of scrape

Hi,
I have created (with pyinstaller) a standalone version of scrape, because I need it in a python 3 PC https://github.com/aborruso/scrape-cli/releases

Scrape is a great tool, thank you to its author

`mktemp` failure

$ seq 12 | Rio -e 'df**2'
mktemp: failed to create file via template `/tmp/user/1001Rio-XXXXXXXX': Permission denied
mktemp: failed to create file via template `/tmp/user/1001Rio-XXXXXXXX': Permission denied
mktemp: failed to create file via template `/tmp/user/1001Rio-XXXXXXXX': Permission denied
/opt/Rio: line 115: $IN: ambiguous redirect
Rscript execution error: No such file or directory

URL of "Calling Web APIs" outdated?

Perhaps the URL of the API is outdated in page 38

curl -s http://api.randomuser.me | jq '.'

It returns nothing at all

Best,
Òscar

jeroenjanssens / data-science-at-the-command-line Goto Github PK

data-science-at-the-command-line's People

Contributors

Stargazers

Watchers

Forkers

data-science-at-the-command-line's Issues

data/wiki.html scrape -b -e 'table.wikitable > tr:not(:first-child)'

Recommend Projects

Recommend Topics

Recommend Org