ai-metrics / ai-metrics Goto Github PK

An open source project to document AI progress through data.

License: Other

Python 76.75% JavaScript 0.50% CSS 0.01% Jupyter Notebook 22.74%

ai-metrics's Introduction

Measuring the Progress of AI Research

This repository contains a Jupyter Notebook, which you can see live at https://eff.org/ai/metrics. It collects problems and metrics / datasets from the artificial intelligence and machine learning research literature, and tracks progress on them. You can use it to see how things are progressing in specific subfields or AI/ML as a whole, as a place to report new results you've obtained, and as place to look for problems that might benefit from having new datasets/metrics designed for them, or as a source to build on for data science projects.

At EFF we're also interested in collecting this data to understand the likely implications of AI, but to begin with we're focused on gathering it.

Original authors: Peter Eckersley and Yomna Nasser at EFF

With contributions from: Gennie Gebhart and Owain Evans
Inspired by and merging data from:

Rodrigo Benenson's "Who is the Best at X / Are we there yet?" collating machine vision datasets & progress
Jack Clark and Miles Brundage's collection of AI progress measurements
Sarah Constantin's Performance Trends in AI
Katja Grace's Algorithmic Progress in Six Domains
The Swedish Computer Chess Association's History of Computer Chess performance
Qi Wu et al.'s Visual Question Answering: A survey of Methods and Datasets
Eric Yuan's Comparison of Machine Reading Comprehension Datasets

How to contribute to this notebook

This notebook is an open source, community effort. You can help by adding new metrics, data and problems to it! If you're feeling ambitious you can also improve its semantics or build new analyses into it. Here are some high level tips on how to do that.

1. If you're comfortable with git and Jupyter Notebooks, or are happy to learn

If you've already worked a lot with git and IPython/Jupyter Notebooks, here's a quick list of things you'll need to do:

Install Jupyter Notebook and git.
- On an Ubuntu or Debian system, you can do:
```
sudo apt-get install git
sudo apt-get install ipython-notebook || sudo apt-get install jupyter-notebook || sudo apt-get install python-notebook
```
- Make sure you have IPython Notebook version 3 or higher. If your OS doesn't provide it, you might need to enable backports, or use pip to install it.

Install this notebook's Python dependencies:

On Ubuntu or Debian, do:

    sudo apt-get install python-{cssselect,lxml,matplotlib{,-venn},numpy,requests,seaborn}

On other systems, use your native OS packages, or use pip:

    pip install cssselect lxml matplotlib{,-venn} numpy requests seaborn

Fork our repo on github: https://github.com/AI-metrics/AI-metrics#fork-destination-box
Clone the repo on your machine, and cd into the directory it's using
Configure your copy of git to use IPython Notebook merge filters to prevent conflicts when multiple people edit the Notebook simultaneously. You can do that with these two commands in the cloned repo:
```
git config --file .gitconfig filter.clean_ipynb.clean $PWD/ipynb_drop_output
```
```
git config --file .gitconfig filter.clean_ipynb.smudge cat
```
Run Jupyter Notebok in the project directory (the command may be ipython notebook, jupyter notebook, jupyter-notebook, or python notebook depending on your system), then go to localhost:8888 and edit the Notebook AI-progress-metrics.ipynb to your heart's content
Save and commit your work (git commit -a -m "DESCRIPTION OF WHAT YOU CHANGED")
Push it to your remote repo
Send us a pull request!

2. If you want something very simple

Microsoft Azure has an IPython / Jupyter service that will let you run and modify notebooks from their servers. You can clone this Notebook and work with it via their service: https://notebooks.azure.com/EFForg/libraries/ai-progress. Unfortunately there are a few issues with running the notebook on Azure:

arXiv seems to block requests from Azure's IP addresses, so it's impossible to automatically extract information about paper when running the Notebook there
The Azure Notebooks service seems to transform Unicode characters in strange ways, creating extra work merging changes from that source

Notes on importing data

Each .measure() call is a data point of a specific algorithm on a specific metric/dataset. Thus one paper will often produce multiple measurements on multiple metrics. It's most important to enter results that were at or near the frontier of best performance on the date they were published, though this isn't a strict requirement and it's nice to have a sense of the performance of the field, or of algorithms that are otherwise notable even if they aren't the frontier for a specific problem.
When multiple revisions of a paper (typically on arXiv) have the same results on some metric, use the date of the first version (the CBTest results in this paper are an example)
When subsequent revisions of a paper improve on the original results (example), use the date and scores of the first results, or if each revision is interesting / on the frontier of best performance, include each paper
- We didn't check this carefully for our first ~100 measurement data points :(. In order to denote when we've checked which revision of an arXiv preprint first published a result, cite the specific version (https://arxiv.org/abs/1606.01549v3 rather than https://arxiv.org/abs/1606.01549); that way we can see which previous entries should be double-checked for this form of inaccuracy.
Where possible, use a clear short name or acronym for each algorithm. The full paper name can go in the papername field (and is auto-populated for some papers). When matplotlib 2.1 ships we may be able to get nice rollovers with metadata like this. Or perhaps we can switch to D3 to get that type of interactivity.

What to work on

If you know of ML datasets/metrics that aren't included yet, add them
If there are papers with interesting results for metrics that aren't included, add them
If you know of important problems that humans can solve, and machine learning systems may or may not yet be able to, and they're missing from our taxonomy, you can propose them
Look at our Github issue list perhaps starting with those tagged as good volunteer tasks.
You can also add missing conferences / journals to the venue-to-date mapping table (unhide the source code and search for conference_dates):

FAQ

Q: What's the point of this project? How does it tie in with the EFF's mission?

Given that machine learning tools and AI techniques are increasingly part of our everyday lives, it is critical that journalists, policy makers, and technology users understand the state of the field. When improperly designed or deployed, machine learning methods can violate privacy, threaten safety, and perpetuate inequality and injustice. Stakeholders must be able to anticipate such risks and policy questions before they arise, rather than playing catch-up with the technology. To this end, it’s part of the responsibility of researchers, engineers, and developers in the field to help make information about their life-changing research widely available and understandable

Q: Why haven't you included dataset X?

There are a tiny number of us and this is a large task! If you'd like to add more data, please send us a pull request

Q: Do you track other things besides how well-solved particular tasks are? For instance, the speed and efficiency of training?

No, but we'd love to. If you are motivated to help organize that data, please dive in and improve the notebook!

Q: Have you thought about how to visualise this data and make it more accessible?

We've considered a variety of things, but decided that the iPython notebook was ultimately the most accessible for now. We're very open to suggestions about visualizations and accessibility, so feel free to reach out if you have any ideas!

Also, if you'd like to build visualizations on top of this project, all the data we use is available in the easily digestible JSON format in progress.json. If you do so, let us know and we'll try to link to it.

Q: Is this an EFF project?

Yes, but we'd like it to grow to be a self-sustaining community effort supported by a coalition of organizations. EFF did the initial work of making the Notebook, but we built on several excellent datasets collected by many other people, and had a number of productive collabrative discussions most especially with people at OpenAI and the Future of Humanity Institute in preparing the document. We will strive to keep the authorship section of the Notebook accurate as others continue to contribue.

Q: When will artificial general intelligence happen?

We don't know--and this project is not meant to answer this question. Instead, we’re interested in compiling data to guide evidence-based conversations about the state of the art in various corners of AI and machine learning research.

ai-metrics's People

Contributors

Stargazers

Watchers

ai-metrics's Issues

External hyperlink navigation is broken for some links

Because EFF server the Notebook (nbconvert --execute --to html) in an iframe, it seems that some of the external links are broken, unless you right click and do "open in new tab/window". Some others load within the iframe, which certainly isn't what we want.

[CSS] Right-align the card2code images

There's a <div> containing sample images of MTG and HS cards, and a screenshot of Java code for the MTG card. It should really be right-aligned with the main text flowing past on the left, but I'm failing to make that happen in CSS. Perhaps someone who knows CSS better can help...

Provide an easy web flow for people adding and editing data

Like this one: https://rodrigob.github.io/are_we_there_yet/build/new_result_form.html

People keep submitting paper requests via issues, but what we want is pull requests. However, we acknowledge that setting up this project locally can be a bit tricky for new developers, so we want to automate PRs via web form submissions.

Whoever does this: when the form has a URL, edit the docs + github issue template to include a link to it.

Make a metric.table!

... in bottom right corner, includes a "edit on github" button.

As a last resort, use: from IPython.display import html.

Tables are mangled on mobile

There's a block of whitespace overlaying the right hand side of the tables on mobile devices, which makes them unreadable.

Missing language modeling results

https://www.microsoft.com/en-us/research/wp-content/uploads/2012/07/rnn_ctxt_TR.sav_.pdf got 74.1 PPL on PTB with RNN-LDA+KN5+cache in 2012.
Also, you should include a KN5 baseline in PTB PPL. Ensemble of RNN-LDA LMs at 80.1 PPL.

Probably you should include the billion word corpus for LM results, as it is less "overfitted" as a dataset than PTB, and as it is bigger it is better suited for RNN/LSTM methods.

Performance target lines are broken

As a side effect of implementing two-metrics-per-graph options

Replace "solved" with something more nuanced

In different cases, it can be different things:

Parity with typical human performance?
Parity with best human performance?
Parity with some handcrafted algorithmic benchmark?

HTML encoding bug

when exporting as HTML, URLs need to be HTML-encoded e.g. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.88.9924&rep=rep1&type=pdf should be http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.88.9924&rep=rep1&type=pdf

Missing text in adversarial_examples.notes

adversarial_examples.notes = """
We know that humans have significant resistance to adversarial examples.  Although methods like camouflage sometimes
work to fool us into thinking one thing is another, those
"""

Rest of the sentence is missing.

VQA: Is LSTM Q+I info correct?

I'm concerned that the reported LSTM Q + I data may be incorrect. Looking at the different versions, I don't think LSTM Q + I was included in the analysis until v3 of the initial VQA dataset paper, published on October 15th 2015 (https://arxiv.org/abs/1505.00468v3). Additionally, v3 is the first version of the paper that explicitly reports any result on the test-standard used throughout the VQA 1.0 evaluations. LSTM Q + I is reported (because of it's superior performance on test-dev), and it achieves 54.06% accuracy in the open-answer framework. The 58.2% currently reported in the repo for LSTM Q + I seems to come from v5 onward, not reported until March 7 2016 on arxiv.

On the multiple choice side of things, it is not until v5 (https://arxiv.org/abs/1511.05756) that the initial VQA dataset paper explicitly states a best result on "test-standard". The 63.1% figure that is currently reported in this repo lines up with the results in the v5 paper, but again, the v5 paper was not submitted to arxiv until 7 Mar 2016.

I believe the following actions should be taken:

v3 of the initial VQA paper should be the starting point of the open-ended evaluations (15 Oct 2015 on arxiv).
It should be reflected that this initial v3 VQA paper reports an accuracy of 54.06% on test-standard, not 58.2%.
LSTM Q + I should not be the starting point of the multiple-choice evaluations, but it should be included in the graph on March 7th 2016. The earliest results I have for test-standard results in the multiple choice framework are from the DPPnet paper: https://arxiv.org/abs/1511.05756. Just one version on arxiv, November 18 2015.

Finish sentence: "When matplotlib 2.1 ships"

I'm proofreading the notebook and am not sure how this sentence is supposed to end:

"The full paper name can go in the papername field (and is auto-populated for some papers). When matplotlib 2.1 ships" ... ?

Write a FAQ

Anaphora resolution

I recently learned about Winograd Schemas

https://en.wikipedia.org/wiki/Winograd_Schema_Challenge

which are an approach to formalizing the testing of anaphora resolution (figuring out what a pronoun refers to when this can only be done from context and knowledge about the world). For example, in the pair of sentences

The doctor examined the patient because she felt ill.

The doctor examined the patient because she was familiar with endocrine disorders.

the word she in the first sentence refers to the patient, but in the second sentence to the doctor, because feeling ill is a reason for someone to be examined by a doctor, while medical expertise is a reason for someone to examine a patient. Or in a somewhat similar vein

Mahayana Buddhist monks may not eat carne asada burritos because they are not vegetarian.

Mahayana Buddhist monks may not eat tacos al pastor because they are vegetarian.

(In one case, they [are/are not] vegetarian refers to people, and in the other case to food items. For the Winograd schema case we would probably just say "Mahayana Buddhist monks may not eat carne asada burritos because they [are/are not] vegetarian", where then the word they will refer either to people or to food items.)

This is a remarkably difficult natural language processing problem because particular instances of it can test arbitrary knowledge about the world and the ability to make likely inferences about relationships, institutions, abilities, and motivations.

In Winograd schemas, a single change in a sentence flips the most natural resolution of a pronoun to refer to a different entity, and the question is to figure out which entity is referred to by each version of the sentence.

The Wikipedia article shows that there are several attempts to test how well different AIs and natural language processing systems can solve Winograd schema problems. Are any tests of this kind already included in any of the reading comprehension tasks? If not, they probably should be!

Sort out VQA real vs abstract

At the moment, the VQA data assumes that results are for a combined real+abstract dataset. But at least some of the results are in fact just for "real". So we need to fix that...

Explain and illustrate some of the problems

It would be nice to use Markdown cells to explain what some of the more interesting Problems and Metrics are about. Are we there yet is a model for one way to do that...

Add all the interesting Atari games

There are lots; we should add all the ones that have 3+ reports in the literature or which appear to be interesting for some conceptual reason.

Measurements shouldn't be full paper names

In some cases we have measurements whose name is an algorithm, like PixelCNN++. In other cases we have measurements whose name is a paper.

The former is better, because it allows cleaner graphs and it allows one paper to generate multiple measurement datapoints if the paper reports new results from multiple algorithms (which is common).

Move the "Getting Started/Contributing" section to the README

People are more likely to contribute if they don't have to scroll to the end of a large document to see instructions on how to do so.

Multipanel Atari graphs

There are a lot of Atari games, and their graphs are currently limited in the number of data points they contain, so it would be great to have them displayed in a slightly less voluminous way, perhaps in two columns at 50% of their current size.

This page gives some examples of how to do this kind of thing.

Yomna, if you can make a graph_atari_games function that renders them all in a nice way, I'll figure out how to get rid of the current Atari graphs at the bottom of the Notebook.

Internal hyperlink navigation is broken in Firefox (but not Chrome)

Again, weird issues due to iframing...

Broken export of progress.json

The export_json function in AI-progress-metrics.ipynb contains the line:

problem_data[problem_attr] = map(lambda x: x.name, getattr(problem, problem_attr))

However, in Python 3 the map function returns a map object rather than a list, so the mostly recently commited progress.jspn contains unhelpul things like:

        "subproblems": "<map object at 0x7faccf3be4a8>",
        "superproblems": "<map object at 0x7faccf3bec50>",

To fix this, the line should be changed to:

problem_data[problem_attr] = list(map(lambda x: x.name, getattr(problem, problem_attr)))

Audit all the scale code

There are some wonky things in there. For instance

why is news-test-2014 en-fr BLEU labelled as solved?
is the use of log to order an linearise proprotion based metrics correct?
should we ever use percentages vs. proportions? Or allow either? Or allow them but force them to be something like proportions at the end?

Add "hs-lpn.jpg" (under "Generating computer programs from specifications")

The notebook is trying to display an image named hs-lpn.jpg in that section, but it's broken because the image was not ever pushed to the repo.

Link to Brundage/Clark data on main page 404s

Add markdown explanations / illustrations for some more important metrics

I've added a couple; it would be good to have these for at least 5 important metrics:

Perhaps bAbi 20 QA?
Perhaps an Atari game (are there nice videos?)
Perhaps a safety problem (can we get a video from Dario and/or Owain)?

Fix data points that are labelled with long paper names rather than short algorithm names

This includes all the data that was automatically imported from Rodrigo Benenson's dataset.

Detect retracted papers and automatically render them differently

They should be a (red?) x, the label should indicate retraction, and they should be removed from the frontier of best performance.

Here are a couple of examples:

Put a "show code" button everywhere that there's hidden code

Figure out how to embed images in the converted notebook more efficiently

Currently, we produce the hosted copy of the notebook with ipython nbconvert --execute --to html. That produces a giant HTML blob with very large embedded png graphs. That HTML blob is huuuugge, 11MB and counting. It would be nice to (1) embed SVG graphs instead, they should be much smaller; and (2) split them out into separate files so the page can render before the graphs have been downloaded.

This is a good volunteer task if you're knowledgeable about Jupyter Notebook, Matplotlib and nbconvert...

Mysterious warning from VQA graph

We get this strange warning from matplotlib when producting the first VQA graph:

/usr/lib/python2.7/dist-packages/matplotlib/axes/_axes.py:545: UserWarning: No labelled objects found. Use label='...' kwarg on individual plots.
warnings.warn("No labelled objects found. "

Add "data poisoning" problem, metrics, and measurements

figure out how to autogenerate + load the contributing docs into the notebook

Our contributing docs are currently located in two places: at the bottom of the notebook and in the README. Right now if we want to update those docs, we have to manually update them in both places.

It might be possible to make the notebook automagically load those docs into the document, while having them keep their nice formatting, which would make it easy to write a script to insert the docs in both places, and only have to edit them from one central location.

I played around with %load and %%markdown (https://ipython.org/ipython-doc/dev/interactive/magics.html#magic-load), but didn't get very far. There's also this: http://guido.vonrudorff.de/ipython-notebook-code-output-as-markdown/, which is close to what we want, but its output is a strange kind of markdown that doesn't look like the rest of the document.

This is a good volunteer task for someone who in interested in hacking Jupyter notebooks!

Graph scale tweaks

One person asked for percentages to always be shown on a 0-100 range and proportions on a 0-1 range. That might be worth trying, though we should also try some log or skewed version of that, allocating more vertical space as you get close to perfection.

Several people have asked us to make the Atari scale logarithmic.

We could implement this by having the scale objects affect the matplotlib axis properties.

Add a table of contents

It's probably good enough to make one manually, with markdown section links.

Export taxonomy and data to JSON

We should have some code that can provide a JSON export of all of the data in the notebook, so that other people can grab it for data science projects.

Investigate normalization of scaling for scores in Atari papers

Some of the Atari data we collected had negative game "scores" for games where the scores are always positive, and one paper mentions this: "Since the scale of scores varies greatly from game to game, we
fixed all positive rewards to be 1 and all negative rewards to be −1, leaving 0 rewards unchanged."[1]

We need to make sure the scores we are entering are all normalized in the same way.

[1] Playing Atari with Deep Reinforcement Learning

Investigate and add papers on ML applied to legal documents/contracts

This may go under "Scientific and Technical capabilities"? Or "Reading Comprehension"?

missing file "msrc_21.png" under "Vision" section

Add augmentations used by the image recognition methods

In order to make comparing different image recognition methods easier, it would help if the tables and charts included the augmentations used by the papers. Image recogntion can be made easier by augmenting the input data with rotated, flipped, color-distorted, etc. versions of the images. Research papers often intentionally limit augmentation to keep the problems hard and to make comparisons between papers easier.

E.g. from the list it looks like CIFAR-10 is best solved by fractional max pooling. However there has been lots of progress since then; the paper just used a more extensive set of image augmentations than most of the other papers.

For CIFAR-10, the standard set of image augmentation applied image research papers appears to be random horizontal flips and image translations. It would be helpful to distinguish papers that use only these augmentations from the papers that use a more extensive set of augmentations. I did not check, but I assume this applies to other image recognition tasks and tables as well.

Use a JSON serializer for export_json

The solution we currently have is not going to scale if we modify the structure of the taxonomy in any non-trivial way and then add more data.

Include SQuAD data

Missing half of sentence in "Speech Recognition" section on the SWB metric

@jackclarksf

In the "Speech Recognition" section of AI-metrics-data.txt, when talking about the switchboard metric, you start a sentence "We also create" (https://github.com/AI-metrics/master_text/blob/master/AI-metrics-data.txt#L77) ... but the rest of it is missing. What should it say (or can we ignore it)?

Import or scrape data from WER_are_we

Making this as a sub issue of #34. This repo has great Word Error Rate data for speech recognition tasks:

https://raw.githubusercontent.com/syhw/wer_are_we/master/README.md

Proposed scraper plan: #34 (comment)

Relatedly, we should decide on whether scrapers keep living in the Notebook or maybe want to live as separate scripts on disk. My instinct is probably the latter (and we should move the awty scraper too...)

Visualize progress across multiple datasets

This repository tracks WER over several popular speech datasets (including LibriSpeech, WSJ, CHIME) . The speech recognition section (and possibly others) could utilize a multi-line graph for comparison, with table of a results beneath, possibly something similar to the following chart (taken from A Historical Perspective of Speech Recognition):

@jackclarksf @syhw @xdh168

An alternative browser/visualization based on D3

I've created an alternative view of this data: https://jamesscottbrown.github.io/ai-progress-vis/index.html

Data is loaded directly from the JSON file and presented in two panels. The left displays a tree of problems/sub-problems and metrics. Clicking on a the name of a metric displays the corresponding measurements in the right panel as a coupled table and scatter-plot.

More comprehensive Atari data

We have quite a bit of Atari data, but for many of the games we only have 1-3 data points, all from one paper.

There is a lot to be added from the papers listed in https://github.com/AI-metrics/master_text/blob/master/AI-metrics-data.txt and by looking for more recent results (especially for Montezuma's Revenge)

This would be a great task for a volunteer just getting started with this project.

Sort class (e.g. Measurement, Problem, etc ..) variables by frequency of use

bAbI task 10k are solved, but not 1k

bAbI tasks are in two flavors: with 1k and 10k training examples.

You write: "The Facebook BABI 20 QA dataset is an example of a basic (and now solved) reading comprehension task."

But this is only true in the 10k case, where memorization plays an important part.
For 1k, which many people have published (in fact the original paper only uses 1k) it is much harder, see e.g. Table 1 of https://arxiv.org/pdf/1606.04582.pdf

So there are still interesting things to try there, which relate to important problems in neural networks: data efficiency and capacity control, learning compositionality, small-shot learning and task transfer.