Giter Club home page Giter Club logo

ai-metrics's Introduction

Measuring the Progress of AI Research

This repository contains a Jupyter Notebook, which you can see live at https://eff.org/ai/metrics. It collects problems and metrics / datasets from the artificial intelligence and machine learning research literature, and tracks progress on them. You can use it to see how things are progressing in specific subfields or AI/ML as a whole, as a place to report new results you've obtained, and as place to look for problems that might benefit from having new datasets/metrics designed for them, or as a source to build on for data science projects.

At EFF we're also interested in collecting this data to understand the likely implications of AI, but to begin with we're focused on gathering it.

Original authors: Peter Eckersley and Yomna Nasser at EFF

With contributions from: Gennie Gebhart and Owain Evans
Inspired by and merging data from:

How to contribute to this notebook

This notebook is an open source, community effort. You can help by adding new metrics, data and problems to it! If you're feeling ambitious you can also improve its semantics or build new analyses into it. Here are some high level tips on how to do that.

1. If you're comfortable with git and Jupyter Notebooks, or are happy to learn

If you've already worked a lot with git and IPython/Jupyter Notebooks, here's a quick list of things you'll need to do:

  1. Install Jupyter Notebook and git.

    • On an Ubuntu or Debian system, you can do:
      sudo apt-get install git
      sudo apt-get install ipython-notebook || sudo apt-get install jupyter-notebook || sudo apt-get install python-notebook
    • Make sure you have IPython Notebook version 3 or higher. If your OS doesn't provide it, you might need to enable backports, or use pip to install it.
  2. Install this notebook's Python dependencies:

    • On Ubuntu or Debian, do:
          sudo apt-get install python-{cssselect,lxml,matplotlib{,-venn},numpy,requests,seaborn}
    • On other systems, use your native OS packages, or use pip:
          pip install cssselect lxml matplotlib{,-venn} numpy requests seaborn
  3. Fork our repo on github: https://github.com/AI-metrics/AI-metrics#fork-destination-box

  4. Clone the repo on your machine, and cd into the directory it's using

  5. Configure your copy of git to use IPython Notebook merge filters to prevent conflicts when multiple people edit the Notebook simultaneously. You can do that with these two commands in the cloned repo:

    git config --file .gitconfig filter.clean_ipynb.clean $PWD/ipynb_drop_output
    git config --file .gitconfig filter.clean_ipynb.smudge cat
  6. Run Jupyter Notebok in the project directory (the command may be ipython notebook, jupyter notebook, jupyter-notebook, or python notebook depending on your system), then go to localhost:8888 and edit the Notebook AI-progress-metrics.ipynb to your heart's content

  7. Save and commit your work (git commit -a -m "DESCRIPTION OF WHAT YOU CHANGED")

  8. Push it to your remote repo

  9. Send us a pull request!

2. If you want something very simple

Microsoft Azure has an IPython / Jupyter service that will let you run and modify notebooks from their servers. You can clone this Notebook and work with it via their service: https://notebooks.azure.com/EFForg/libraries/ai-progress. Unfortunately there are a few issues with running the notebook on Azure:

  • arXiv seems to block requests from Azure's IP addresses, so it's impossible to automatically extract information about paper when running the Notebook there
  • The Azure Notebooks service seems to transform Unicode characters in strange ways, creating extra work merging changes from that source

Notes on importing data

  • Each .measure() call is a data point of a specific algorithm on a specific metric/dataset. Thus one paper will often produce multiple measurements on multiple metrics. It's most important to enter results that were at or near the frontier of best performance on the date they were published, though this isn't a strict requirement and it's nice to have a sense of the performance of the field, or of algorithms that are otherwise notable even if they aren't the frontier for a specific problem.
  • When multiple revisions of a paper (typically on arXiv) have the same results on some metric, use the date of the first version (the CBTest results in this paper are an example)
  • When subsequent revisions of a paper improve on the original results (example), use the date and scores of the first results, or if each revision is interesting / on the frontier of best performance, include each paper
    • We didn't check this carefully for our first ~100 measurement data points :(. In order to denote when we've checked which revision of an arXiv preprint first published a result, cite the specific version (https://arxiv.org/abs/1606.01549v3 rather than https://arxiv.org/abs/1606.01549); that way we can see which previous entries should be double-checked for this form of inaccuracy.
  • Where possible, use a clear short name or acronym for each algorithm. The full paper name can go in the papername field (and is auto-populated for some papers). When matplotlib 2.1 ships we may be able to get nice rollovers with metadata like this. Or perhaps we can switch to D3 to get that type of interactivity.

What to work on

  • If you know of ML datasets/metrics that aren't included yet, add them
  • If there are papers with interesting results for metrics that aren't included, add them
  • If you know of important problems that humans can solve, and machine learning systems may or may not yet be able to, and they're missing from our taxonomy, you can propose them
  • Look at our Github issue list perhaps starting with those tagged as good volunteer tasks.
  • You can also add missing conferences / journals to the venue-to-date mapping table (unhide the source code and search for conference_dates):

FAQ

Q: What's the point of this project? How does it tie in with the EFF's mission?

Given that machine learning tools and AI techniques are increasingly part of our everyday lives, it is critical that journalists, policy makers, and technology users understand the state of the field. When improperly designed or deployed, machine learning methods can violate privacy, threaten safety, and perpetuate inequality and injustice. Stakeholders must be able to anticipate such risks and policy questions before they arise, rather than playing catch-up with the technology. To this end, it’s part of the responsibility of researchers, engineers, and developers in the field to help make information about their life-changing research widely available and understandable

Q: Why haven't you included dataset X?

There are a tiny number of us and this is a large task! If you'd like to add more data, please send us a pull request

Q: Do you track other things besides how well-solved particular tasks are? For instance, the speed and efficiency of training?

No, but we'd love to. If you are motivated to help organize that data, please dive in and improve the notebook!

Q: Have you thought about how to visualise this data and make it more accessible?

We've considered a variety of things, but decided that the iPython notebook was ultimately the most accessible for now. We're very open to suggestions about visualizations and accessibility, so feel free to reach out if you have any ideas!

Also, if you'd like to build visualizations on top of this project, all the data we use is available in the easily digestible JSON format in progress.json. If you do so, let us know and we'll try to link to it.

Q: Is this an EFF project?

Yes, but we'd like it to grow to be a self-sustaining community effort supported by a coalition of organizations. EFF did the initial work of making the Notebook, but we built on several excellent datasets collected by many other people, and had a number of productive collabrative discussions most especially with people at OpenAI and the Future of Humanity Institute in preparing the document. We will strive to keep the authorship section of the Notebook accurate as others continue to contribue.

Q: When will artificial general intelligence happen?

We don't know--and this project is not meant to answer this question. Instead, we’re interested in compiling data to guide evidence-based conversations about the state of the art in various corners of AI and machine learning research.

ai-metrics's People

Contributors

drschwenk avatar gegebhart avatar guicho271828 avatar jackclarksf avatar jamesscottbrown avatar jekbradbury avatar lschatzkin avatar mfb avatar miroli avatar owainevans avatar pde avatar thewolf-aws avatar void-elf avatar ynasser avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ai-metrics's Issues

External hyperlink navigation is broken for some links

Because EFF server the Notebook (nbconvert --execute --to html) in an iframe, it seems that some of the external links are broken, unless you right click and do "open in new tab/window". Some others load within the iframe, which certainly isn't what we want.

[CSS] Right-align the card2code images

There's a <div> containing sample images of MTG and HS cards, and a screenshot of Java code for the MTG card. It should really be right-aligned with the main text flowing past on the left, but I'm failing to make that happen in CSS. Perhaps someone who knows CSS better can help...

Make a metric.table!

... in bottom right corner, includes a "edit on github" button.

As a last resort, use: from IPython.display import html.

Tables are mangled on mobile

There's a block of whitespace overlaying the right hand side of the tables on mobile devices, which makes them unreadable.

Replace "solved" with something more nuanced

In different cases, it can be different things:

  • Parity with typical human performance?
  • Parity with best human performance?
  • Parity with some handcrafted algorithmic benchmark?

HTML encoding bug

when exporting as HTML, URLs need to be HTML-encoded e.g. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.88.9924&rep=rep1&type=pdf should be http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.88.9924&amp;rep=rep1&amp;type=pdf

Missing text in adversarial_examples.notes

adversarial_examples.notes = """
We know that humans have significant resistance to adversarial examples.  Although methods like camouflage sometimes
work to fool us into thinking one thing is another, those
"""

Rest of the sentence is missing.

VQA: Is LSTM Q+I info correct?

I'm concerned that the reported LSTM Q + I data may be incorrect. Looking at the different versions, I don't think LSTM Q + I was included in the analysis until v3 of the initial VQA dataset paper, published on October 15th 2015 (https://arxiv.org/abs/1505.00468v3). Additionally, v3 is the first version of the paper that explicitly reports any result on the test-standard used throughout the VQA 1.0 evaluations. LSTM Q + I is reported (because of it's superior performance on test-dev), and it achieves 54.06% accuracy in the open-answer framework. The 58.2% currently reported in the repo for LSTM Q + I seems to come from v5 onward, not reported until March 7 2016 on arxiv.

On the multiple choice side of things, it is not until v5 (https://arxiv.org/abs/1511.05756) that the initial VQA dataset paper explicitly states a best result on "test-standard". The 63.1% figure that is currently reported in this repo lines up with the results in the v5 paper, but again, the v5 paper was not submitted to arxiv until 7 Mar 2016.

I believe the following actions should be taken:

  • v3 of the initial VQA paper should be the starting point of the open-ended evaluations (15 Oct 2015 on arxiv).
  • It should be reflected that this initial v3 VQA paper reports an accuracy of 54.06% on test-standard, not 58.2%.
  • LSTM Q + I should not be the starting point of the multiple-choice evaluations, but it should be included in the graph on March 7th 2016. The earliest results I have for test-standard results in the multiple choice framework are from the DPPnet paper: https://arxiv.org/abs/1511.05756. Just one version on arxiv, November 18 2015.

Finish sentence: "When matplotlib 2.1 ships"

I'm proofreading the notebook and am not sure how this sentence is supposed to end:

"The full paper name can go in the papername field (and is auto-populated for some papers). When matplotlib 2.1 ships" ... ?

Anaphora resolution

I recently learned about Winograd Schemas

https://en.wikipedia.org/wiki/Winograd_Schema_Challenge

which are an approach to formalizing the testing of anaphora resolution (figuring out what a pronoun refers to when this can only be done from context and knowledge about the world). For example, in the pair of sentences

The doctor examined the patient because she felt ill.

The doctor examined the patient because she was familiar with endocrine disorders.

the word she in the first sentence refers to the patient, but in the second sentence to the doctor, because feeling ill is a reason for someone to be examined by a doctor, while medical expertise is a reason for someone to examine a patient. Or in a somewhat similar vein

Mahayana Buddhist monks may not eat carne asada burritos because they are not vegetarian.

Mahayana Buddhist monks may not eat tacos al pastor because they are vegetarian.

(In one case, they [are/are not] vegetarian refers to people, and in the other case to food items. For the Winograd schema case we would probably just say "Mahayana Buddhist monks may not eat carne asada burritos because they [are/are not] vegetarian", where then the word they will refer either to people or to food items.)

This is a remarkably difficult natural language processing problem because particular instances of it can test arbitrary knowledge about the world and the ability to make likely inferences about relationships, institutions, abilities, and motivations.

In Winograd schemas, a single change in a sentence flips the most natural resolution of a pronoun to refer to a different entity, and the question is to figure out which entity is referred to by each version of the sentence.

The Wikipedia article shows that there are several attempts to test how well different AIs and natural language processing systems can solve Winograd schema problems. Are any tests of this kind already included in any of the reading comprehension tasks? If not, they probably should be!

Sort out VQA real vs abstract

At the moment, the VQA data assumes that results are for a combined real+abstract dataset. But at least some of the results are in fact just for "real". So we need to fix that...

Add all the interesting Atari games

There are lots; we should add all the ones that have 3+ reports in the literature or which appear to be interesting for some conceptual reason.

Measurements shouldn't be full paper names

In some cases we have measurements whose name is an algorithm, like PixelCNN++. In other cases we have measurements whose name is a paper.

The former is better, because it allows cleaner graphs and it allows one paper to generate multiple measurement datapoints if the paper reports new results from multiple algorithms (which is common).

Multipanel Atari graphs

There are a lot of Atari games, and their graphs are currently limited in the number of data points they contain, so it would be great to have them displayed in a slightly less voluminous way, perhaps in two columns at 50% of their current size.

This page gives some examples of how to do this kind of thing.

Yomna, if you can make a graph_atari_games function that renders them all in a nice way, I'll figure out how to get rid of the current Atari graphs at the bottom of the Notebook.

Broken export of progress.json

The export_json function in AI-progress-metrics.ipynb contains the line:

problem_data[problem_attr] = map(lambda x: x.name, getattr(problem, problem_attr))

However, in Python 3 the map function returns a map object rather than a list, so the mostly recently commited progress.jspn contains unhelpul things like:

        "subproblems": "<map object at 0x7faccf3be4a8>",
        "superproblems": "<map object at 0x7faccf3bec50>",

To fix this, the line should be changed to:

problem_data[problem_attr] = list(map(lambda x: x.name, getattr(problem, problem_attr)))

Audit all the scale code

There are some wonky things in there. For instance

  • why is news-test-2014 en-fr BLEU labelled as solved?
  • is the use of log to order an linearise proprotion based metrics correct?
  • should we ever use percentages vs. proportions? Or allow either? Or allow them but force them to be something like proportions at the end?

Figure out how to embed images in the converted notebook more efficiently

Currently, we produce the hosted copy of the notebook with ipython nbconvert --execute --to html. That produces a giant HTML blob with very large embedded png graphs. That HTML blob is huuuugge, 11MB and counting. It would be nice to (1) embed SVG graphs instead, they should be much smaller; and (2) split them out into separate files so the page can render before the graphs have been downloaded.

This is a good volunteer task if you're knowledgeable about Jupyter Notebook, Matplotlib and nbconvert...

Mysterious warning from VQA graph

We get this strange warning from matplotlib when producting the first VQA graph:

/usr/lib/python2.7/dist-packages/matplotlib/axes/_axes.py:545: UserWarning: No labelled objects found. Use label='...' kwarg on individual plots.
warnings.warn("No labelled objects found. "

figure out how to autogenerate + load the contributing docs into the notebook

Our contributing docs are currently located in two places: at the bottom of the notebook and in the README. Right now if we want to update those docs, we have to manually update them in both places.

It might be possible to make the notebook automagically load those docs into the document, while having them keep their nice formatting, which would make it easy to write a script to insert the docs in both places, and only have to edit them from one central location.

I played around with %load and %%markdown (https://ipython.org/ipython-doc/dev/interactive/magics.html#magic-load), but didn't get very far. There's also this: http://guido.vonrudorff.de/ipython-notebook-code-output-as-markdown/, which is close to what we want, but its output is a strange kind of markdown that doesn't look like the rest of the document.

This is a good volunteer task for someone who in interested in hacking Jupyter notebooks!

Graph scale tweaks

One person asked for percentages to always be shown on a 0-100 range and proportions on a 0-1 range. That might be worth trying, though we should also try some log or skewed version of that, allocating more vertical space as you get close to perfection.

Several people have asked us to make the Atari scale logarithmic.

We could implement this by having the scale objects affect the matplotlib axis properties.

Export taxonomy and data to JSON

We should have some code that can provide a JSON export of all of the data in the notebook, so that other people can grab it for data science projects.

Investigate normalization of scaling for scores in Atari papers

Some of the Atari data we collected had negative game "scores" for games where the scores are always positive, and one paper mentions this: "Since the scale of scores varies greatly from game to game, we
fixed all positive rewards to be 1 and all negative rewards to be −1, leaving 0 rewards unchanged."[1]

We need to make sure the scores we are entering are all normalized in the same way.

[1] Playing Atari with Deep Reinforcement Learning

Add augmentations used by the image recognition methods

In order to make comparing different image recognition methods easier, it would help if the tables and charts included the augmentations used by the papers. Image recogntion can be made easier by augmenting the input data with rotated, flipped, color-distorted, etc. versions of the images. Research papers often intentionally limit augmentation to keep the problems hard and to make comparisons between papers easier.

E.g. from the list it looks like CIFAR-10 is best solved by fractional max pooling. However there has been lots of progress since then; the paper just used a more extensive set of image augmentations than most of the other papers.

For CIFAR-10, the standard set of image augmentation applied image research papers appears to be random horizontal flips and image translations. It would be helpful to distinguish papers that use only these augmentations from the papers that use a more extensive set of augmentations. I did not check, but I assume this applies to other image recognition tasks and tables as well.

bAbI task 10k are solved, but not 1k

bAbI tasks are in two flavors: with 1k and 10k training examples.

You write: "The Facebook BABI 20 QA dataset is an example of a basic (and now solved) reading comprehension task."

But this is only true in the 10k case, where memorization plays an important part.
For 1k, which many people have published (in fact the original paper only uses 1k) it is much harder, see e.g. Table 1 of https://arxiv.org/pdf/1606.04582.pdf

So there are still interesting things to try there, which relate to important problems in neural networks: data efficiency and capacity control, learning compositionality, small-shot learning and task transfer.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.