Giter Club home page Giter Club logo

snorkel's People

Contributors

4d4stra avatar ajratner avatar alldefector avatar anerirana avatar bhancock8 avatar brahmaneya avatar bryanhe avatar catalinvoss avatar danich1 avatar dependabot[bot] avatar dhimmel avatar fpoms avatar fsonntag avatar hangyao avatar hardianlawi avatar henryre avatar humzaiqbal avatar jason-fries avatar jasontlam avatar larskarg avatar lukehsiao avatar moreymat avatar netj avatar paidi avatar paroma avatar pmlandwehr avatar regoldman avatar stephenbach avatar thammegowda avatar vincentschen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

snorkel's Issues

Parallelize nltk CoreNLP parser in simple way

Emphasis on simple- this is not going to be an optimal preprocessing setup either way, we just want to make it a bit better through simple means that don't require any additional installs, configs, etc.

Error analysis workflow 1

  1. User gets random subsample of candidates in Mindtagger, and labels them
  2. User gets statistics over the labeling functions, some w.r.t. to this label set (e.g. empirical acc., etc.)
  3. Learn model
  4. Get precision stats
  5. Log (?) and repeat

Stats to show for label function development:

  • Coverage
  • Overlap
  • Conflict
  • Empirical accuracy
  • Show labeling functions and/or candidates that are conflict heavy (+ low emp. accuracy lfs)

Questions:

  • Should we be proscriptive, and automatically (opaquely) split their label set into a "label fn. validation set" and a test set (as default option which can be turned off)?
  • How to integrate ground truth that they bring in externally?

Add raw (untokenized) text as attribute to Sentence object

For regex matching, it would be very helpful to have access to the text of a single sentence without any tokenization. When tagging chemical names, for example, we frequently get these tokenization type artifacts:
Li ( 3 ) PS ( 4 ) vs. Li(3)PS(4)
Some of these can be fixed with modified regexes, but it would be nice to operate on the original text itself. As far as mapping back to tokens for entity tags, we could just consider a match as anything that overlaps with the original span.

Create a simple DocParser class

Desired initial functionalities:

  • Take as input a directory filepath, a single filename, a filename pattern
  • Strip XML, HTML (i.e. strip tags without corrupting basic sentence structures)

Ideally there would be some simple way to extend so that users could write basic XML/HTML parser modules (e.g. to grab metadata, preserve section structure, etc.) via some python library (e.g. lxml, beautifulsoup). This kind of solution would not be performant, but could potentially be very simple...

DB / DeepDive connectivity

One simple way to have db connectivity in the notebook is our favorite extension, ipython-sql. We could initially just build some helper functions around this (or any other psql connector).

However, in DDL we pass around an object containing the entire dataset (Relations)- this would allow us to connect to the database in a way that is opaque to the user, turning this Relations object into essentially a cache for the DeepDive db...

What else?

Reload tags in MindTagger

When we reopen MindTagger, we can keep the same sample as before, but not reload the tags. @netj is there a nice way to do this with the API like how we retrieve the tags, or should we form and dump tags.json to the instance directory?

Output marginals + calibration plots + histograms

We need to output the basic deepdive calibration plots (notebooks are perfect for this!), as well as potentially some other histograms which guide users towards correct error analysis / debugging procedures.

We also need to output the marginals, which is a minor sub-function to add in.

Write documentation

Example notebooks are the de facto documentation. Example notebooks are not actually documentation.

Add dependency tree helper functionality

E.g. user should be able to access a path_between attribute of a Relation object, etc. This can / is currently being done with treedlib, however I am trying to decouple these two repos... however can bring this back in under the covers in a more limited form (e.g. Relation objects initialize with several dependency path attributes like path_between, but don't expose the direct XPath mechanisms to the user)

Save MindTagger Output

Refine saving and loading annotation dumps from MindTagger during DSR refinement. I realize "items.csv" is dumped in the MindTagger directory under some unique folder id and tags can be fetched using get_mindtagger_tags() on the MindTagger instance, but the metrics associated with these values should be wrapped up in some sort of "classification_report" type function.

Create a RegexMatch entity mention operator

This will be almost identical to the existing DictionaryMatch operator.

This RegexMatch operator, if essentially copied from DictionaryMatch, will be able to trivially do e.g. POS tag sequence matches (using match_attrib=poses); however this could be wrapped / presented as a separate operator...?

Fix parentheses encoding

Not a big deal, but shows up as -LRB- and -RRB- in MindTagger which looks like a lot like a gene to the underinformed

Add "smart" Viewer sampling

Rather than a fully random sample, do we want

  • Some all-abstained candidates?
  • Some high conflict candidates?
  • Some low conflict candidates?
  • Some candidates with probability close to 0.5?

Refactor Extractions so it isn't a big state machine?

Would be good to separate the concepts of Extractions as a data container/operator and the learning algorithms it implements? Should also probably implement Relation and Entity using a proxy pattern.

@ajratner let's talk before deciding either way on this?

edit: Adding question marks so it sounds like a question?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.