Giter Club home page Giter Club logo

paper_comparison_internal's Introduction

Paper Comparison

Installation

pip install -e ".[dev]"

High quality initial dataset

For the data in data/arxiv_tables_2308_high_quality, there are three relevant files:

  • tables.jsonl: a list of 30 papers. Each line is a json object with the following fields:
    • tabid: a unique hash for the table that comes from the latex-processing pipeline
    • table: a python dictionary dump of the table created frompd.DataFrame.to_dict(). Can be loaded back into pandas with pd.DataFrame(table['table']). The index of the data frame is the the corpus_id of the full text of the paper (which can be accessed at data/full_texts/{corpus_id}.jsonl).
    • row_bib_map: a list of json objects, where each represents a citation from the table. (This shouldn't be needed anymore, but I'm leaving it in in case it's useful for now. Will probably be deleted soon.) Each object contains:
      • the corpus_id of the full text of the paper
      • the row in the table that corpus_id corresponds to
      • the type of the citation - "ref" means it's an external reference and "ours" means that the paper containing the table is represented in the given row.
      • the bib_hash_or_arxiv_id associated with that corpus_id. arxiv_id is used when the type is "ours" and the bib_hash is used when it's "ref". The bib_hash comes from the latex-processing pipeline and it's a function of the citation text and the citing paper's arxiv_id.
  • papers.jsonl: a list of all the papers needed for generating the tables. Each paper entry contains:
    • tabids: a list of ids for the tables the paper is cited in.
    • corpus_id, the corpus id of a paper. If the full text is available, it can be accessed at data/full_texts/{corpus_id}.jsonl. (For the small subset, all of these texts should be available.)
    • title: the title of the paper from s2.
    • paper_id: the s2 id of the paper.
  • dataset.jsonl: contains a superset of the information in tables.jsonl (it additionally includes an html version of the table from which the json format was derived, some artefacts from the html->json conversion, and some additional metadata about the papers and tables)

These are loaded in using the configs/full_texts config, with the data.FullTexts class performing the loading part.

python paper_comparison/run.py data=full_texts data.path=data/arxiv_tables_2308_high_quality endtoend=debug endtoend.name=oracle

(Setting endtoend.name=oracle just has the system output the gold tables.)

ArXiv data processing

See scripts/data_processing

Running

When running, you need to pass some input example, and either a baseline or a schematizer and populator.

To run an end-to-end model on a single example:

python paper_comparison/run.py data.path=data/debug_abstracts endtoend=debug

To run a schematizer-populator combination on a single example:

python paper_comparison/run.py data.path=data/debug_abstracts schematizer=followup_questions populator=qa

After running, the results path will be output to the terminal. This path contains three files:

  • tables.jsonl: a file with one table per line
  • metrics.jsonl: a file with table metrics
  • commands.txt: a list of commands that were run to obtain the results in the directory. Most recent at the bottom.

Data

Currently all data is in sub-directories of the data path. These directories are passed to the data.path field at the command line. In these directories, there are one or two files: one (usually called papers.jsonl) has one paper json object per line. For example, one line will have the following fields:

{
    "tabids": ["<Table IDs. All papers that will be compared should have the same Table ID>. This is a list because papers can"],
    "paperid": "<Paper ID. A string used as the name of the row in the table. It's used to link tables to papers."
    <other information eg abstracts, title, authors, etc.>
}

If this data is being used for training or supervised evaluation, there will also be gold tables in this directory (in a file usually called tables.jsonl). Each line in the table has the following structure:

{
    "tabid" : "<Table ID",
    "table": {
        "row1": {"paperid_1": "value", "paperid_2": "value"},
        "row2": {"paperid_1": "value", "paperid_2": "value"},
        ...
    }
}

Each of these files can be overridden individually by setting the data.papers_path or data.tables_path at the command line. E.g.:

python paper_comparison/run.py data.papers_path=data/debug_abstracts/papers.jsonl data.tables_path=data/debug_abstracts/tables.jsonl endtoend=debug

Training

(Not implemented yet) For training, train one of endtoend, schematizer, or populator. The relevant configs must have training details in the training field of the config.

Training an end-to-end model:

python paper_comparison/train.py train_data=<data_config> val_data=<data_config> endtoend=<endtoend_config> trainer=<train_config>

Training a schematizer model:

python paper_comparison/train.py train_data=<data_config> val_data=<data_config> schematizer=<schematizer_config>

Training a populator model requires specifying a schematizer to use as well:

python paper_comparison/train.py train_data=<data_config> val_data=<data_config> schematizer=<schematizer_config> populator=<populator>

Code Structure

  • loaders: Code for loading in tables (in json format) and papers (using papermage) [Not implemented yet]
  • endtoend.py: Transform collections of papers directly into tables. Used for baselines.
  • schematizer.py: Transform collections of papers into schema
  • populator.py: Tansforms papers + schema into table
  • eval.py: Evaluates tables
  • utils.py: Various experiment utilities.

paper_comparison_internal's People

Contributors

bnewm0609 avatar aakanksha19 avatar kyleclo avatar yjo2lee avatar

Watchers

Pao Siangliulue avatar  avatar  avatar  avatar Raymond Fok avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.