Giter Club home page Giter Club logo

automating-metadata's Introduction

automating-metadata

A central place to drop all of our experiments! As a disclaimer - we're still at the experimentation phase! Lots of this code will change.

The core of this project is to see how we can use LLM and other technology available to us to respond to the following problems (not an exhaustive list):

  1. Metadata is non-standard/incomparable.
  2. Metadata is inconsistently applied.
  3. Metadata is inflexible

These features create a problem which makes relevant literature difficult to connect to each other, leaves the decision of what metadata to include up to researchers or journals, and requires standard application to be useful. We then end up with inconsistent metadata that can't connect research projects either to their own research objects like the code and data associated with them, or to other papers that might be relevant. These issues make academic search and representing an accurate 'map' of science extremely difficult.

We hope to see how we can create a flexible metadata standard that accurately connects papers both to their own additional materials and to other research based on as much data as we can gather.

We're operationalizing this problem as:

How might we reliably automate a literature review?

Publication Text Extraction

Aim/Goal

The aim/goal of this is to provide a way to programmically extract machine readable text from journal publication when provided an identifier such as a title or DOI. Adapted using the methodology described in https://www.nature.com/articles/s41524-021-00687-2, "Automated pipeline for superalloy data by text mining"

Setup and use

Make sure you have Docker installed on your machine. You can download it from Docker's official website.

Clone the Repository

git clone https://github.com/DeSci-md/automating-metadata.git
cd your-repository

Build the Docker Image

docker build -t your-image-name .

Run the Docker Container

docker run -e NODE_ENV=your-node-env -e DOI_ENV=your-doi-env your-image-name

Replace your-node-env and your-doi-env with the desired values for NODE_ENV and DOI_ENV.

Setting Environment Variables

  • NODE_ENV: This determines the node you want to get metadata for. EG: 46
  • DOI_ENV: This is a DOI that corresponds to the node. if you do not have a DOI for the article, do not set this variable.

You can set these environment variables using the -e option with the docker run command.

Example

docker run -e NODE_ENV="46" -e DOI_ENV="10.3847/0004-637X/828/1/46" your-image-name

automating-metadata's People

Contributors

plikt avatar hhlohwv avatar gizmotronn avatar

Stargazers

Atmatan Kabbaher avatar  avatar Sina Iman avatar

automating-metadata's Issues

Figuring out how to add the submodule.

I have added the submodule for PDFMetadataExtractor on my machine. However, while it says that I have successfully added the submodule, I can't find that module anywhere in the Git and the little arrow file that @Gizmotronn said would exist isn't showing up. What am I missing here?

Switching from OpenAI to open source models

ISSUE: We aim to be fully open source, but are currently using Open AI as our model due to ease of use and it's higher quality.

TASK: Switch Open AI prompts to Open Source language models that have similar reliability to Open AI for the tasks we have in langchain_api.py

ADDITIONAL INFO:
Reach out here if you have additional questions!

Visual + LLM parsing to associate metadata annotations with locations on a PDF

ISSUE: We're currently updating a manifest file in nodes.desci.com as specified by the API description at docs.desci.com. However, we hope to associate some of the annotations we're making with specific locations on the PDFs, code files, etc held within a research object (see desci.com for descriptions of a research object).

TASK: Connect visual parsing + LLMs to label on the document itself the relevant annotation.

ADDITIONAL INFO:
See this Semantic Scholar reader for an example of what this could look like: https://www.semanticscholar.org/reader/eb56aaadb392044fc5264b109b5b298e10a39b95

Author Disambiguation using the Open Alex API

ISSUE: Currently we simply input the name of an author obtained by reading the PDF text into the https://openalex.org/ API. However, we would like to instead input the name of the author and use the author disambiguation feature on the open alex API.

TASK: Find what additional information we can feed to the API to get a more accurate OrcID. This is in the get_orcid method in the langchain_api.py file.

ADDITIONAL INFO:
Reach out here if you have additional questions.

Looking for more LLM developers!

ISSUE: We're working with a small team of newbie coders - we'd love to work with someone more experienced in LLMs to help move this project forward.

Reach out here if you're interested!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.