Giter Club home page Giter Club logo

paperai's Introduction

Semantic search and workflows for medical/scientific papers

Version GitHub Release Date GitHub issues GitHub last commit Build Status Coverage Status


paperai is a semantic search and workflow application for medical/scientific papers.

demo

Applications range from semantic search indexes that find matches for medical/scientific queries to full-fledged reporting applications powered by machine learning.

architecture architecture

paperai and/or NeuML has been recognized in the following articles:

Installation

The easiest way to install is via pip and PyPI

pip install paperai

Python 3.8+ is supported. Using a Python virtual environment is recommended.

paperai can also be installed directly from GitHub to access the latest, unreleased features.

pip install git+https://github.com/neuml/paperai

See this link to help resolve environment-specific install issues.

Docker

Run the steps below to build a docker image with paperai and all dependencies.

wget https://raw.githubusercontent.com/neuml/paperai/master/docker/Dockerfile
docker build -t paperai .
docker run --name paperai --rm -it paperai

paperetl can be added in to have a single image to index and query content. Follow the instructions to build a paperetl docker image and then run the following.

docker build -t paperai --build-arg BASE_IMAGE=paperetl --build-arg START=/scripts/start.sh .
docker run --name paperai --rm -it paperai

Examples

The following notebooks and applications demonstrate the capabilities provided by paperai.

Notebooks

Notebook Description
Introducing paperai Overview of the functionality provided by paperai Open In Colab

Applications

Application Description
Search Search a paperai index. Set query parameters, execute searches and display results.

Building a model

paperai indexes databases previously built with paperetl. The following shows how to create a new paperai index.

  1. (Optional) Create an index.yml file

    paperai uses the default txtai embeddings configuration when not specified. Alternatively, an index.yml file can be specified that takes all the same options as a txtai embeddings instance. See the txtai documentation for more on the possible options. A simple example is shown below.

    path: sentence-transformers/all-MiniLM-L6-v2
    content: True
    
  2. Build embeddings index

    python -m paperai.index <path to input data> <optional index configuration>
    

The paperai.index process requires an input data path and optionally takes index configuration. This configuration can either be a vector model path or an index.yml configuration file.

Running queries

The fastest way to run queries is to start a paperai shell

paperai <path to model directory>

A prompt will come up. Queries can be typed directly into the console.

Building a report file

Reports support generating output in multiple formats. An example report call:

python -m paperai.report report.yml 50 md <path to model directory>

The following report formats are supported:

  • Markdown (Default) - Renders a Markdown report. Columns and answers are extracted from articles with the results stored in a Markdown file.
  • CSV - Renders a CSV report. Columns and answers are extracted from articles with the results stored in a CSV file.
  • Annotation - Columns and answers are extracted from articles with the results annotated over the original PDF files. Requires passing in a path with the original PDF files.

In the example above, a file named report.md will be created. Example report configuration files can be found here.

Tech Overview

paperai is a combination of a txtai embeddings index and a SQLite database with the articles. Each article is parsed into sentences and stored in SQLite along with the article metadata. Embeddings are built over the full corpus.

Multiple entry points exist to interact with the model.

  • paperai.report - Builds a report for a series of queries. For each query, the top scoring articles are shown along with matches from those articles. There is also a highlights section showing the most relevant results.
  • paperai.query - Runs a single query from the terminal
  • paperai.shell - Allows running multiple queries from the terminal

paperai's People

Contributors

davidmezzetti avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

paperai's Issues

Add Dockerfile

Add a Dockerfile for running paperai. The Dockerfile should have a way to integrate with paperetl so both can run within the same image.

Add wildcard query reports

Currently, reports are built off search results. For smaller sources, it should be possible to build a report off a full database without a driving search. Add support for wildcard query reports.

Nested Output Directory

When inputting a nested input directory as well as a desired output directory, the file tree is being matched in the output directory. The tei.xml files just get dumped in the output directory. This can lead to issues when multiple pdf files have the same name in a nested input directory. If the client is told to force overwrite xml files, identically named files are overwritten. To fix this, the nested file tree needs to copied to the output directory. At your discretion I have created a PR that solves this issue. I've also included directory creation in case the output directory hasn't been created yet.

Installation issues

The system would report issue with "UnicodeDecodeError: 'gbk' codec can't decode byte 0x82 in position 12007: illegal multibyte sequence" when I execute this command "pip install paperai". I wonder if WINDOWS SYSTEM cannot decompress tar.gz-type packages.
ๅพฎไฟกๅ›พ็‰‡_20201215222928

Embeddings index method is inefficient with memory

The current index method makes a number of calls that force creating copies of the working embeddings index array. For larger datasources, this can cause out of memory errors. Make the following improvements:

  • When streaming embeddings back from disk, create an empty NumPy array initialized to the size of the embeddings index array. Appending NumPy arrays to a list will force a copy to be created, when creating the final NumPy embeddings array.

  • Modify the removePC method to operate directly on the input array vs returning a copy

  • Modify the normalize method to operate directly on the input array vs returning a copy

Add additional command line parameters to vector process

Currently, the vectors process defaults the output location to ~/.cord19/vectors/cord19-300d.magnitude

Add optional command line parameters for:

  • output file
  • number of dimensions for vectors
  • min count to required to include a word token in the vector model

Support query negation

Currently, extractive qa filters support both + and - modifiers to include/not include results when pre-filtering for qa candidate rows. Also add this logic to query filters.

Add annotation report

Add a new report type that annotates the answers over input PDF documents. This source requires a path to the original PDFs.

Use the txtmarker library.

Add additional column parameters

Add the following new column parameters:

  • Source: Allow exporting the source column
  • matches: int. Export top n embedding matches for query instead of running QA process against embeddings matches
  • surround: int. Add additional n surrounding lines either to matches or question result.
  • section: boolean. True if full text section with match should be exported. False otherwise.

I'm not sure to have followed correctly the procedure for running paperai with pre-trained vectors

After successfully installing paperai in Linux (Ubuntu 20.04.1 LTS), I tried to run it by using the pre-trained vectors option to build the model, as follows:

(1) I downloaded the vectors from https://www.kaggle.com/davidmezzetti/cord19-fasttext-vectors#cord19-300d.magnitude
(2) My Downloads folder in my computer ended up with a Zip file containing the vectors.
(3) I created a directory ~/.cord19/vectors/ and moved the downloaded Zip file into this directory (see yellow folder in the figure below).
(4) I extracted the Zip file, which resulted in the grey folder shown below, which contained the file cord19-300d.magnitude
(5) I moved the cord19-300d.magnitude file outside of the grey folder and thus into the ~/.cord19/vectors/ directory (see figure below)

Screenshot from 2020-08-06 22-07-59

(6) I excuted the following command to build the embeddings index with the above pre-trained vectors:

python -m paperai.index

Upon performing the above I got the following error message (see below)

ppai1 9 ErrorLaptopRun

Am I getting this error because the above steps are not the correct ones?
If so, what would be the correct steps?
Otherwise, what other things should I try to eliminate the issue?

Add report reference column

Currently, all columns are uniquely generated. The extraction process for each column can be time-consuming. The framework should be able to generate new columns from existing columns to apply formatting to the output. This change adds a new column type derived from a reference column A prior column can be referenced in the query section of the new column.

Windows install issue

It was reported that paperai can't be installed in a Windows environment due to the following error:

ValueError: path 'src/python/' cannot end with '/'

Allow indexing partial datasources

Currently, the indexing process runs on all articles in the input database. Add a parameter to specify the maximum number of documents. This parameter will pull the most recent N documents, by entry date and only index those.

risk-factors.yml issues

when i run command "python -m paperai.report tasks/risk-factors.yml 50 md cord19/models
", i can't find file risk-factors.yml, and i can't understand argument "50"

Allow setting report options within task yml files

Currently, report options must be passed on the command line. This is cumbersome as the number of arguments grow. This change will add support for report options directly in the report yml file. The current command line options will still work and take precedence but moving forward new options will be primarily defined in the report yml.

Add pre-trained vector model to GitHub

Currently, the pre-trained vector model is stored on Kaggle. Put a copy of this file on the next GitHub release. This will allow automation/docker builds.

# Can optionally use pre-trained vectors
# https://www.kaggle.com/davidmezzetti/cord19-fasttext-vectors#cord19-300d.magnitude
# Default location: ~/.cord19/vectors/cord19-300d.magnitude

Allow running reports against full databases

Currently, an embeddings index is required for reports. But with all queries, the full database can be processed without the need for an embeddings query. Modify the model loading logic to allow this use case.

Add report column format parameter

Add a new report column format parameter (dtype) that allows applying formatting rules to the column output. Examples include converting the result to a number, formatting a duration field and converting the text to a categorical value.

cannot install successfully

image
Thanks for the code. But when I install it, there is a problem -ERROR: Failed building wheel for hnswlib. Do you know how to solve it ? Many thanks.

Remove study design columns

neuml/paperetl#34 removes the study design columns as discussed in the issue. paperai needs to remove references to those columns. The same functionality can be provided with extraction queries.

Vector model file not found (cord19-300d.magnitude)

  • issue moved from wrong project to here -

Hi,

I get the following error when running python -m paperai.index

raise IOError(ENOENT, "Vector model file not found", path)
FileNotFoundError: [Errno 2] Vector model file not found: 'C:\Users\x\.cord19\vectors\cord19-300d.magnitude'

PS. I am quite new to all this; so, apologies if the mistake is on my end.

When trying to download cord19-300d.magnitude from https://www.kaggle.com/davidmezzetti/cord19-fasttext-vectors#cord19-300d.magnitude, I get the error: "Too many requests"

Add parameters to index process

Currently, the Embeddings index is hardcoded to BM25 + fastText. Allow passing in embeddings configuration to the indexing process, so that any txtai supported index can be created.

Allow custom transformer qa model

Currently, reports are hardcoded to the "NeuML/bert-small-cord19qa" model for qa extraction. Add a path parameter to allow Hugging Face hub models and local models. If no qa model set, fall back to "NeuML/bert-small-cord19qa".

RuntimeError: CUDA error: out of memory (NVidia V100, 32 GB DDRAM)

What are the minimum memory requirements for the PaperAI? When running on Nvidia V100, 32 GB DDRAM I got: RuntimeError: CUDA error: out of memory. GPU memory seems to be completely free.

Is there a way how to run it from GPU, or can I run it exclusively on TPUs?

from txtai.embeddings import Embeddings
import torch

torch.cuda.empty_cache()

# MEMORY
id = 1
t = torch.cuda.get_device_properties(id).total_memory
c = torch.cuda.memory_cached(id)
a = torch.cuda.memory_allocated(id)
f = c-a  # free inside cache

print("TOTAL", t / 1024/1024/1024," GB")
print("ALLOCATED", a)

# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/bert-base-nli-mean-tokens"})

import numpy as np

sections = ["US tops 5 million confirmed virus cases",
            "Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
            "Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
            "The National Park Service warns against sacrificing slower friends in a bear attack",
            "Maine man wins $1M from $25 lottery ticket",
            "Make huge profits without work, earn up to $100,000 a day"]


query = "health"
uid = np.argmax(embeddings.similarity(query, sections))
print("%-20s %s" % (query, sections[uid]))

TOTAL 31.74853515625 GB
ALLOCATED 0
Traceback (most recent call last):
File "pokus2.py", line 32, in
uid = np.argmax(embeddings.similarity(query, sections))
File "/home/user/.local/lib/python3.8/site-packages/txtai/embeddings.py", line 228, in similarity
query = self.transform((None, query, None)).reshape(1, -1)
File "/home/user/.local/lib/python3.8/site-packages/txtai/embeddings.py", line 179, in transform
embedding = self.model.transform(document)
File "/home/user/.local/lib/python3.8/site-packages/txtai/vectors.py", line 264, in transform
return self.model.encode([" ".join(document[1])], show_progress_bar=False)[0]

Processing custom sqlite file

I want to create an index and vector file over a Custom sqlite articles database. I have created a articles.sqlite database on medical papers, using paperetl. But I did not find any instruction as to how to process it . Can you please give instructions on this ?

Integration: DeepSource

I ran DeepSource analysis on my fork of this repository and found some code quality issues. Have a look at the issues caught in this repository by DeepSource here.

DeepSource is a code review automation tool that detects code quality issues and helps you to automatically fix some of them. You can use DeepSource to track test coverage, Detect problems in Dockerfiles, etc. in addition to detecting issues in code.

The PR #24 fixed some of the issues caught by DeepSource.

All the features of the DeepSource are mentioned here.
I'd suggest you integrate DeepSource since it is free for Open Source projects forever.

Integrating DeepSource to continuously analyze your repository:

  • Install DeepSource on your repository here.
  • Create .deepsource.toml configuration specific to this repo or use the configuration mentioned below which I used to run the analysis on the fork of this repo.
  • Activate analysis here.
version = 1

test_patterns = ["/test/python/*.py"]

[[analyzers]]
name = "python"
enabled = true

  [analyzers.meta]
  runtime_version = "3.x.x"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.