The csda-publication-usage-finder from nasa-impact

Develop the Initial Application Structure

Create a new branch initial_application_structure or iss4__initial_app_structure
Create files/classes for the application - According to the initial Architecture Diagram
Fill out with initial class stubs - Class Names, Constructors, Standard output or any other common class functions needed - at first, this will mostly be empty.
Connect the communication between classes according to the Architecture Diagram
Create a section for dev/fixture tests - This is a place (directory) to test any complex or unknown processes in isolation
Create an entry point for both devs and production
- Code to start the process by a local developer
- Code to start the process triggered by a single function call via inside of an Airflow process.

Tasks

No tasks being tracked yet.

Options

Examine scholarly python library

Examine python library called scholarly to see if this can work as a direct way to search for publications in google scholars.

Direct link to function reference for generic google scholars search
https://github.com/scholarly-python-package/scholarly/blob/9269ff36ad2314e6cc0c5b499efc3b79b844707e/scholarly/_scholarly.py#L91C12-L91C12

Another helper for working with scholarly
https://stackoverflow.com/questions/62938110/does-google-scholar-have-an-api-available-that-we-can-use-in-our-research-applic

Install scholarly python library locally
Try out the example functions with our test string
analyze the results and figure out next steps that would be needed to integrate with the existing code base (next ticket)
- Examine the output from the search results to see what properties we have and how to access them.
- Determine the list of properties that we will need to save for each publication (Draft: Title, Authors list, Citation string, Citation_ID, journal name, etc)
- Next Step - #16

Handle Storage of previous runs - Specific to Staging and Production environments (cloud)

Update existing code for the local storage version to be compatible with upgrading to a multi storage model
Store previous runs results data (JSON object) in an S3 Bucket
Store run results in an AWS S3 location
- Create the S3 Bucket that will be designated to be the staging storage area
- Create the S3 Bucket that will be designated to be the production storage area
- If necessary, create a scriptable version of creating the AWS S3 / infrastructure components
- Write code and functions for saving and loading this data (AWS CLI Wrapper or Boto3 python interface - This depends on the way we usually do it in Airflow)

Implement the report generation

Design the report generation feature.
Ensure that this report meets the needs of the stakeholder (Does this need to be emailed, is it enough to just store the report in S3, are there particular formatting requirements, etc)
For local copies, the reports should go into a local directory, similar to how result storage is handled.
Create the code to generate the report

Part 1 of 2 (Some of these items are dependent on other tickets so splitting this work up into two chunks)

The report should include

Part 2 of 2 (Some of these items are dependent on other tickets so splitting this work up into two chunks)

// Left todo list

Brining the Search String forward (through the code) to the report
Sort the Publications by published Year (maybe with an option to FIRST sort by brand new search results, and then by published year within those two groups)
Once the local storage work is done, test it and make sure the code which handles loading previously searched articles is working properly

API Prototyping

For each Publication API, write code that shows how to use it to perform one of these searches.
This is a simple and relatively quick way to ensure that the API functions properly before attempting to write production ready code. Also, these prototypes can serve as a starting point for writing the production ready code.

Note: This could be as simple as a single script file which has a hard coded inputs, single function to call the API and then print statements to demonstrate that an output can be achieved.

List of Publication APIs to prototype

Elseiver
Serpa
Google Custom Search API (JSON results)
python scholarly library (#9)

Publication API Research

Research the various Publication APIs.
Here is an initial list of Publication Search APIs I found after some checking.
https://guides.temple.edu/APIs#:~:text=An%20API%20is%20a%20protocol,the%20corpus%20they've%20downloaded.
(Note: Some of the below questions are already answered on this page)
The initial goal here is to have 1 or 2 highly reliable APIs to start with that may generate positive results.

Also Note: There is a 3rd party api for Google Scholar called SerpAPI that should also be examined. https://serpapi.com/google-scholar-api

For each of the above APIs we need to get an understanding of what kind of limitations may exist.
Here is a starter list of things to look for, for each API - Note: Not all of these list items and follow up questions are expected to have answers for every API

Cost, is this API Free? If it is, is it always free? If it is not free, is there a free version that is limited? If not, how much does it cost?
What is the complexity / process for getting an API Key or regular API Access? Is there simply an HTTP request we can make? Do we need to mess around with Auth Tokens? Do we need to get a new key every so often?
Rate Limits: Are we limited to a certain amount of requests per time period (ex: 5 requests per second, or 1000 requests per month, etc), we need to know this so we don't overuse any of these APIs in development or production environments
Would an Earth Science Publication show up here? Content Limitations? I noticed that some Publication Searches are limited to certain scientific disciplines. We would probably not expect to find an Earth Science Publication in a biomedical journal (although that kind of cross over is something we should later consider as a possibility)

Create the initial Application Architecture Design

Design an initial, flexible code flow for the application, draw a diagram to visualize it, and then create tasks

Design an initial flexible code flow for the application (Notes and some brief explanations)
Generate an application architecture diagram (Lucid Charts)
Add an image of the Architecture diagram to the code base
Create the next tickets to execute the work

Integrate Scholarly Code with App Structure

Take the code from the prototype and put it in the app structure
- Make sure we are not making too many requests
Test to see if scholarly works within the app structure

Handle Storage of previous runs - Logic

When this application is run, it generates some outputs.
When the application is run again, we need to reference the previously generated outputs and perform operations with those.

Convert python dictionary object to JSON
During report generation, be able to load this previously run data and compare with the current results for the report.
- Make sure this works for local runs
- Make sure this works for cloud runs

Handle Storage of previous runs - Specific to Local dev environment

Store previous runs results data (JSON object) in a local file

Store run results in a local directory
- Create a local directory that will be designated to be the local storage area
- - Add this directory to the .gitignore file
- Write the reusable python function for saving and loading this data

Integrate code with Airflow for production

Steps

Make sure this can run within an Airflow environment
Find out how to, and implement custom Airflow Run Configurations (where we pass in JSON settings)
Deploy Application
- #12
Test application

Acceptance Criteria

The app runs and generates a report (list of publications which matched the search) which is (emailed, or stored in S3 bucket)

Initial Project Setup

Set up this repo, condense notes, begin edits on readme file, plan project goals, create a plan of action to progress forward.

Loosely Plan Initial Object Models
Draw initial Architecture (code and interactions with the web) #2
Research potential websites to target
Research potential libraries to use to achieve the goals

nasa-impact / csda-publication-usage-finder Goto Github PK

csda-publication-usage-finder's People

Contributors

Watchers

csda-publication-usage-finder's Issues

Develop the Initial Application Structure

Tasks

Examine scholarly python library

Handle Storage of previous runs - Specific to Staging and Production environments (cloud)

Implement the report generation

API Prototyping

Publication API Research

Create the initial Application Architecture Design

Integrate Scholarly Code with App Structure

Handle Storage of previous runs - Logic

Handle Storage of previous runs - Specific to Local dev environment

Integrate code with Airflow for production

Initial Project Setup

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent