Giter Club home page Giter Club logo

csda-publication-usage-finder's People

Contributors

krisstanton avatar simha4 avatar

Watchers

 avatar

csda-publication-usage-finder's Issues

Develop the Initial Application Structure

  • Create a new branch initial_application_structure or iss4__initial_app_structure
  • Create files/classes for the application - According to the initial Architecture Diagram
  • Fill out with initial class stubs - Class Names, Constructors, Standard output or any other common class functions needed - at first, this will mostly be empty.
  • Connect the communication between classes according to the Architecture Diagram
  • Create a section for dev/fixture tests - This is a place (directory) to test any complex or unknown processes in isolation
  • Create an entry point for both devs and production
    • Code to start the process by a local developer
    • Code to start the process triggered by a single function call via inside of an Airflow process.

Tasks

No tasks being tracked yet.

Examine scholarly python library

Examine python library called scholarly to see if this can work as a direct way to search for publications in google scholars.

Direct link to function reference for generic google scholars search
https://github.com/scholarly-python-package/scholarly/blob/9269ff36ad2314e6cc0c5b499efc3b79b844707e/scholarly/_scholarly.py#L91C12-L91C12

Another helper for working with scholarly
https://stackoverflow.com/questions/62938110/does-google-scholar-have-an-api-available-that-we-can-use-in-our-research-applic

  • Install scholarly python library locally
  • Try out the example functions with our test string
  • analyze the results and figure out next steps that would be needed to integrate with the existing code base (next ticket)
    • Examine the output from the search results to see what properties we have and how to access them.
    • Determine the list of properties that we will need to save for each publication (Draft: Title, Authors list, Citation string, Citation_ID, journal name, etc)
    • Next Step - #16

Handle Storage of previous runs - Specific to Staging and Production environments (cloud)

  • Update existing code for the local storage version to be compatible with upgrading to a multi storage model
    Store previous runs results data (JSON object) in an S3 Bucket
  • Store run results in an AWS S3 location
    • Create the S3 Bucket that will be designated to be the staging storage area
    • Create the S3 Bucket that will be designated to be the production storage area
    • If necessary, create a scriptable version of creating the AWS S3 / infrastructure components
    • Write code and functions for saving and loading this data (AWS CLI Wrapper or Boto3 python interface - This depends on the way we usually do it in Airflow)

Implement the report generation

Design the report generation feature.
Ensure that this report meets the needs of the stakeholder (Does this need to be emailed, is it enough to just store the report in S3, are there particular formatting requirements, etc)
For local copies, the reports should go into a local directory, similar to how result storage is handled.
Create the code to generate the report

Part 1 of 2 (Some of these items are dependent on other tickets so splitting this work up into two chunks)

The report should include

  • Current list of results from the requests on the current application run.
  • For each publication,
    • When this publication / citation was first seen (was it this run, was it a previous run? -- outputting the saved date handles this)
    • Output the List of Authors
    • Output the Publication Title
    • Output the Citation
    • Output the Search String that got to this result
    • Output the name of the API / search method that got us to this result
    • Output other search / API parameters used (Please do not output the API key here!!)
  • Make search results in the report are sorted by publication year
  • Test this report generation

Part 2 of 2 (Some of these items are dependent on other tickets so splitting this work up into two chunks)

// Left todo list

  • Brining the Search String forward (through the code) to the report
  • Sort the Publications by published Year (maybe with an option to FIRST sort by brand new search results, and then by published year within those two groups)
  • Once the local storage work is done, test it and make sure the code which handles loading previously searched articles is working properly

API Prototyping

For each Publication API, write code that shows how to use it to perform one of these searches.
This is a simple and relatively quick way to ensure that the API functions properly before attempting to write production ready code. Also, these prototypes can serve as a starting point for writing the production ready code.

Note: This could be as simple as a single script file which has a hard coded inputs, single function to call the API and then print statements to demonstrate that an output can be achieved.

List of Publication APIs to prototype

  • Elseiver
  • Serpa
  • Google Custom Search API (JSON results)
  • python scholarly library (#9)

Publication API Research

Research the various Publication APIs.
Here is an initial list of Publication Search APIs I found after some checking.
https://guides.temple.edu/APIs#:~:text=An%20API%20is%20a%20protocol,the%20corpus%20they've%20downloaded.
(Note: Some of the below questions are already answered on this page)
The initial goal here is to have 1 or 2 highly reliable APIs to start with that may generate positive results.

Also Note: There is a 3rd party api for Google Scholar called SerpAPI that should also be examined. https://serpapi.com/google-scholar-api

For each of the above APIs we need to get an understanding of what kind of limitations may exist.
Here is a starter list of things to look for, for each API - Note: Not all of these list items and follow up questions are expected to have answers for every API

  • Cost, is this API Free? If it is, is it always free? If it is not free, is there a free version that is limited? If not, how much does it cost?
  • What is the complexity / process for getting an API Key or regular API Access? Is there simply an HTTP request we can make? Do we need to mess around with Auth Tokens? Do we need to get a new key every so often?
  • Rate Limits: Are we limited to a certain amount of requests per time period (ex: 5 requests per second, or 1000 requests per month, etc), we need to know this so we don't overuse any of these APIs in development or production environments
  • Would an Earth Science Publication show up here? Content Limitations? I noticed that some Publication Searches are limited to certain scientific disciplines. We would probably not expect to find an Earth Science Publication in a biomedical journal (although that kind of cross over is something we should later consider as a possibility)

Create the initial Application Architecture Design

Design an initial, flexible code flow for the application, draw a diagram to visualize it, and then create tasks

  • Design an initial flexible code flow for the application (Notes and some brief explanations)
  • Generate an application architecture diagram (Lucid Charts)
  • Add an image of the Architecture diagram to the code base
  • Create the next tickets to execute the work

Handle Storage of previous runs - Logic

When this application is run, it generates some outputs.
When the application is run again, we need to reference the previously generated outputs and perform operations with those.

  • Convert python dictionary object to JSON
  • During report generation, be able to load this previously run data and compare with the current results for the report.
    • Make sure this works for local runs
    • Make sure this works for cloud runs

Handle Storage of previous runs - Specific to Local dev environment

Store previous runs results data (JSON object) in a local file

  • Store run results in a local directory
    • Create a local directory that will be designated to be the local storage area
      • Add this directory to the .gitignore file
    • Write the reusable python function for saving and loading this data

Integrate code with Airflow for production

Steps

  • Make sure this can run within an Airflow environment
  • Find out how to, and implement custom Airflow Run Configurations (where we pass in JSON settings)
  • Deploy Application
  • Test application

Acceptance Criteria

  • The app runs and generates a report (list of publications which matched the search) which is (emailed, or stored in S3 bucket)

Initial Project Setup

Set up this repo, condense notes, begin edits on readme file, plan project goals, create a plan of action to progress forward.

  • Loosely Plan Initial Object Models
  • Draw initial Architecture (code and interactions with the web) #2
  • Research potential websites to target
  • Research potential libraries to use to achieve the goals

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.