Giter Club home page Giter Club logo

elsametric's People

Contributors

pmsoltani avatar

Watchers

 avatar

elsametric's Issues

Create a script to match database authors with institution faculty members

Feature description

In the past, the unorganized and messy repo shcopus was used (with some success) to match Sharif University of Technology's faculty members with .csv files exported from Scopus. If other institutions are to be added to the database, their faculty members must be distinguished from other authors. To achieve that, a new script is needed.

Suggested solution

A new script should be developed that:

  • Uses the institutions section of the config.json (to loop through desired institutions).
  • Employs fuzzy matching of strings (using the fuzzywuzzy package).
  • Returns a list of Scopus IDs and a degree of confidence associated with the results.

Merge branches "db_updates" & "fastapi" together.

Feature description

Working on the backend would be easier if it had only 1 development branch, namely dev.

Suggested solution

  1. Create a dev branch from master.
  2. Merge branches db_updates & fastapi into it.
  3. Remove branches db_updates & fastapi.

Write an account for "elsametric"

Feature description

Since its beginning 5 months ago, elsametric has only existed as a thought project. Its goals and the path to achieving them remained as vague ideas. Writing about how it came to life can help to clarify many things, among which is what to do next.

Suggested solution

Write about the following points

  • the history of analyzing Scopus data.
  • why there was a need to create a database?
  • steps taken so far
  • the current state of the project
  • steps needed to be taken for a production-ready code

Use the above to prioritize and make a plan for future steps.

Put a cap on the number of author's collaborators

Feature description

Currently, the size of the list of collaborations of an author is determined only by network_threshold, which specifies the minimum number of joint papers between author and co-author to be on the returned list.

That list, however, can be quite long for some authors. Increasing the network_threshold cannot be considered since some authors do not have many collaborations. The solution, then, is to have an alternative method of limiting the final list.

Suggested solution

  • Rename the variable networ_threshold to collaboration_threshold, since the former name is too vague.
  • Add a new variable called network_max_count which accepts an integer.
  • Edit the get_author_network function of api_queries.py to use that variable to limit the final list (remove the least strong collaborations first)

elsametric cannot read environment variables

Bug description

In #53, elsametric started using the environs package to access environment variables. If, however, elsametric is called from another module, environs cannot find the required .env file, unless a specific path is supplied.

Suggested solution

Replace the line

env.read_env()

inside __init__.py with the one below:

env.read_env(path=Path.cwd())

Note that it may be needed to import Path from the pathlib library.

Add support for two-level department structures

Feature description

Currently, elsametric supports institutions with only one level of departments. However, many universities have a two-level structure: they have some faculties/colleges/schools (such as "College of Engineering" or "School of Fine Arts") and each of them can have multiple departments/study group/research institutes within them (like "Department of Mechanical Engineering" or "Department of Performing Arts").

Adding support for these latter universities, though requiring a significant change in the current structure of elsametric, will make it able to cover a more broad range of academic institutions.

Suggested solution

Here are some of the challenges:

  • Some institutions have only 1 level of departmental structure (as it is already supported).
  • The name of parent & child departments varies from institution to institution and even within a single university, there might be names such as "school of this" and "college of that".
  • Within institutions with 2 levels of departments, there might be units that do not have a parent or a child.

To overcome these challenges, it might be best to (these are initial, inconsistent thoughts):

  • Create parent > children structure that can accept any name (faculty, college, ... for parent and department, group, ... for children)
  • Each Institution can have multiple parent departments (one to many). Each parent department can have multiple child departments of its own (one to many).
  • All authors of an institution must belong to at least 1 child department. No authors can belong to a parent department directly.
  • If an institution has one 1 level of departmental structure, all of its units must be the children of a generic parent department (Undefined).
  • If an institution with 2 levels of departmental structure also has units with no parent or children, those units should be treated as children of a generic parent department (Undefined).

Parent Department & Child Department models should support the following fields:

  1. English & Farsi names
  2. English abbreviation
  3. Type
  4. Other fields already present in the department model.

... to be continued?

Turn the SQLAlchemy model into a full python package

Feature description

The folder structure of the repo is becoming too cluttered. The SQLAlchemy model should be its own package, which can be installed using pip.

Suggested solution

Using the pythons setuptools package, turn the directory elsametric within the repo to its own package. Since all of this is experimental, by using the available methods of the setuptools package, unnecessary folders can be excluded from the final package and so the current structure can be left alone

"sqlalchemy_utils" creates the database with "utf8" instead of "utf8mb4"

Bug description

As a result of #24, elsametric now uses the sqlalchemy_utils package to create a database, if one does not exist. This happens in the package's __init__.py file as:

if not database_exists(engine_uri):
    create_database(engine_uri)

This code, by default, creates a schema using the utf8 charset, which is fine for most cases, but there are some publications that use special characters not supported by utf8. That is why MySQL supports another encoding, utf8mb4, which covers more characters.

Not using utf8mb4 may result in errors like Incorrect string value when dealing with VARCHAR or TEXT columns.

Suggested solution

Explicitly tell sqlalchemy_utils to use utf8m4 when creating a database:

if not database_exists(engine_uri):
    create_database(engine_uri, encoding='utf8mb4')

Move unrelated files to another repo

Introduction

After #22, elsametric should be used as a separate package and the current repo should only host files necessary for its development. To achieve this, the scope of the package elsametric must be determined and to do that, its mission has to be defined first.

What is elsametric?

It began as a collection of loose ideas about how to break down the raw academic publication data (such as data obtained from Scopus) in a database, which then could be queried to extract information.

elsametric has two main parts:

  • a part which designs the database using SQLAlchemy (the models directory)
  • a part which gives access to some process functions that help populate the database with the data gathered from here and there (the helpers directory, along with db_populate.py)

There are also some other parts as well:

  • some (mostly visual) files on the current shape of the database (the db_design folder)
  • some Python scripts and Jupyter Notebooks which usually contain old, experimental scripts, including:
    • custom.ipynb which retrieves data from Scopus API
    • Tehran.py & Modares.py which attempt to crawl faculty data from the University of Tehran and Tarbiat Modares University respectively
    • queries.py & queries.ipynb which were used to test different queries

These files should be moved to either the junkyard or the helper_scripts directories.

Additionally, the repo contains:

  • config.json which is used to connect to the database, populate it using different sources, configure the API, ...
  • gsc_profile.py in the helper_scripts directory which is used to get author metrics from google scholar

To create the API, two other files were created: main.py (API routes) & api_queries.py, which includes process functions for the API to work.

What should elsametric do (and what should it not)

elsametric is about designing and maintaining an efficient database to store academic publications data. As such, it should be consisted of:

  1. the elsametric folder, which includes the SQLAlchemy model and some helper functions to process data
  2. the db_design folder, which holds a graphical version of the SQLAlchemy model, created using MySQL Workbench
  3. scripts to populate the database, which at the moment only includes db_populate.py
  4. scripts to gather data from the web, including:
    • scripts for getting publications data from servers such as Scopus & WOS
    • crawlers for getting the profile of the faculty members
    • crawlers for getting author metrics (such as h-index from google scholar)

Of the items mentioned above, only the elsametric directory will be installed using pip. Other scripts and files reside solely in the repo. Future releases might install them along with the elsametric folder.

Other functionality regarding the growth and maintenance of the database can be included in the future. For example, the CSV-processing functions in the shcopus repo which can analyze CSV export from Scopus can be added to this repo, in case of Scopus API limitations.

Yet other functionality might include ways of migrating the database, probably using SQLAlchemy's Alembic. These tools will enable the package to avoid re-populating the entire database, every time a change in the structure is needed.

elsametric is not about creating and maintaining a webserver or an API. That should be the job of another repo. Hence, the files main.py and api_queries.py are to be moved out of this repository.

Any remaining script, whether Python or Jupyter Notebook, should be moved to the helper_scripts, and if they are not needed, to the junkyard directory. Eventually, the junkyard folder should be reviewed for any useful files and subsequently deleted from the repo... one should travel light!

Add tests

Feature description

Currently, there are no tests implemented against the codebase. Frankly, I think that is why the feature requests of the repo are much more than its bug reports.

Suggested solution

Add Unit Testing and Integration Testing to the codebase.

Rename "Session" to "SessionLocal"

Feature description

The idea comes from FastAPI documentation.
Import Session from sqlalchemy.orm wherever a session object is needed for extra editor support.

Suggested solution

Rename the Session from the sessionmaker function in the base.py to SessionLocal. This will help avoid any clashes between that and the Session imported from sqlalchemy.orm. Then in every file needed do the following:

  • Use SessionLocal to issue queries and such.
  • Use Session (from sqlalchemy.orm) for type hinting.
  • Also, rename instances of SessionLocal from session (all lower-cased) to just db. That way, an example query would be like:
author = db.query(Author).get(1)

Instead of:

author = session.query(Author).get(1)

This has the benefit of being shorted and more meaningful.

Review and refactor the "gsc_profile.py" script

Feature description

With the creation of the new helpers.py module as a result of #30, the script gsc_profile.py should be refactored to use the functions in that module, instead of defining new ones within the script.

Suggested solution

Remove the function definitions from gsc_profile.py and use the ones imported from helpers.py.

Author Stats

Feature suggestion

This issue indents to make common author stats available to the front-end clients.

Stats potentially include:

  • h-index
  • i10-index
  • total papers
  • total citations
  • papers this year
  • citations this year
  • collaborations: institutional
  • collaborations: national
  • collaborations: international
  • number of books & book chapters
  • conference papers percentage (or conference paper to journal paper ratio)

Additionally, the rank of the author for different metrics within the institution can also be shown as separate metrics:

  • rank of h-index
  • rank of i10-index
  • rank of total papers
  • rank of total citations

These ranks can be accompanied with the histogram data of that metric within the institution, to better show the position of author among his/her colleagues.

Suggested solution

  • A new API endpoint should be made available, e.g. /a/authorID/stats
  • Some stats are readily available, such as h-index & total citations, while others need to be calculated, like international collaborations
  • All stats should have a date attached to them. For some of them, this is the date that the metrics were obtained from a 3rd-party source, such as h-index & i10-index, while for others like total citations, this must be equal to the retrieval date of the paper with the oldest retrieval_time attribute.
  • For each institution (or more precisely in this case, for the home_institution), the histogram data of all metrics should be calculated once, perhaps asynchronously. The results then can be attached to the rank metrics of the requested author.
  • The returned object should be formatted roughly like this:
{
  "hIndex": {
    "value": 1,
    "retrievalTime": "timestamp or datetime object",
    "institutionRank": "5/120 tied 2",
    "histogram": "data"
  },
  "papers": {
    "value": 50,
    "retrievalTime": "timestamp or datetime object",
    "institutionRank": "7/120",
    "histogram": "data"
  },
  "papersThisYear": {
    "value": 10,
    "retrievalTime": "timestamp or datetime object",
    "institutionRank": "12/120 tied 1",
    "histogram": []
  }
}

Review and refactor the "faculties_tehran.py" script

Feature description

It has been some time since the creation of faculties_tehran.py script (formerly Tehran.py). The script should be reviewed and refactored so that it uses the recent changes in the repo's programming style (like the extensive use of config.json).

Additionally, some helper functions that are useful across multiple scripts may be extracted from faculties_tehran.py to their own helpers.py module.

Suggested solution

Here are some tips:

  • Use more meaningful names for variables.
  • Use constants where possible.
  • Import constants from config.json.
  • Extract CSV I/O functions to a new sibling module, called helpers.py.
  • Simplify the script as much as possible (e.g. by removing the part for "education history" of the faculty members).

Good Luck!

Change API endpoint "qs" to something more descriptive

Feature description

The API endpoint /a/authorID/qs, which returns the count of published papers in journals with different percentiles, is vague. It should be replaced with a more suitable sub-domain.

Suggested solution

  • The qs should be renamed to jmetrics (as in "journal metrics").
  • The a/authorID/papers should accept the parameter metric instead of q.

Review and refactor the "faculties_modares.py" script

Feature description

The script faculties_modars.py (formerly Modares.py) was created long ago and was never tested. Now that #30 is in process, it is best to refactor this script as well, so that it uses the recent changes in the repo's programming style (like the extensive use of config.json).

Suggested solution

Here are some tips:

  • Use more meaningful names for variables.
  • Use constants where possible.
  • Import constants from config.json.
  • Use helper functions from helpers.py. Add to them if necessary.
  • Simplify the script as much as possible (e.g. by removing the part for "education history" of the faculty members).

Refactor the code in "helpers"

Feature description

Regarding the text in elsametric.md, this issue is going to address the refactoring of the functions in the helpers directory.

Suggested solution

  • Add type annotations to functions (if any remains) and important variables (such as SQLAlchemy's query results).
  • Deal with any remaining TODOs.
  • Use fewer if statements and more try/except blocks (EAFP).
  • Simplify the functions as much as possible.
  • For special fields that can belong to only MySQL or PostgreSQL (but not both), decide on a way to proceed: Different codes for different backends, or use a more general type for these columns hat both MySQL and Postgres support (for example, Postgres supports bool type, but not MySQL; so they can both use an int type that accepts only 1 or 0). To achieve better results, wait for #50 to be closed, first.

Make "elsametric" compatible with PostgreSQL

Feature description

Regarding the text in elsametric.md, this issue is going to address how to make elsametric work with Postgres.

Suggested solution

  • Run the code, as is, with a Postgres database (without creating one first, so as to test the functionality of sqlalchemy-utils along the way).
  • Quick solve the problems, one by one, to see which column types are problematic.
  • Decide whether to use different column types for different DBMSs or use a more general type that can cover both. This decision will affect the functions using the created database, including the repo elsaserver.
  • Apply the decision to all modified classes.
  • Decide if config.json file needs restructuring (to store Postgres configurations).
  • Update __init__.py and base.py if needed.

Create schema if it does not exist

Feature description

Currently, if the schema in config.json does not exist, scripts such as db_populate.py will throw an exception complaining about it. Devise a way of creating the schema if one isn't found.

Suggested solution

Use the python package sqlalchemy-utils to check the existence of the schema and creating it if one is not present.

Make "token_generator" function configurable

Feature description

The function token_generator has a hard-coded value within it:

def token_generator(nbytes=8):
    return secrets.token_urlsafe(nbytes)

This value should be coming from config.json.

Suggested solution

  1. Add a new key, token_bytes to the database section of the config.json file.
  2. Create a new constant, TOKEN_BYTES, that gets its value from config.json (inside __init__.py).
  3. Import TOKEN_BYTES from __init__.py into base.py and use it in token_generator.
  4. Calculate the required length of a VARCHAR column to store to token (inside base.py) and store it in a constant, VARCHAR_COLUMN_LENGTH.
  5. Import VARCHAR_COLUMN_LENGTH and use it in any module that uses token_generator.

Review the "ext_faculty_process" function for updates

Bug description

Working on #30, #31, and #32 caused some modifications to the way faculty data is handled in Airtable. For example, some columns now have a different name. This can affect the ext_faculty_process function in the helpers.py of the elsametric package.

Suggested solution

Review the function ext_faculty_process and the script db_populate.py to see if any updates are necessary.

Add google scholar IDs to author profiles

Feature description

Having an author's google scholar id provides a way of accessing his/her metrics, such as h-index & i10-index. These can be used in the front-end's dashboard, along with the link to the google scholar's profile itself, which can serve as an additional contact.

Suggested solution

Create a script that reads faculty names from a .csv file and search for them in google scholar's database. A good way would be to use the package scholarly. If an author is found, use his/her google scholar id (id_gsc) to obtain the h-index & i10-index (probably using BeautifulSoup4).

Export the results to a .csv file.

Add columns for "preferred" first and last names to the "Author" class of SQLAlchemy model

Feature description

The Author class of the SQLAlchemy model should have support for "preferred first name" and "preferred last name". These columns should be used in front-end clients for 2 reasons:

  1. When populating the database, the algorithm may fill the first column with an "initial", which is not suitable.
  2. In the future, users may be able to suggest edits on the author's info, in this case first and last names. If that happens, the "preferred" first and last names should be updated and displayed.

Suggested solution

  • Add two columns to the SQLAlchemy Author model, such as first_pref & last_pref.
  • Configure the Author model to use the values for first and last columns as initial values.
  • Values from "validated sources" can update first_pref and last_pref columns. These sources include a curated list of authors supplied from a third-party source (faculties.csv) and the author him/herself (using the front-end client).

Use environment variables instead of config.json

Feature description

It would appear that using "environment variables" is the standard way of injecting runtime variables into the application. Currently, elsametric uses a config.json file with a nested structure, which makes it impossible for services such as Heroku to use the configurations, except for committing the config.json itself, which is not recommended at all.

Using environment variables has, unfortunately, an immediate downside: one has to use a flat structure. As a way of compensation, section names can be chained to the variable names using an underscore, e.g. DB_STARTUP_MYSQL_DRIVER which can be represented in JSON format as:

{
  "db": {
    "startup": {
      "mysql": {
        "driver": "mysqlconnector"
      }
    }
  }
}

Suggested solution

Use a package to parse environment variables, either injected from the command line, or from a .env file. One example is dynaconf, another is environs.

Change the code in __init__.py and elsewhere to employ the newly installed package.

Close branch flask

Feature description

The development of the Flask API has stopped for some time now, due to using Fast API. It is best to remove the flask branch to reduce the complexity of the project.

Suggested solution

  • Remove the flask branch.

Refactor api_queries.py

Feature description

api_queries.py file has become too large and too complex. It should be refactored, and possibly turned into smaller pieces.

Expand the "papers" endpoint to accept more kinds of journal metrics

Feature description

Currently the /a/authorID/papers endpoint can accept q1, q2, q3, q4, as values for the metric parameter. It should also accept percentile values (such as p79) and undefined (for papers published in non-ranked sources).

Suggested solution

The function get_author_papers_metric should be changed to accept these new parameters. Care must be taken on the data validations.

Use string values in the get_keywords method of the Author class of the SQLAlchemy model

Feature description

Currently, the Author class of the SQLAlchemy model has a get_keywords() method which returns a dictionary of keywords used by the author, along with the count of each keyword. The dictionary is of the following format:

{keyword1: count1, keyword2: count2}

The problem is that the keys in that dictionary are of the type keyword_.Keyword instead of simple str. Since the Keyword class of the SQLAlchemy model does not hold much information (only the string of the actual keyword, in fact), it is safe to use string values as the keys for the above dictionary.

Suggested solution

  • Change the code in the Author model to use str values for the keys in the get_keywords() method.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.