pmsoltani / elsametric Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 48.88 MB

Jupyter Notebook 10.67% Python 89.33%

elsametric's People

Contributors

Watchers

elsametric's Issues

Create a script to match database authors with institution faculty members

Feature description

In the past, the unorganized and messy repo shcopus was used (with some success) to match Sharif University of Technology's faculty members with .csv files exported from Scopus. If other institutions are to be added to the database, their faculty members must be distinguished from other authors. To achieve that, a new script is needed.

Merge branches "db_updates" & "fastapi" together.

Feature description

Working on the backend would be easier if it had only 1 development branch, namely dev.

Write an account for "elsametric"

Feature description

Since its beginning 5 months ago, elsametric has only existed as a thought project. Its goals and the path to achieving them remained as vague ideas. Writing about how it came to life can help to clarify many things, among which is what to do next.

Put a cap on the number of author's collaborators

Feature description

Currently, the size of the list of collaborations of an author is determined only by network_threshold, which specifies the minimum number of joint papers between author and co-author to be on the returned list.

That list, however, can be quite long for some authors. Increasing the network_threshold cannot be considered since some authors do not have many collaborations. The solution, then, is to have an alternative method of limiting the final list.

elsametric cannot read environment variables

Bug description

In #53, elsametric started using the environs package to access environment variables. If, however, elsametric is called from another module, environs cannot find the required .env file, unless a specific path is supplied.

Add support for two-level department structures

Feature description

Currently, elsametric supports institutions with only one level of departments. However, many universities have a two-level structure: they have some faculties/colleges/schools (such as "College of Engineering" or "School of Fine Arts") and each of them can have multiple departments/study group/research institutes within them (like "Department of Mechanical Engineering" or "Department of Performing Arts").

Adding support for these latter universities, though requiring a significant change in the current structure of elsametric, will make it able to cover a more broad range of academic institutions.

Turn the SQLAlchemy model into a full python package

Feature description

The folder structure of the repo is becoming too cluttered. The SQLAlchemy model should be its own package, which can be installed using pip.

"sqlalchemy_utils" creates the database with "utf8" instead of "utf8mb4"

Bug description

As a result of #24, elsametric now uses the sqlalchemy_utils package to create a database, if one does not exist. This happens in the package's __init__.py file as:

if not database_exists(engine_uri):
    create_database(engine_uri)

This code, by default, creates a schema using the utf8 charset, which is fine for most cases, but there are some publications that use special characters not supported by utf8. That is why MySQL supports another encoding, utf8mb4, which covers more characters.

Not using utf8mb4 may result in errors like Incorrect string value when dealing with VARCHAR or TEXT columns.

Move unrelated files to another repo

Introduction

After #22, elsametric should be used as a separate package and the current repo should only host files necessary for its development. To achieve this, the scope of the package elsametric must be determined and to do that, its mission has to be defined first.

What is `elsametric`?

It began as a collection of loose ideas about how to break down the raw academic publication data (such as data obtained from Scopus) in a database, which then could be queried to extract information.

elsametric has two main parts:

a part which designs the database using SQLAlchemy (the models directory)
a part which gives access to some process functions that help populate the database with the data gathered from here and there (the helpers directory, along with db_populate.py)

There are also some other parts as well:

some (mostly visual) files on the current shape of the database (the db_design folder)
some Python scripts and Jupyter Notebooks which usually contain old, experimental scripts, including:
- custom.ipynb which retrieves data from Scopus API
- Tehran.py & Modares.py which attempt to crawl faculty data from the University of Tehran and Tarbiat Modares University respectively
- queries.py & queries.ipynb which were used to test different queries

These files should be moved to either the junkyard or the helper_scripts directories.

Additionally, the repo contains:

config.json which is used to connect to the database, populate it using different sources, configure the API, ...
gsc_profile.py in the helper_scripts directory which is used to get author metrics from google scholar

To create the API, two other files were created: main.py (API routes) & api_queries.py, which includes process functions for the API to work.

What should `elsametric` do (and what should it not)

elsametric is about designing and maintaining an efficient database to store academic publications data. As such, it should be consisted of:

the elsametric folder, which includes the SQLAlchemy model and some helper functions to process data
the db_design folder, which holds a graphical version of the SQLAlchemy model, created using MySQL Workbench
scripts to populate the database, which at the moment only includes db_populate.py
scripts to gather data from the web, including:
- scripts for getting publications data from servers such as Scopus & WOS
- crawlers for getting the profile of the faculty members
- crawlers for getting author metrics (such as h-index from google scholar)

Of the items mentioned above, only the elsametric directory will be installed using pip. Other scripts and files reside solely in the repo. Future releases might install them along with the elsametric folder.

Other functionality regarding the growth and maintenance of the database can be included in the future. For example, the CSV-processing functions in the shcopus repo which can analyze CSV export from Scopus can be added to this repo, in case of Scopus API limitations.

Yet other functionality might include ways of migrating the database, probably using SQLAlchemy's Alembic. These tools will enable the package to avoid re-populating the entire database, every time a change in the structure is needed.

elsametric is not about creating and maintaining a webserver or an API. That should be the job of another repo. Hence, the files main.py and api_queries.py are to be moved out of this repository.

Any remaining script, whether Python or Jupyter Notebook, should be moved to the helper_scripts, and if they are not needed, to the junkyard directory. Eventually, the junkyard folder should be reviewed for any useful files and subsequently deleted from the repo... one should travel light!

Add tests

Feature description

Currently, there are no tests implemented against the codebase. Frankly, I think that is why the feature requests of the repo are much more than its bug reports.

Rename "Session" to "SessionLocal"

Feature description

The idea comes from FastAPI documentation.
Import Session from sqlalchemy.orm wherever a session object is needed for extra editor support.

Review and refactor the "gsc_profile.py" script

Feature description

With the creation of the new helpers.py module as a result of #30, the script gsc_profile.py should be refactored to use the functions in that module, instead of defining new ones within the script.

Author Stats

Feature suggestion

This issue indents to make common author stats available to the front-end clients.

Stats potentially include:

Additionally, the rank of the author for different metrics within the institution can also be shown as separate metrics:

rank of h-index
rank of i10-index
rank of total papers
rank of total citations

These ranks can be accompanied with the histogram data of that metric within the institution, to better show the position of author among his/her colleagues.

Review and refactor the "faculties_tehran.py" script

Feature description

It has been some time since the creation of faculties_tehran.py script (formerly Tehran.py). The script should be reviewed and refactored so that it uses the recent changes in the repo's programming style (like the extensive use of config.json).

Additionally, some helper functions that are useful across multiple scripts may be extracted from faculties_tehran.py to their own helpers.py module.

Change API endpoint "qs" to something more descriptive

Feature description

The API endpoint /a/authorID/qs, which returns the count of published papers in journals with different percentiles, is vague. It should be replaced with a more suitable sub-domain.

Review and refactor the "faculties_modares.py" script

Feature description

The script faculties_modars.py (formerly Modares.py) was created long ago and was never tested. Now that #30 is in process, it is best to refactor this script as well, so that it uses the recent changes in the repo's programming style (like the extensive use of config.json).

Refactor the code in "helpers"

Feature description

Regarding the text in elsametric.md, this issue is going to address the refactoring of the functions in the helpers directory.

Make "elsametric" compatible with PostgreSQL

Feature description

Regarding the text in elsametric.md, this issue is going to address how to make elsametric work with Postgres.

Create schema if it does not exist

Feature description

Currently, if the schema in config.json does not exist, scripts such as db_populate.py will throw an exception complaining about it. Devise a way of creating the schema if one isn't found.

Make "token_generator" function configurable

Feature description

The function token_generator has a hard-coded value within it:

def token_generator(nbytes=8):
    return secrets.token_urlsafe(nbytes)

This value should be coming from config.json.

Review the "ext_faculty_process" function for updates

Bug description

Working on #30, #31, and #32 caused some modifications to the way faculty data is handled in Airtable. For example, some columns now have a different name. This can affect the ext_faculty_process function in the helpers.py of the elsametric package.

Add google scholar IDs to author profiles

Feature description

Having an author's google scholar id provides a way of accessing his/her metrics, such as h-index & i10-index. These can be used in the front-end's dashboard, along with the link to the google scholar's profile itself, which can serve as an additional contact.

Add columns for "preferred" first and last names to the "Author" class of SQLAlchemy model

Feature description

The Author class of the SQLAlchemy model should have support for "preferred first name" and "preferred last name". These columns should be used in front-end clients for 2 reasons:

When populating the database, the algorithm may fill the first column with an "initial", which is not suitable.
In the future, users may be able to suggest edits on the author's info, in this case first and last names. If that happens, the "preferred" first and last names should be updated and displayed.

Use environment variables instead of config.json

Feature description

It would appear that using "environment variables" is the standard way of injecting runtime variables into the application. Currently, elsametric uses a config.json file with a nested structure, which makes it impossible for services such as Heroku to use the configurations, except for committing the config.json itself, which is not recommended at all.

Using environment variables has, unfortunately, an immediate downside: one has to use a flat structure. As a way of compensation, section names can be chained to the variable names using an underscore, e.g. DB_STARTUP_MYSQL_DRIVER which can be represented in JSON format as:

{
  "db": {
    "startup": {
      "mysql": {
        "driver": "mysqlconnector"
      }
    }
  }
}

Close branch flask

Feature description

The development of the Flask API has stopped for some time now, due to using Fast API. It is best to remove the flask branch to reduce the complexity of the project.

Refactor api_queries.py

Feature description

api_queries.py file has become too large and too complex. It should be refactored, and possibly turned into smaller pieces.

Expand the "papers" endpoint to accept more kinds of journal metrics

Feature description

Currently the /a/authorID/papers endpoint can accept q1, q2, q3, q4, as values for the metric parameter. It should also accept percentile values (such as p79) and undefined (for papers published in non-ranked sources).

Use string values in the get_keywords method of the Author class of the SQLAlchemy model

Feature description

Currently, the Author class of the SQLAlchemy model has a get_keywords() method which returns a dictionary of keywords used by the author, along with the count of each keyword. The dictionary is of the following format:

{keyword1: count1, keyword2: count2}

The problem is that the keys in that dictionary are of the type keyword_.Keyword instead of simple str. Since the Keyword class of the SQLAlchemy model does not hold much information (only the string of the actual keyword, in fact), it is safe to use string values as the keys for the above dictionary.

pmsoltani / elsametric Goto Github PK

elsametric's People

Contributors

Watchers

elsametric's Issues

Feature description

Suggested solution

Feature description

Suggested solution

Feature description

Suggested solution

Feature description

Suggested solution

Bug description

Suggested solution

Feature description

Suggested solution

Feature description

Suggested solution

Bug description

Suggested solution

Introduction

What is elsametric?

What should elsametric do (and what should it not)

Feature description

Suggested solution

Feature description

Suggested solution

Feature description

Suggested solution

Feature suggestion

Suggested solution

Feature description

Suggested solution

Feature description

Suggested solution

Feature description

Suggested solution

Feature description

Suggested solution

Feature description

Suggested solution

Feature description

Suggested solution

Feature description

Suggested solution

Bug description

Suggested solution

Feature description

Suggested solution

Feature description

Suggested solution

Feature description

Suggested solution

Feature description

Suggested solution

Feature description

Feature description

Suggested solution

Feature description

Suggested solution

Recommend Projects

Recommend Topics

Recommend Org

What is `elsametric`?

What should `elsametric` do (and what should it not)