elsametric's People
elsametric's Issues
Create a script to match database authors with institution faculty members
Feature description
In the past, the unorganized and messy repo shcopus
was used (with some success) to match Sharif University of Technology's faculty members with .csv
files exported from Scopus. If other institutions are to be added to the database, their faculty members must be distinguished from other authors. To achieve that, a new script is needed.
Suggested solution
A new script should be developed that:
- Uses the
institutions
section of theconfig.json
(to loop through desired institutions). - Employs fuzzy matching of strings (using the
fuzzywuzzy
package). - Returns a list of Scopus IDs and a degree of confidence associated with the results.
Merge branches "db_updates" & "fastapi" together.
Feature description
Working on the backend would be easier if it had only 1 development branch, namely dev
.
Suggested solution
- Create a
dev
branch frommaster
. - Merge branches
db_updates
&fastapi
into it. - Remove branches
db_updates
&fastapi
.
Write an account for "elsametric"
Feature description
Since its beginning 5 months ago, elsametric
has only existed as a thought project. Its goals and the path to achieving them remained as vague ideas. Writing about how it came to life can help to clarify many things, among which is what to do next.
Suggested solution
Write about the following points
- the history of analyzing Scopus data.
- why there was a need to create a database?
- steps taken so far
- the current state of the project
- steps needed to be taken for a production-ready code
Use the above to prioritize and make a plan for future steps.
Put a cap on the number of author's collaborators
Feature description
Currently, the size of the list of collaborations of an author is determined only by network_threshold
, which specifies the minimum number of joint papers between author and co-author to be on the returned list.
That list, however, can be quite long for some authors. Increasing the network_threshold
cannot be considered since some authors do not have many collaborations. The solution, then, is to have an alternative method of limiting the final list.
Suggested solution
- Rename the variable
networ_threshold
tocollaboration_threshold
, since the former name is too vague. - Add a new variable called
network_max_count
which accepts an integer. - Edit the
get_author_network
function ofapi_queries.py
to use that variable to limit the final list (remove the least strong collaborations first)
elsametric cannot read environment variables
Bug description
In #53, elsametric
started using the environs
package to access environment variables. If, however, elsametric
is called from another module, environs
cannot find the required .env
file, unless a specific path is supplied.
Suggested solution
Replace the line
env.read_env()
inside __init__.py
with the one below:
env.read_env(path=Path.cwd())
Note that it may be needed to import Path
from the pathlib
library.
Add support for two-level department structures
Feature description
Currently, elsametric
supports institutions with only one level of departments. However, many universities have a two-level structure: they have some faculties/colleges/schools (such as "College of Engineering" or "School of Fine Arts") and each of them can have multiple departments/study group/research institutes within them (like "Department of Mechanical Engineering" or "Department of Performing Arts").
Adding support for these latter universities, though requiring a significant change in the current structure of elsametric
, will make it able to cover a more broad range of academic institutions.
Suggested solution
Here are some of the challenges:
- Some institutions have only 1 level of departmental structure (as it is already supported).
- The name of parent & child departments varies from institution to institution and even within a single university, there might be names such as "school of this" and "college of that".
- Within institutions with 2 levels of departments, there might be units that do not have a parent or a child.
To overcome these challenges, it might be best to (these are initial, inconsistent thoughts):
- Create parent > children structure that can accept any name (faculty, college, ... for parent and department, group, ... for children)
- Each Institution can have multiple parent departments (one to many). Each parent department can have multiple child departments of its own (one to many).
- All authors of an institution must belong to at least 1 child department. No authors can belong to a parent department directly.
- If an institution has one 1 level of departmental structure, all of its units must be the children of a generic parent department (
Undefined
). - If an institution with 2 levels of departmental structure also has units with no parent or children, those units should be treated as children of a generic parent department (
Undefined
).
Parent Department & Child Department models should support the following fields:
- English & Farsi names
- English abbreviation
- Type
- Other fields already present in the
department
model.
... to be continued?
Turn the SQLAlchemy model into a full python package
Feature description
The folder structure of the repo is becoming too cluttered. The SQLAlchemy model should be its own package, which can be installed using pip
.
Suggested solution
Using the pythons setuptools
package, turn the directory elsametric
within the repo to its own package. Since all of this is experimental, by using the available methods of the setuptools
package, unnecessary folders can be excluded from the final package and so the current structure can be left alone
"sqlalchemy_utils" creates the database with "utf8" instead of "utf8mb4"
Bug description
As a result of #24, elsametric
now uses the sqlalchemy_utils
package to create a database, if one does not exist. This happens in the package's __init__.py
file as:
if not database_exists(engine_uri):
create_database(engine_uri)
This code, by default, creates a schema using the utf8
charset, which is fine for most cases, but there are some publications that use special characters not supported by utf8
. That is why MySQL supports another encoding, utf8mb4
, which covers more characters.
Not using utf8mb4
may result in errors like Incorrect string value
when dealing with VARCHAR
or TEXT
columns.
Suggested solution
Explicitly tell sqlalchemy_utils
to use utf8m4
when creating a database:
if not database_exists(engine_uri):
create_database(engine_uri, encoding='utf8mb4')
Move unrelated files to another repo
Introduction
After #22, elsametric
should be used as a separate package and the current repo should only host files necessary for its development. To achieve this, the scope of the package elsametric
must be determined and to do that, its mission has to be defined first.
What is elsametric
?
It began as a collection of loose ideas about how to break down the raw academic publication data (such as data obtained from Scopus) in a database, which then could be queried to extract information.
elsametric
has two main parts:
- a part which designs the database using SQLAlchemy (the
models
directory) - a part which gives access to some process functions that help populate the database with the data gathered from here and there (the
helpers
directory, along withdb_populate.py
)
There are also some other parts as well:
- some (mostly visual) files on the current shape of the database (the
db_design
folder) - some Python scripts and Jupyter Notebooks which usually contain old, experimental scripts, including:
custom.ipynb
which retrieves data from Scopus APITehran.py
&Modares.py
which attempt to crawl faculty data from the University of Tehran and Tarbiat Modares University respectivelyqueries.py
&queries.ipynb
which were used to test different queries
These files should be moved to either the junkyard
or the helper_scripts
directories.
Additionally, the repo contains:
config.json
which is used to connect to the database, populate it using different sources, configure the API, ...gsc_profile.py
in thehelper_scripts
directory which is used to get author metrics from google scholar
To create the API, two other files were created: main.py
(API routes) & api_queries.py
, which includes process functions for the API to work.
What should elsametric
do (and what should it not)
elsametric
is about designing and maintaining an efficient database to store academic publications data. As such, it should be consisted of:
- the
elsametric
folder, which includes the SQLAlchemy model and some helper functions to process data - the
db_design
folder, which holds a graphical version of the SQLAlchemy model, created using MySQL Workbench - scripts to populate the database, which at the moment only includes
db_populate.py
- scripts to gather data from the web, including:
- scripts for getting publications data from servers such as Scopus & WOS
- crawlers for getting the profile of the faculty members
- crawlers for getting author metrics (such as h-index from google scholar)
Of the items mentioned above, only the elsametric
directory will be installed using pip
. Other scripts and files reside solely in the repo. Future releases might install them along with the elsametric
folder.
Other functionality regarding the growth and maintenance of the database can be included in the future. For example, the CSV-processing functions in the shcopus
repo which can analyze CSV export from Scopus can be added to this repo, in case of Scopus API limitations.
Yet other functionality might include ways of migrating the database, probably using SQLAlchemy's Alembic. These tools will enable the package to avoid re-populating the entire database, every time a change in the structure is needed.
elsametric
is not about creating and maintaining a webserver or an API. That should be the job of another repo. Hence, the files main.py
and api_queries.py
are to be moved out of this repository.
Any remaining script, whether Python or Jupyter Notebook, should be moved to the helper_scripts
, and if they are not needed, to the junkyard
directory. Eventually, the junkyard
folder should be reviewed for any useful files and subsequently deleted from the repo... one should travel light!
Add tests
Feature description
Currently, there are no tests implemented against the codebase. Frankly, I think that is why the feature requests of the repo are much more than its bug reports.
Suggested solution
Add Unit Testing and Integration Testing to the codebase.
Rename "Session" to "SessionLocal"
Feature description
The idea comes from FastAPI documentation.
Import Session
from sqlalchemy.orm
wherever a session
object is needed for extra editor support.
Suggested solution
Rename the Session
from the sessionmaker
function in the base.py
to SessionLocal
. This will help avoid any clashes between that and the Session
imported from sqlalchemy.orm
. Then in every file needed do the following:
- Use
SessionLocal
to issue queries and such. - Use
Session
(fromsqlalchemy.orm
) for type hinting. - Also, rename instances of
SessionLocal
fromsession
(all lower-cased) to justdb
. That way, an example query would be like:
author = db.query(Author).get(1)
Instead of:
author = session.query(Author).get(1)
This has the benefit of being shorted and more meaningful.
Review and refactor the "gsc_profile.py" script
Feature description
With the creation of the new helpers.py
module as a result of #30, the script gsc_profile.py
should be refactored to use the functions in that module, instead of defining new ones within the script.
Suggested solution
Remove the function definitions from gsc_profile.py
and use the ones imported from helpers.py
.
Author Stats
Feature suggestion
This issue indents to make common author stats available to the front-end clients.
Stats potentially include:
- h-index
- i10-index
- total papers
- total citations
- papers this year
- citations this year
- collaborations: institutional
- collaborations: national
- collaborations: international
- number of books & book chapters
- conference papers percentage (or conference paper to journal paper ratio)
Additionally, the rank of the author for different metrics within the institution can also be shown as separate metrics:
- rank of h-index
- rank of i10-index
- rank of total papers
- rank of total citations
These ranks can be accompanied with the histogram data of that metric within the institution, to better show the position of author among his/her colleagues.
Suggested solution
- A new API endpoint should be made available, e.g.
/a/authorID/stats
- Some stats are readily available, such as h-index & total citations, while others need to be calculated, like international collaborations
- All stats should have a date attached to them. For some of them, this is the date that the metrics were obtained from a 3rd-party source, such as h-index & i10-index, while for others like total citations, this must be equal to the retrieval date of the paper with the oldest
retrieval_time
attribute. - For each institution (or more precisely in this case, for the
home_institution
), the histogram data of all metrics should be calculated once, perhaps asynchronously. The results then can be attached to the rank metrics of the requested author. - The returned object should be formatted roughly like this:
{
"hIndex": {
"value": 1,
"retrievalTime": "timestamp or datetime object",
"institutionRank": "5/120 tied 2",
"histogram": "data"
},
"papers": {
"value": 50,
"retrievalTime": "timestamp or datetime object",
"institutionRank": "7/120",
"histogram": "data"
},
"papersThisYear": {
"value": 10,
"retrievalTime": "timestamp or datetime object",
"institutionRank": "12/120 tied 1",
"histogram": []
}
}
Review and refactor the "faculties_tehran.py" script
Feature description
It has been some time since the creation of faculties_tehran.py
script (formerly Tehran.py
). The script should be reviewed and refactored so that it uses the recent changes in the repo's programming style (like the extensive use of config.json
).
Additionally, some helper functions that are useful across multiple scripts may be extracted from faculties_tehran.py
to their own helpers.py
module.
Suggested solution
Here are some tips:
- Use more meaningful names for variables.
- Use constants where possible.
- Import constants from
config.json
. - Extract CSV I/O functions to a new sibling module, called
helpers.py
. - Simplify the script as much as possible (e.g. by removing the part for "education history" of the faculty members).
Good Luck!
Change API endpoint "qs" to something more descriptive
Feature description
The API endpoint /a/authorID/qs
, which returns the count of published papers in journals with different percentiles, is vague. It should be replaced with a more suitable sub-domain.
Suggested solution
- The
qs
should be renamed tojmetrics
(as in "journal metrics"). - The
a/authorID/papers
should accept the parametermetric
instead ofq
.
Review and refactor the "faculties_modares.py" script
Feature description
The script faculties_modars.py
(formerly Modares.py
) was created long ago and was never tested. Now that #30 is in process, it is best to refactor this script as well, so that it uses the recent changes in the repo's programming style (like the extensive use of config.json).
Suggested solution
Here are some tips:
- Use more meaningful names for variables.
- Use constants where possible.
- Import constants from config.json.
- Use helper functions from
helpers.py
. Add to them if necessary. - Simplify the script as much as possible (e.g. by removing the part for "education history" of the faculty members).
Refactor the code in "helpers"
Feature description
Regarding the text in elsametric.md, this issue is going to address the refactoring of the functions in the helpers
directory.
Suggested solution
- Add type annotations to functions (if any remains) and important variables (such as SQLAlchemy's query results).
- Deal with any remaining TODOs.
- Use fewer
if
statements and moretry/except
blocks (EAFP). - Simplify the functions as much as possible.
- For special fields that can belong to only MySQL or PostgreSQL (but not both), decide on a way to proceed: Different codes for different backends, or use a more general type for these columns hat both MySQL and Postgres support (for example, Postgres supports
bool
type, but not MySQL; so they can both use anint
type that accepts only1
or0
). To achieve better results, wait for #50 to be closed, first.
Make "elsametric" compatible with PostgreSQL
Feature description
Regarding the text in elsametric.md, this issue is going to address how to make elsametric
work with Postgres.
Suggested solution
- Run the code, as is, with a Postgres database (without creating one first, so as to test the functionality of
sqlalchemy-utils
along the way). - Quick solve the problems, one by one, to see which column types are problematic.
- Decide whether to use different column types for different DBMSs or use a more general type that can cover both. This decision will affect the functions using the created database, including the repo
elsaserver
. - Apply the decision to all modified classes.
- Decide if
config.json
file needs restructuring (to store Postgres configurations). - Update
__init__.py
andbase.py
if needed.
Create schema if it does not exist
Feature description
Currently, if the schema in config.json
does not exist, scripts such as db_populate.py
will throw an exception complaining about it. Devise a way of creating the schema if one isn't found.
Suggested solution
Use the python package sqlalchemy-utils
to check the existence of the schema and creating it if one is not present.
Make "token_generator" function configurable
Feature description
The function token_generator
has a hard-coded value within it:
def token_generator(nbytes=8):
return secrets.token_urlsafe(nbytes)
This value should be coming from config.json
.
Suggested solution
- Add a new key,
token_bytes
to thedatabase
section of theconfig.json
file. - Create a new constant,
TOKEN_BYTES
, that gets its value fromconfig.json
(inside__init__.py
). - Import
TOKEN_BYTES
from__init__.py
intobase.py
and use it intoken_generator
. - Calculate the required length of a
VARCHAR
column to store to token (insidebase.py
) and store it in a constant,VARCHAR_COLUMN_LENGTH
. - Import
VARCHAR_COLUMN_LENGTH
and use it in any module that usestoken_generator
.
Review the "ext_faculty_process" function for updates
Bug description
Working on #30, #31, and #32 caused some modifications to the way faculty data is handled in Airtable. For example, some columns now have a different name. This can affect the ext_faculty_process
function in the helpers.py
of the elsametric
package.
Suggested solution
Review the function ext_faculty_process
and the script db_populate.py
to see if any updates are necessary.
Add google scholar IDs to author profiles
Feature description
Having an author's google scholar id provides a way of accessing his/her metrics, such as h-index & i10-index. These can be used in the front-end's dashboard, along with the link to the google scholar's profile itself, which can serve as an additional contact.
Suggested solution
Create a script that reads faculty names from a .csv
file and search for them in google scholar's database. A good way would be to use the package scholarly
. If an author is found, use his/her google scholar id (id_gsc
) to obtain the h-index & i10-index (probably using BeautifulSoup4
).
Export the results to a .csv
file.
Add columns for "preferred" first and last names to the "Author" class of SQLAlchemy model
Feature description
The Author
class of the SQLAlchemy model should have support for "preferred first name" and "preferred last name". These columns should be used in front-end clients for 2 reasons:
- When populating the database, the algorithm may fill the
first
column with an "initial", which is not suitable. - In the future, users may be able to suggest edits on the author's info, in this case first and last names. If that happens, the "preferred" first and last names should be updated and displayed.
Suggested solution
- Add two columns to the SQLAlchemy
Author
model, such asfirst_pref
&last_pref
. - Configure the
Author
model to use the values forfirst
andlast
columns as initial values. - Values from "validated sources" can update
first_pref
andlast_pref
columns. These sources include a curated list of authors supplied from a third-party source (faculties.csv
) and the author him/herself (using the front-end client).
Use environment variables instead of config.json
Feature description
It would appear that using "environment variables" is the standard way of injecting runtime variables into the application. Currently, elsametric
uses a config.json
file with a nested structure, which makes it impossible for services such as Heroku to use the configurations, except for committing the config.json
itself, which is not recommended at all.
Using environment variables has, unfortunately, an immediate downside: one has to use a flat structure. As a way of compensation, section names can be chained to the variable names using an underscore, e.g. DB_STARTUP_MYSQL_DRIVER
which can be represented in JSON format as:
{
"db": {
"startup": {
"mysql": {
"driver": "mysqlconnector"
}
}
}
}
Suggested solution
Use a package to parse environment variables, either injected from the command line, or from a .env
file. One example is dynaconf, another is environs.
Change the code in __init__.py
and elsewhere to employ the newly installed package.
Close branch flask
Feature description
The development of the Flask API has stopped for some time now, due to using Fast API. It is best to remove the flask
branch to reduce the complexity of the project.
Suggested solution
- Remove the
flask
branch.
Refactor api_queries.py
Feature description
api_queries.py
file has become too large and too complex. It should be refactored, and possibly turned into smaller pieces.
Expand the "papers" endpoint to accept more kinds of journal metrics
Feature description
Currently the /a/authorID/papers
endpoint can accept q1
, q2
, q3
, q4
, as values for the metric
parameter. It should also accept percentile values (such as p79
) and undefined
(for papers published in non-ranked sources).
Suggested solution
The function get_author_papers_metric
should be changed to accept these new parameters. Care must be taken on the data validations.
Use string values in the get_keywords method of the Author class of the SQLAlchemy model
Feature description
Currently, the Author
class of the SQLAlchemy model has a get_keywords()
method which returns a dictionary of keywords used by the author, along with the count of each keyword. The dictionary is of the following format:
{keyword1: count1, keyword2: count2}
The problem is that the keys in that dictionary are of the type keyword_.Keyword
instead of simple str
. Since the Keyword
class of the SQLAlchemy model does not hold much information (only the string of the actual keyword, in fact), it is safe to use string values as the keys for the above dictionary.
Suggested solution
- Change the code in the
Author
model to usestr
values for the keys in theget_keywords()
method.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.