The full documentation of our application can be found at https://cutt.ly/Mn78T3f
-
Make sure you have Docker installed and running on your machine.
-
Run 'docker-compose up' command in the root directory of the project.
-
Access the application through your web browser by going to 'http://localhost:8000/'.
The most recent version of the application can be accessed at https://aip.irqize.dev/
Use docker-compose up --build
to rebuild your project. This is useful, if you
want to update the application after pulling it from git.
If you want to update the database, run docker volume ls' to find the name of the volume containing the database and delete it using 'docker volume rm <volume_name>' command.
Article Information Parser is an instrument to parse, unify, and in some cases correct article meta-data. AIP creates a PostgreSQL database that allows for easily finding related work.
Developing such a database is tricky, an excerpt of our article introducing this instrument:
Current information sources do not cover the spectrum of the systems community entirely.
For example, DBLP -- which specifically focuses on computer science articles -- lacks certain venues and does not record article abstracts.
Other datasets such as Semantic Scholar and AMiner have similar and other limitations.
Moreover, these datasets also overlap, yet contain important information the others do not offer; they are disjoint.
Our approach is to parse each dataset and filter and unify the information provided.
This instrument combines three data sources: DBLP, Semantic Scholar, and AMiner, which we filter and store in a PostgreSQL database. DBLP is a well-known European archive that focuses on computer science and features all the top-level venues (journals and conferences). Semantic Scholar is an American project created by the Allen Institute for AI. The project aims to analyze and extract important data from scientific publications. AMiner is an Asian project that aims to provide a knowledge graph for mining academic social networks. Both AMiner and Semantic Scholar have incorporated Microsoft's Academic Graph (MAG) in their datasets nowadays.
AIP tackles several non-trivial challenges in unifying these datasets:
- Data discrepancies between sources. For example, titles in DBLP end with a dot, whereas they do not in the Semantic Scholar and AMiner corpuses, causing exact matching to fail.
- Titles and abstracts may contain encoded characters leading to mismatching articles that are in fact the same.
- Despite all data sources having a format specified, we encountered several instances where the format specified is not adhered to, or the data is malformed.
- Venue strings being different among these sources. Some sources use an abbreviation, some use a BibTeX string, etc. AIP maps all these occurrences to the same abbreviation.
- Complementing existing entries. For example, DBLP does not offer abstracts whilst Semantic Scholar and AMiner do.