medialab / bibliotools3.0 Goto Github PK

View Code? Open in Web Editor NEW

12.0 12.0 4.0 84 KB

modification of bibliotools 2.2 from Sébastian Grauwin

License: Apache License 2.0

Python 99.80% Shell 0.20%

bibliotools3.0's Issues

User-interface

Develop a minimal user-interface for the script, notably to:

merge a series of csv exported from ISI in one corpus
define a series of temporal periods (with feedback on the number of bibliographical notices in each)
define the threshold for the different types of nodes (with a feedback on the number of nodes obtained)
define the black-list of items not to be included in the graph (i.e. the keywords present in the original query see this other issue)
extract the graphs
set the parameters of spatialisation, colour and ranking to pre-visualize the network (see this other issue)

Parse data from Scopus

Currently the script works on ISI Web of Science data.
It would be interesting to develop a parser for Scopus (http://www.scopus.com) as well

Writing all the thresholds in the report

In the config of the script it is possible to define two different threshold for each type of node:

occurrence count
weight
In the report, however, only the threshold of the 'weight' is mentioned. Both should be indicated.

Distinguishing different types of keywords

The scripts does not distinguish between

the Title Keywords (TK - the words present in the title)
the Author Keyword (IK - the keywords assigned by the author)

In particular:

it is not possible to assign them different thresholds
in the Graphml, they are in the same partition "keyword".

Scripting the graph visualisation

It would save a lot of time, if the default visualisation of the networks was automatised (scripted in Gephi toolkit, Sigma or other).

In particular, these are the operations to script:

SPATIALISATION
LinLog Mode (yes)
Scaling (0.02)
Prevent Overlap (no)
gravity (1)

COLORS (as named in Gephi)
reference: light grey
institutions: olivedrab
authors: gold
keywords: hot pink
subjects: dark orange
countries: powder blue

RANKING (according to the occurence count)
5-150

Blacklist
It often happens that users want to exclude some of the entities from all the extracted graph. A typical case, is the exclusion of the keyword that were present in the original query and are therefore, by construction, connected to almost all the items in the graph.
It would save users a lot of time to be able to define this entities once and for all in black-list to be applied to all the extracted graph.

Whitelist
Researchers or experts may want to include in the maps some items that would be excluded by the selected thresholds. It would be therefore useful to have the possibility to 'impose' some items in the networks.

Community detection and analysis

Connect the Paul-optimised version of the script with the algorithm to detect&analyse the communities by Sebastien (see http://sebastian-grauwin.com/BIBLIOMAP/ and click on the nodes).

NB. this would be super helpful for the CO2 project

Adding the journals

It would be nice to have the titles of the journal appearing in the corpus as a new type of node (apparently the parser is capable to is capable to parse them, but they are not exploited)

Problème sur la connexion des meta-données dans la dernière version du script

En les analysant avec les doctorants de Leuven, on a trouvé un truc bizarre sur les cartes que tu m'a envoyé jeudi dernier.
Les références spatialisent bien, mais il y a un problème sur le meta-données.
Sur pas mal des cartes (notamment celles des dernières tranches temporelles), quasiment tous les nœuds des meta-données se retrouvent au centre (car uniformément liés à toutes les parties du graphe de références).

Au début je croyais qu'il s'agissait d'un problème des seuils, mais puis on a lancé l'extraction des même cartes (avec les mêmes seuils) sur l'ordi de Kari et on ne retrouve pas le même problème (les meta-données au contraire se spatialisent bien où on s'attendrait de les trouver).

Vu que le problème semble concerner seulement les dernières tranches temporelles, je me demande si cela n'est pas du à la parallélisation.

Threshold by average and quartile

Extracting comparable networks (in the sense of having roughly the same number of nodes) from time-spans containing a highly diverse number of bibliographical notices demands to set different filtering thresholds. Lower for time-spans containing fewer nodes; higher for time-spans containing more nodes.
A way of doing this in a more systematic way may be to use average and quartiles.
Instead of filtering all the nodes with an occurence count ("occ") lower than N or the edges with a weight lower than N ("weight"), we could filter all the nodes and edges with an occurence or weight lower than the average (or the 1st quartile or the 3rd quartile).

medialab / bibliotools3.0 Goto Github PK

bibliotools3.0's Issues

User-interface

Parse data from Scopus

Writing all the thresholds in the report

Distinguishing different types of keywords

Scripting the graph visualisation

Blacklists of entities

Community detection and analysis

Adding the journals

Problème sur la connexion des meta-données dans la dernière version du script

Threshold by average and quartile

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent