ollybutters / puma Goto Github PK

View Code? Open in Web Editor NEW

13.0 6.0 4.0 44.63 MB

Getting, analysing and displaying lists of papers

Home Page: https://ollybutters.github.io/puma_web

License: GNU General Public License v3.0

Python 62.61% CSS 2.59% JavaScript 34.64% Dockerfile 0.16%

metadata publications awesome bibliometrics birth-cohort citations big-cats python

puma's Introduction

👋 Hi, I’m @OllyButters

I do a lot of health software development and health data wrangling in a university environment.

You can poke around my CV at https://faji.uk/cv/ or explore my academic publications at https://ollybutters.github.io/puma_web/olly/

Drop me a message at https://twitter.com/DrOllyButters if you want to talk

puma's People

Contributors

Stargazers

Watchers

Forkers

lxngoddess5321 christk arcnwc franck-pepperlabs

puma's Issues

Come up with a reasonable export format for ANALYSIS

pubmed moving to https only

Dear Colleagues,

As we announced on June 10, 2016 (https://www.ncbi.nlm.nih.gov/news/06-10-2016-ncbi-https/), NCBI will be transitioning to using HTTPS-only protocols on September 30, 2016. This change may affect any software that uses NCBI APIs such as the E-utilities or NCBI software toolkits such as the C/C++ and SRA toolkits.

Based on a review of incoming API requests, we have identified your email address as being associated with API calls using the HTTP protocol. We are writing to ensure that you are aware of these upcoming changes so that you can make any necessary updates to your software. If you are not making such calls, please disregard this message and accept our apologies.

Recently we published a page that describes these changes and offers suggested actions you can take to mitigate any problems with your software that may arise. It also lists several test servers that you can use to check the performance of your current API calls.

http://www.ncbi.nlm.nih.gov/home/develop/https-guidance.shtml

This page is also linked from the NCBI Develop page (a direct link from the NCBI home page) and the NCBI API page:

http://www.ncbi.nlm.nih.gov/home/develop/
http://www.ncbi.nlm.nih.gov/home/develop/api.shtml

Please review these documents and plan to take action soon. If you have questions, please write to [email protected].

Split the list of publications by year

Should make it quicker to load if there is only a couple of hundred per page.

Figure out how to match institutions

mesh headings have major flags

in the xml version there is a major topic yn flag to describe how important the mesh heading is, see e.g.

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=10557003&retmode=xml

outlined here

http://www.nlm.nih.gov/bsd/licensee/elements_descriptions.html

Deal with deltas

There will be some things that we can't get from pubmed etc. We need to be able to deal with that. One option is that we grab all the data from zotero then only use it if all the other sources of info fail (e.g. pubmed et al). We would need to have a way of highlighting when a vital bit of info is missing - perhaps a web page listing e.g.

DOI: 1234
MISSING: first author

then the end user could make sure a first author is present in zotero.

Get citations from pubmed?

scopus gives us one number for how many citations there have been, but that is only a reflection of what is in their database - different companies will give different values. Can we get the number of citations from pubmed? Looking at a publication:

https://www.ncbi.nlm.nih.gov/pubmed/18276894

It has a number of citations in the right hand column. Can we get that? The original 'get.py' file does some simple pubmed api stuff with the biopython module - if you can get the citations it will be with something like that. See if that is possible.

We could end up with 2 or 3 different numbers for how many citations each paper has.

Display papers that do not have a year

Papers that do not have a year do not show up anywhere. A page needs to be create in the papers list for unknown years.

Missing authors!

2587412
2250698
1347620
8680184

Not picking up the date from zotero when no date from doi or pmid

This paper

https://www.zotero.org/groups/811126/items/collectionKey/T6K7G4G6/itemKey/5AKHFCFZ

is in a stata journal from 2003 -> no DOI or PMID. It's in zotero manually and there is a year for the date, but that doesn't make it through the pipeline and this one ends up with an error on the error log page - 07a806570b9fca29fc607ada90b3bc6b - should this work?

Tidy HTML and validate

The output HTML looks like it is one lone string - I find sticking in some line breaks at the end of each line makes it easier to debug the HTML from the browser. Maybe do that?

The output HTML is not valid - that needs to be fixed.

Keywords usage over time statistics

I think that this can been achieved fairly easily using the current meta data and it can be displayed using a bar chart of years on the relevant keyword page.

Missing first author institute

Seems to be some discrepancy with the institute between the raw PMID, DOI and merged files. e.g.

PMID:
20372150

"AuthorList": [
{
"LastName": "Freathy",
"Initials": "RM",
"Identifier": [],
"AffiliationInfo": [
{
"Affiliation": "Genetics of Complex Traits, Peninsula College of Medicine and Dentistry, University of Exeter, Exeter, UK.",
"Identifier": []
}
],
"ForeName": "Rachel M"
},

DOI:
3GQ5C9FC

"creators": [
{
"lastName": "Freathy",
"creatorType": "author",
"firstName": "Rachel M."
},

But the merged file:
00e81dc2fda509749bb4b56a351ff282

"author": [
{
"affiliation": [
{
"identifier": "",
"name": ""
}
],
"given": "Rachel M",
"family": "Freathy"
},

Put error catching on citation stuff

Volume number missing

The volume number is missing from the merged files.

For example:
RAW -
"Journal": { "ISSN": "2168-6238", "ISOAbbreviation": "JAMA Psychiatry", "JournalIssue": { "Volume": "71", "Issue": "10", "PubDate": { "Month": "Oct", "Year": "2014" } }
MERGED -
"Journal": { "ISOAbbreviation": "JAMA Psychiatry", "JournalIssue": { "Issue": "10", "PubDate": { "Year": "2014", "Month": "Oct" } }

Use wikidata to look up university coordinates

Could use something like wikidata to look up university lat/longs. The link below does a lookup for german universities.

https://query.wikidata.org/#%23Locations%20of%20universities%20in%20Germany%0A%23defaultView%3AMap%0Aselect%20%3FuniversityLabel%20%3FuniversityDescription%20%3Fwebsite%20%3Fcoord%20WHERE%20%7B%0A%09%3Funiversity%20wdt%3AP31%2Fwdt%3AP279%2a%20wd%3AQ3918%20%3B%0A%09%09wdt%3AP17%20wd%3AQ183%20%3B%0A%09%09wdt%3AP625%20%3Fcoord%20.%0A%09OPTIONAL%20%7B%0A%09%09%3Funiversity%20wdt%3AP856%20%3Fwebsite%0A%09%7D%0A%09SERVICE%20wikibase%3Alabel%20%7B%0A%09%09bd%3AserviceParam%20wikibase%3Alanguage%20%22en%2C%20de%22%20.%0A%09%7D%0A%7D

Figure out how to output to bibtex/endnote etc formats

Missing affiliations

I've been chasing some missing affiliations that i think should be there. The extract from the log below is one that is flagged as a missing institute. As far as I can tell it should be there though - it is in the raw pubmed xml file, but then that gets lost in getPubmed.py There are ~50 missing institute errors and I suspect some are from this.

ob13747@IT017411:~/git/papers/logs$ grep 26257770 papers.log
INFO:root:Working on 26257770
INFO:root:Downloading 26257770
WARNING:root:Unable to read pmid 26257770

Build HTML report of missing required fields

At the end if the 'get/merge' phase it would be good double check all the required fields are there. Spitting out any issues into a HTML page to highlight issues seems like a good idea. Will need to decide what is a required field first though! Could be somewhat fuzzy with this too, e.g.

Missing first author -> highlight red and say it needs to be fixed.
Missing keywords, titles, geolocation etc -> Highlight yellow and say it ought to be fixed.

Since the merging order is something like: Pubmed, DOI, elsewhere, zotero; we can then add the missing fields to zotero and it will fix the problem.

Fix non-ascii characters in author list on html pages

Make the author's names work properly on the html pages of lists of pages. Will need to encode the utf8 names into something like htmlspecial characters - or can html5 deal with utf now if marked up properly?

Put a date stamp on when citation data downloaded, then update if old

Date ranges from pubmed

some pubmed results return a date range for a paper, this is because the authors feel the need to say they did the work over a specific date range. I've had a go at dealing with this, but it is rudimentary and should probably check what is returned makes sense - eg an int in a range of 1980-2100 or some thing?

Some citation api calls return multiple values

Calc average position of each author in list and give a weighted position

Make paths and imports consistent in html2

Try to import all the modules at the start of the file and not in fns - otherwise you end up importing the same thing again and again.

still some relative paths e.g. line 482.

Deal with missing data

Use different metadata sources

Currently it's all about pubmed and scopus. Need to figure out how to deal with other sources of metadata - think about the social science papers etc.

Issued dates missing

Some papers are missing a issued dates.

For example: 032b01c1ed7b8c4ac1e77d1591ec1d8b

Inconsistent parsing of 'extras' field

Currently the Zotero notes field is parsed as [fieldname]: [fielddata]\n in collate.py and [fieldname]=[fieldata]\\ in clean.py. Need to:
i) make this consistent
ii) make additions to notes upload to Zotero
iii) update incorrectly formatted data

3 year difference between first version and final version!

this paper was first published in 2011, but also again in 2014. PM picks it up as 2014, but we have it as 2011.

http://ejo.oxfordjournals.org/content/36/2/125.article-info

Add completeness stats to analysis outputs

When we output the results from analysis routines it would be good to have an idea of how complete the source data is. This means that if I get a result of 100% of papers coming from one journal then I can see if that is true for all papers, or if it is just because only 1 paper has info for the journal.

Calculate number of citations received each year

The number of citations is the total number for that paper for all time - it is not how many it got in a year. It would be good if we could calculate that.

Alt metric gone

Add a flag to show author network or not in config

people might not want to show the author network diagram (it is a bit of a faff to make it all the time) so having a flag in the config file to show the link or not seems like a good idea.

Concatenated PMID

Merged file:
11cdfce682822ce04f875402e4f5f244

has PMID:
11397886

But there is not a PMID in the raw pmids folder with that ID, there is
1139788

which has a crazy date in 1975 which is getting picked up elsewhere. So My guess is that the long PMID is getting chopped somewhere.

Not running the full merge on all the files everytime

The merging function takes a long time to run, and seems to run every time the pipeline is run. It would be good if we could come up with a way for it to only run on the changed files.

Pubmed errors

Pubmed does have some errors in it, we should correct these

http://www.ncbi.nlm.nih.gov/books/NBK3827/#pubmedhelp.Typographical_Errors

h index

Mismatch in numbers of papers

@hgarner I have a bit of a discrepancy of the numbers of papers we have. In the alspac all papers zotero library I have 1319 papers, in the cache snapshot I have 1073, and that is what comes out of the pipeline. What are you running your zotero extraction against?

build wordcloud

current master dies at html part

HTML - Home

Traceback (most recent call last):
File "./source/papers.py", line 136, in
cohort_rating, cohort_rating_data_from = html.build_htmlv2.build_home(papers, error_log)
File "/home/ob13747/git/papers/source/html/build_htmlv2.py", line 126, in build_home
shutil.copyfile(config.template_dir + '/style_main.css', config.html_dir + '/css/style_main.css')
File "/usr/lib/python2.7/shutil.py", line 82, in copyfile
with open(src, 'rb') as fsrc:
IOError: [Errno 2] No such file or directory: 'html/template/style_main.css'
ob13747@IT017411:~/git/papers$

Citations plot additions

On the number of citations plot it would be good if we could show the mean and median values, e.g. by colouring the relevant bar in a different colour, or putting a label on the plot somehow.

It would also be good if we could show the high value citations too - I am not sure the best way to do this - we could do a log plot, or we could bin the citations up in to e.g. bins of 10, or we could show the larger values on a different plot. Have a look at the data and see what you think is the best way to display it.

Use the google docs API?

Could automatically pull PMID from google docs?

https://developers.google.com/google-apps/spreadsheets/#working_with_list-based_feeds

Multiple citations returned for dois

Tom did look at this, but I am not sure I understand his solution - look more at it!

Merge-clean-testing branch merging issues

Issues with the current main branches -

mct - correct number of papers pulled from zotero, and years allocated properly to most of them. But, hardly any members of exec labelled. Figure this out before merging.

Also, I don't understand the collate settings. The log seems to imply it is still downloading stuff from zotero and pubmed/doi.

current master ->
https://github.com/OllyButters/papers/tree/3cc3c72d2f9c9ba87198add92ed6a982fbd982b7

merge-clean-testing
https://github.com/OllyButters/papers/tree/c8846e5cc3fe5afab8886c23b613bc0c8a3c1999

Hardcoded zotero links

There might be a couple of places where the full zotero link has been hardcoded with the group ID in it. grep through the code for zotero and make sure this is added as a variable.

Tidying up old files

Trying to spring clean the source tree a bit. Candidates for deletion:

get/
build_pmid_list.py
get.py ???
google_ss.py
google.py

@hgarner - you got any reason not to delete these?