Giter Club home page Giter Club logo

puma's Introduction

puma's People

Contributors

beccawilson avatar hgarner avatar katrinleinweber avatar ollybutters avatar twyburton avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

puma's Issues

pubmed moving to https only

Dear Colleagues,

As we announced on June 10, 2016 (https://www.ncbi.nlm.nih.gov/news/06-10-2016-ncbi-https/), NCBI will be transitioning to using HTTPS-only protocols on September 30, 2016. This change may affect any software that uses NCBI APIs such as the E-utilities or NCBI software toolkits such as the C/C++ and SRA toolkits.

Based on a review of incoming API requests, we have identified your email address as being associated with API calls using the HTTP protocol. We are writing to ensure that you are aware of these upcoming changes so that you can make any necessary updates to your software. If you are not making such calls, please disregard this message and accept our apologies.

Recently we published a page that describes these changes and offers suggested actions you can take to mitigate any problems with your software that may arise. It also lists several test servers that you can use to check the performance of your current API calls.

http://www.ncbi.nlm.nih.gov/home/develop/https-guidance.shtml

This page is also linked from the NCBI Develop page (a direct link from the NCBI home page) and the NCBI API page:

http://www.ncbi.nlm.nih.gov/home/develop/
http://www.ncbi.nlm.nih.gov/home/develop/api.shtml

Please review these documents and plan to take action soon. If you have questions, please write to [email protected].

Deal with deltas

There will be some things that we can't get from pubmed etc. We need to be able to deal with that. One option is that we grab all the data from zotero then only use it if all the other sources of info fail (e.g. pubmed et al). We would need to have a way of highlighting when a vital bit of info is missing - perhaps a web page listing e.g.

DOI: 1234
MISSING: first author

then the end user could make sure a first author is present in zotero.

Get citations from pubmed?

scopus gives us one number for how many citations there have been, but that is only a reflection of what is in their database - different companies will give different values. Can we get the number of citations from pubmed? Looking at a publication:

https://www.ncbi.nlm.nih.gov/pubmed/18276894

It has a number of citations in the right hand column. Can we get that? The original 'get.py' file does some simple pubmed api stuff with the biopython module - if you can get the citations it will be with something like that. See if that is possible.

We could end up with 2 or 3 different numbers for how many citations each paper has.

Tidy HTML and validate

The output HTML looks like it is one lone string - I find sticking in some line breaks at the end of each line makes it easier to debug the HTML from the browser. Maybe do that?

The output HTML is not valid - that needs to be fixed.

Keywords usage over time statistics

I think that this can been achieved fairly easily using the current meta data and it can be displayed using a bar chart of years on the relevant keyword page.

Missing first author institute

Seems to be some discrepancy with the institute between the raw PMID, DOI and merged files. e.g.

PMID:
20372150

"AuthorList": [
{
"LastName": "Freathy",
"Initials": "RM",
"Identifier": [],
"AffiliationInfo": [
{
"Affiliation": "Genetics of Complex Traits, Peninsula College of Medicine and Dentistry, University of Exeter, Exeter, UK.",
"Identifier": []
}
],
"ForeName": "Rachel M"
},

DOI:
3GQ5C9FC

"creators": [
{
"lastName": "Freathy",
"creatorType": "author",
"firstName": "Rachel M."
},

But the merged file:
00e81dc2fda509749bb4b56a351ff282

"author": [
{
"affiliation": [
{
"identifier": "",
"name": ""
}
],
"given": "Rachel M",
"family": "Freathy"
},

Volume number missing

The volume number is missing from the merged files.

For example:
RAW -
"Journal": { "ISSN": "2168-6238", "ISOAbbreviation": "JAMA Psychiatry", "JournalIssue": { "Volume": "71", "Issue": "10", "PubDate": { "Month": "Oct", "Year": "2014" } }
MERGED -
"Journal": { "ISOAbbreviation": "JAMA Psychiatry", "JournalIssue": { "Issue": "10", "PubDate": { "Year": "2014", "Month": "Oct" } }

Use wikidata to look up university coordinates

Missing affiliations

I've been chasing some missing affiliations that i think should be there. The extract from the log below is one that is flagged as a missing institute. As far as I can tell it should be there though - it is in the raw pubmed xml file, but then that gets lost in getPubmed.py There are ~50 missing institute errors and I suspect some are from this.

ob13747@IT017411:~/git/papers/logs$ grep 26257770 papers.log
INFO:root:Working on 26257770
INFO:root:Downloading 26257770
WARNING:root:Unable to read pmid 26257770

Build HTML report of missing required fields

At the end if the 'get/merge' phase it would be good double check all the required fields are there. Spitting out any issues into a HTML page to highlight issues seems like a good idea. Will need to decide what is a required field first though! Could be somewhat fuzzy with this too, e.g.

  • Missing first author -> highlight red and say it needs to be fixed.
  • Missing keywords, titles, geolocation etc -> Highlight yellow and say it ought to be fixed.

Since the merging order is something like: Pubmed, DOI, elsewhere, zotero; we can then add the missing fields to zotero and it will fix the problem.

Date ranges from pubmed

some pubmed results return a date range for a paper, this is because the authors feel the need to say they did the work over a specific date range. I've had a go at dealing with this, but it is rudimentary and should probably check what is returned makes sense - eg an int in a range of 1980-2100 or some thing?

Make paths and imports consistent in html2

Try to import all the modules at the start of the file and not in fns - otherwise you end up importing the same thing again and again.

still some relative paths e.g. line 482.

Use different metadata sources

Currently it's all about pubmed and scopus. Need to figure out how to deal with other sources of metadata - think about the social science papers etc.

Issued dates missing

Some papers are missing a issued dates.

For example: 032b01c1ed7b8c4ac1e77d1591ec1d8b

Inconsistent parsing of 'extras' field

Currently the Zotero notes field is parsed as [fieldname]: [fielddata]\n in collate.py and [fieldname]=[fieldata]\\ in clean.py. Need to:
i) make this consistent
ii) make additions to notes upload to Zotero
iii) update incorrectly formatted data

Add completeness stats to analysis outputs

When we output the results from analysis routines it would be good to have an idea of how complete the source data is. This means that if I get a result of 100% of papers coming from one journal then I can see if that is true for all papers, or if it is just because only 1 paper has info for the journal.

Concatenated PMID

Merged file:
11cdfce682822ce04f875402e4f5f244

has PMID:
11397886

But there is not a PMID in the raw pmids folder with that ID, there is
1139788

which has a crazy date in 1975 which is getting picked up elsewhere. So My guess is that the long PMID is getting chopped somewhere.

Mismatch in numbers of papers

@hgarner I have a bit of a discrepancy of the numbers of papers we have. In the alspac all papers zotero library I have 1319 papers, in the cache snapshot I have 1073, and that is what comes out of the pipeline. What are you running your zotero extraction against?

current master dies at html part

HTML - Home

Traceback (most recent call last):
File "./source/papers.py", line 136, in
cohort_rating, cohort_rating_data_from = html.build_htmlv2.build_home(papers, error_log)
File "/home/ob13747/git/papers/source/html/build_htmlv2.py", line 126, in build_home
shutil.copyfile(config.template_dir + '/style_main.css', config.html_dir + '/css/style_main.css')
File "/usr/lib/python2.7/shutil.py", line 82, in copyfile
with open(src, 'rb') as fsrc:
IOError: [Errno 2] No such file or directory: 'html/template/style_main.css'
ob13747@IT017411:~/git/papers$

Citations plot additions

On the number of citations plot it would be good if we could show the mean and median values, e.g. by colouring the relevant bar in a different colour, or putting a label on the plot somehow.

It would also be good if we could show the high value citations too - I am not sure the best way to do this - we could do a log plot, or we could bin the citations up in to e.g. bins of 10, or we could show the larger values on a different plot. Have a look at the data and see what you think is the best way to display it.

Merge-clean-testing branch merging issues

Issues with the current main branches -

mct - correct number of papers pulled from zotero, and years allocated properly to most of them. But, hardly any members of exec labelled. Figure this out before merging.

Also, I don't understand the collate settings. The log seems to imply it is still downloading stuff from zotero and pubmed/doi.

current master ->
https://github.com/OllyButters/papers/tree/3cc3c72d2f9c9ba87198add92ed6a982fbd982b7

merge-clean-testing
https://github.com/OllyButters/papers/tree/c8846e5cc3fe5afab8886c23b613bc0c8a3c1999

Hardcoded zotero links

There might be a couple of places where the full zotero link has been hardcoded with the group ID in it. grep through the code for zotero and make sure this is added as a variable.

Tidying up old files

Trying to spring clean the source tree a bit. Candidates for deletion:

get/
build_pmid_list.py
get.py ???
google_ss.py
google.py

@hgarner - you got any reason not to delete these?

Make a 1958 html template in the config dir

Make it look a bit different - pick a random colour scheme and make a couple of images of '1958' or something. Need to have a proof of concept that we can skin it differently, even if we do just show the same data for now.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.