Giter Club home page Giter Club logo

alexandria3k's Issues

PubMed Data Set Query Option Missing in PyPI Version of Alexandria3k

Environment:

  • OS : Ubuntu 22.04 (Windows Subsystem for Linux 2)
  • Python Version : 3.11.5

Alexandria3k Version: 3.1.2

Issue Description:

When attempting to perform a query on the pubmed-data using the alexandria3k CLI, the system returns an error indicating that 'pubmed-data' is not a valid choice for data_name and it doesn't seem like there is a choice for 'pubmed-data' in general. This occurs despite the package documentation suggesting that PubMed data set queries are supported.

Command Attempted:

panaspanakis@hpc:~$ a3k query pubmed-data 'pubmed_data' --query 'SELECT DOI, title FROM pubmed_articles WHERE published_year > 2020 AND title LIKE "%Machine Learning%"'

Error Received:

usage: a3k query [-h] [-a ATTACH_DATABASES [ATTACH_DATABASES ...]] [-E OUTPUT_ENCODING] [-F FIELD_SEPARATOR] [-H] [-o OUTPUT] [-P] (-Q QUERY_FILE | -q QUERY) [-s SAMPLE]
                 {orcid,doaj,asjcs,journal-names,uspto,crossref,ror,funder-names} [data_location]
a3k query: error: argument data_name: invalid choice: 'pubmed-data' (choose from 'orcid', 'doaj', 'asjcs', 'journal-names', 'uspto', 'crossref', 'ror', 'funder-names')

Here is also a screenshot of the described issue below:

image

Possible non-random sampling

Hi there. Firstly, thanks for a3k, I'm finding it very useful.

I noticed a problem when using --sample 'random.random() < 0.0001' to randomly sample from the latest Crossref dataset. It seemed to produce identical samples each time, whereas I was expecting it to produce different samples each time. I've not yet looked through the code, but I wondered if it might be an issue with seeding the random generator? Perhaps this is expected behaviour, so apologies if I missed this in the docs.

An example:

$ a3k populate --sample 'random.random() < 0.0001' /tmp/crossref.db crossref ./data-files/crossref/

$ sqlite3 -batch /tmp/crossref.db 'select id,doi,title from works where id in (select max(id) from works) or id in (select min(id) from works) order by id';
id     doi                          title                                                                                     
-----  ---------------------------  ------------------------------------------------------------------------------------------
0      10.1007/978-3-658-29701-5_1  Keynote Speech Disruption in mobility – new trends, new concepts and new business models?!
21383  10.18356/98a0368f-en-fr      No. 47244 International Bank for Reconstruction and Development and Brazil

$ rm /tmp/crossref.db

$ a3k populate --sample 'random.random() < 0.0001' /tmp/crossref.db crossref ./data-files/crossref/

$ sqlite3 -batch /tmp/crossref.db 'select id,doi,title from works where id in (select max(id) from works) or id in (select min(id) from works) order by id';
id     doi                          title                                                                                     
-----  ---------------------------  ------------------------------------------------------------------------------------------
0      10.1007/978-3-658-29701-5_1  Keynote Speech Disruption in mobility – new trends, new concepts and new business models?!
21383  10.18356/98a0368f-en-fr      No. 47244 International Bank for Reconstruction and Development and Brazil

Notice the identical results after deleting and recreating the database with a 'fresh' sample. Perhaps this is expected behaviour, but I was expecting a random sample, and hence different each time.

Some quick sanity checks:

$ sqlite3 -batch /tmp/crossref.db 'select count(*) from works;'
count(*)
--------
10000

$ ls -l data-files/crossref/ | head -n 4
total 185934MB
-rwxrwxrwx 1 austinjp austinjp  8MB 2023-08-10 19:35 0.json.gz
-rwxrwxrwx 1 austinjp austinjp 11MB 2023-08-10 20:10 10000.json.gz
-rwxrwxrwx 1 austinjp austinjp  7MB 2023-08-10 20:11 10001.json.gz

$ ls -1 data-files/crossref/ | wc -l
28702

Workaround

As a workaround, I use --sample '( random.seed() ) or random.random() < 0.0001' to re-seed the random generator at every sample decision. It's inefficient, but it gives the results I'd expected:

$ rm /tmp/crossref.db

$ a3k populate --sample '( random.seed() ) or random.random() < 0.0001' /tmp/crossref.db crossref ./data-files/crossref/

$ sqlite3 -batch /tmp/crossref.db 'select id,doi,title from works where id in (select max(id) from works) or id in (select min(id) from works) order by id';

id     doi                               title                                                                  
-----  --------------------------------  -----------------------------------------------------------------------
0      10.1097/00001721-199206000-00004  Protein S negates the activated protein C inhibitory activity of plasma
37767  10.1177/109980040000200201        Summer Camp for Scientists

$ rm /tmp/crossref.db

$ a3k populate --sample '( random.seed() ) or random.random() < 0.0001' /tmp/crossref.db crossref ./data-files/crossref/

$ sqlite3 -batch /tmp/crossref.db 'select id,doi,title from works where id in (select max(id) from works) or id in (select min(id) from works) order by id';

id     doi                            title                                                                                                                     
-----  -----------------------------  --------------------------------------------------------------------------------------------------------------------------
0      10.15296/ijwhr.2017.33         Health Promoting Behaviors and Self-efficacy of Physical Activity During Pregnancy: An Interventional Study               
21383  10.1016/s0973-0826(08)60299-9  Risk and uncertainty analysis of natural environmental assets threatened by hydropower projects: case study from Sri Lanka

Best wishes.

CI dependencies installation error

After 5 consequent releases of new versions of pipenv for the month of August, there seems to be something missing from the previous versions. As a result this raises an error on CI when installing the environment dependencies.

The history of releases can be seen here.

One of the missing dependencies is "virtualenv-clone", however the raised error is not based on that specific dependency.

image

As a workaround

I propose using the last version of pipenv for the month of July (2023.7.23). This way the dependency installation of CI works fine.

I will follow the pipenv releases and test them. If pipenv stops releasing new versions and the a3k CI still has a problem, then I can maybe contact the authors with our error to provide some further information.

GSOC 2023 Proposal

Greeting Mr Diomidis,

I have uploaded my proposal for the GSOC 2023 with regards to the addition of extensions in Alexandria3k. I would love to hear back feedback from you, so to make it better and maybe clarify some parts.

Kind Regards,
Aggelos Margkas.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.