geocompx / geocompy Goto Github PK

View Code? Open in Web Editor NEW

239.0 11.0 43.0 255.4 MB

Geocomputation with Python: an open source book and online resource for getting started in this space

Home Page: https://py.geocompx.org/

License: Other

TeX 24.12% Shell 0.16% Python 74.88% HTML 0.84%

geocomputation book geo geopython python spatial

geocompy's Introduction

geocompy

https://py.geocompx.org

Running the code in this book requires the following:

Python dependencies, which can be installed with pip, a package manager or a Docker container (see below)
An integrated development environment (IDE) such as VS Code (running locally or on Codespaces/other host) or Jupyter Notebook for running and exploring the Python code interactively
Quarto, which is used to generate the book

Reproduce the book with GitHub Codespaces

GitHub Codespaces minimise set-up costs by providing access to a modern IDE (VS Code) plus dependencies in your browser. This can save time on package installation. Codespaces allow you to make and commit changes, providing a way to test changes and contribute fixes in an instant.

To run the book in Codespaces, click on the link below.

Open in GitHub Codespaces

You should see something like this, the result of running all the code in the book by opening the terminal (e.g. with the command Ctrl+J) and entering the following command:

quarto preview

Reproduce the book with Docker (devcontainer)

If you can install Docker this is likely to be the quickest way to reproduce the contents of this book. To do this from within VS Code:

Install Microsoft’s official Dev Container extension
Open the folder containing the repo in VS Code and click on the ‘Reopen in container’ button that should appear, as shown below (you need to have Docker installed on your computer for this to work)

Edit the code in the containerised instance of VS Code that will appear 🎉

See details below for other ways to get the dependencies and reproduce the book.

Install dependencies with pip

Use pip to install the dependencies as follows, after cloning the repo and opening a terminal in the root folder of the repo.

First we’ll set-up a virtual environment to install the dependencies in:

# Create a virtual environment called geocompy
python -m venv geocompy
# Activate the virtual environment
source geocompy/bin/activate

Then install the dependencies (with the optional python -m prefix specifying the Python version):

# Install dependencies from the requirements.txt file
python -m pip install -r requirements.txt

You can also install packages individually, e.g.:

pip install jupyter-book

Deactivate the virtual environment when you’re done:

deactivate

Install dependencies with a package manager

The environment.yml file contains a list of dependencies that can be installed with a package manager such as conda, mamba or micromamba. The instructions below are for micromamba but should work for any package manager.

# For Linux, the default shell is bash:
curl -L micro.mamba.pm/install.sh | bash
# For macOS, the default shell is zsh:
curl -L micro.mamba.pm/install.sh | zsh

After answering the questions, install dependencies with the following command:

micromamba env create -f environment.yml

Activate the environment as follows:

micromamba activate geocompy

Install kernel, this will allow you to select the environment in vscode or IPython as follows:

python -m ipykernel install --user

You can now reproduce the book (requires quarto to be installed):

micromamba run -n geocompy quarto preview

Reproduce chapters with jupyter

VS Code’s quarto.quarto plugin can Python code in the chunks in the .qmd files in this book interactively.

However, you can also run any of the chapters in a Jupyter Notebook, e.g. as follows:

cd ipynb
# jupyter notebook . # open a notebook showing all chapters
jupyter notebook 02-spatial-data.ipynb

You should see something like this:

See documentation on running and developing Python code in a Jupyter notebook at docs.jupyter.org.

Additional information

If you’re interested in how to auto-generate and run the .py and .ipynb files from the .qmd files, see below.

Updating the .py and .ipynb files

The Python scripts and IPython notebook files stored in the code and ipynb folders are generated from the .qmd files. To regenerate them, you can use the following commands, to generate .ipynb and .py files for local versions of Chapter 2, for example:

quarto convert 02-spatial-data.qmd # generate .ipynb file
jupytext --to py *.ipynb # generate .py files .ipynb files

Do this for all chapters with the following bash script in the repo:

./convert.sh

Updating .py and .ipynb files with GitHub Actions

We have set-up a GitHub Action to do this automatically: every commit message that contains the text string ‘convert’ will create and push updated .ipynb and .py files.

Executing the .py and .ipynb files

Running the code chunks in the .qmd files in an IDE such as VSCode or directly with quarto is the main way code in this book is designed to be run interactively, but you can also execute the .py and .ipynb files directly. To run the code for chapter 2, for example, you can run one of the following commands from your system shell:

python code/chapters/02-spatial-data.py # currently requires manual intervention to complete, see #71
ipython ipynb/02-spatial-data.ipynb # currently requires manual intervention to complete, see #71
bash ./run-code.sh # run all .python files

Updating packages

We pin package versions in the environment.yml and requirements.txt files to ensure reproducibility.

To update the requirements.txt run the following:

python -m pip install pur
pur -r requirements.txt
python -m pip install -r requirements.txt

To update the environment.yml file in the same way based on your newly installed packages, run the following:

micromamba list export > environment.yml

geocompy's People

Contributors

Stargazers

Watchers

geocompy's Issues

Resolving use of both environment.yml and requirements.txt files

Currently we have two files defining dependencies:

https://github.com/geocompr/py/blob/main/data/requirements.txt

and

https://github.com/geocompr/py/blob/main/environment.yml

The fact that we have two versions of the deps can cause issues, e.g. #31.

I think we should standardise and, looking at various help files, am thinking that an environment.yml file that points to the pip deps should be fine. Source: https://stackoverflow.com/a/68164027

# environment.yml
name: geocompy
dependencies:
  - python>=3.9
  - miniconda3
  - pip
  - pip:
    - -r file:data/requirements.txt

However, I note these caveats from https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html :

Issues may arise when using pip and conda together. When combining conda and pip, it is best to use an isolated conda environment. Only after conda has been used to install as many packages as possible should pip be used to install any remaining software. If modifications are needed to the environment, it is best to create a new environment rather than running conda after pip. When appropriate, conda and pip requirements should be stored in text files.

Thoughts? Another advantage of standardising this could be to automatically update the Docker images when the requirements get updated here, linking to #6

Add action to identify failing lines of code

Split-out from #66 and linked to #71.

Build fail due to lack of pygeos (or rtree)

Source: https://github.com/geocompr/py/runs/7059648042?check_suite_focus=true#step:4:239

Graph showing popularity of languages

Inspired by https://geobgu.xyz/py/ by @michaeldorman

Updated image from StackOverflow:

Points: although Python is more popular overall, for data processing (roughly but not completely) captured by pandas questions, Python-for-data-science and R are more similar. I've thrown a few new ones in there, Rust is super future proof systems language (with low level geo crates in GeoRust), Julia is a new kid on the block with a similar focus as R.

Make GitHub bot (not users) push commits from actions

Context: https://github.community/t/bot-user-and-email-to-push/174459

Suggesting:

git config user.name github-actions
git config user.email [email protected]

Making `landast.tif` avaiable to GitHub actions

I've added few more files to the (local) data directory, one of them (landast.tif) is large and in .gitignore, which makes the build fail. It's used in this example:

Will be happy to hear any ideas how we can make landast.tif available to GitHub actions. Thanks!

Set-up Binder

Ideally with Jupyter and RStudio options.

Switch to Python 3.11

It seems this will be a major release with lots of performance enhancements: https://docs.python.org/3.11/whatsnew/3.11.html

Problem is, seems that conda will lag behind the official release: conda-forge/conda-forge.github.io#1629

According to the release schedule the full 3.11 release will be published around October 2022 so we don't need to do anything atm, just opening this up here so we're ready and to get feedback on that. From then it seems that Python 3.11 will be supported for the next 5 years so I suggest we pin to that version for stability when it's ready and stable, including on conda. Sound reasonable @anitagraser ?

Use explore instead of hvplot for geopandas

Hey,

this is a great initiative! Thanks! I was checking the section 2 and noticed that in 2.2.2 you try to get an interactive plot using hvplot. I would suggest going with the built-in explore method as it is already there and you don't need any other external package. So unless you really prefer holoviews or need to do something explore cannot, I believe it is better to point users to the built in interactive plotting. Depending on folium and leaflet instead of hvplot is also much more lightweight.

Set-up notes on Windows

Making some notes here from recent installation experience.

Installed Chocolatey
Installed miniconda3
Installed VS Code
Installed the Quarto plugin in VS Code

At that point I tried running some Python code and was asked to select the Python Interpreter:

IPython versions of chapters

You can convert from qmd to ipynb with quarto convert test.qmd
Should we do that automatically in the CI?
Plan: have unevaluated ipynb versions of every chapter to be format agnostic and as accessible as possible
May make it easier when making Binder

Remove 'axis subplot' messages in Chapter 9

Learning as I go here, just remembered this tip, thanks @michaeldorman !

Before:

After...

Domain name

Currently the book is hosted in 2 places:

Directly from GitHub pages: http://geocompr.github.io/py

Random netlify deploy: https://vocal-longma-0d407b.netlify.app/ #68

We can continue with just the former no problem as the default. Advantage: simplicity.

Advantage of custom domain hosted by Netlify:

Netlify deploys do not add anything to the commit history or repo size keeping it clean
Potential advantage in terms of discoverability with custom domain

This is linked to a broader question about where to host all 'geocompr' repos, we could get economies of scale by hosting many of them on the same site:

Geocomputation with R, deployed from gh-pages branch via Netlify to https://geocompr.robinlovelace.net/ default plan is to continue to host it there but am open to moving it at some point, e.g. to geocomp.xyz/r
Associated book website, blog and solutions
Spanish translation
French translation
Japanese translation
... translation in another human language
translation into another computer language e.g. JavaScript or Rust

All this and more could live on a custom domain. That may take some setting up. Options

geocomp.xyz suggested by Jakub

geowith.com suggested by me (over a coffee ☕ ) thinking that r.geowith.com and py.geowith.com, as in geo(computation) with r/py/.../ could be good and then have r.geowith.com/en etc as consistent sites.

This would be a big operation and not to be taken lightly. Simple is good. So the default is to keep it as is for now with the silent Netlify deploys as an experimental backup but thought I'd dump thoughts on this hear while fresh in my head (have discussed broadly with Jakub who suggested a dedicated URL partly as geocompr does not fit so well with the Pythonic nature of this project already but keen to hear what others think).

Test interactive plots

Do they work in Quarto. Worth switching to ipynb format?

Add GitPod

After waiting ages for Binder to start up I'm wondering about adding support for GitPod: https://www.gitpod.io/docs/getting-started

Replica of data folder in zip

For ease of reproducing results

Switch to conda only build/deploy

Just noticed another CI fail, with this message:

NameError                                 Traceback (most recent call last)
Input In [22], in <cell line: 1>()
----> 1 multipolygon = shapely.geometry.MultiPolygon([Polygon([(1,5), (2,2), (4,1), (4,4), (1,5)]), 
      2                              Polygon([(0,2), (1,2), (1,3), (0,3), (0,2)])])
      3 multipolygon
NameError: name 'Polygon' is not defined
NameError: name 'Polygon' is not defined

Source: https://github.com/geocompr/py/runs/7087792814?check_suite_focus=true#step:4:107

I'm wondering at this point if we should use the conda image in the main.yml file. However I do not understand this latest error message and would be good to understand the cause before acting.

Also thinking we should have CI on Windows, Mac and Linux but can save that for another issue. Thoughts?

Prevent more errors when data directory is not available

Discovered this when testing out the new Binder instance

Suggested solution, something like this at the beginning of each chapter (we could also check if the ../data directory exists):

https://github.com/geocompr/py/blob/50b2ba4b02be136ce86a23d03f0e02df79199604/02-spatial-data.qmd#L57-L69

How to run Python scripts in the code folder non-interactively

For minimal dependency testing. Am thinking of a CI workflow that does not depend on Quarto which is a bit niche and which still has no dedicated quarto package on pypi.

Lighten up repo

The zip file alone is now 110 MB 😵‍💫

Good news: we can purge the giant files with https://rtyley.github.io/bfg-repo-cleaner/

I can give this a go... You may need to reclone the (much lighter) repo after the job is done as it 'rewrites history' so heads-up @Nowosad, @anitagraser and @michaeldorman.

`rasterstats` package not installed

The rasterstats package is missing so at the moment the book does not render on GitHub. I'm not sure how to solve this, I guess that RUN pip3 install rasterstats needs to be added to https://github.com/geocompr/docker/edit/master/python/Dockerfile ?

Build error

Shown here: https://github.com/geocompr/py/runs/6642744049?check_suite_focus=true

I'm on the case...

Show graphical output side-by-side in first figure of Chapter 9

Issue, these should be next to each other, not below each other:

Pitching Ideas

Good evening,

Following a short exchange with Robin on Twitter I'd like see how I could contribute to the project. Which prompts the question, who is the target audience? and how can I help?

For my part I use Open Data sources to present interactive data visualization as hobby. Examples includes:

Population and census data from the Office of National Statistics (ONS) and National Library of Scotland,
Transport network data from OpenStreetMap (OSM)
Transport locations from the National Passenger Transport Node (NaPTAN)
Planning and operation data from Network Rail OpenDataFeeds

You appear to have covered these already but I typically use shell scripts and python on Linux to extract, clean and wrangle data using python with numpy, pandas, GeoPandas, Networkx, OSMnx and SciPy. I've also being playing with PySAL (Python Spacial Analytics Library) and sklearn (Scikit Learn).

For what it's worth, my day job is bit more prosaic, I am technical lead on projects involving national planning, operations and performance industry data exchange on the British network and have a role in European rail data exchange governance.

Given all this, my initial thought would be to take one of the visualizations, say how to draw an interactive population map here

Or perhaps trunk road and rail transport map here or fizzy-knitting* map of the British Railway such as here.

Does any of this work?

Apologies for the length of the note. Shout if you have any questions with this stuff.

Ta,

Will

How to reproduce the book on Windows?

Just trying this after #22 and hitting this error message:

quarto preview
Preparing to preview
[ 1/10] index.qmd

Starting Jupyter kernel...
ERROR: Error executing 'C:/tools/miniconda3/python.exe': The pipe is being closed. (os error 232)

Anyone got any ideas? I guess I should set the environment but don't know how...

Input datasets

To go into releases if over 1MB.

Chapters not displayed on GitHub pages

Pages other than index.html return error 404 at the moment, not sure why. For example:
https://geocompr.github.io/py/02-spatial-data.html

Make test website work

https://geocompr.github.io/pytest

Update appendix

I think this is a bit of an MVP before we put this out there: https://geocompr.github.io/py/a1-starting.html

Reasoning: great that it's got installation instructions but it's for the wrong language (result of copying template from jjallaire)!

Happy to give this a bash but if anyone else has insight/experience into good practice when installing Python for geo work on Linux/Windows/Mac/Docker here's the place to say.

Key links:

The source code of the appendix (nice Quarto tricks in there): https://github.com/geocompr/py/edit/main/a1-starting.qmd
Section on reproducing the book which has become a bit long, I suggest we move some of this into the appendix: https://geocompr.github.io/py/#reproducing-this-book

Build failing

Since removing the landsat and 'air' datasets for #12 the build is failing : ( sorry!

I suggest we solve this by downloading the file if it does not exists, e.g. with:

import subprocess
url = "https://github.com/geocompr/py/releases/download/0.1/landsat.tif"
path = "landsat.tif"
subprocess.run(["wget", "-r", "-nc", "-P", path, url])

Workflow to build book from geocompr/geocompr:conda

This should make the build process more resilient and easier to maintain, building on the new conda image: geocompx/docker#27

Error message related to CRSs in c2

This is my output:

quarto preview # generate live preview of the book

Preparing to preview
[1/8] 02-spatial-data.qmd

Starting Jupyter kernel...Done

Executing '02-spatial-data.ipynb'
  Cell 1/47...Done
  Cell 2/47...Done
  Cell 3/47...Done
  Cell 4/47...

An error occurred while executing the following cell:
------------------
gdf = gpd.read_file("data/world.gpkg")
------------------

---------------------------------------------------------------------------
CRSError                                  Traceback (most recent call last)
Input In [4], in <module>
----> 1 gdf = gpd.read_file("data/world.gpkg")

File /opt/anaconda3/envs/OSMNX/lib/python3.10/site-packages/geopandas/io/file.py:244, in _read_file(filename, bbox, mask, rows, **kwargs)
    239 if kwargs.get("ignore_geometry", False):
    240     return pd.DataFrame(
    241         [record["properties"] for record in f_filt], columns=columns
    242     )
--> 244 return GeoDataFrame.from_features(
    245     f_filt, crs=crs, columns=columns + ["geometry"]
    246 )

File /opt/anaconda3/envs/OSMNX/lib/python3.10/site-packages/geopandas/geodataframe.py:610, in GeoDataFrame.from_features(cls, features, crs, columns)
    608     row.update(feature["properties"])
    609     rows.append(row)
--> 610 return GeoDataFrame(rows, columns=columns, crs=crs)

File /opt/anaconda3/envs/OSMNX/lib/python3.10/site-packages/geopandas/geodataframe.py:126, in GeoDataFrame.__init__(self, data, geometry, crs, *args, **kwargs)
    122     super().__init__(data, *args, **kwargs)
    124 # need to set this before calling self['geometry'], because
    125 # getitem accesses crs
--> 126 self._crs = CRS.from_user_input(crs) if crs else None
    128 # set_geometry ensures the geometry data have the proper dtype,
    129 # but is not called if `geometry=None` ('geometry' column present
    130 # in the data), so therefore need to ensure it here manually
   (...)
    134 
    135 # if gdf passed in and geo_col is set, we use that for geometry
    136 if geometry is None and isinstance(data, GeoDataFrame):

File /opt/anaconda3/envs/OSMNX/lib/python3.10/site-packages/pyproj/crs/crs.py:479, in CRS.from_user_input(cls, value, **kwargs)
    477 if isinstance(value, cls):
    478     return value
--> 479 return cls(value, **kwargs)

File /opt/anaconda3/envs/OSMNX/lib/python3.10/site-packages/pyproj/crs/crs.py:326, in CRS.__init__(self, projparams, **kwargs)
    324     self._local.crs = projparams
    325 else:
--> 326     self._local.crs = _CRS(self.srs)

File pyproj/_crs.pyx:2352, in pyproj._crs._CRS.__init__()

CRSError: Invalid projection: epsg:4326: (Internal Proj Error: proj_create: no database context specified)
CRSError: Invalid projection: epsg:4326: (Internal Proj Error: proj_create: no database context specified)

(geocompy)

Is this a PROJ version issue? Will try in a different environment.

GitHub actions error

The GitHub action produces an error as follows:

 fatal: not in a git directory
Error: Process completed with exit code 128.

However, the book is built nevertheless:

Use of an explicit index paradigm

Please would you consider the following approach in the introductory chapters of introducing an example of using an index to filter dataframes so that rather than writing

df[df['A'] == 3]

I would create an variable idx to hold the boolean filter

idx = df['A'] == 3
df[idx]

This has the advantage when the filter becomes more complex, it makes it clearer what the intent is. For example when using the ~ ($not$) operation.

Then then working with multiple filters makes using the & and | (logical and and or) more straightforward.

For example I feel the intent of following examples is somewhat clearer:

idx1 = df['A'] == 3
idx2 = df['B'] == 'Q'

df[~idx1]
df[idx1 & ~idx2]

Rather than

df[~(df['A'] == 3)]
df[(df['A'] == 3) & ~df['B'] == 'Q']

Update instructions to run .ipynb notebooks

Currently the README states:

Then, navigate to the above-mentioned working directory, and open the Jupyter Notebook of any of chapters using a command such as:

jupyter notebook 02-spatial-data.ipynb

But now the .ipynb files are in the ipynb/ directory.

Quarto version issue after running in Docker container

After following these instructions:

https://github.com/geocompr/py/blob/64a9dc742b521b502c42106f8876cd13b793a565/index.qmd#L47-L50

I'm seeing this issue:

Turn off netlify commit messages

Fixing #68 led to annoying commit messages that are not currently desirable IMO.

Add text around new content for chapter 9

Following recently merged #57 we should add some text to describe the code at the very least. I'm up for taking a first look at this so assigning myself but may ask for input @anisotropi4 so any comments from you or anyone welcome. The vis. chapter is in an early state of development so may be worth adding basic content before this more advances stuff but we can add a link and a brief description at the very least.

Operations on data and Geometry in GeoPandas and Shapely

Having more or less completed the NaPTAN PR I have been having a look at the other chapters which look very good.

They did pose a question about how to keep consistency with the PR or example:

Reading CSV data files from a file, buffer or data either in Pandas (.read_csv) or memory (.from_dict)
Extracting dataframe columns to numpy array, and handling issues around numpy array shape
Creating GeoSeries geometry from a 2-d numpy using .from_xy
Using Shapely binary predicates to identify geometric relationships, for example .within with a GeoSeries

There are then other interesting predicate functions such as .contains, .crosses and so on.

Error when the correct environment is not activated

Documenting here as a common issue that people will hit, from the PowerShell command line this time (really don't like using Windows but good to see how ~90% of people will hit these questions of reproducibility):

quarto preview
Preparing to preview
[1/5] 02-spatial-data.qmd

Starting Jupyter kernel...Done

Executing '02-spatial-data.ipynb'
  Cell 1/40...

An error occurred while executing the following cell:
------------------
import geopandas as gpd
------------------

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Input In [1], in <cell line: 1>()

ModuleNotFoundError: No module named 'geopandas'
ModuleNotFoundError: No module named 'geopandas'

There is a environment.yml file in this directory. Is this for a conda env that you need to restore?

Attempted solution:


PS C:\Users\robinadmin\gh\geocompr\py> conda env activate geocompy
conda : The term 'conda' is not recognized as the name of a cmdlet, function, script file, or operable program. Check the spelling of the name, or if a path was 
included, verify that the path is correct and try again.
At line:1 char:1
+ conda env activate geocompy
+ ~~~~~
    + CategoryInfo          : ObjectNotFound: (conda:String) [], CommandNotFoundException
    + FullyQualifiedErrorId : CommandNotFoundException

Error in requirements.txt?

Heads-up @michaeldorman I get the following:

pip install -r data/requirements.txt                     

ERROR: Invalid requirement: 'affine=2.3.0=py_0' (from line 4 of data/requirements.txt)
Hint: = is not a valid operator. Did you mean == ?

Run convert action only when commit message contains a certain text string (convert)

As documented here https://stackoverflow.com/questions/70377390/is-there-any-way-to-trigger-a-specific-github-action-workflow-by-commit-message you can make workflows run only when a certain key word is in the commit message.

Thinking: we can use this to control when all the .py and .ipynb files are updated, after every commit is a bit much. Thoughts @anitagraser, @michaeldorman and @Nowosad ?

Use Netlify Actions

Context: #66

Docs: https://github.com/marketplace/actions/netlify-actions

No module named topojson issue

That error message appeared after running

conda activate geocompy # the default name of the environment
quarto preview

I guess it's not in the environment.yml file...

Evaluate all relevant lines in c2

E.g.

https://github.com/geocompr/py/blob/c4cea549ff5b17e93857d68072d332a0d34b730a/02-spatial-data.qmd#L90

Add workflow to test PRs

Current the CI only triggers on pushes to the main branch, meaning PRs such as #17 don't get tested.

Add sections on reproducing the book on the landing page

One of the most important things for reproducibility and accessibility. Heads-up @anitagraser any suggestions on the local installation side v. welcome

Running the code locally (including python set-up instructions)
Running the code in Docker with IPython notebook (possibly the second most used approach)
Running the code in Docker with VSCode
Running the code in Docker with RStudio
Running the code in Binder

Experience working with quarto

So far, really, really good!

Code blocks not showing with correct run labels

I think this was first mentioned by @michaeldorman and that I've found the cause. Reproducible example:

Solution described here: quarto-dev/quarto-vscode#5

Add tags like this:

```{r}
#| label: rtest

```

Download and unzip data folder if it does not exist

To enable people to run the .ipynb files without fail. Split out from #8 which contained the following comment:

Problem: the data folder doesn't exist in the ipynb or python directories. Proposed solution: download and unzip a folder containing the input data if it doesn't exist, e.g. using solutions mentioned here: https://stackoverflow.com/questions/9419162/download-returned-zip-file-from-url