mgastner / cartogram-cpp Goto Github PK
View Code? Open in Web Editor NEWCartogram generator in C++
License: MIT License
Cartogram generator in C++
License: MIT License
I came across a bug on the master
branch.
This first screenshot below shows a GeoDiv of russia.geojson
(input GeoJSON) with its NAME_1
as Sakha
.
This second screenshot below shows the same GeoDiv of an inset of russia_B_cartogram_scaled.geojson
with its NAME_1
as Bryansk
, which is incorrect. This bug can be traced back to commit 551633ccf0a33c562a7d5b0bc9315c15e320d4df
, which seems to be the first commit where this bug occurs, while commit 3996e831f259ddeeae60161ecaafc95ad19af686
, 2 commits before, runs fine, which I believe means that this error is due to the inset code. (Commit da6ea99fd6dc1736955d66e8db99cbae123f379b
, which is the commit between the 2 commits mentioned above had some compiling errors so I moved on to the commit after this one, which is 551633ccf0a33c562a7d5b0bc9315c15e320d4df
)
To reproduce this bug, please follow these steps:
git checkout master
Place the attached russia.geojson
(after unzipping) and inset_russia_population.csv
into the sample_data/inset_data/
directory.
inset_russia_modified_population.csv
russia.geojson.zip
Comment out lines 190-254 in cartogram_generator/main.cpp
.
Run the following in your terminal.
cd build/
cmake .
make
./cartogram -g ../sample_data/inset_data/russia.geojson -v ../sample_data/inset_data/inset_russia_population.csv
russia_B_cartogram_scaled.geojson
into Mapshaper to view the incorrectly labelled GeoDiv.We frequently use the pattern:
Apply function f() to the exterior ring.
For all internal rings,
apply function f().
Mapshaper uses callback functions of the type: polygon_with_holes.forEachRing(f)
. I think we should adopt Mapshaper's strategy. It is more readable and more maintainable if we decide to switch our internal representation of polygons with holes (e.g. to 2D arrangements).
We often have long type declarations, for example:
std::vector<std::vector<intersection> >
I am wondering whether we should use a typedef for this declaration and, in fact, all other STL containers, similar to the way in which we handle CGAL types. It would shorten our code without running the risk of using
namespaces, which is risky because many different things in our code are called 'map'. Thus, my proposal is to add a header file stl_typedef.h
that contains lines such as:
typedef std::vector<intersection> intersection_vector;
typedef std::vector<intersection_vector> intersection_vector_2;
Implementing a Topology Check with CGAL 2D Arrangements or our own functions may be helpful. Some ideas:
Using contiguity graph and segment intersections.
To speed up above code:
Note:
Above function may also help with topology checks where two straight polylines fully overlap, but one has more points than the other.
Currently we create new copies of polygons if we want to make changes to them (densify, project, simplify etc). It may be faster if simply modify them in place.
This would allow users to see intersections in their input maps, as well as developers to see any intersections incorrectly added by our program.
When we want to make comparisons between two GeoDivs that touch or are nearby, we currently use bounding boxes to filter out far away GeoDivs; checking the contiguity graph and only testing against Adjacent GeoDivs may be faster.
Additionally, a map with the index of all GeoDivs may be helpful to speed up calculations and store GeoDivs in this manner. Perhaps, something like GeoDiv.index(). Thus, the adjacency list could then be stored as integers. This may also help us how with intersection calculation.
We should study and implement the best practices as to what and what not to include in source files compared to header files. The conventions seems to be too include as few include
directives as possible. I have noticed inconsistencies in our code; the convention followed seems to be ambiguous.
These may assist in further reading and rewriting:
https://stackoverflow.com/questions/2596449/including-includes-in-header-file-vs-source-file
https://softwareengineering.stackexchange.com/questions/167723/what-should-and-what-shouldnt-be-in-a-header-file
After the blur width hits zero, we observed that the maximum area error began increasing. The temporary fix, implemented by #93, is to keep the blur width above zero at all times. While it is a short-term fix, we must investigate the root cause of this bug.
Should we consistently adopt British English in all repos related to go-cart.io?
Currently, cartogram_cpp
uses US spelling, whereas the website uses British spelling. Our target audience is international, and our main sources of data are international organisations. The United Nations recommends adopting the spelling in the Concise Oxford English Dictionary (see https://www.un.org/dgacm/en/content/editorial-manual/spelling); thus, much of our data is likely to follow British spelling. It would feel more consistent if we adopt the same rule.
A customised makefile for creating a static build may help us in the future.
Perhaps, something like:
cartogram ../sample_data/belgium.geojson -V ../sample_data/belgium_population2019.csv -e
-V
flag accepts the belgium_population2019.csv
files.-e
flag will output an EPS file.GitHub replaced the naming convention of the "default branch" in all new repositories. Changing the name from master
to main
is likely to reduce confusion in the future. Here are instructions: https://www.git-tower.com/learn/git/faq/git-rename-master-to-main
I was able to finally get our program to compile natively on M1, as well as x86, using LLVM's clang. M1 has a considerable speed advantage when compiling and running our program. Finally, this would allow us to significantly streamline our repository (such that M1 users and other macOS users have the same build instructions). In the future, we'd also be able to create a simple autobuild script such that the installation process would boil down to running one file.
The one caveat is that we would need to remove our references to ranges as LLVM's clang does not support ranges yet (https://llvm.discourse.group/t/where-can-i-see-status-of-c-20-range-views-in-clang/4706). I argue that we should not use the ranges feature as it does not bring much benefit and is only partially supported, that too by very few compilers (GNU gcc-13, at the moment). Further, by using GNU gcc on a Mac, we make ourselves vulnerable to future bugs, as all the libraries we install from Homebrew are built by clang on Macs. Thus, using LLVM clang, and removing the ranges feature would ensure that the libraries we link, too, are built with the same compiler that we use for our program.
Functions in project.cpp
refer to InsetState *inset_state
. It would make sense to move this file into the directory inset_state
and shift the function signatures to inset_state.h
.
std::cerr
and std::cout
are also notoriously slow and may be having an impact on our performance without us realising. We may want to move to proper logging
using any logging library (perhaps, even something that is contained within a single header, like spdlog
: https://github.com/gabime/spdlog).
Currently, URLs in our code that exceed 80 characters are split into two comments. However, this ends up making it tougher to copy paste and visit the URL, or open it directly using an editor.
Indeed, the Google style guide for C++ to mentions that URLs can be above 80 characters:
https://google.github.io/styleguide/cppguide.html#Line_Length
This stackexchange post has some more discussion about it.
https://meta.stackexchange.com/questions/12527/do-i-have-to-worry-about-copyright-issues-for-code-posted-on-stack-overflow
Shall we perhaps consider merging the split URLs into one comment per URL?
The style the following post proposes seems interesting:
https://stackoverflow.com/questions/2527355/using-the-slash-character-in-git-branch-name
For instance, perhapsarea_at
and color_at
may work better, instead of areas_at
of colors_at
.
Right now, there are many files in sample_data/
that were at some point useful for debugging. Let us remove those files and only keep GeoJSONs for real countries with real data and without artificial insets. Also remove topologically invalid GeoJSONs.
There are break
statements in flatten_density()
and print_properties_map()
. There is never a good reason for break
. Instead, the code should be refactored so that the loops with the break
statements become functions that return when the condition for break
is met.
All the other density related functions follow the pattern [verb]_density
(blur_density
, flatten_density
). Thus, would it be more appropriate to rename fill_with_density
to fill_density
@mgastner ?
I came across a bug on the master
branch.
To reproduce this bug, please follow these steps:
git checkout master
Place the attached russia.geojson
(after unzipping) and inset_russia_population.csv
into the sample_data/inset_data/
directory.
inset_russia_modified_population.csv
russia.geojson.zip
Run the following in your terminal.
cd build/
cmake .
make
./cartogram -g ../sample_data/inset_data/russia.geojson -v ../sample_data/inset_data/inset_russia_population.csv
main.cpp
, however, it runs fine. It seems then that this Russia example is getting stuck in holes_inside_polygons()
.On several occasions we access x-coordinates of CGAL points by subsetting with [0]
. It is more semantic to use the x()
method.
The analogous change should be made for y-coordinates.
Does CGAL's polyline simplification rely on the assumption that polylines that are shared by two neighboring polygons also share the same vertices? For example, the following two polygons touch along the line from [10, 1], to [10, 9]:
polygon 1: [[0, 0], [10, 0], [10, 10], [0, 10], [0, 0]]
polygon 2: [[10, 1], [19, 1], [19, 9], [10, 9], [10, 8], [10, 7], [10, 6], [10, 5], [10, 4], [10, 3], [10, 2], [10, 1]]
However, none of the line segments [[10, 1], [10, 2]], ..., [[10, 8], [10, 9]] are in polygon 1. Would this prevent CGAL's algorithm to remove the intermediate vertices?
Although not a bug that has affected any real-life cases, some pathological cases may not have adjacency detected correctly through the current method.
Currently, our code is not very friendly to debug, without using tools like lldb
or gdb
. It may be helpful if we "upgrade" our try-catch blocks to throw nested exceptions with std::nested_exception
. This way, we will be able to create a custom backtrace, whenever our program fails.
We were originally under the assumption that the compiler would automatically optimise (with the -O3 flag, at the very least) us calling functions that return copies of an object's member rather than accessing the member directly. However, I was playing around with optimising some code earlier, and this does not seem to be the case.
For instance, in src/inset_state/contiguity_graph.cpp
(currently on 105; however, since the pull request is not directly related to this issue, I have chosen to not use the # symbol), going from:
geo_divs_[gd_i].adjacent_to(geo_divs()[gd_j].
geo_divs_[gd_j].adjacent_to(geo_divs()[gd_i].id());
geo_divs_[gd_i].adjacent_to(geo_divs_[gd_j].id());
geo_divs_[gd_j].adjacent_to(geo_divs_[gd_i].id());
makes a significant performance difference (a few magnitude even, possibly).
However, geo_divs() simply returns geo_divs_ as such:
const std::vector<GeoDiv> InsetState::geo_divs() const
{
return geo_divs_;
}
Thus, we should no longer hold this assumption and modify any parts of our code that we initially thought were already being optimised by the compiler. We may able to significantly reduce runtime for everything but the Integration itself.
We sometimes use C-style typecasting as in (int)2.5
and sometimes functional typecasting as in int(2.5)
. We should be consistent. Should both be replaced by static_cast<int>(2.5)
?
The code currently has two issues:
Here is the start of the output:
Usage: ./cartogram [options] geometry_file
Positional arguments:
geometry_file File path: GeoJSON file
Optional arguments:
-h --help shows help message and exits [default: false]
-v --version prints version information and exits [default: false]
-v --visual_variable_file File path: CSV file with ID, area, and (optionally) colour
Can the short form of -v --version
be removed?
https://stackoverflow.com/questions/11947587/is-there-a-naming-convention-for-git-repositories
Additionally, a lot of our branches have names with -
in between their words.
cartogram ../sample_data/sandbox/world/world_map_with_antarctica.geojson -tw
Running this, seems to yield the following error for me on the master branch:
❯ cartogram ../sample_data/sandbox/world/world_map_with_antarctica.geojson ../sample_data/sandbox/world/world_gdp_with_antarctica.csv -tw
Using geometry from file ../sample_data/sandbox/world/world_map_with_antarctica.geojson
Using visual variables from file ../sample_data/sandbox/world/world_gdp_with_antarctica.csv
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
Coordinate reference system: urn:ogc:def:crs:OGC:1.3:CRS84
Replacing small target areas.
Turks and Caicos Is.: 9.1755e+08 to 1.8511e+09
Kiribati: 1.96e+08 to 1.8511e+09
Guinea-Bissau: 1.669e+09 to 1.8511e+09
British Virgin Is.: 9.71237e+08 to 1.8511e+09
St. Kitts and Nevis: 1.118e+09 to 1.8511e+09
Palau: 3.24e+08 to 1.8511e+09
Micronesia: 3.89e+08 to 1.8511e+09
Montserrat: 6.20514e+07 to 1.8511e+09
Dominica: 5.9e+08 to 1.8511e+09
Grenada: 1.33e+09 to 1.8511e+09
Nauru: 1.17e+08 to 1.8511e+09
Samoa: 9.6e+08 to 1.8511e+09
Comoros: 7.73e+08 to 1.8511e+09
Sao Tome and Principe: 5.27e+08 to 1.8511e+09
Anguilla: 3.3752e+08 to 1.8511e+09
Tonga: 5.12e+08 to 1.8511e+09
Solomon Is.: 1.607e+09 to 1.8511e+09
Seychelles: 1.739e+09 to 1.8511e+09
Cook Is.: 2.90194e+08 to 1.8511e+09
Marshall Is.: 2.28e+08 to 1.8511e+09
Vanuatu: 9.94e+08 to 1.8511e+09
Sint Maarten: 1.07207e+09 to 1.8511e+09
San Marino: 1.672e+09 to 1.8511e+09
St. Vin. and Gren.: 9.03e+08 to 1.8511e+09
Rescaling to 512-by-256 grid with bounding box
(-2.50663, -1.25331, 2.50663, 1.25331)
Integration number 0
blur_width = 32
[1] 16477 bus error cartogram ../sample_data/sandbox/world/world_map_with_antarctica.geojson -tw
~/Desktop/github/cartogram_cpp/tests master ✘ BUS
❯ cartogram ../sample_data/sandbox/world/world_map_with_antarctica.geojson ../sample_data/sandbox/world/world_gdp_with_antarctica.csv -tw
Using geometry from file ../sample_data/sandbox/world/world_map_with_antarctica.geojson
Using visual variables from file ../sample_data/sandbox/world/world_gdp_with_antarctica.csv
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
area_field: NA
Coordinate reference system: urn:ogc:def:crs:OGC:1.3:CRS84
Replacing small target areas.
Turks and Caicos Is.: 9.1755e+08 to 1.8511e+09
Kiribati: 1.96e+08 to 1.8511e+09
Guinea-Bissau: 1.669e+09 to 1.8511e+09
British Virgin Is.: 9.71237e+08 to 1.8511e+09
St. Kitts and Nevis: 1.118e+09 to 1.8511e+09
Palau: 3.24e+08 to 1.8511e+09
Micronesia: 3.89e+08 to 1.8511e+09
Montserrat: 6.20514e+07 to 1.8511e+09
Dominica: 5.9e+08 to 1.8511e+09
Grenada: 1.33e+09 to 1.8511e+09
Nauru: 1.17e+08 to 1.8511e+09
Samoa: 9.6e+08 to 1.8511e+09
Comoros: 7.73e+08 to 1.8511e+09
Sao Tome and Principe: 5.27e+08 to 1.8511e+09
Anguilla: 3.3752e+08 to 1.8511e+09
Tonga: 5.12e+08 to 1.8511e+09
Solomon Is.: 1.607e+09 to 1.8511e+09
Seychelles: 1.739e+09 to 1.8511e+09
Cook Is.: 2.90194e+08 to 1.8511e+09
Marshall Is.: 2.28e+08 to 1.8511e+09
Vanuatu: 9.94e+08 to 1.8511e+09
Sint Maarten: 1.07207e+09 to 1.8511e+09
San Marino: 1.672e+09 to 1.8511e+09
St. Vin. and Gren.: 9.03e+08 to 1.8511e+09
Rescaling to 512-by-256 grid with bounding box
(-2.50663, -1.25331, 2.50663, 1.25331)
Integration number 0
blur_width = 32
[1] 16498 bus error cartogram ../sample_data/sandbox/world/world_map_with_antarctica.geojson -tw
Using bash
scripts may be helpful. Some examples can be found here:
https://stackoverflow.com/questions/20796200/how-to-loop-over-files-in-directory-and-change-path-and-add-suffix-to-filename
The graticules are not directly related to any part of the polygons. Thus, write_polygons_to_eps
becomes much longer.
I updated our testing script to also show the progress bar statements printed to stdout/stderr.
However, I noticed that our progress bar goes backwards sometimes. You may test this out with:
cartogram ../sample_data/usa/usa_with_puerto_rico.geojson ../sample_data/usa/usa_with_puerto_rico_na_test_inset.csv
Perhaps, we could use https://github.com/p-ranav/indicators for our progress bars.
Thereafter, use the this
pointer to reference members.
https://www.geeksforgeeks.org/this-pointer-in-c/
For instance, put getter and setter functions together.
apt
is a more user-friendly way to interact with Ubuntu's Advanced Package Tool than apt-get
, see:
https://askubuntu.com/questions/445384/what-is-the-difference-between-apt-and-apt-get
We should update README.md accordingly.
This can be applied to inset_state.h
and cartogram_info.h
. Otherwise, looking at the header file does not easily convey what all the class's purpose is.
At present, it is difficult to determine the meaning of the command-line flags. Should we at least sort them alphabetically (e.g. 'A', 'C', 'D', ..., 'P', 'd', 'g', ..., 'w')?
Would it not be better to have this hidden in the control flow? Perhaps have cartogram_info call a member function that updates the non-positive target areas in the background? My reason for this is because we do not require min_positive_area or replacement_for_nonpositive_area anywhere else. Of course, progress would still remain in the same scope.
Further, the wording on line 203 "Replacing zero target area with [replacement_for_nonpositive_area] times the minimum non-positive area" may possibly be wrong; the wording suggests that the newly assigned area is also non-positive (because we multiply X times the non-positive area). Would "Replacing 0 target area with [replacement_for_nonpositive_area] (0.1 times the minimum positive area)." work better?
Instead of
Point p1(10, 20);
p1 = Point(p1.x() + 10, p1.y() + 10);
We could use vectors as they support addition to Points with the +
operator.
https://doc.cgal.org/latest/Kernel_23/classCGAL_1_1Point__2.html
https://doc.cgal.org/latest/Kernel_23/classCGAL_1_1Vector__2.html
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.