taken from https://raw.githubusercontent.com/msyamkumar/cs220-f20-projects/master/p12/README.md
Before you start the following assignment, you need to have some understanding of Pandas. You can start off here, or, if you feel ready to jump in to something a bit faster paced, use this cheat sheet here.
In this project, you will
- Gain more experience with reading and writing files
- Practice using a linter with your code
- Practice using Pandas with python
- Practice creating DataFrames
For this project, you're going to analyze the whole world!
Specifically, you're going to study various statistics for 174 countries, answering questions such as: What is the correlation between a country's literacy rate and GDP?
To start, clone the repository. git clone github.com/iancam/pandas1
You'll do all your work in main.ipynb.
For this project, you'll be using one large JSON file with statistics from about 174 countries. We found them here. You will also extract data from a snapshot of this page.
To get started, open up main, and run the notebook spot, and then the second. Great! What you should see are our two data tables
Some of the columns in the data require a little extra explanation:
- Area: measured in square miles
- Coastline: Ratio of coast to area
- Birth-rate: Births per 1000 people per year
- Death-rate: Deaths per 1000 people per year
- Infant-mortality: Deaths per 1000 births per year
- Literacy: (out of 100%)
- Phones: Number of phones per 1000 people
DISCLAIMER: This data is probably wrong in a lot of ways. We're using it to practice pandas.
All the right answers can be found in test.py
. If you're wondering if you did the right thing, take a look over there.
Hint: Review how to extract a single column as a Series from a
DataFrame. You can add all the values in a Series with the .sum()
method.
Use capitals
and countries
DataFrames to answer the following questions.
Answer with an alphabetically-sorted Python list.
Hint: you can use fancy indexing to extract the row where the
country
equals "Italy". Then, extract the capital
Series, from
which you can grab the only value using next(iter(...))
.
Produce a Python list of the 7, with southernmost first.
Hint: look at the documentation examples of how to sort a DataFrame with the sort_values function.
The expected output should be sorted according to the distance, with the closest captial at the start of the list.
A "land-locked" country is one that has zero coastline. Smallest is in terms of area.
A "coastal" country is one that has non-zero coastline. Largest is in terms of area.
This isn't related to countries, but it's a good warmup for the next problems. Your answer should be about 1.4339 miles.
Assumptions:
- The latitude/longitude of Randall Stadium is 43.070231, -89.411893
- The latitude/longitude of the Wisconsin Capital is 43.074645, -89.384113
- Use the Haversine formula: http://www.movable-type.co.uk/scripts/gis-faq-5.1.html
- The radius of the earth is 3956 miles
- You should answer in miles
If you find code online that computes the Haversine distance for you, great! You are allowed to use it as long as (1) it works and (2) you cite the source with a comment. Note that we won't help you troubleshoot Haversine functions you didn't write yourself during office hours, so if you want help, you should start from scratch on this one.
If you decide to implement it yourself (it's fun!), here are some tips:
- Review the formula: http://www.movable-type.co.uk/scripts/gis-faq-5.1.html
- Remember that latitude and longitude are in degrees, but sin, cos, and other Python math functions usually expect radians. Consider math.radians
- This means that before you do anything with the long and latitudes make sure to convert them to radians as your FIRST STEP
For the coordinates of a country, use its capital.
Your result should be DataFrame with 3 rows (for each country) and 3
columns (again for each country). The value in each cell should be
the distance between the country of the row and the country of the
column. For a general idea of what this should look like, open the
expected.html
file you downloaded. When displaying the distance
between a country and itself, the table should display NaN (instead of
0).
Your result should be a table with 24 rows (for each country) and 24
columns (again for each country). The value in each cell should be
the distance between the country of the row and the country of the
column. For a general idea of what this should look like, open the
expected.html
file you downloaded. When displaying the distance
between a country and itself, the table should display NaN (instead of
0).
This is the country that has the shortest average distance to other countries in North America.
Hint 1: Check out the following Pandas functions:
- DataFrame.mean
- Series.sort_values (note this is not the same as the DataFrame.sort_values function you've used before)
Hint 2: A Pandas Series contains indexed values. If you have a
Series s
and you want just the values, you can use s.values
; if
you want just the index, you can use s.index
. Both of these
objects can readily be converted to lists.
This one has the largest average distance to other countries.
The answer should be in a table with countries as the index and two
columns: nearest
will contain the name of the nearest country and
distance
will contain the distance to that nearest country.
Hint 1: Find a Series of numerical data you can experiment with
(perhaps from one of the DataFrames you've been using for this
project). If your Series is named s
, try running s.min()
.
Unsurprisingly, this returns the smallest value in the Series. Now
try running s.idxmin()
. What does it seem to be doing?
Hint 2: If you run df.min()
on a DataFrame, Pandas applies that
function to every column Series in the DataFrame. The returned value
is a Series. The index of the returned Series contains the columns
of the DataFrame, and the values of the returned Series contain the
minimum values. If you run df.idxmin()
on a DataFrame, the
returned values contain indexes from the DataFrame.
Hint 3: If you get an error message about dtypes when running
idxmin, make sure your DataFrame contains only floats (use
df.astype(float)
if necessary).
The answer should be in a table with countries as the index and two
columns: furthest
will contain the name of the furthest country and
distance
will contain the distance to that furthest country.
#Q19: For birth-rate
and death-rate
, what are various summary statistics (e.g., mean, max, standard deviation, etc.)?
Format: Use the
describe
function on a DataFrame that contains birth-rate
and death-rate
columns. You may include summary statistics for other columns in your output, as long as your summary table has stats for birth-rate and death-rate.
Very often, you don't have data in nice json format like capitals.json
. Instead data needs to be scraped from a webpage and requires some cleanup.
This is a long but fun exercise where we will do the same by scraping this webpage: http://techslides.com/list-of-countries-and-capitals.
Our capitals.json
file was created from this same webpage.
You need to write the code to create capitals.json
file from this table yourself.
Start by installing BeautifulSoup using pip, as discussed in class (learn how to install from lecture slides).
Then call download('capitals.html', 'https://raw.githubusercontent.com/msyamkumar/cs220-f20-projects/master/p12/techslides-snapshot.html')
to download the webpage. Note that this code is not downloading the original webpage, but a snapshot of it (this is to avoid creating
excessive load on their servers). You can open capitals.html
and make sure that this page looks fine.
Now do the following:
- Read from
capitals.html
and use beautiful soup to convert the html text to soup. - Find the table containing the data (Hint: .find() or .find_all() methods can be used).
- Find all the rows in the table (Note: rows are inside 'tr' html tag and data is in 'td' tag).
- Create a dictionary containing country name, capital and location coordinate and then create a list of dictionaries for all the countries.
- Careful! This web page has more countries than
countries.json
. We will ignore the countries that are not in that file. You need to filter and keep only the 174 countries whose names also appear incountries.json
. The column names should be consistent with the originalcapitals.json
. - Save this list into file titled
my_capitals.json
. You can use json.dump() method.
Your answer will be your data from your constructed my_capitals.json
file read in using open("my_capitals.json", "r").read()
This answer should look like capitals.json however this time you have parsed the data using BeautifulSoup. Note the data type of columns latitude
and longitude
should be float
.
After you add your name and the name of your partner to the notebook in the first cell, please remember to Kernel->Restart and Run All to check for errors then run the test.py script one more time before submission. To keep your code concise, please remove your own testing code that does not influence the correctness of answers.
Cheers!