arfc / webscraping Goto Github PK

View Code? Open in Web Editor NEW

1.0 12.0 3.0 21 KB

scraping the web for international reactor data

License: Creative Commons Attribution 4.0 International

Python 100.00%

webscraping's Introduction

Webscraping

Yukun (Tifa) Tan's SPIN project.

Updated by ARFC.

Scrapes reactor coordinates from Wikipedia.

License

This is under a CC-BY license.

How-to-use

Run: python scraping_wikidata.py

Output File: coordinates.sqlite

webscraping's People

Contributors

Stargazers

Watchers

Forkers

katyhuff ytan15 nsryan2

webscraping's Issues

Cannot put the new pandas dataframe into sqlite table

Hi Dr. @katyhuff ! As mentioned earlier today, seems like the code

CREATE TABLE testTable(
        'index', 'Name' TEXT, 'Coord' TEXT, 'Long' REAL, 'Lat' REAL
    )
    '''

(lines 84-86 in https://github.com/ytan15/webscraping/blob/master/scraping_wikidata.py)
is not working. The error messages says "probably unsupported type".

Earlier it was

CREATE TABLE testTable(
        'index', 'Name' OBJECT, 'Coord' OBJECT
    )
    '''

and it was working fine.
Can you please take a look at it? Thanks!

I'm submitting a ...

feature request

Expected Behavior

There should be a LICENSE file that covers the entire repository.

Actual Behavior

There is a subsection in the README.md file that claims the repository is under CC-BY, and I'm not sure if that's adequate.

How can this issue be closed?

First, a determination has to be made about the type of License under which to cover this repository.
Second, that License should then be added through a pull request to this repository.

Webscrape results missing data

The results of Webscrape are missing some reactors that are currently shutdown.
This is especially the case for reactors out of the United States

Next step, if you're getting frustrated with wikipedia: check out the PRIS database. If necessary, @jbae11 can help with understanding it and perhaps would be willing to show you what he's done so far.

Some, but not all, of the information we want, should be in that database.

Scrape from Wikidata and Wikipedia

Hi @ytan15 ! Sorry it took a while to describe this goal. Let's see what you can get into your sqlite3 file out of wikipedia alone. Here are some tutorials on scraping wikipedia and wikidata:

You ought to be able to find names and locations of reactors. You may also be able to find some of the other columns as well. Let us know what you can find!

First Task: Create Sqlite file

This should be a mostly empty database.

Create a python script
In the python script, import and use the sqlite3 package to create a "reactors.sqlite" file.
This file should have one table, called reactors, and should have the following columns:

ID, Name, Lat, Long, Institution, Country, Type, Fuel, Enrichment, Electrical Capacity, Thermal Capacity, Thermal Efficiency, Capacity Factor

Need some nuclear knowledge boost & clarification of our goals

Hi Dr. @katyhuff ! I've been looking into scraping from wikidata, and I think I've grabbed the gist of it. So I started to expand on scraping_wikidata.py, trying to find more information. However, I've encountered some confusions-

From my understanding, nuclear reactor is the energy source of a nuclear power plant, and nuclear power plant is the key facility of a nuclear power station, is it right? Are we looking for nuclear power plant, nuclear power station, or nuclear reactor?
I was trying to find the country of a nuclear reactor, by adding the line

 ?reactors wdt:P17 ?country .

into the query. However, the problem with that is, not all nuclear reactors have a country attribute (because in wikidata "country" means "sovereign state of this item"). For example, Bhabha Atomic Research Centre (https://www.wikidata.org/wiki/Q854682) doesn't have "country" attribute, although from the description we know that it is based in India. Hence the issue is, if I add this line into the query, it will filter out this entry (Bhabha Atomic Research Centre). Is this something that we should be concerned about?

Although I haven't started on Wikipedia yet, I'm a bit concerned that would querying from wikidata and wikipedia possibly cause any overlap, since wikidata stores the data of wikipedia?
If Wikidata contains data of wikipedia, how come there is "Category: Nuclear power reactor types" in Wikipedia, but not Wikidata? Am I having some kind of misunderstanding?

Would you mind discussing these issues with me? Either on here or I could make an appointment with you if you feel like it would easier to talk in person.