Comments (13)
The scraper is working and is able to write to a file within the project. The flow of the scraper is such:
- Make POST to
http://sanjoseca.gov/Facilities/Facility/Search
, with the complete query being something likehttp://sanjoseca.gov/Facilities/Facility/Search?featureIDs=&categoryIDs=15&occupants=null&keywords=&pageSize=100&pageNumber=1&sortBy=3¤tLatitude=null¤tLongitude=null&isReservableOnly=false
. This returns HTML listing artworks. - Have scraper follow the link to each individual artwork.
- Have scraper set data to certain categories such as "artist", "description"...
- Iterate through the data and clean it up, formatting.
- Save to file.
The cleanup is not so straightforward because the HTML is inconsistent from individual page to page.
<div class="editorContent">
<font class="Subhead1">1737 Trees</font>
<font class="Subhead2">
Artist: Angela Buenning Filo<br>
<font class="Normal">2006</font><br>
</font>
</div>
<div class="editorContent">
<div class="Normal" style="text-align: left;">
<font class="Subhead1">
8 Minutes<em><br></em>
<font class="Subhead2">
Artists: Merge Conceptual Design (Franka Diehnelt and Claudia Reisenberger)
</font><br>
</font>
2013
</div>
I'll think some more on how to grab this data without too much hassle.
Btw, how were the geolocation computed in art.js
?
For the data being scraped, is the idea to use the postal address to determine the lat and long coordinates?
from heartofthevalley.
Got 60 records from the spreadsheet, researched and found additional information from the SJC website, and added it the spreadsheet.
from heartofthevalley.
@amygcho will work on scraping and parsing data about City sponsored public art from http://sanjoseca.gov/facilities, and adding data to art.js
from heartofthevalley.
@JMStudiosJoe @amygcho added start code on branch
from heartofthevalley.
@ychoy @JMStudiosJoe ...
@ctram awesome job. I would assume taking the postal address and converting it to lon/lat. from what I could tell every public art link title stated with Public Art: and looks to be the same with Artist? Please let me know if need more help on this and I’ll do what I can.
Sent with GitHawk
from heartofthevalley.
@JMStudiosJoe I am able to save the address but having issues coming up with neatly getting the details (artist, title, etc) under their proper labels because of the issue with inconsistent HTML structure. Please take a look if you have time. I'll scrape all these 200 or so pages later, if the number of exceptions are reasonable, it might be worth it to just manually clean up the oddballs.
I'm currently writing the data as JSON, so a future task is to inject that data into the map.
Where did the current data come from, how did it get into JS object format in art.js
file?
from heartofthevalley.
@ctram Current data came from a spreadsheet (outdated) and @ychoy manually entering in data. I have not gone into the art.js file been mainly going after that webscraper.
from heartofthevalley.
@JMStudiosJoe @ychoy To check, am I OK to use the MapBox API key to generate the geolocation based on postal address?
from heartofthevalley.
@JMStudiosJoe @ychoy I believe we can make X amount of API requests per month before they start charging someone's card? : ]
from heartofthevalley.
@ctram yes MapBox API should be good to use and this won't be making that many requests per month
from heartofthevalley.
@ctram , thanks for working on the scraper! Once you start on inputting the data from the scrape into art.js, there may be duplicate information - I think we got about 60 records from the City's website into art.js. It's okay to overwrite what I have and just take the information you get from the City's website.
For geocoding lat and long - we've been trying to use everything Open Street Maps for this project. Maybe consider using Nominatim-Browser https://www.npmjs.com/package/nominatim-browser? It won't be entirely accurate because sometimes the position of the public art/mural will not be at the lat and long of the postal address. But until all of this information is inputted into OSM and able to be queried, this will work for now.
This is the general format of each JS object in art.js. We have a separate issue of the art.js needing to be cleaned up (because I injected a lot HTML tags, since some pieces have multiple artists and thus multiple websites, etc.). So I propose that we add additional attributes to look out for. sourceOfInformation would be the City of San Jose Public Art Program and the sourceURL is the specific webpage with the details about the public art/mural piece.. If the information exists regarding artist website, include it.
"geometry": {
"type": "Point",
"coordinates": [
]
},
"properties": {
"title": "",
**"artist1": "",
"artist2": "",
"artist3": "",
"artist1website": "",
"artist2website": "",
"artist3website": "",**
"description": "",
"**sourceOfInformation": "",
"sourceURL": "",**
"address": "",
"city": "",
"country": "",
"postalCode": "",
"state": ""
}
}
I realized I hadn't updated the API key. I have a key from CFA, which should allow for more API requests each month. I'll update it today.
Let me know if you have any more questions.
from heartofthevalley.
@ychoy Thanks! To be clear, the art.js
data came from the city website, but did not come from http://sanjoseca.gov/Facilities
, is that correct?
Yes, I will be working to consolidate all the data into a single JSON file.
I have Nominatim up and running, thanks for the suggestion!
@ychoy @JMStudiosJoe might you know how to get JSON data to the client without necessitating a call to a server? I am saving the scraped data as JSON; I'm not familiar with how to include JSON data with the index.html
file download; for example, would you include a <script>
tag with a source to the JSON?
from heartofthevalley.
@ctram likely we will have a frontend site such as react or angular that will serve the file as needed, at least that would be apart of the plan
from heartofthevalley.
Related Issues (20)
- Broken images HOT 3
- Broken URL for artwork scraped from the City of San Jose website HOT 3
- Map isn't loading HOT 1
- Hide mapbox keys HOT 1
- Vulnerable dependency
- Create a user flow for a navigation feature HOT 1
- Add murals from POW!WOW! San Jose Oct 2018 HOT 2
- Clicking on markers doesn't fly to the information on side menu HOT 1
- Allow Website Users to Add Entries and Artwork HOT 4
- Ripple Effect and Run River Run is mislocated on the map HOT 1
- Add murals from 2017 POW! WOW! San Jose HOT 1
- Pagination HOT 1
- Fix scrapper HOT 2
- Pop up descriptions cutoff HOT 3
- Create webapp to visually check each POI
- Adjust center of map to reveal details about an artwork HOT 2
- User feedback HOT 1
- Migrate to Gatsby HOT 1
- Move all mapbox api keys to env variables
- Allow images to be added to an artwork
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from heartofthevalley.