The data-science-playground from creativecommons

Discussion: Project to Quantify the Commons

Background:

Creative Commons has submitted a project to UMSI and they have determined that this project is a potential fit for the course SI 485: Information Analysis Capstone and Final Project. In this course, advanced undergraduate students deliver data-oriented solutions through the development and analysis of data sets, building tools to extract useful information for clients through manipulation, analysis and visualization. This ticket is intended for discussion of the project, with the goal of refining the potential questions we'd like answered and getting input from those who have considered this challenge in the past.

Project General Information

Project Idea:

Creative Commons (CC) seeks to quantify the use of CC legal tools (works in the commons). CC legal tools include the licenses (e.g. CC BY, CC BY-NC-SA) and public declarations e.g. CC0, PDM). This project would include data collection, analysis, and visualization.

Potential questions to be answered:

How many works are in the commons?

What can we determine from the rate of change?

How can those works be characterized (e.g. by legal tool, region, language)?

How can the data be managed to allow future trend analysis (e.g. which languages saw the largest growth in legal tool adoption)?

How can the use of CC legal tools be meaningfully visualized?

Developing reproducible methodologies for gathering information about the use of CC legal tools will help CC communicate its impact, support policy work (at all levels of government and within institutions), and support the wider community.

Full Description

Creative Commons (CC) seeks to quantify the use of CC legal tools (works in the commons). CC legal tools include the licenses (e.g. CC BY, CC BY-NC-SA) and public declarations (e.g. CC0, PDM). This project would include data collection, analysis, and visualization.

First, this project should create reproducible processes or methodologies for creating a dataset of information about works that are CC licensed or dedicated to the public domain. The dataset may be built from platform APIs (e.g. Flickr), Common Crawl data, etc. The project should create a starting place not only for the project itself, but future efforts to extend the dataset and the meaning derived from it.

Second, the project should begin to create meaning from the dataset. How many works are currently in the commons? How has that changed/trended? How can those works be characterized (e.g. by legal tool, region, language)? How can the data be managed to allow future trend analysis (e.g. which languages saw the largest growth in legal tool adoption)?

Third, optionally, how can the data be visualized to communicate meaning and allow exploration?

Project Outcome

What deliverable(s) would students produce and share with your organization as a result of this project?
How do you plan to use the feedback, recommendations, or product you receive from the student team?

Students should create reproducible processes or methodologies for creating a dataset, the resulting dataset, and analysis. Optionally students may create visualizations of the dataset.

The processes, dataset, and analysis will help Creative Commons communicate its impact, support policy work (at all levels of government and within institutions), and support the wider community.

What do students need for this project to be successful?

Examples: skills needed, social impact orientation, interest or experience in a specific field/domain/industry.

Curiosity, motivation, proficiency in a programming language that can be used to query APIs and manipulate data (e.g. JavaScript, Pearl, Python, Ruby; Python is preferred), and a recognition of the value of open knowledge.

Data Proposal Information

Data Set

We expect students to create a new data set for us

Size of Data Set

How big is the data set? Approximately how many rows and columns does it have?

Between 200 million and 2 billion rows with 10 columns. The last effort to quantify the commons in 2017 estimated 1.4 billion works. I expect metadata can be discovered on at least 200 million works. Columns could include: URL, author, date, legal_tool, language, reference_count, etc.

Findings from Data Set

What do you want to learn from your data set? Please share 3-5 specific questions that the data can help solve:

How many works are in the commons?
What can we determine from the rate of change?
How can those works be characterized (e.g. by legal tool, region, language)?
How can the data be managed to allow future trend analysis (e.g. which languages saw the largest growth in legal tool adoption)?

Data Availability, Type, Format

No dataset currently exists and CC has not made a recommendation on format. Input is welcome on this subject.

creativecommons / data-science-playground Goto Github PK

data-science-playground's People

Contributors

Stargazers

Watchers

Forkers

data-science-playground's Issues