Giter Club home page Giter Club logo

Comments (67)

rossdakin avatar rossdakin commented on May 19, 2024 10

I firmly believe that the proposed schema should include a standards-compliant 3.5mm headphone jack.

from code-gov-web.

rossdakin avatar rossdakin commented on May 19, 2024 7

But seriously, some thoughts:

Data Format

Some thoughts on proposed data format standards for agency publication (code.gov consumption).

NOTE: a related but distinct feature of code.gov should be the publication of its aggregated inventory. There may be value in providing this inventory in many formats (expecting many varied consumers), whereas below I advocate for a single data format (expecting code.gov to be the sole consumer).

JSON

This should be the standard, IMO, for all the reasons that JSON has become popular: easily readable, ubiquitous, expressive (i.e. allows for collections (arrays) unlike CSV), and libraries exist in all major languages/platforms for JSON generation.

If only one format is supported, I suggest it be JSON.

XML

This wasn't mentioned, but is ubiquitous enough to warrant discussion. As I see it, JSON can do everything XML can do while being more readable and easier to construct and less complex to define (no WSDL, etc.).

If the schema were intended for broad consumption, I might suggest discussing XML support, but seeing as the schema is primarily intended for consumption exclusively by code.gov, I don't think the added complexity yields much additional benefit.

CSV

I don't see any benefits to supporting CSV, which lacks support for expressions like multi-dimensional collections (i.e. arrays) beyond the single dimension of the rows in a CSV table. One could hack around this constraint by supporting dynamic column headers (e.g. tag_1, tag_2, ... tag_n) or implicitly enumerated columns (e.g. tag, tag, tag -- similar to how some web frameworks handle array POST value). The same could be done for nested attributes (e.g. contractor_1_contacts_contact_2_phone) but that's incredibly inelegant.

One could argue that CSV is simpler to publish when maintaining an inventory by hand (e.g. by exporting an Excel spreadsheet). While this is true, I don't think that benefit outweighs the inherent limits of the format. It also seems that in the long-term, we would want agencies to programmatically generate their inventory file rather than hand-crafting it manually; not supporting CSV may nudge them in the desired direction.

YML

For the sake of completeness -- not mentioned above, but worth discussing. Same attributes as JSON but somewhat more human-readable and somewhat more fragile (white space dependency). I don't see a benefit to supporting YML in addition to JSON.

Collection Methodology

Is it best to retrieve or ask agencies to submit their inventories?

Pros and cons either way. A "pull" methodology seems to be the simplest (avoids "push" credential checking, account maintenance, etc.; also puts burden of initiation on code.gov centrally rather than on each agency individually).

One benefit of a "push" methodology would be more real-time reporting, though I'm not convinced that real-time reporting is very important in this project or outweighs the additional complexity.

CRUD

It's also worth talking about how specific actions should be taken and how certain situations should be interpreted.

For example, if a record suddenly stops being included in a agency's reported inventory, what does code.gov do? Delete it? Ignore the omission?

Should code.gov assign unique identifiers or require them as a part of inventory submission (to avoid duplication and enable "upserts")?

Which inventory actions should be idempotent?

Etc.

Collected Data

1,000% agree with @theresaanna on fully spelling out field names rather than using abbreviated/Hungarian-like naming.

Relationships / Reuse

If one agency does start using code from another agency, how is that represented in the code.gov data model?

from code-gov-web.

ctubbsii avatar ctubbsii commented on May 19, 2024 4

@theresaanna wrote:

...not all software will have a license...

I still think the License field should be required, even if the value is None, Not Available, Not Publicly Licensed, or similar. That way, nobody is left wondering.

from code-gov-web.

afeld avatar afeld commented on May 19, 2024 4

I admittedly haven't read the whole thread, but wanted to drop a few links for existing schemas that might be worth looking at:

from code-gov-web.

NoahKunin avatar NoahKunin commented on May 19, 2024 4

Obviously not going to self-deal here, but I can represent the position from the Technology Transformation Service front-office that we'd like to go with About YML, or at least some kind of YML...

...beyond my own personal preferences, I simply think it will be more accessible to people in Product roles who want to keep this data up to date.

from code-gov-web.

RobertRM avatar RobertRM commented on May 19, 2024 3

And Git Hooks would be a good way to submit this information while pushing to github for projects hosted on that platform.

http://githooks.com/

from code-gov-web.

bbrotsos avatar bbrotsos commented on May 19, 2024 3

Collected Data

Code.gov should reuse data element names and definitions from project open data metadata schema https://project-open-data.cio.gov/v1.1/schema/ where possible. These are based on W3C http://www.w3.org/TR/vocab-dcat/ and dublin core that has been around for many years. Alternatively, if GitHub, GitLab or other code repository has existing data elements and types, this project could use those fields. Code.gov could reuse the following fields from project open data:

  • title
  • description
  • keyword
  • contactPoint
  • publisher
  • license
  • landingPage
  • identifier

An example:

{
     "projects":{
          "title": "Important USDA Code Repository",
          "description": "Creates new automated farms",
          "landingPage": "usda.gov",
          "repositoryURL": "github.usda.gov/automated-farms"
          "softwareLanguage": ["ada", "perl", "cobol"]
          ...

There may be more fields to reuse. I also recommend adding fields which will be good for analytics of what agencies and investments are releasing their code:

  • bureauCode
  • programCode

By aligning to these fields names, there is also hope in developing common system for storing data sets, data assets and code repositories. For example, we could potentially create an extension for CKAN or DKAN to also store code repositories. You could also reuse existing documentation.

from code-gov-web.

okamanda avatar okamanda commented on May 19, 2024 3

@jecb and others have brought up making this a tool or process to make generating the code inventories as easy as possible. I think the first step in doing so, is mapping schema fields to some of the web-based repo hosting tools (e.g., github,bitbucket), especially those that have APIs.

To that end, I've put together this table which shows what this might look like.

schema field github field bitbucket field
agencyAcronym given [given]
projects.vcs [git]
projects.repoPath [html_url] or [url]
projects.repoID [id]
projects.projectURL [homepage]
projects.projectName [name] or [full_name]
projects.projectDescription [description]
projectTags.tag (?) process/analyze from [description], [name], and [language]
codeLanguage.language [language]
Updated.LastCommitDate [updated_at]
Updated.LastMetadataUpdate [pushed_at] or [updated_at]
Updated.LastPullRequest grab [updated_at] from [pulls_url]
POCemail (?)
license (?) grab/process/analyze from LICENSE.MD/README.MD, etc.
openproject 1
govwideReuseproject 0
closedproject 0
exemption null

from code-gov-web.

VisionPaul avatar VisionPaul commented on May 19, 2024 3

Collected Data
Adding the name of the system or platform may help for purchased environments that allow for custom solutions to be developed within. We use both Salesforce and ServiceNow - and many other agencies are using these platforms as well, and it would be great to search and post developed solution sets - especially since they probably already come with some level of A&A.

Maybe "softwareLanguage" as @bbrotsos has listed above would be appropriate usage for this example....

from code-gov-web.

ckaran avatar ckaran commented on May 19, 2024 3

As a suggestion, let's drop the updated field and replace it with a version field where the value is a string that obeys the Semantic Versioning guidelines. That will allow both people and automated systems to determine how important a change is, which time stamps don't allow.

from code-gov-web.

rossdakin avatar rossdakin commented on May 19, 2024 3

One thought on the topology.

"Project" here seems synonymous with "repository" — I could see this being confusing when listing projects that have multiple repositories (e.g. a UI, an API, etc.).

Possible mitigations:

  • use "Repo description" etc. rather than "Project description"
  • allow multiple repositories per project
  • leave topology as is but add a field for "related projects"

from code-gov-web.

philipashlock avatar philipashlock commented on May 19, 2024 3

Maybe it was intentional to get a fresh perspective, but it seems like the original discussions on this topic from the policy should be required reading here. See:

Seems like we may be re-hashing many of the same points. In fact, it seems like there are at least four different threads on the metadata schema topic and it's a bit confusing to follow along. Here are the threads I've identified in chronological order:

Where possible, I'd suggest trying to de-duplicate or consolide these threads or at least update the first post on the thread to distinguish the different threads if each is meant to serve a distinct purpose.

from code-gov-web.

rough68fish avatar rough68fish commented on May 19, 2024 2

I think it would be a good idea to follow the process established by the data.gov effort as much as possible. Since most agencies have been working on setting that up they should be familiar with json and have processes for creating and maintaining the json data.

Also try not to invent a whole new schema and if possible try to reuse data.gov data descriptions where you are talking about the same thing.

from code-gov-web.

okamanda avatar okamanda commented on May 19, 2024 2

Re: Updated, LastUpdated, LastCommit, etc.

@theresaanna @niden I think the value of having some form of timestamp updated field is to show currency. Stale, unattended, outdated code is not particularly helpful to the open source community or the federal agencies. There's merit in giving developers a sense of how much attention a certain repository receives.

Difficulties in implementation notwithstanding, these fields or something like them will be important piece of data in evaluating whether or not to rely on a segment of code.

A good question for us to consider is whether there is an easy proxy that we can use. For repos on popular hosting platforms like Github or Bitbucket, we have a few fields from which to choose. For other private or more obscure repos, this becomes a bit more difficult to find that timestamp. Difficult, but not impossible given that timestamp changes to code is the major reason why we have version control software!

So I'd love to hear from folks here who've wrestled with this problem in some form or another. Are there creative ways to get timestamp info on:

  • date created
  • date of last commit
  • date of last push
  • date of last modification generally

?

from code-gov-web.

theresaanna avatar theresaanna commented on May 19, 2024 1

I've asked a handful of developers here at 18F for some feedback on approach and schema. Here are some highlights:

  • Projects that have instances per-agency, like the eRegulations platform, will be duplicated in the directory.
  • We should provide a way for agencies to submit tarball or other packaged software URLs.
  • In the draft schema, we should make the names clearer and avoid abbreviations like "openPjct", "govwideReusePjct", and "closedPjct", opting to instead spell out "project".
  • "pjctTags" should be an array.
  • "license" may have multiple values
  • It could be helpful to have some searchable metadata as indicators of overall project health (# contributors, alpha/beta/prod status, etc.)
  • We may be able to remove "closedPjct" because the presence of "exemption" would indicate that it's a closed project.

from code-gov-web.

mgifford avatar mgifford commented on May 19, 2024 1

Would be interesting to be able to record if projects:

  • Meet with Section 508 (WCAG 2.0 AA) requirements
  • Support multi-lingual content (English/Spanish) and interface
  • Have undergone external security reviews
  • Are actively maintained or have critical mass
  • Include screenshots or other documentation

That being said, good to make it as easy as possible for folks to get started. We don't want to overwhelm folks.

from code-gov-web.

ctubbsii avatar ctubbsii commented on May 19, 2024 1

If data entry is provided, then the format CSV or JSON doesn't matter, because the view can be exposed either way. The format does matter for bulk-import of metadata, and for that, I'd prefer JSON.

I think it's best to ask agencies to submit their inventory to code.gov (this is where that bulk-import feature helps), rather than rely on them to publish on their own site and pull from there (not all government agency's have up-to-date and convenient sites, and if you provide the platform for receiving the information, it'll probably be easier and faster to get the data than requiring them to sustain their own platform for publishing). Some incentive should be provided to ensure project managers submit this data. Using the data to have a "featured projects" page, might be one way to incentivize timely submissions.

As for fields,

  • Project URL should be required. It's the single most important piece of information. Everything else can typically be discerned by visiting that URL. If it doesn't have a URL, I'm not sure why it would ever be listed here (unless the intent is to publish metadata about closed-source projects).
  • POC Email should not be required, because sometimes, the best way to contact an open source project is through the forums, not directly. Additionally, email as a mandatory method of communication is not really future-proof.
  • Last Updated is a confusing field. Does it refer to the last commit? The last time a user reported a bug? The last mailing list discussion? The last time the metadata was updated? If it's going to be there at all, it should refer to the last update of the metadata. It's not reasonable for projects to update this metadata field every time the project itself has activity, so it's pretty useless for that purpose. It would, however, be useful to see if the metadata is old or not. If it gets used this way, it should be required.
  • License should be required. This is a pretty important field.
  • Open Project Status is confusing. Is this metadata intended to index non-open source projects as well? Even if it is, this raises the question of "whose definition of open is being used?". This is also redundant with the License field, because the status is determined solely by the license.
  • Government-wide Reuse Project Status is also confusing. Why would this ever matter? An agency's intention that their release be reused across government has no bearing on whether or not it will be.
  • Exemption field may not be useful. I imagine that most things that would exempt it, would also exempt the metadata being requested. As an optional field, I guess it's fine, but it's probably better to simplify things and elminate fields, until a demonstrated value exists. Best to start small and grow bigger, than to start big, and just grow more complex.

from code-gov-web.

niden avatar niden commented on May 19, 2024 1

Adding to what @ctubbsii wrote:

Last Updated could be renamed to something like Updated and become an array that has more information in there such as LastCommitDate, LastMetadataUpdate, LastPullRequest etc.

The Languages should be an array not a comma separated field. It will be easier to index that way IMO

I see little value in supporting CSV or XML. As @rossdakin points out, not offering CSV will point people to the right direction :)

from code-gov-web.

bondsbw avatar bondsbw commented on May 19, 2024 1

Government approval processes often become roadblocks and cause systems and data to become stale and unreliable for their purposes. I fear the same for this effort. As red tape is added, this data could become so dated that nobody finds it useful.

I suggest that Code.gov needs to get in front of this problem before the culture settles. Encourage agencies to push metadata updates as quickly and as often as possible while reducing red tape in these processes. Make the update process responsive by eliminating any approval processes aside from standard security and authorization measures.

I would hate to see all this effort reduced to the usual "I technically did my part" checkbox I find in too many government tasks.

from code-gov-web.

jasonduley avatar jasonduley commented on May 19, 2024 1

Which data format is the best fit: CSV or JSON?

  • We prefer a JSON based serialization as we have tools inside of NASA to support both open data and internal code sharing that operate seamlessly with JSON. Also, it should be allowable for agencies to extend the base schema to append additional attributes and this is not possible or easily done with CSV.

Is it best to retrieve or ask agencies to submit their inventories?

  • We would like to mimic the scheme that data.gov uses and post the JSON file on a web server and have it harvested at some interval. Ideally, we'd like to have access to the harvest job admin screen in order to run it manually as well as a dev harvester for performing end of quarter batch loads

As I mentioned on our call today, since the majority of our code is behind the NASA firewall, it would reduce perceived risk and increase NASA's adoption of this policy if URLs are optional. Of course for open source repositories such as the ones NASA maintains here: www.github.com/nasa, we would include the URL fields as they are important in this context. I think title, description and POC are all important for code discovery and setting up potential collaborations between government parties

Schema comments I have, within the Projects array ...
VCS should be typed ENUM (avoid confusion between SVN and Subversion for example)
pjctTags should be typed an array of strings
codeLanguage should be typed an array of strings and potentially ENUM to avoid terminology mismatches (node vs node.js)
POCemail should be replaced with POC of type object similar to ...
POC: {
email: "[email protected]",
name: "Jason Duley"
}
boolean fields should be true/false

also, from a schema standard we should decide if attributes should be included with NULL values OR if those NULL valued attributed should be omitted.

from code-gov-web.

jbjonesjr avatar jbjonesjr commented on May 19, 2024 1

I want to take a second before responding myself to thank @rossdakin for his detailed post above. He did a great job laying out reasoning behind multiple formats and each delivery mechanism. Thank you for taking the time to share that and add to the conversation.


Now, some thoughts in no particular order....

  • It would be really nice if tags could be a fixed set instead of freeform. I'b be curious if StackOverflow published a list of it's top tags that could seed this project. While rejecting data for incomplete tags is not optimal, dealing with multiple tags that mean the same thing "Subversion, SVN, svn" can make discovery very difficult.

  • As @IanLee1521 mentioned, this metadata should be derived wherever possible (the GitHub API is great for this). Creating an API process to iterate each repository in the organization, collect the proper information, then "push" it to either code.gov or the agency website would be a pretty low lift (for systems already collecting that information at least).

  • For @jasonduley,

    it would reduce perceived risk and increase NASA's adoption of this policy if URLs are optional.

Can you tell me more of how NASA treats internally-resolvable urls as a risk? I'd think as the govt works towards more inner-sourcing and reuse, that being able to go "to" the code will be a big help.

  • As we talk about push vs pull, keep in mind what happens when projects are abandoned? Who updates this information then (or if someone wants to register a long-abandoned project)? Might it be helpful to include "expected project period of performance" to provide a hint that a project might be OBE at a later point in time?
  • Regardless of ingest format, couldn't code.gov due the translation and offer data in many formats? Very Write Once, Read Many...
  • In terms of push vs pull, would you require the parent agencies website to publish the inventory? So all the DOE labs would be required to roll up to DOE? How about multi-agency organizations? I think it would simplify the data flow if push was decided as the standard. The only issue there is you lose some agency capability to keep on top of the inventory.

from code-gov-web.

theresaanna avatar theresaanna commented on May 19, 2024 1

That being said, good to make it as easy as possible for folks to get started. We don't want to overwhelm folks.

@mgifford I agree that we should make it easy for folks to get started, but you bring up some valuable data points we might collect. Thanks so much for your feedback. A question that remains for me is whether it's better to have an initial version of the schema that we add onto as agencies feel more comfortable or if it's better to be thorough up front.

from code-gov-web.

thecapacity avatar thecapacity commented on May 19, 2024 1

@theresaanna I think you've got a lot of good material in the above discussion, and may have already seen this from some of my colleagues: https://18f.gsa.gov/2016/08/29/data-act-prototype-simplicty-is-key/

"... One of the earliest decisions our team grappled with centered on the data format we would receive from agencies. ... "

I wanted to augment some of the earlier comments that it definitely seems like an "and" and making one machine-readable format is a good way to validate another (e.g. CSV to validate a "more formal" JSON/XML/... spec).

from code-gov-web.

theresaanna avatar theresaanna commented on May 19, 2024 1

@rossdakin Thank you so much for your thoughtful feedback! You've brought up some great food for thought. I am in agreement with you that a JSON, pull-based system makes the most sense. Some thoughts:

For example, if a record suddenly stops being included in a agency's reported inventory, what does code.gov do? Delete it? Ignore the omission?

My assumption is that code.gov always reflects the most recent version of agency inventories, meaning we'd delete the record. I don't know if this is a good assumption. Are there cases in which we'd want to hold onto old data? I imagine it'll be normal for software to drop out of inventories as it becomes replaced.

Should code.gov assign unique identifiers or require them as a part of inventory submission (to avoid duplication and enable "upserts")?

You bring up a great point. I think that for a first version, given the aggressive timeline the policy lays out, we won't be able to tackle this, however, I will add it to our backlog for addressing in the future. I cringe a little to say that, as this is something we'll want to think about sooner rather than later admittedly.

If one agency does start using code from another agency, how is that represented in the code.gov data model?

That is a fantastic question! I think we will need to have some discussion around how we might represent that - whether it's in the data model or a layer on top of it. Do you see any benefits to having it in the data model?

from code-gov-web.

okamanda avatar okamanda commented on May 19, 2024 1

Re: exemption/closedProject

@theresaanna @ctubbsii - Exemption/closedProject reflects the scenario when an agency cannot report the details about a particular code inventory and are relying one of the five exemptions provided for in the policy (e.g., national security risk).

In this scenario, agencies would have to remove or 'blackline' certain fields like projectURL, repoPath, and others (to be determined) prior to publishing/posting their code.json inventory. In addition, they'd have to state the exemption upon which they rely in the 'exemption' field. That way when the code.json inventory is published with the missing fields, the public will be able to see the reason those fields are missing in the exemption field.

I could see an argument that if the exemption field is null, then closedProject should be 'false'. But in any event, the exemption/closedProject fields are closely related.

from code-gov-web.

neilmartis avatar neilmartis commented on May 19, 2024 1

@NoahKunin @afeld @theresaanna I found this created by a former PIF Rob Baker not sure if this can be added to the list by Aidan https://github.com/rrbaker/maker.json

from code-gov-web.

jbjonesjr avatar jbjonesjr commented on May 19, 2024 1

Regarding formats of submitted data (json? csv? xml? yml? ), I would remind that we don't need to find one format to solve all problems and use cases.

Government/Code.gov ingest is a separate problem from discoverability by external users is a separate issue from user presentation is a separate issue from data generation. There are various tools (convertors? formatters? etc) to solve many of these issues.

I'd prioritize (Selfishly) whichever of these formats is most important for you, and let code.gov provide the facilities for other conversions (Pull Requests welcome!)

from code-gov-web.

ckaran avatar ckaran commented on May 19, 2024 1

@jbjonesjr I see the point that @okamanda was making about avoiding stale projects. However, I'm unaware of any version control system that doesn't store the date when changes were made, so the last update date can always be extracted from the VCS directly. Semantic versioning can help an automated system determine which patches can be applied, and which can't. Going from 1.10.9 to 2.0 means something in my system is going to break. But going from 1.10.9 to 1.10.10 could be handled automatically when my system downloads and applies patches.

from code-gov-web.

ctubbsii avatar ctubbsii commented on May 19, 2024 1

@massonpj Obviously, any new license should be approved by both FSF and OSI. The biggest issue I think OSI would have is seeing it as "duplicative" if it's too similar.

from code-gov-web.

ddelmoli avatar ddelmoli commented on May 19, 2024

If considering a JSON format, it may be useful to follow / look at the npm package file format https://docs.npmjs.com/files/package.json

from code-gov-web.

IanLee1521 avatar IanLee1521 commented on May 19, 2024

Personally, I would prefer not XML for the reason that it isn't as well supported by tools like Jekyll which may be used for the display / web visualization of the data.

Another thought, should the fields / the spelling of the fields be aligned with the type of information that can be grabbed from sources like the GitHub REST API?

This would allow, at least for open / GitHub repos, the ability to absorb all projects by only knowing the organization names. This is something that I am doing for the @LLNL organization to create a software portal, much like what Code.gov will become, at software.llnl.gov.

from code-gov-web.

IanLee1521 avatar IanLee1521 commented on May 19, 2024

Oh, and I also agree with @jasonduley that the ability for agencies to push into the repository would greatly ease the integration of "inside the firewall" code hosting.

from code-gov-web.

jasonduley avatar jasonduley commented on May 19, 2024

@jbjonesjr
Today, mission-based CM systems that contain flight code, vehicle commands, ground software, etc. and other sensitive projects are not going to allow a firewall exception to government partners and will most likely share "released" source code by re-hosting to neutrally located CM systems outside the NASA internal firewall for government-wide sharing. For the inventory, the URL should be optional for internal source as they live behind the firewall.

from code-gov-web.

IanLee1521 avatar IanLee1521 commented on May 19, 2024

@jasonduley -- Would providing the links, even if they are inaccessible be an issue? It seems like if it were possible to provide the where now, that would assist with identifying where new connections need to be established.

@jbjonesjr -- One other thought is that the number of sources for the metadata we (all) would be scraping is fairly limited... There are only so many tools for hosting code. GitHub.com obviously, but also: GitLab, Bitbucket.org, Bitbucket Server, SourceForge, etc. By deciding on a common format and building tools for scraping that data out of these sources, all of the agencies would be able to contribute collaboratively.

from code-gov-web.

jasonduley avatar jasonduley commented on May 19, 2024

@IanLee1521
I think supplying URLs to NASA's internal and tightly secured code projects will cause issues for us. Please note this would be a subset of projects in the inventory and all already released open source would contain URLs.

from code-gov-web.

IanLee1521 avatar IanLee1521 commented on May 19, 2024

Makes sense... For what it's worth, I suspect we would have similar issues @LLNL.

from code-gov-web.

CynthiaParr-USDA avatar CynthiaParr-USDA commented on May 19, 2024

Because new code is often generated in association with research data, we are encouraging data submitters to the Ag Data Commons (https://data.nal.usda.gov) to also submit a pointer and metadata description for their software (which we hope is primarily managed in an open source code repository). Two points to make about this:

  1. We have the same POD 1.1 metadata for the software (which we have augmented with a few fields -- see https://data.nal.usda.gov/description-fields-%E2%80%9Cedit-dataset%E2%80%9D-page)
  2. We obtain DataCite DOIs for software tools, whether they are registered separately from their data or included as a resource in a data package.

I would encourage processes to align as closely as possible with the existing open data.gov processes. I have no problem with additional value-added metadata.

from code-gov-web.

jecb avatar jecb commented on May 19, 2024

Apologies if this question has been asked, but has there been discussion around creating a JSON conversion tool similar to the DCOI Strategic Plan: https://datacenters.cio.gov/json-conversion-tool/?

from code-gov-web.

theresaanna avatar theresaanna commented on May 19, 2024

@ctubbsii Thank you so much for all of your feedback. You've brought up some great points that are so valuable in helping us think this through. I've replied to much of your comment inline:

Project URL should be required. It's the single most important piece of information. Everything else can typically be discerned by visiting that URL. If it doesn't have a URL, I'm not sure why it would ever be listed here (unless the intent is to publish metadata about closed-source projects).

So, we will be collecting data about presumably many closed-source projects, and so a public URL may not be available.

POC Email should not be required, because sometimes, the best way to contact an open source project is through the forums, not directly. Additionally, email as a mandatory method of communication is not really future-proof.

I agree that it's not very future-proof. My assumption was that agencies would need a way to get in contact with the project maintainers if this inventory were to be useful. However, I'm not sure that's a good assumption. I'm planning to remove it as a required field unless a good argument is made to the contrary.

Last Updated is a confusing field. Does it refer to the last commit? The last time a user reported a bug? The last mailing list discussion? The last time the metadata was updated? If it's going to be there at all, it should refer to the last update of the metadata. It's not reasonable for projects to update this metadata field every time the project itself has activity, so it's pretty useless for that purpose. It would, however, be useful to see if the metadata is old or not. If it gets used this way, it should be required.

Interesting. I had thought of this field as a signifier of the activity of a project, but this would be hard to maintain unless we were pulling project info right from Github or similar. I agree that it would be useful to see when the metadata was last changed. The more I think about this field, though, the less convinced I am that we need it. Until we have a tool to generate the inventory JSON, I imagine this field will fail to be updated with changes, making it unreliable.

License should be required. This is a pretty important field.

Agreed that it is important, but unfortunately not all software will have a license. Along the lines of the suggestion you made about incentives, perhaps there's a way to encourage folks to release code and help them decide which license is right.

Government-wide Reuse Project Status is also confusing. Why would this ever matter? An agency's intention that their release be reused across government has no bearing on whether or not it will be.

There are projects that are built as platforms or to be reused specifically. For example, the eRegulations project. https://eregs.github.io/. This field will allow users to look specifically for these types of projects.

Exemption field may not be useful. I imagine that most things that would exempt it, would also exempt the metadata being requested. As an optional field, I guess it's fine, but it's probably better to simplify things and elminate fields, until a demonstrated value exists. Best to start small and grow bigger, than to start big, and just grow more complex.

Definitely agreed on preferring to start small.
@okamanda or @mattbailey0 - I realize I don't fully understand what it means for a project to be exempt. Is this exemption from the open source part of the policy?
A related question: If you look at the original schema, it has exemption and closedPjct. Will a closed project always have an exemption? Put another way, can we remove closedPjct and rely on the existence of exemption to indicate that it's closed?

from code-gov-web.

theresaanna avatar theresaanna commented on May 19, 2024

Last Updated could be renamed to something like Updated and become an array that has more information in there such as LastCommitDate, LastMetadataUpdate, LastPullRequest etc.

The Languages should be an array not a comma separated field. It will be easier to index that way IMO

@niden these are great suggestions, thank you. I will implement your languages field suggestion - I agree.
In the interest of ease of use, I'm thinking we may want to drop the Last Updated field. Though, if we do implement it in the future, this object-based approach would make things clearer. I could see this field being more useful when we provide a JSON generator or can pull data from somewhere like Github.

@okamanda, I'm interested in your thoughts here. Do you see a need for Last Updated that I don't? I worry that it will fail to be updated and then become unreliable data if folks are updating it manually.

from code-gov-web.

neilmartis avatar neilmartis commented on May 19, 2024

maker.json is a schema to promote standards in the information we share about DIY spaces around the world toward fostering further awareness and improving collaboration.

from code-gov-web.

IanLee1521 avatar IanLee1521 commented on May 19, 2024

In this scenario, agencies would have to remove or 'blackline' certain fields like projectURL, repoPath, and others (to be determined) prior to publishing/posting their code.json inventory.
-- @okamanda

Interesting... So the intention here is that the code will be acknowledged and named, with a note that it is exempt?

from code-gov-web.

jasonduley avatar jasonduley commented on May 19, 2024

hello everyone, I had a few questions I received today during discussions with some NASA stakeholders:

Q1) Does the code inventory we post exclude software the agency has developed prior to Aug 8th and only include "new code" projects formulated after Aug 8th
OR
Is the list exhaustive in that all software must be accounted for?

I think I know the answer but wanted to document the question so I can pass it along to some folks within the agency (this question stemmed from the "is not retroactive" part in the document)

Q2) Should NASA or other agencies include code projects written via hackathons/challenges (e.g. spaceappschallenge), grants, proposals as part of the inventory?

from code-gov-web.

ctubbsii avatar ctubbsii commented on May 19, 2024

Expanding on @jasonduley 's question about hackathon, grants, proposals... I'm also curious what granularity this is going for.

If the goal is to capture metadata for all government-produced software, this clearly means some threshold above the script to search my shell history, this saved SQL predicate, vim plugin to format XML the way I like it, script to launch my favorite apps when I log in for the day, or example pseudo-code that accompanies a paper I wrote levels.

Developers write, think, speak, and dream in code, and not all of it attains the level of "this is a named government software project with metadata to inventory", even if it is a produced by a government agency and was given a name by its author.

Up to now, I had assumed that this effort was mainly about inventorying published open source projects. But, with the comments above about inventorying closed-source or unpublished software, this question of granularity really becomes important.

from code-gov-web.

mikecharles avatar mikecharles commented on May 19, 2024

Q1) Does the code inventory we post exclude software the agency has developed prior to Aug 8th and only include "new code" projects formulated after Aug 8th
OR
Is the list exhaustive in that all software must be accounted for?

And if it only includes code after Aug 8th, I assume we can still add all of our older code if we want?

from code-gov-web.

jbjonesjr avatar jbjonesjr commented on May 19, 2024

I'd give a 👍 (despite implementation details) to @okamanda points about update date. Something that semantic versions or other fields don't tell without other information is how recently the code has been updated (basically a proxy for is the code still maintained).

While not all version controls systems don't provide this data to be derived as easily (a pity), even something as simple as the year of last maintenance would be an improvement and important signal to maintenance status.

from code-gov-web.

jbjonesjr avatar jbjonesjr commented on May 19, 2024

@ckaran totally agree with the value of Semantic Versioning done right. One of my concerns with Semantic Versioning is that the end user (aka, the developers already writing this code) have to be relied on to use semantic versioning correctly. This may not always be a given (sometimes 2.0 is because a contract was recompeted, sometimes its because of breaking api changes, but sometimes it's because of marketing). While even government projects can't screw up the meaning of last commit/update.

The is key because the use case in my mind as a former govt developer and architect, is that I'd go to code.gov to see if there is already a widget that does PetShop aggregation before I build my own. When billions of dollars of government projects are provided with information on code.gov, i expect there to be many PetShop widgets available for my use, many likely from my home agency. Having an update date will help me figure out which widget to use without diving directly into each Repository to get specifics.

Not against Semantic Versioning, but don't think it can replace an update_date.

from code-gov-web.

bondsbw avatar bondsbw commented on May 19, 2024

Semantic Versioning and update date have different purposes and serve different needs. I suggest having both.

from code-gov-web.

ckaran avatar ckaran commented on May 19, 2024

@jbjonesjr I agree with you that sometimes there are version bumps solely for marketing and other purposes; however, nothing prevents someone from performing a pointless update to a code base solely to cause the Last Updated field to get updated1. That said, if we assume that people are generally honest and will not deliberately game the system, then @bondsbw is right that both have their uses. Semantic versioning will tell you how important the change is, while the Last Updated field gives you a clue about the vibrancy of the project. So, I guess I'm now voting for both fields.

[1] I'm assuming here that once a project's URL has been submitted to code.gov, then the servers can automatically look for any updated projects and update their databases accordingly. Computers are lousy at determining which changes are important ones, so this would be a trivial trick for an unscrupulous person to make it appear that their project is getting lots of updates.

from code-gov-web.

bandrzej avatar bandrzej commented on May 19, 2024

Some feedback, from my personal opinion:

  • codeLanguage.language
    Realize this should be a multiple value field, unless you are going to specify in the description the primary code language in use as a single value.
  • license
    This should be required to point to a README.md or license document within the code repository. This clearly defines the secure rights obtained for the source code that this OMB memo is trying to solve.
  • openproject vs. closed project
    What about combining this into one field, and its values are "Open" or "Closed"
  • govwideReuseproject
    Why would we be listing projects that are not gov wide re-use? Use case?
  • exemption
    Is the plan here to list source code that has exemptions to gov-wide release? How do you plan to deal with FIOAs?

from code-gov-web.

bandrzej avatar bandrzej commented on May 19, 2024

+1 for YML per @NoahKunin

It is assumed a developer would do this task, but it is left up to the agency how it is accomplished. It would not surprise me some agencies task their Public Affairs or Security Offices to maintain since it is public facing.

from code-gov-web.

bandrzej avatar bandrzej commented on May 19, 2024

Question:

How do you plan to track government contributions to existing public OSS projects that were not started by the government?

from code-gov-web.

mattbailey0 avatar mattbailey0 commented on May 19, 2024

Thanks @philipashlock. Especially with your helpful write up in place, let's consolidate the discussion here. I'm going to close out the other issues and point folks to this thread, which is the most active overall.

from code-gov-web.

IanLee1521 avatar IanLee1521 commented on May 19, 2024

@mattbailey0 -- Would it make sense to start working all of these discussions into the draft guidance, rather than continuing to solely use the issue threads?

from code-gov-web.

philipashlock avatar philipashlock commented on May 19, 2024

Current Status

What's the status of this schema? There's documentation on code.gov that seems somewhat final, but there seem to be a number of important points that haven't been addressed, this issue is still open, and the code.gov site somewhat confusingly says that both the publication of the metadata schema and implementation of the schema by agencies are due December 9th (referring to Section 7.2).

Allow for revisions

Whether or not this is final or it's possible to make some minor updates, I would suggest creating some expectations or provisions for a revision within a year or so after there's sufficient experience and feedback from those who have implemented and consumed it. We did this with the Project Open Data Metadata schema and the 1.1 update not only allowed us to address issues that had come up, but to also fully align it with the international standard established by voluntary consensus bodies (DCAT). It's understandable that there was a short timeline to establish this schema, but we don't want to create the impression that this draft will be locked in for perpetuity.

One of the ways we addressed this with the Project Open Data schema is in the v1.1. update we required implementors to explicitly state the version of the schema at the top of the file.

Use existing standards

While it may not seem like the development of this schema is part of a standards making process, it really is if agencies are required to follow it. OMB A-119 sets out basic requirements for the use of standards in government, specifically "this Circular directs agencies to use voluntary consensus standards in lieu of government-unique standards except where inconsistent with law or otherwise impractical." In other words, government should avoid creating government-specific standards unless it has a good reason to do so. Avoiding reinventing the wheel also meets the spirit of reuse set out in this policy. With that in mind, it would be good to review existing standards and document why or why not they are practical to use here.

A number of existing schemas and specifications have been raised in this discussion including the Asset Description Metadata Schema for Software ADMS.SW used by federated national software catalogs across Europe - which integrates much of the DCAT vocabulary used for the Project Open Data data.json schema, the civic.json schema (with various flavors that have been used or proposed by the civic tech community in the U.S. including BetaNYC, Code for America, and DC Government), the Schema.org SoftwareSourceCode and SoftwareApplication schemas which appear to be implemented by a relatively small number of websites (10 and less than 50,000 respectively), and the NIST specification for Asset Identification which I think its mostly used to describe software in an operational environment rather than as an autonomous asset ready for reuse.

The current schema appears to be largely based on the civic.json specification. The pros of this is that it's something that's already been developed by the community and it's relatively simple. The cons of this is that it's not clear that it's widely been used, well documented, or even proposed consistently enough to enable interoperability.

The ADMS.SW specification seems like the most robust standard aligned with the needs of Code.gov. The pros of this is that it's been developed through formal voluntary consensus bodies, is thoroughly documented, aligns with the DCAT schema used for the open data policy, and is implemented in a federated way by European government bodies just as needed by U.S. federal agencies. The cons of this is that it appears overly complex with very dense documentation. You can see a full PDF copy of the ADMS.SW spec here (copied from here) and a presentation about it here

The Schema.org schemas are fairly simple, well documented, and developed through a voluntary consensus process. One of the biggest pros is that these are supported by the major search engines which means that they should be indexed by search engines and that's the most likely way people will find software (not on code.gov). The con is that these are not yet well adopted, at least not SoftwareSourceCode, and the search engines do not yet appear to be doing anything special to index these. However, it's totally possible to implement one of the schemas mentioned above while also implementing a schema.org schema, but you'll want to be sure there's a good mapping between the two. We did this with the Project Open Data metadata schema, but it was fairly easy because the POD schema is merely an extension of DCAT and the schema.org Dataset schema was explicitly based on DCAT. None of the major search engines were doing anything special by indexing the schema.org Dataset schema when it was first implemented on Data.gov, but Google is now working on this more and expanding the Dataset schema for the way Google wants to index things like Science Datasets and I think we can expect something similar to happen with software.

So while it seems like a fairly final decision to develop something new based on the civic.json schema, I think it's worth considering whether more could be done to leverage the work that's gone into ADMS.SW, to reuse the elements in DCAT already used by the open data policy, to align with a formal voluntary consensus standard, and to allow for interoperability with the federated European software catalog. That said, more should be done to provide a simplified profile of ADMS.SW and to better understand the pros and cons of ADMS.SW in practice. We did this with POD v1.1 and DCAT by working with W3C to make data.json a formal representation of DCAT with JSON-LD and I think we found a good compromise. When POD v1.0 was developed, it was mostly aligned with DCAT, but DCAT had not been finalized. POD v1.1 is now compatible with DCAT and a large portion of national data catalogs around the world use DCAT. The European Union uses DCAT as the basis for their federated Europe-wide data catalog.

And even where an existing specification isn't fully packaged to meet all the needs here, you can still assemble fields from existing vocabularies. This allows for field level interoperability and can ensure you reuse properties that are already well defined rather than coin new ones that are vague or inconsistent.

Feedback on fields

In the meantime, here's some feedback on specific fields (some of this reiterates or emphasizes John's comments)

agency - there are no official or consistent acronyms for government agencies in the federal government. To ensure consistency, you'll have to use a unique identifier like we did with Project Open Data. We primarily used bureauCode but GSA is also working on a more universal unique identifier system for agencies. Additionally, ideally this field would not be government specific. I would also suggest that this field be associated with each project entry rather than with the whole catalog as this will allow the metadata to be more easily mixed and aggregated across multiple sources without losing this important data.

organization - for Project Open Data we allowed folks to use the publisher field to optionally provide the context of where the office sits in the agency by indicating some level of hierarchy. I would also suggest that this field be associated with each project entry rather than with the whole catalog as this will allow the metadata to be more easily mixed and aggregated across multiple sources without losing this important data.

openSourceProject - this seems somewhat redundant with license. The policy defines open source as anything meeting the Open Source Definition and OSI has a list of licenses that meet that definition, so this field could just be derived from the license. Even if you feel the need to keep it here, I'd make it explicit that this means the code is licensed (or unlicensend) in a way that meets the OSD. It's also worth noting that OSI has not accepted CC0 as meeting the definition, but does recognize the public domain status of U.S. Government Works. This is a topic that should be discussed and debated further, but it might be worth considering whether it's better to use the usa.gov URL for U.S. public domain as defined by Project Open Data rather than assert international public domain with CC0 like we suggest for datasets. The difference in these use cases, as explained by OSI, has to do with patent rights which are relevant for software, but not data. Additionally, this field should use a boolean (true or false) not an integer since the the boolean datatype is intended specifically for this purpose and is more human readable.

governmentWideReuseProject - this should be renamed so it's less government-specific, e.g. designedForReuse and it should use a boolean (true or false) not an integer since the the boolean datatype is intended specifically for this purpose and is more human readable.

languages - this should make it clear that it's referring to the code language rather than human language. In ADMS.SW, DCAT, and Schema.org, language is used to refer to the human language used by the asset, whereas schema.org uses a term like programmingLanguage on their SoftwareSourceCode schema to be clear they're referring to code not content. This should also be singular, not plural regardless of whether the data type is singular or not.

exemption - I'd suggest making this more explicit like reuseExemption and using a more human readable controlled vocabulary for the excemption reasons rather than integers.

Missing Fields

identifier - It's important to try to establish a globally unique identifier for each project since many other fields will change and it will be hard to track the entry without a unique identifier. Data.gov uses the identifier field to know when an entry has been added or removed rather than updated. This field should be globally unique using a URI to avoid collisions from different catalogs when aggregated from multiple sources. This should be a required field.

provenance or source - In the spirit of reuse, it'd be helpful to know this codebase was forked or otherwise derived from a separate upstream codebase. This could be the URI of the unique identifier or the URL of the project.

Serialization Format

I recommend JSON for many of the reasons other have stated. It has worked for Project Open Data data.json and we have built out the infrastructure to validate and harvest in this format. JSON-LD is also now the format recommended by Google for schema.org schemas and other structured data on webpages. Some have suggested YAML as an alternate because it's more human readable and easy for folks to edit, but this also means it's more likely to result in poor or inconsistent data quality for any data structure with even moderate complexity. With the initial implementation of the Project Open Data data.json schema, many folks attempted to maintain their JSON metadata by hand, but this resulted in the majority of the problems we encountered with regard to harvesting and interoperability. I would strongly suggest that we do not rely on a structured data format that is edited by hand, but agencies are free to allow for this upstream as long as they validate it when compiling their aggregate copy. It's worth noting that JSON is actually a subset of YAML, so agencies could allow either YAML or JSON from individual offices if they're using a YAML parser, but they'll still have to validate it against the final JSON schema requirements and provide a comprehensive JSON version.

from code-gov-web.

philipashlock avatar philipashlock commented on May 19, 2024

I've attempted an initial mapping between code.json and ADMS.SW. Note that ADMS.SW follows the same conceptual model as DCAT used for Project Open Data data.json:

To ensure that the Data Catalog Vocabulary (DCAT), the Asset Description Metadata Schema (ADMS), and the Asset Description Metadata Schema for Software (ADMS.SW) are seeded on the same structure, the RADion vocabulary was created [RADion]. RADion is shorthand for Repository, Asset, and Distribution – the three structural elements that RADion abstracts from.

In ADMS.SW, the concepts Software Repository, Software Release and Software Package are defined as specialisations of the more general concepts Repository, Asset and Distribution specified by RADion

To clarify these relationships, I created a visual diagram similar to the Schema Object Model Diagram provided for the Project Open Data version of DCAT, but this diagram includes all the fields provided by ADMS.SW rather than paired down to just the required, optional, and extended fields as is the case with the POD diagram.

adms-schema-diagram

The property mapping and descriptions here are based on the full ADMS.SW documentation PDF and the HTML version of the RDF schema. I would refer to those documents for full property definitions. Also note that some of the properties here are synonymous with those in DCAT even if they use a different property name or namespace.

Software Repository

A Software Repository is a system or service that provides facilities for storage and maintenance of descriptions of Software Projects, Software Releases and Software Packages, and functionality that allows users to search and access these descriptions. A Software Repository will typically contain descriptions of several Software Projects, Software Releases and related Software Packages.

An example of a Software Repository is the Apache Software Foundation Project Catalogue

ADMS.SW Property ADMS.SW Label Namespace:Property code.json Property
accessURL Access URL adms:accessURL
created Date of Creation dcterms:created
modified Date of Last Modification dcterms:modified
description Description dcterms:description
label Name rdfs:label
supportedSchema Supported Schema adms:supportedSchema
hasPart Includes dcterms:hasPart
publisher Publisher dcterms:publisher agency or organization
spatial Spatial Coverage dcterms:spatial
themeTaxonomy Theme Taxonomy rad:themeTaxonomy

Software Project

A Software Project is a time-delimited undertaking with the objective to produce one or more software releases, materialised as software packages. Some projects are long-running undertakings, and do not have a clear time-delimited nature or project organisation. In this case, the term ‘software project’ can be interpreted as the result of the work: a collection of related software releases that serve a common purpose.

An example of a Software Project is the Apache HTTP Server Project

ADMS.SW Property ADMS.SW Label Namespace:Property code.json Property
description Description doap:description project.description
homepage Homepage doap:homepage project.homepage
keyword Keyword rad:keyword project.tags
name Name doap:name project.name
release Release doap:release
contributor Contributor schema:contributor project.partners
fundedBy Funded By admssw:fundedBy project.partners
forkOf Fork Of admssw:forkOf
developer Developer doap:developer project.partners
documenter Documenter doap:documenter project.partners
maintainer Maintainer doap:maintainer project.contact
helper Helper doap:helper project.partners
tester Tester doap:tester project.partners
translator Translator doap:translator project.partners
metrics Metrics admssw:metrics
theme Theme rad:theme
intendedAudience Intended Audience admssw:intendedAudience project.governmentWideReuseProject
locale Locale admssw:locale
userInterfaceType User Interface Type admssw:userInterfaceType
programmingLanguage Programming Language admssw:programmingLanguage project.languages
isPartOf Repository Origin dcterms:isPartOf project.repository
operatingSystem Operating System schema:operatingSystem
supportsFormat Supports Format admssw:supportsFormat
status Status admssw:status project.status

Software Release

A Software Release is an abstract entity that reflects the intellectual content of the software at a particular point in time and represents those characteristics of the software that are independent of its physical embodiment. This abstract entity corresponds to the FRBR entity expression (the intellectual or artistic realization of a work). A release is typically associated with a version number.

An example of a Software Release is the Apache HTTP Server 2.22.22 (httpd) release.

ADMS.SW Property ADMS.SW Label Namespace:Property code.json Property
alternative Alternative Name dcterms:alternative
created Date of Creation dcterms:created
modified Date of Last Modification dcterms:modified project.updated.sourceCodeLastModified
description Description dcterms:description
identifier Identifier admssw:identifier
keyword Keyword rad:keyword
metadataDate Metadata Data adms:metadataDate project.updated.metadataLastUpdated
name Label rdfs:label
revision Version doap:revision
releaseNotes Version Notes schema:releaseNotes
assessment Assessment admssw:assessment
contactPoint Contact Point adms:contactPoint project.contact
includedAsset Included Asset admssw:includedAsset
metrics Metrics admssw:metrics
language Language dcterms:language
logo Logo foaf:logo
describedBy Main Documentation wdrs:describedby
metadataLanguage Metadata Language adms:metadataLanguage
last Current Version xhv:last
next Next Version xhv:next
prev Previous Version xhv:prev
project Project admssw:project
publisher Publisher dcterms:publisher
relation Related Asset dcterms:relation
relatedWebPage Related Web Page adms:relatedWebPage
package Package admssw:package
isPartOf Repository Origin dcterms:isPartOf project.repository
spatial spatial coverage dcterms:spatial
status Status admssw:status
theme Theme rad:theme
usedBy Used By admssw:usedBy

Software Package

A Software Package represents a particular physical embodiment of a Software Release, which is an example of the FRBR entity manifestation (the physical embodiment of an expression of a work). A Software Package is typically a downloadable computer file (but in principle it could also be a paper document) that implements the intellectual content of a Software Release. A particular Software Package is associated with one and only one Software Release, while all Packages of an Asset share the same intellectual content in different physical formats.

An example of a Software Package is httpd-2.2.22.tar.gz, which represents the Unix Source of the Apache HTTP Server 2.22.22 (httpd) software release.

Software often has at least two kinds of physical embodiments: a source code package and a binary package. Binary packages are sometimes compiled for different operating systems or are released under difference licences, e.g. in case of dual licensing. Also scripting languages need some sort of packaging for installation systems used by end users.

ADMS.SW Property ADMS.SW Label Namespace:Property code.json Property
created Date of creation dcterms:created
modified Date of last modification dcterms:modified
description Description dcterms:description
label Name rdfs:label
software_id Software_id swid:software_id
tagURL Tag URL admssw:tagURL
fileSize File size schema:fileSize
checksum Checksum spdx:checksum
format Format dcterms:format
license License dcterms:license project.license
downloadUrl Download URL schema:downloadUrl project.downloadURL
release Release amdssw:release
publisher Publisher dcterms:publisher
status Status admssw:status

from code-gov-web.

ckaran avatar ckaran commented on May 19, 2024

As @philipashlock noted on November 7, CC0 is not considered Open Source by the OSI. Appendix A defines Open Source Software as:

Open Source Software (OSS): Software that can be accessed, used, modified, and shared by anyone. OSS is often distributed under licenses that comply with the definition of “Open Source” provided by the Open Source Initiative (https://opensource.org/osd) and/or that meet the definition of “Free Software” provided by the Free Software Foundation (https://www.gnu.org/philosophy/free-sw.html).

The first part suggests to me that anything under the CC0 license is considered to be Open Source, but the second part suggests that it isn't. What is the official consensus? How should we mark a project's openSourceProject key in their code.json file if they are using the CC0 license?

Part of my concern is how automated tools will handle the openSourceProject key for metrics purposes; if CC0 is not considered to be Open Source, then quite a few agencies will not be able to meet the 20% requirement, even though they are putting their code out there for others to use.

from code-gov-web.

ctubbsii avatar ctubbsii commented on May 19, 2024

@ckaran CC0 is addressed in OSI's FAQs. No decision was made by OSI whether it meets their definition of "Open Source". However, it would be useful to know what definition of "Open Source" to use when completing the openSourceProject field. Does code.gov offer a definition for the purpose of this field? Personally, I'd prefer deferring to OSI's definition, but if OSI can't reach a decision on CC0, then their definition is insufficient.

My personal recommendation is to avoid "traps" like CC0, where it's "open" with respect to copyright, but patent use rights are explicitly not conferred. MIT and BSD avoid the question entirely (not explicitly conferred), and GPL tends to impose restrictions on consumers that I don't think the government should be in the business of imposing, so I prefer ASL 2.0, myself, for government-released open source projects. (ASL 2.0 also provides a convention to use a NOTICE file for copyright notices, separate from the license, where it would be appropriate to add a brief text noting the license is not applicable domestically for the portions of code produced exclusively by government employees on behalf of the U.S. government.)

from code-gov-web.

jbjonesjr avatar jbjonesjr commented on May 19, 2024

cc/ @benbalter if you have specific thoughts to share:

from code-gov-web.

ckaran avatar ckaran commented on May 19, 2024

@ctubbsii You're right about the problems of patents, etc. with regards to CC0. The lab I work for has been working to avoid the problem by requiring all external contributors sign a contributor license agreement (CLA) before their contributions will be included in any of the lab's projects (you can read the policy here. The lab's lawyers believe that will solve the issues directly related to patents and other IP rights.

Note that the policy was adopted by the lab on 19 Dec 2016, but it already has one issue; we can't currently post our CLA, nor can we accept CLAs at the current time as (by design) an executed CLA will contain what can be argued to be personally identifiable information (PII). The lawyers I've talked to tell me that means the lab must obey the Privacy Act, which requires some more work. So, if you read the policy and expect that we'll be able to start accepting contributions immediately, I'm sorry to say that we can't.

from code-gov-web.

ctubbsii avatar ctubbsii commented on May 19, 2024

@ckaran Another great thing about ASL 2.0, it contains an embedded CLA, and defines "Contributor" and "Contributions". No need for a separate CLA 😄

from code-gov-web.

ckaran avatar ckaran commented on May 19, 2024

@ctubbsii Honestly, if we could, I would recommend the standard OSI-approved licenses, including the the ASL 2.0 for exactly that reason. Unfortunately, most of the work produced by my lab doesn't have copyright attached, which means that copyright-based licenses may fail in court.

from code-gov-web.

ctubbsii avatar ctubbsii commented on May 19, 2024

@ckaran Not sure what you mean by "doesn't have copyright attached". My guess is that you mean "public domain" or you simply mean that nobody is interested in asserting copyright. If it's the former, it probably only applies domestically. The creators still may own copyright internationally, so a license is still worth recommending. If you mean the latter, well, omission of a copyright notice does not disclaim copyrights.

If a work isn't covered by copyright (because it's public domain, for instance), an infringement claim would certainly fail in court... but I'm not sure why that matters. That only matters if the creators intend to enforce/assert their copyright claims in the face of a particular infringement, via a lawsuit. If you know the work is public domain in the jurisdiction where the violation occurred... simply don't pursue it with a lawsuit in that case... it's really as simple as that.

The license still communicates the limitations of the rights granted in jurisdictions where copyright is applicable (who cares if it's void in jurisdictions where it's not applicable?) and communicates a minimum set of rights guaranteed to everybody else. This instills confidence in the project's users, allowing them to use it according to the license conditions without fear of reprisal. Often (as in the case of ASL 2.0), it also explicitly conveys the rights the project expects contributors to grant, in order for the contributions to be accepted into the project (in other cases, this might be implicit). This is valuable to a project, even if some portions of the project are not subject to copyright protections (public domain).

from code-gov-web.

ckaran avatar ckaran commented on May 19, 2024

@ctubbsii Sorry, I've been talking with our legal counsel for too long. Yes, I mean works that are in the public domain. I've talked with the appropriate people in the Justice Department to see if US Government works have copyright outside of the US. They told me that the US Government's position is that it does, but the lawyer I spoke with wasn't able to find any case law to back that up. What's more, it would have to be litigated in the courts of each nation individually, so there isn't a single 'right' answer.

As for why all this is important, it comes down to severability and warranty/liability. Assume that some Government work is licensed under the Apache License 2.0, which is a license that depends on copyright. Someone can sue the Government claiming that the clauses that depend on copyright are void, and (because there is no severability clause), so are all the other clauses. If a court agrees that the license as a whole is void simply because the US Government doesn't have copyright within the US, then that includes the clauses regarding warranty and liability, which means that the Government might be on the hook for damages in some manner[1]. Moreover, downstream users/projects may also have problems[1].

For works that have copyright and are contributed to the Government, I think that the Government would be OK with any of the standard OSI-approved licenses. However, work that is created by Government employees might be in the public domain, so then you have a weird mix of stuff that is protected by the license, and stuff that might not be[1]. Will this cause an issue? I don't know, but I'm not interested in finding out.

[1] I'm not a lawyer, this is not legal advice, and as far as I know, this has not yet been litigated in a court.

from code-gov-web.

ctubbsii avatar ctubbsii commented on May 19, 2024

@ckaran Oh, I see. Perhaps code.gov should fork ASL 2.0 (which is permitted) and add a severability clause. (Note: I'm currently promoting a discussion on the Apache Mailing Lists about adding this in some future version of the license, perhaps 2.1).

from code-gov-web.

ckaran avatar ckaran commented on May 19, 2024

@ctubbsii I've thought about forking it, but that could also start to fork Open Source (there will be questions about which licenses are compatible with other license, which could be problematic; @massonpj, is this a good assessment?)

@ctubbsii I've seen your discussions on the ASL lists; I think that is the best way to go. Not only could everyone (Government and private) use the same license, it would also mean that the license is OSI-approved, which the forked license may not be. The reason this is important is because some journals will only accept code that is under and OSI-approved license; JOSS is one of them. See the discussion here for some of the issues.

Basically, what I want are modifications to the standard Open Source licenses that ensure that works that don't have copyright attached have all the following:

  1. As many of the protections that the various licenses give as possible, for both code that has copyright, and code that is in the public domain. At a minimum, this has to include warranty, liability, and IP protections[1].
  2. Protects anyone that uses the code or includes it in their own works.
  3. Is fully inter-operable with the standard licenses (forked licenses might not be; updated licenses will by definition be).

[1] Public domain code by definition doesn't have copyright protections, but in a mixed work that has some copyrighted material and some public domain material, the copyrighted material should not be effectively reduced to being public domain; if that was what the authors had intended, then they would have put it in the public domain. That means that license has to be inherently flexible enough to handle this case. IP protections means that public domain work doesn't get hammered by patent headaches from contributions.

from code-gov-web.

massonpj avatar massonpj commented on May 19, 2024

@ckaran

is this a good assessment

Yes.

@ctubbsii, while anyone can create their own license, the OSI's License Review Process, "ensures that licenses and software labeled as 'open source' conforms to existing community norms and expectations." Simply creating a new license and labeling it an "open source software license" is not good.

from code-gov-web.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.