Giter Club home page Giter Club logo

data.gov's Introduction

data.gov main repository

Data.gov Actions

Site / Repo Badges
catalog.data.gov 1 - Commit 2 - Publish & Deploy 3 - Restart Apps 4 - Automated CKAN Jobs 5 - Run CKAN Command Check for Snyk Vulnerabilities
inventory.data.gov commit deploy restart application Update Inventory Publishers List Check for Snyk Vulnerabilities
egress actions disable egress proxy enable egress proxy restart egress proxy
ssb egress actions disable egress proxy enable egress proxy restart egress proxy
data.gov Build & Test
www-redirects deploy
datagov-ssb commit plan apply
resources.data.gov Build & Test QA
strategy.data.gov Build & Test QA
federation.data.gov Build QA
sdg.data.gov Build Static Site QA Static Site

This is the main repository for the Data.gov Platform. We use this repository primarily to track our team's work, but also to house datagov-wide code (GitHub Actions templates, egress, etc). If you are looking for documentation for cloud.gov environments, see the application repositories:

  • www.data.gov (Static site on Federerlist). [repo]
  • catalog.data.gov (CKAN 2.9) [repo]
  • inventory.data.gov (CKAN 2.9) [repo]
  • dashboard.data.gov (CodeIgniter PHP) [repo]

GitHub Actions and Templates

A number of our GitHub Actions have been refactored to use templates in this repository. See templates here and examples of invoking them in Inventory and Catalog .

data.gov's People

Contributors

adborden avatar anup-khanal avatar avdata99 avatar btylerburton avatar chris-macdermaid avatar codeshtuff avatar dano-reisys avatar dependabot-preview[bot] avatar dependabot[bot] avatar eric-asongwed avatar fuhuxia avatar hareeshreddyg avatar hkdctol avatar jasonschulte avatar jbrown-xentity avatar jin-sun-tts avatar jjediny avatar kvuppala avatar mogul avatar neilhunt1 avatar nickumia-reisys avatar pburkholder avatar philipashlock avatar pjsharpe07 avatar robert-bryson avatar rshewitt avatar starsinmypockets avatar thejuliekramer avatar woodt avatar ydave-reisys avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

data.gov's Issues

Add this repository to the GSA code inventory

(I work in GSA IT, Office of the CTO. I am submitting this as part of our work to ensure GSA complies with the new Federal Source Code Policy.)

GSA needs to create an inventory of all agency source code, whether open source or closed source. The inventory we create will appear on Code.gov. The inventory will contain basic information about each source code repository, but will not include the source code itself. Please read the implementation guide and use it to submit this repository to the inventory by December 5.

Basically, please do one of the following, the details of which are described in the implementation guide:

Let me know if you would like me to open a PR with an example .codeinventory.yml file.

Please let me know if you have any questions.

Thanks!


References:

Jenkins Access

We require entitled access to Jenkins with at least the following:

  1. Within a folder, create/edit/destroy jobs
  2. Latest Pipeline plugin suite installed
  3. Ability to create and use credentials

S3 Key/Credentials Auto Refresh

Three of our current instances (inventory, catalog, and wordpress) write to an s3 bucket. CKAN's plugin for this can use the instance's credentials (IAM role) without a keypair. However the Wordpress plugin needs a key:

Current Options:

  • create a cron job to fetch the instance's credentials from the host and write the contents to a shared directory that the plugin can read from and use those as the key. (how these creds are updated every 24 hour).
  • get an issued IAM role for the Wordpress Plugin to right to s3 (we'd need to account for updating the keypair every 90 days)

Setup ElasticSearch logging through fluentd

Setup an AWS managed ElasticSearch and configure the fluentd agents to output their logs to it. Probably use FluentD ElasticSearch plugin. Verify this can all be provisioned using our catalog-deploy playbooks.

Initially try this on dev instances in OCSIT account; then move to prod instances in BSP and ensure they can communicate with this elastic search

Setup Role for Host Hardening and Baseline Configuration - CIS Ubuntu Benchmark

https://github.com/GSA/cis-ubuntu-ansible
http://cis-ubuntu-ansible.readthedocs.io/en/latest/CIS_Ubuntu/

Center for Internet Security Benchmark for Ubuntu 14.04 is a set of Best Practices on "hardening" the Host Operating System. This Baseline image is then used to build applications on top of

  • Fix or Remove currently failing/skipped tasks
  • Use Ansible tags to map tasks to NIST-800-53 security control groups for future auditing

Setup CloudWatch FluentD integration

We would like to setup bi-direction CloudWatch/FluentD integration.

Our CloudWatch metrics should pump into our FluentD aggregator. We would presumably use a plugin such as this.

Our FluentD agents should pump our application logs to CloudWatch so that we can see them in AWS console and register alerts on them. Possibly using this plugin

Automate the CKAN sysadmin user creation

For CKAN setting up the first sysadmin account is done through a command prompt after using paster command:
ckan sysadmin add {{ user }}

  • Setup an linux expect script to launch the ckan sysadmin add command using Ansible variables injected as part of the secrets file.

Approach:

  • setup an expect script to automate the interactive session?
  • use Ansible directly for the interactive session?

Setup Graylog2 Stack

Pending confirmation from BSP, we will setup a Graylog2 server, presumably using Ansible. This would include ElasticSearch (since AWS ElasticSearch service only exposes HTTP endpoints and graylog2 makes backend TCP calls to ES, so we can't use AWS's service) and probably other services as well.

There is a role for this open-sourced, but this would definitely need to be modified.

efficient handing of cron jobs and catalog harvester on BSP

As most of the data.gov instances have two web servers in BSP, we need to make sure the cron jobs in dashboard / wordpress if any and CKAN catalog harvester gather/consume processes don't duplicate the work/data.

Remediation

  • For Dashboad, we can make sure to enable the cron jobs only on the main harvester
  • For CKAN Catalog, both the harvesters can be configured in such a way that it doesn't duplicate the gather function, but consume process can be split and run on both harvesters.
  • Also for catalog, all other back end QA cron, topics csv snapshot etc can be configured on the second harvester

Setup branching/merging strategy in git

Would be good if we weren't all committing directly to master. We should consider at minimum using feature branches and tagging our commits with issue numbers.

Build Failing - Error importing PyZ3950

Not sure if this is an issue with the order of scripts or an issue with importing or installing the PyZ3950
library. In generally we should also try to maintain the py/pil dependencies by forking the stable versions or using pip/pypi instead of pointing to a remote git repo not maintained...

https://github.com/GSA/ckanext-geodatagov/blob/master/pip-requirements.txt#L8

TASK: [ckan-db | install postgis script] **************************************
changed: [default]

TASK: [ckan-db | Initialize ckan db] ******************************************
failed: [default] => {"changed": true, "cmd": ["ckan", "db", "init"], "delta": "0:00:02.136923", "end": "2015-11-23 01:51:34.372962", "rc": 1, "start": "2015-11-23 01:51:32.236039", "warnings": []}
stderr: Traceback (most recent call last):
File "/usr/bin/ckan", line 59, in
load_entry_point('PasteScript', 'console_scripts', 'paster')()
File "/usr/lib/ckan/lib/python2.7/site-packages/paste/script/command.py", line 104, in run
invoke(command, command_name, options, args[1:])
File "/usr/lib/ckan/lib/python2.7/site-packages/paste/script/command.py", line 143, in invoke
exit_code = runner.run(args)
File "/usr/lib/ckan/lib/python2.7/site-packages/paste/script/command.py", line 238, in run
result = self.command()
File "/usr/lib/ckan/src/ckan/ckan/lib/cli.py", line 208, in command
self._load_config(cmd!='upgrade')
File "/usr/lib/ckan/src/ckan/ckan/lib/cli.py", line 148, in _load_config
load_environment(conf.global_conf, conf.local_conf)
File "/usr/lib/ckan/src/ckan/ckan/config/environment.py", line 232, in load_environment
p.load_all(config)
File "/usr/lib/ckan/src/ckan/ckan/plugins/core.py", line 138, in load_all
load(*plugins)
File "/usr/lib/ckan/src/ckan/ckan/plugins/core.py", line 152, in load
service = _get_service(plugin)
File "/usr/lib/ckan/src/ckan/ckan/plugins/core.py", line 258, in _get_service
return plugin.load()(name=plugin_name)
File "/usr/lib/ckan/lib/python2.7/site-packages/pkg_resources.py", line 2048, in load
entry = import(self.module_name, globals(),globals(), ['name'])
File "/usr/lib/ckan/src/ckanext-geodatagov/ckanext/geodatagov/harvesters/init.py", line 15, in
from ckanext.geodatagov.harvesters.z3950 import Z3950Harvester
File "/usr/lib/ckan/src/ckanext-geodatagov/ckanext/geodatagov/harvesters/z3950.py", line 3, in
from PyZ3950 import zoom
ImportError: No module named PyZ3950

FATAL: all hosts have already failed -- aborting

PLAY RECAP ********************************************************************
to retry, use: --limit @/home/john/site.retry

default : ok=44 changed=29 unreachable=0 failed=1

Ansible failed to complete successfully. Any error output should be
visible above. Please fix these errors and try again.

deploy catalog and inventory through Ansible

Create the AMI using packer

  • CKAN Catalog (Solr, FrontEnd, Harvester)
  • CKAN Inventory (Solr, FrontEnd, Datapusher)
  • CKAN Catalog FGDC to ISO conversion license key approach (encrypt and decrypt)

labs.data.gov/dashboard

@alex-perfilov-reisys emailed about the ansible build of the Project Open Dashboard:
https://github.com/GSA/project-open-data-dashboard-deploy/tree/ansible

Setup Role for Fluentd and Graylog2 Stack - Centralized Log Management/Monitoring

This is the master issue for log management. Most work should be performed in the other issues, in fact we may close this once those are all created, and associate them with the milestone.

Setup a Master Log Aggregator (Fluentd), Storage/Search (AWS Elasticsearch), and Dashboard/Event Management (Graylog2) for centralized logging and monitoring.

  • Setup inventory of td-agents/plugins for each Ansible --tag
    • Nginx
    • Apache2
    • Tomcat7
    • Python interpreter
    • PHP interpreter
  • Setup Role for main Graylog2 server
  • Setup AWS Elasticsearch Service
  • Config AWS CloudWatch IAM and Plugin

SSL Configuration on Application Server

Has anyone come across scenarios where user login and logouts drops to port 80 (http) even though I've configured my load balancer to only listen to port 443 (https)?

I've followed all the steps here - https://github.com/ckan/ckan/wiki/SSL.

My site starts from https://mysite.com
But when I try to logout/login, it tries to reach http://mysite.com/user/login?came_from=/user/logged_in
After a while, my browser's connection is reset with a grey screen.
I can tell that I'm logged in because i can get to https://mysite.com/dashboard, but only if i enter the url manually after the connection reset

One thing I took note is that my SSL certs are already given at the load balancer level. (we're using AWS)
I"m not sure giving the SSL cert and key again on NGINX config would work.

This whole SSL configuration is new to me, so i'm not sure where to look for answers.
Would really appreciate if anyone have any answers here.

Thanks!


ckan-dev mailing list
[email protected]
https://lists.okfn.org/mailman/listinfo/ckan-dev
Unsubscribe: https://lists.okfn.org/mailman/options/ckan-dev

to ckan-dev
Turns out you need to add this chunk of code to your apache.wsgi file in /etc/ckan/default/apache.wsgi

environ['wsgi.url_scheme'] = environ.get('HTTP_X_URL_SCHEME', 'https') return _application(environ, start_response)

Now it's redirecting all requests to https

Ubuntu AMI bug fixes

Dan reported the following issues with the current Ubuntu AMI

  1. git is unusable: this one I have worked on and this and I get it to function correctly, however it seems to be affected by Packer build because after Packer builds, git doesn't seem to work. Will retry it without packer since I'm not sure Packer is critical anymore.
  2. there is an issue with hostaname resolution (will try this solution: echo 127.0.0.1 $(hostname) >> /etc/hosts)
  3. Sometimes I can’t install packages (will try this solution: sed -i '/postdrop/d' /var/lib/dpkg/statoverride && sed -i '/ssl-cert/d' /var/lib/dpkg/statoverride)

Production CI/CD Pipeline

Packer may or may not be available to us in the new environment @dano-reisys @neilhunt1 lets re-propose all alternative approach(es) for discussionn:

Including but not limited too:

  • Replacing Packer with Jenkins/Ansible for AMI Creation (not efficient but neither is not using Terraform for infrastructure creation 🎱 )
  • Spectrums of using Jenkins (100%) <-> Ansible (80%)

Digging deeper and deeper into the rabbit hole of all options and approaches... CI/CD with ansible has two core approaches with no downtime:

  • Pre-baking Images (AMI) to an AWS Auto-Scaling Group.
  • Rolling updates to a Highly Available set of services for each app - and coordinating with a web proxy to offload traffic to the to be or being provisioned machine then cycling through instances until all are updated.

Prebaked Images (compared to) Rolling updates

Pros

  • Simplest approach to vertically scale the application
  • Most cost effective (as compared to having to run application in High Availability with a preferably odd number of (3-5) instances per service/application. While not a best practice, for rolling updates, can be done on 1 load balancer/proxy to 2 apps servers - particularly if you have a mirror of the same setup in another region/availability zone. That's still 6 @ 24/7 instances for rolling updates compared to 2-3 minimum with x^ max completely managed by traffic demands, cpu/memory thresholds, or time of day which is more economical
  • AMI(s) are their own artifact
  • Immutable/fault tolerant when scaled exponentially
  • Talks AWS native speak to ELBs.
  • Everything is ready to run out of the gate (near 0 downtime)

Cons

  • System state is a nightmare if system needs state.
  • Is rigged to A/B Testing Scenarios
  • Doesn't pay to invest in if the "service" for a "Service" could otherwise have been handled by one instance running over a day or week... especially if cpu/memory depends aren't related to daily traffic and production code is push infrequently.
  • Images are large and cumbersome to be used to push out for minor code updates.
  • Expensive in the cheap sense - they are heavy to make but only when code is uptaken to Jenkins and then offloaded to s3 for 5x revert ready roll backs. Counter intuitive with last point - this is a pro or con depending on the reality of traffic/update frequency.
Git --> [ Jenkins {ansible} Jenkins ] ---> AMI [ ] --->  |  <---[ ASG ]--->   |  ----> S3

OR

             [ Ansible ]         
        /       /        \       \    
   [ X ]    [ X ]    [ X ]   [ X ]
     2\        4\         6\        8\
[ 1 ][ 2 ] [ 3 ][ 4 ] [ 5 ][ 6 ] [ 7 ][ 8 ] 

              .... Then.....

             [ Ansible ]         
        /       /        \       \    
   [ X ]    [ X ]    [ X ]   [ X ]
    /1         /3         /5        /7
[ 1 ][ 2 ] [ 3 ][ 4 ] [ 5 ][ 6 ] [ 7 ][ 8 ] 

Document Dev/Test/Deploy Commands

Requirements:

  • boto3 (for infrastructure provisioning only): https://github.com/boto/boto3
  • ansible-secret.txt: export ANSIBLE_VAULT_PASSWORD_FILE=~/ansible-secret.txt
  • run all provisioning/app deployment commands from repo's ansible folder
  • for wordpress/dashboard/crm run the following command within the role's root folder before you provision anything: ansible-galaxy install -r requirements.yml
  • {{ inventory }} can be:
    • inventories/staging/hosts
    • inventories/production/hosts
    • inventories/local/hosts

Provision Infrastructure

VPC:

create vpc:
ansible-playbook create_datagov_vpc.yml -e "vpc_name=datagov"

delete vpc:
ansible-playbook delete_datagov_vpc.yml -e "vpc_name=datagov"

Catalog:

create stack :
ansible-playbook create_catalog_stack.yml "vpc_name=datagov"

delete stack:
ansible-playbook delete_catalog_stack.yml "vpc_name=datagov"

Provision apps

Wordpress:

provision vm: ansible-playbook datagov-web.yml -i {{ inventory }} --skip-tags="deploy-rollback" --limit wordpress-web
deploy app: ansible-playbook datagov-web.yml -i {{ inventory }} --tags="deploy" --limit wordpress-web
deploy rollback: ansible-playbook datagov-web.yml -i {{ inventory }} --tags="deploy-rollback" --limit wordpress-web

Dashboard

provision vm: ansible-playbook dashboard-web.yml -i {{ inventory }} --skip-tags="deploy-rollback" --limit dashboard-web
deploy app: ansible-playbook dashboard-web.yml -i {{ inventory }} --tags="deploy" --limit dashboard-web
deploy rollback: ansible-playbook dashboard-web.yml -i {{ inventory }} --tags="deploy-rollback" --limit dashboard-web

CRM

provision vm: ansible-playbook crm-web.yml -i {{ inventory }} --skip-tags="deploy-rollback" --limit crm-web
deploy app: ansible-playbook crm-web.yml -i {{ inventory }} --tags="deploy" --limit crm-web
deploy rollback: ansible-playbook crm-web.yml -i {{ inventory }} --tags="deploy-rollback" --limit crm-web" --limit crm-web

Catalog:

provision vm - web: ansible-playbook catalog.yml -i {{ inventory }} --tags="frontend,ec2" --skip-tags="solr,db,cron" --limit catalog-web

provision vm - harvester: ansible-playbook catalog.yml -i {{ inventory }} --tags="harvester,ec2" --skip-tags="apache,solr,db,saml2,redis" --limit catalog-harvester

provision vm - solr: ansible-playbook catalog.yml -i {{ inventory }} --tags="solr" --limit catalog-solr

Inventory

provision vm - web: ansible-playbook inventory.yml -i {{ inventory }} --skip-tags="solr,db" --limit inventory-web
deploy app: ansible-playbook inventory.yml -i {{ inventory }} --tags="deploy" --skip-tags="solr,db" --limit inventory-web
provision vm - solr: ansible-playbook inventory.yml -i {{ inventory }} --tags="solr" --limit inventory-solr

Labs Dashboard - External service interaction (DNS and HTTP)

These URLs send out DNS and HTTP requests when provided a URL. While this appears to be the intention of the underlying code, the URLs are not restricted and can therefor be sent to a website which contains malicious code. We suggest that the submitted URLs be restricted in some fashion.

Verified

Remediation

  • None

Justification

A user is able to enter a URL for a file and/or directly upload the file for validation purposes. These User provided requests/parameters cannot be shared with another user as a URL or within a session. Therefore the client/user is isolated from using this to affect other users

Setup Role for Web Application Firewall (WAF) - OWASP Modsecurity on NGINX and Apache

https://www.owasp.org/index.php/Category:OWASP_ModSecurity_Core_Rule_Set_Project
http://www.modsecurity.org/download.html

The OWASP ModSecurity CRS provides protections if the following attack/threat categories:

  • HTTP Protection - detecting violations of the HTTP protocol and a locally defined usage policy.
  • Real-time Blacklist Lookups - utilizes 3rd Party IP Reputation
  • HTTP Denial of Service Protections - defense against HTTP Flooding and Slow HTTP DoS Attacks.
  • Common Web Attacks Protection - detecting common web application security attack.
  • Automation Detection - Detecting bots, crawlers, scanners and other surface malicious activity.
  • Integration with AV Scanning for File Uploads - detects malicious files uploaded through the web application.
  • Tracking Sensitive Data - Tracks Credit Card usage and blocks leakages.
  • Trojan Protection - Detecting access to Trojans horses.
  • Identification of Application Defects - alerts on application misconfigurations.
  • Error Detection and Hiding - Disguising error messages sent by the server.

Setup Role for Host Intrusion Detection - OSSEC-HIDS

http://ossec.github.io/
https://github.com/JJediny/ansible-ossec-server

OSSEC is an open-source, host-based intrusion detection system (HIDS) that performs log analysis, integrity checking, Windows registry monitoring, rootkit detection, time-based alerting, and active response.

  • File Integrity Checking: checks whether something suspicious was written to the file and reports it.
  • Log Analysis: uses the syslog log data already available on a computer and analyzes it, trying to detect some anomaly.
  • Rootkit Detection: is able to detect the presence of rootkits.
  • Policy Monitoring: checks whether all the components in the system adhere to the security policy.
  • Active Response: is able to actively respond to the malicious threat, so we can eliminate it as soon as possible.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.