Giter Club home page Giter Club logo

cedar.github.collector's Introduction

Build Status

Introduction

CEDAR.GitHub.Collector is a set of Azure Functions to collect engineering metadata from GitHub. It consists of four collectors:

  1. Main: the main collector processes the data coming directly from the GitHub Webhooks
  2. Delta: the delta collector makes requests against the EventsTimeline API to ensure that data is not missed through the main collector
  3. Onboarding: the onboarding collector collects current state of a given GitHub repository / organization
  4. Traffic: the traffic collects Traffic API data

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

1. Download Required Software and Extensions

Developing and debugging the CEDAR.GitHub.Collector is easiest using Visual Studio (2017 or later) with the Azure Functions Tools extensions.

Download Visual Studio Here : https://visualstudio.microsoft.com/downloads/

The Azure Functions Tools extension can be installed during the VS installation process or added after download.

2. Fork Repository

Create a fork of this repository and open the GitHub.Collectors.sln solution file in Visual Studio.

3. Create local.settings.json

Create a local.settings.json file in the under the GitHub.Collectors.Functions project.

Find the file local.settings.barebones.template.json and copy its contents into your new local.settings.json file.

Add your GitHub account username under the key “Identity”.

Add a Personal Access Token associated with your GitHub account under the key “PersonalAccessToken”.

4. Setup Azure Storage

In Azure create an Azure storage account where the data you will be collecting from GitHub will be saved.

Paste the Connection String of this new storage account into your local.settings.json file under the key “AzureWebJobsStorage”.

5. Setup Application Insights

In Azure create an Application Insights resource where telemetry from your function executions will be sent.

Add the Instrumentation key from this account into your local.settings.json file under the key “APPINSIGHTS_INSTRUMENTATIONKEY”.

6. Create Settings.json

Create a Settings.json file in the GitHub.Collectors.Functions project.

Find the file Settings.barebones.template.json and copy its contents into your new Settings.json file.

Add your GitHub account username under the key “Identity”.

7. Upload Settings.json

Create a github-settings Blob container in your Azure Storage account.

Open the container and upload Settings.json.

8. Run the Azure Functions Locally with Visual Studio Code

In Visual Studio, select the Debug solution configuration and run GutHub.Collectors.Functions.

Test the Onboarding Collector

Create a storage queue named onboarding. To test the Onboarding function, onboard this repository by adding the following message to the onboarding queue in your storage account:

{
    "OrganizationId": 6154722,
    "OrganizationLogin": "microsoft",
    "RepositoryId": 282058629,
    "RepositoryName": "CEDAR.GitHub.Collector",
    "OnboardingType": "Repository",
    "IgnoreCache": true
}

After the function has completed you should be able to see your collected data under the github blob container in your storage account.

Test the Traffic Collector

Create a storage queue named traffic. Test the Traffic function by adding the following message to the traffic queue in your storage account:

{
    "OrganizationId": 6154722,
    "OrganizationLogin": "microsoft",
    "RepositoryId": 282058629,
    "RepositoryName": "CEDAR.GitHub.Collector"
}

After the function has completed you should be able to see your collected data under the github blob container in your storage account.

Test the Webhook Collector

To test the webhook collector, essentially you want to post a payload that is similar/same to a GitHub webhook payload to your localhost endpoint. Using your favorite program (e.g., Postman post the following (example) message body / headers to http://localhost:7071/api/ProcessWebHook:

Headers:

X-GitHub-Delivery: <any GUID of your choice>
X-GitHub-Event: "<a valid GitHub event, e.g., issue>

Body:

{
  "action": "opened",
  "issue": {
    "url": "https://api.github.com/repos/microsoft/CEDAR.GitHub.Collector/issues/5",
    "repository_url": "https://api.github.com/repos/microsoft/CEDAR.GitHub.Collector",
    "labels_url": "https://api.github.com/repos/microsoft/CEDAR.GitHub.Collector/issues/5/labels{/name}",
    "comments_url": "https://api.github.com/repos/microsoft/CEDAR.GitHub.Collector/issues/5/comments",
    "events_url": "https://api.github.com/repos/microsoft/CEDAR.GitHub.Collector/issues/5/events",
    "html_url": "https://github.com/microsoft/CEDAR.GitHub.Collector/issues/5",
    "id": 702507654,
    "node_id": "MDU6SXNzdWU3MDI1MDc2NTQ=",
    "number": 5,
    "title": "Expand ReadMe.md with details on how to test the remaining collectors",
    "user": {
      "login": "kivancmuslu",
      "id": 43969379,
      "node_id": "MDQ6VXNlcjQzOTY5Mzc5",
      "avatar_url": "https://avatars1.githubusercontent.com/u/43969379?v=4",
      "gravatar_id": "",
      "url": "https://api.github.com/users/kivancmuslu",
      "html_url": "https://github.com/kivancmuslu",
      "followers_url": "https://api.github.com/users/kivancmuslu/followers",
      "following_url": "https://api.github.com/users/kivancmuslu/following{/other_user}",
      "gists_url": "https://api.github.com/users/kivancmuslu/gists{/gist_id}",
      "starred_url": "https://api.github.com/users/kivancmuslu/starred{/owner}{/repo}",
      "subscriptions_url": "https://api.github.com/users/kivancmuslu/subscriptions",
      "organizations_url": "https://api.github.com/users/kivancmuslu/orgs",
      "repos_url": "https://api.github.com/users/kivancmuslu/repos",
      "events_url": "https://api.github.com/users/kivancmuslu/events{/privacy}",
      "received_events_url": "https://api.github.com/users/kivancmuslu/received_events",
      "type": "User",
      "site_admin": true
    },
    "labels": [],
    "state": "open",
    "locked": false,
    "assignee": null,
    "assignees": [],
    "milestone": null,
    "comments": 0,
    "created_at": "2020-09-16T06:54:49Z",
    "updated_at": "2020-09-16T06:54:49Z",
    "closed_at": null,
    "author_association": "MEMBER",
    "active_lock_reason": null,
    "body": "Currently, it only describes how to test the onboarding collector.",
    "performed_via_github_app": null
  },
  "repository": {
    "id": 282058629,
    "node_id": "MDEwOlJlcG9zaXRvcnkyODIwNTg2Mjk=",
    "name": "CEDAR.GitHub.Collector",
    "full_name": "microsoft/CEDAR.GitHub.Collector",
    "private": false,
    "owner": {
      "login": "microsoft",
      "id": 6154722,
      "node_id": "MDEyOk9yZ2FuaXphdGlvbjYxNTQ3MjI=",
      "avatar_url": "https://avatars2.githubusercontent.com/u/6154722?v=4",
      "gravatar_id": "",
      "url": "https://api.github.com/users/microsoft",
      "html_url": "https://github.com/microsoft",
      "followers_url": "https://api.github.com/users/microsoft/followers",
      "following_url": "https://api.github.com/users/microsoft/following{/other_user}",
      "gists_url": "https://api.github.com/users/microsoft/gists{/gist_id}",
      "starred_url": "https://api.github.com/users/microsoft/starred{/owner}{/repo}",
      "subscriptions_url": "https://api.github.com/users/microsoft/subscriptions",
      "organizations_url": "https://api.github.com/users/microsoft/orgs",
      "repos_url": "https://api.github.com/users/microsoft/repos",
      "events_url": "https://api.github.com/users/microsoft/events{/privacy}",
      "received_events_url": "https://api.github.com/users/microsoft/received_events",
      "type": "Organization",
      "site_admin": false
    },
    "html_url": "https://github.com/microsoft/CEDAR.GitHub.Collector",
    "description": "Data collection pipeline for GitHub",
    "fork": false,
    "url": "https://api.github.com/repos/microsoft/CEDAR.GitHub.Collector",
    "forks_url": "https://api.github.com/repos/microsoft/CEDAR.GitHub.Collector/forks",
    "keys_url": "https://api.github.com/repos/microsoft/CEDAR.GitHub.Collector/keys{/key_id}",
    "collaborators_url": "https://api.github.com/repos/microsoft/CEDAR.GitHub.Collector/collaborators{/collaborator}",
    "teams_url": "https://api.github.com/repos/microsoft/CEDAR.GitHub.Collector/teams",
    "hooks_url": "https://api.github.com/repos/microsoft/CEDAR.GitHub.Collector/hooks",
    "issue_events_url": "https://api.github.com/repos/microsoft/CEDAR.GitHub.Collector/issues/events{/number}",
    "events_url": "https://api.github.com/repos/microsoft/CEDAR.GitHub.Collector/events",
    "assignees_url": "https://api.github.com/repos/microsoft/CEDAR.GitHub.Collector/assignees{/user}",
    "branches_url": "https://api.github.com/repos/microsoft/CEDAR.GitHub.Collector/branches{/branch}",
    "tags_url": "https://api.github.com/repos/microsoft/CEDAR.GitHub.Collector/tags",
    "blobs_url": "https://api.github.com/repos/microsoft/CEDAR.GitHub.Collector/git/blobs{/sha}",
    "git_tags_url": "https://api.github.com/repos/microsoft/CEDAR.GitHub.Collector/git/tags{/sha}",
    "git_refs_url": "https://api.github.com/repos/microsoft/CEDAR.GitHub.Collector/git/refs{/sha}",
    "trees_url": "https://api.github.com/repos/microsoft/CEDAR.GitHub.Collector/git/trees{/sha}",
    "statuses_url": "https://api.github.com/repos/microsoft/CEDAR.GitHub.Collector/statuses/{sha}",
    "languages_url": "https://api.github.com/repos/microsoft/CEDAR.GitHub.Collector/languages",
    "stargazers_url": "https://api.github.com/repos/microsoft/CEDAR.GitHub.Collector/stargazers",
    "contributors_url": "https://api.github.com/repos/microsoft/CEDAR.GitHub.Collector/contributors",
    "subscribers_url": "https://api.github.com/repos/microsoft/CEDAR.GitHub.Collector/subscribers",
    "subscription_url": "https://api.github.com/repos/microsoft/CEDAR.GitHub.Collector/subscription",
    "commits_url": "https://api.github.com/repos/microsoft/CEDAR.GitHub.Collector/commits{/sha}",
    "git_commits_url": "https://api.github.com/repos/microsoft/CEDAR.GitHub.Collector/git/commits{/sha}",
    "comments_url": "https://api.github.com/repos/microsoft/CEDAR.GitHub.Collector/comments{/number}",
    "issue_comment_url": "https://api.github.com/repos/microsoft/CEDAR.GitHub.Collector/issues/comments{/number}",
    "contents_url": "https://api.github.com/repos/microsoft/CEDAR.GitHub.Collector/contents/{+path}",
    "compare_url": "https://api.github.com/repos/microsoft/CEDAR.GitHub.Collector/compare/{base}...{head}",
    "merges_url": "https://api.github.com/repos/microsoft/CEDAR.GitHub.Collector/merges",
    "archive_url": "https://api.github.com/repos/microsoft/CEDAR.GitHub.Collector/{archive_format}{/ref}",
    "downloads_url": "https://api.github.com/repos/microsoft/CEDAR.GitHub.Collector/downloads",
    "issues_url": "https://api.github.com/repos/microsoft/CEDAR.GitHub.Collector/issues{/number}",
    "pulls_url": "https://api.github.com/repos/microsoft/CEDAR.GitHub.Collector/pulls{/number}",
    "milestones_url": "https://api.github.com/repos/microsoft/CEDAR.GitHub.Collector/milestones{/number}",
    "notifications_url": "https://api.github.com/repos/microsoft/CEDAR.GitHub.Collector/notifications{?since,all,participating}",
    "labels_url": "https://api.github.com/repos/microsoft/CEDAR.GitHub.Collector/labels{/name}",
    "releases_url": "https://api.github.com/repos/microsoft/CEDAR.GitHub.Collector/releases{/id}",
    "deployments_url": "https://api.github.com/repos/microsoft/CEDAR.GitHub.Collector/deployments",
    "created_at": "2020-07-23T21:26:30Z",
    "updated_at": "2020-09-15T22:06:22Z",
    "pushed_at": "2020-09-16T06:53:28Z",
    "git_url": "git://github.com/microsoft/CEDAR.GitHub.Collector.git",
    "ssh_url": "[email protected]:microsoft/CEDAR.GitHub.Collector.git",
    "clone_url": "https://github.com/microsoft/CEDAR.GitHub.Collector.git",
    "svn_url": "https://github.com/microsoft/CEDAR.GitHub.Collector",
    "homepage": "",
    "size": 74,
    "stargazers_count": 1,
    "watchers_count": 1,
    "language": "C#",
    "has_issues": true,
    "has_projects": true,
    "has_downloads": true,
    "has_wiki": true,
    "has_pages": false,
    "forks_count": 1,
    "mirror_url": null,
    "archived": false,
    "disabled": false,
    "open_issues_count": 2,
    "license": {
      "key": "mit",
      "name": "MIT License",
      "spdx_id": "MIT",
      "url": "https://api.github.com/licenses/mit",
      "node_id": "MDc6TGljZW5zZTEz"
    },
    "forks": 1,
    "open_issues": 2,
    "watchers": 1,
    "default_branch": "main"
  },
  "organization": {
    "login": "microsoft",
    "id": 6154722,
    "node_id": "MDEyOk9yZ2FuaXphdGlvbjYxNTQ3MjI=",
    "url": "https://api.github.com/orgs/microsoft",
    "repos_url": "https://api.github.com/orgs/microsoft/repos",
    "events_url": "https://api.github.com/orgs/microsoft/events",
    "hooks_url": "https://api.github.com/orgs/microsoft/hooks",
    "issues_url": "https://api.github.com/orgs/microsoft/issues",
    "members_url": "https://api.github.com/orgs/microsoft/members{/member}",
    "public_members_url": "https://api.github.com/orgs/microsoft/public_members{/member}",
    "avatar_url": "https://avatars2.githubusercontent.com/u/6154722?v=4",
    "description": "Open source projects and samples from Microsoft"
  },
  "enterprise": {
    "id": 1578,
    "slug": "microsoftopensource",
    "name": "Microsoft Open Source",
    "node_id": "MDEwOkVudGVycHJpc2UxNTc4",
    "avatar_url": "https://avatars0.githubusercontent.com/b/1578?v=4",
    "description": "Microsoft's organizations for open source collaboration",
    "website_url": "https://opensource.microsoft.com",
    "html_url": "https://github.com/enterprises/microsoftopensource",
    "created_at": "2019-12-09T02:41:53Z",
    "updated_at": "2020-05-19T18:21:45Z"
  },
  "sender": {
    "login": "kivancmuslu",
    "id": 43969379,
    "node_id": "MDQ6VXNlcjQzOTY5Mzc5",
    "avatar_url": "https://avatars1.githubusercontent.com/u/43969379?v=4",
    "gravatar_id": "",
    "url": "https://api.github.com/users/kivancmuslu",
    "html_url": "https://github.com/kivancmuslu",
    "followers_url": "https://api.github.com/users/kivancmuslu/followers",
    "following_url": "https://api.github.com/users/kivancmuslu/following{/other_user}",
    "gists_url": "https://api.github.com/users/kivancmuslu/gists{/gist_id}",
    "starred_url": "https://api.github.com/users/kivancmuslu/starred{/owner}{/repo}",
    "subscriptions_url": "https://api.github.com/users/kivancmuslu/subscriptions",
    "organizations_url": "https://api.github.com/users/kivancmuslu/orgs",
    "repos_url": "https://api.github.com/users/kivancmuslu/repos",
    "events_url": "https://api.github.com/users/kivancmuslu/events{/privacy}",
    "received_events_url": "https://api.github.com/users/kivancmuslu/received_events",
    "type": "User",
    "site_admin": true
  }
}

After the function has completed you should be able to see your collected data under the github blob container in your storage account.

Investigating collector telemetry

Long running functions (Onboarding and Traffic) print some additional progress stats when theya re executed locally. However, richer telemetry is sent to Application Insights. To consume the telemetry data from your functions you can visit Application Insights and navigate to the Monitoring -> Logs tab.

Retrieve session events

customEvents
| where name in ("SessionStart", "SessionEnd")
| extend Context = parse_json(customDimensions)
| extend SessionId = tostring(Context.SessionId),
         CollectorType = tostring(Context.CollectorType),
         Success = tostring(Context.Success)

Retrieve requests done in a particular session

let sessionId = "<session ID>";
dependencies
| extend Context = parse_json(customDimensions)
| extend SessionId = tostring(Context.SessionId),
| where SessionId == sessionId
| order by timestamp desc

Retrieve exceptions in a particular session

let sessionId = "<session ID>";
exceptions
| extend Context = parse_json(customDimensions)
| extend SessionId = tostring(Context.SessionId),
| where SessionId == sessionId
| order by timestamp desc

7. Make and debug changes

Create and checkout feature branches from your fork on your local machine and make your contributions to the code base.

Test your and debug your changes. Running the GitHub.Collectors.Functions in the Debug Configuration will allow you to use the Visual Studio debugging tools while your functions run. (Breakpoints, Variable Tracking, etc...)

Note: CEDAR.GitHub.Collector depend on CEDAR.Core.Collector and consumes the latter as a Git submodule. If you are making changes on CEDAR.Core.Collector, you need to first create a PR on that repository (following the same practices mentioned here) have that PR merged and create your PR with the updated submodule SHA in this repository.

8. Write unit tests to cover new code

New code should be covered by comprehensive unit tests using the Microsoft.VisualStudio.TestTools.UnitTesting framework.

9. Commit and Push your changes and make a Pull Request

When your contributions have been tested you can commit them to your remote branch and request that your changes be merged into the CEDAR.GitHub.Collector repository.

Troubleshooting

System.Private.CoreLib: The input is not a valid Base-64 string as it contains a non-base 64 character, more than two padding characters, or an illegal character among the padding characters. 

If you are using the Azure web portal, check the box that says Encode the message body in Base64. If the box is unavailable to check, then the message body contains an illegal character and cannot be encoded. Check to make sure that it is not an invisible character (copying and pasting from GitHub has caused an invisible illegal character in the past).

Microsoft.CloudMine.GitHub.Collectors.Functions: Invalid URI: The hostname could not be parsed. 

The API domain isn't set. Currently, it is set in Settings.json. Make sure you have a value mapped to the key ApiDomain (Ex. "ApiDomain": "api.github.com").

cedar.github.collector's People

Contributors

allainpl avatar kivancmuslu avatar lgostling avatar maxi avatar microsoftopensource avatar myles-mcleroy avatar norahuang avatar rheapatel avatar saschajust avatar varshavadaga avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cedar.github.collector's Issues

ADO collectors on Azure Functions

This effort is to improve our collector architecture, collector latency and reliability, allowing CEDAR to be more reliable with an improved code base. This work will be completed in phases

Implement a point collector

High Level Design

  1. We add another Azure queue to store payloads for entities that need to be collected.
  2. We add another collector (Azure Function) that processes the items in the queue by querying GitHub API endpoints while abiding GitHub throttling.

This will allow us to do two things in the long run:

  • Caching of entities to reduce the number of queries against the API.
  • Untangle entities in the ADLS data (which right now prevents us from switching to ADLS for GitHub processing).

Create barebones local and global settings files

We need barebones local.settings.json and Settings.json files to make it easier for users to get the collectors running. These barebones files will only include the fields required to run the collectors. The user will only need to add their details (usernames, secrets, etc.).

Move ApiDomain to global section of the config

ApiDomain, which can be different between different GitHub collector deployments, is currently set through an Environment variable (function app variable), which is set at the release definition. It would be beneficial to move this to the main config in the global section to keep config-related things together.

GitHub collector parity with Azure DevOp equivalents

The goal is to close the gap between both services to enable customers (1st and 3rd party) to invoke analytics on both platforms with respect to the data offerings of GitHub and what exists in Azure DevOps.

Infinite retention for queue messages

Currently queue-based collectors use the default Azure Function retention policy, which is 7-days. However, it is possible to change this. We should change this to infinite to ensure that no work is lost (even in poison queues). Similar implementation is already done / available on the ADO side.

Make it possible to turn-off putting notifications messages for AzureBlob writer settings

There is already logic in the core library that this happens (no notification message is put) when NotificationQueueSotrageConnectionEnvironmentVariable is set to empty string or null. However, GitHub collectors, always provide "AzureWebJobs" (the default value) for this setting. The goal of this issue is to extend the GitHub collectors with the ability to skip putting these notification messages.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.