Since you are using the GitHub Server option with the Gr

The absence of the labels field in the <code class="

I am able to view the corresponding connections ID and name in <code class="notranslat

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[Bug][GitHub Data source] pull_requests table does not populate all the repos added to the project,about apache/incubator-devlake

Comments (12)

dosubot commented on September 27, 2024 2

The "pull_requests" table might not populate all the repos added to the project after a successful pipeline run because the CollectPrs function relies on the GithubApiParams struct, which includes the ConnectionId and Name parameters. If these parameters are not correctly set or if the new repositories are not included in the scope defined by these parameters, the pull requests for those repositories will not be extracted and populated into the table.

Additionally, the CollectPrs function uses a GraphQL query to collect pull requests, and it orders the results by CREATED_AT and uses cursor pagination. This approach may miss some pull requests that are updated during the collection process. Although these missed pull requests will be collected in the next run, it might not be sufficient for immediate data consistency.

Here is the relevant code snippet from the CollectPrs function:

func CollectPrs(taskCtx plugin.SubTaskContext) errors.Error {
    data := taskCtx.GetData().(*tasks.GithubTaskData)
    var err errors.Error
    apiCollector, err := api.NewStatefulApiCollector(api.RawDataSubTaskArgs{
        Ctx: taskCtx,
        Params: tasks.GithubApiParams{
            ConnectionId: data.Options.ConnectionId,
            Name:         data.Options.Name,
        },
        Table: RAW_PRS_TABLE,
    })
    if err != nil {
        return err
    }

    err = apiCollector.InitGraphQLCollector(api.GraphqlCollectorArgs{
        GraphqlClient: data.GraphqlClient,
        PageSize:      10,
        BuildQuery: func(reqData *api.GraphqlRequestData) (interface{}, map[string]interface{}, error) {
            query := &GraphqlQueryPrWrapper{}
            if reqData == nil {
                return query, map[string]interface{}{}, nil
            }
            ownerName := strings.Split(data.Options.Name, "/")
            variables := map[string]interface{}{
                "pageSize":   graphql.Int(reqData.Pager.Size),
                "skipCursor": (*graphql.String)(reqData.Pager.SkipCursor),
                "owner":      graphql.String(ownerName[0]),
                "name":       graphql.String(ownerName[1]),
            }
            return query, variables, nil
        },
        GetPageInfo: func(iQuery interface{}, args *api.GraphqlCollectorArgs) (*api.GraphqlQueryPageInfo, error) {
            query := iQuery.(*GraphqlQueryPrWrapper)
            return query.Repository.PullRequests.PageInfo, nil
        },
        ResponseParser: func(iQuery interface{}, variables map[string]interface{}) ([]interface{}, error) {
            query := iQuery.(*GraphqlQueryPrWrapper)
            prs := query.Repository.PullRequests.Prs
            for _, rawL := range prs {
                if apiCollector.GetSince() != nil && !apiCollector.GetSince().Before(rawL.CreatedAt) {
                    return nil, api.ErrFinishCollect
                }
            }
            return nil, nil
        },
    })
    if err != nil {
        return err
    }

    return apiCollector.Execute()
}

Ensure that the ConnectionId and Name parameters are correctly set to include all the repositories you want to track. Additionally, consider the limitations of the current pagination and ordering strategy, and plan for potential updates to the collection logic to ensure all relevant pull requests are captured ^[1]^[2]^[3].

Details

About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

from incubator-devlake.

dosubot commented on September 27, 2024 2

Since you are using the GitHub Server option with the GraphQL option toggled off, the data collection process will rely on the REST API instead of GraphQL. This means the CollectApiPullRequests function will be used to collect pull request data.
The table that captures the ConnectionId and Name parameters is archived.GithubConnection. Here is the relevant code snippet:

err = db.AutoMigrate(archived.GithubConnection{})
if err != nil {
    return err
}
encodeKey := basicRes.GetConfig(plugin.EncodeKeyEnvStr)
connection := &archived.GithubConnection{}
connection.Endpoint = basicRes.GetConfig(`GITHUB_ENDPOINT`)
connection.Proxy = basicRes.GetConfig(`GITHUB_PROXY`)
connection.Token = basicRes.GetConfig(`GITHUB_AUTH`)
connection.Name = `GitHub`
if connection.Endpoint != `` && connection.Token != `` && encodeKey != `` {
    // update from .env and save to db
    err = db.Create(connection)
    if err != nil {
        return err
    }
}

This code creates a GithubConnection entry with ConnectionId and Name parameters ^[1].

Details

About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

from incubator-devlake.

dosubot commented on September 27, 2024 2

The pull_requests table might not be populating with all the repositories added to your project due to the filtering conditions in the EnrichPullRequestIssues function. Specifically, the dal.Where clause filters the data based on repo_id and connection_id. If these IDs do not match the repositories you expect to be included, the data will not be processed.

Here is the relevant code snippet:

cursor, err := db.Cursor(
    dal.From(&models.GithubPullRequest{}),
    dal.Where("repo_id = ? and connection_id = ?", repoId, data.Options.ConnectionId),
)

Ensure that the repoId and data.Options.ConnectionId are correctly set for all the repositories you have added to your project. If these values are incorrect or not set for some repositories, the corresponding pull requests will not be included in the pull_requests table.

Additionally, verify that the PullRequest struct in your codebase is correctly defined and matches the expected schema. Here is the definition for reference:

type PullRequest struct {
    DomainEntity
    BaseRepoId  string `gorm:"index"`
    HeadRepoId  string `gorm:"index"`
    Status      string `gorm:"type:varchar(100);comment:open/closed or other"`
    Title       string
    Description string
    Url         string `gorm:"type:varchar(255)"`
    AuthorName  string `gorm:"type:varchar(100)"`
    AuthorId       string `gorm:"type:varchar(100)"`
    ParentPrId     string `gorm:"index;type:varchar(100)"`
    PullRequestKey int
    CreatedDate    time.Time
    MergedDate     *time.Time
    ClosedDate     *time.Time
    Type           string `gorm:"type:varchar(100)"`
    Component      string `gorm:"type:varchar(100)"`
    MergeCommitSha string `gorm:"type:varchar(40)"`
    HeadRef        string `gorm:"type:varchar(255)"`
    BaseRef        string `gorm:"type:varchar(255)"`
    BaseCommitSha  string `gorm:"type:varchar(40)"`
    HeadCommitSha  string `gorm:"type:varchar(40)"`
}

func (PullRequest) TableName() string {
    return "pull_requests"
}

Make sure that the table name and field types are correctly set to avoid any schema mismatches.

Additionally, ensure that the EnrichPullRequestIssues function in your code is correctly extracting and processing the pull request data. Here is the relevant part of the EnrichPullRequestIssues function:

func EnrichPullRequestIssues(taskCtx plugin.SubTaskContext) (err errors.Error) {
    db := taskCtx.GetDal()
    data := taskCtx.GetData().(*GithubTaskData)
    repoId := data.Options.GithubId

    var prBodyCloseRegex *regexp.Regexp
    prBodyClosePattern := data.Options.ScopeConfig.PrBodyClosePattern
    prBodyClosePattern = strings.Replace(prBodyClosePattern, "%s", data.Options.Name, 1)
    if len(prBodyClosePattern) > 0 {
        prBodyCloseRegex, err = errors.Convert01(regexp.Compile(prBodyClosePattern))
        if err != nil {
            return errors.Default.Wrap(err, "regexp Compile prBodyClosePattern failed")
        }
    }
    charPattern := regexp.MustCompile(`[\/a-zA-Z\s,]+`)
    cursor, err := db.Cursor(dal.From(&models.GithubPullRequest{}),
        dal.Where("repo_id = ? and connection_id = ?", repoId, data.Options.ConnectionId))
    if err != nil {
        return err
    }
    defer cursor.Close()

    converter, err := api.NewDataConverter(api.DataConverterArgs{
        InputRowType: reflect.TypeOf(models.GithubPullRequest{}),
        Input:        cursor,
        RawDataSubTaskArgs: api.RawDataSubTaskArgs{
            Ctx: taskCtx,
            Params: GithubApiParams{
                ConnectionId: data.Options.ConnectionId,
                Name:         data.Options.Name,
            },
            Table: RAW_PULL_REQUEST_TABLE,
        },
        Convert: func(inputRow interface{}) ([]interface{}, errors.Error) {
            githubPullRequst := inputRow.(*models.GithubPullRequest)
            results := make([]interface{}, 0, 1)

            issueNumberListStr := ""
            if prBodyCloseRegex != nil {
                issueNumberListStr = prBodyCloseRegex.FindString(githubPullRequst.Body)
            }

            if issueNumberListStr == "" {
                return nil, nil
            }

            issueNumberListStr = charPattern.ReplaceAllString(issueNumberListStr, "#")
            issueNumberList := strings.Split(issueNumberListStr, "#")
            for _, issueNumberStr := range issueNumberList {
                issue := &models.GithubIssue{}
                issueNumberStr = strings.TrimSpace(issueNumberStr)
                issueNumber, numFormatErr := strconv.Atoi(issueNumberStr)
                if numFormatErr != nil {
                    continue
                }
                err = db.All(
                    issue,
                    dal.Where("number = ? and repo_id = ? and connection_id = ?",
                        issueNumber, repoId, data.Options.ConnectionId),
                    dal.Limit(1),
                )
                if err != nil {
                    return nil, err
                }
                if issue.Number == 0 {
                    continue
                }
                githubPullRequstIssue := &models.GithubPrIssue{
                    ConnectionId:      data.Options.ConnectionId,
                    PullRequestId:     githubPullRequst.GithubId,
                    IssueId:           issue.GithubId,
                    PullRequestNumber: githubPullRequst.Number,
                    IssueNumber:       issue.Number,
                }
                results = append(results, githubPullRequstIssue)
            }
            return results, nil
        },
    })
    if err != nil {
        return err
    }

    return converter.Execute()
}

Ensure that the EnrichPullRequestIssues function is correctly configured and that the ConnectionId and repoId are properly set for all repositories. This function is responsible for extracting and processing the pull request data, and any misconfiguration here could result in missing data in the pull_requests table ^[1]^[2]^[3]^[4].

Details

About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

from incubator-devlake.

dosubot commented on September 27, 2024 2

The absence of the labels field in the devlake._raw_github_api_issues table for the missing repository could indeed be a contributing factor to why the data is not being loaded into the pull_requests table. The labels field is processed and transformed during the data ingestion process, and its absence might affect the overall data integrity and the subsequent transformations.

The CollectApiIssues function handles the labels field by using the convertGithubLabels function, which processes each label associated with an issue. If the labels field is missing or malformed, it could lead to incomplete or incorrect data being ingested, which might prevent the data from being correctly loaded into the pull_requests table ^[1].

To address this issue, ensure that the labels field is correctly populated in the devlake._raw_github_api_issues table. You might need to verify the data collection process and ensure that the GitHub API responses include the labels field for all issues. If the labels field is missing in the API response, it could indicate a problem with the data source configuration or the API request parameters.

Details

About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

from incubator-devlake.

dosubot commented on September 27, 2024 2

The missing labels in the devlake._raw_github_api_issues table could be due to several factors:

Regex Configuration: The IssueRegexes might not be correctly configured to match the labels in the issues. If the regex patterns for severity, component, priority, or type do not match any labels, those labels will not be processed.
Label Extraction Logic: The logic in the convertGithubLabels function might not be correctly extracting and storing the labels. Ensure that the labels are being appended to the results slice and that the joinedLabels are being correctly joined and assigned to githubIssue.Type.
Database Schema: The schema for the GithubIssueLabel table might not be correctly defined or might be missing. Ensure that the table _tool_github_issue_labels exists and is correctly defined to store the labels.

To resolve this issue, you can:

Verify Regex Patterns: Ensure that the regex patterns in the GithubScopeConfig are correctly defined to match the labels you expect.
Debug Label Extraction: Add logging or debugging statements in the convertGithubLabels function to verify that labels are being correctly processed and appended to the results slice.
Check Database Schema: Verify that the GithubIssueLabel table is correctly defined and exists in the database. The table should have the following schema:

package archived

import (
	"github.com/apache/incubator-devlake/core/models/migrationscripts/archived"
)

// Please note that Issue Labels can also apply to Pull Requests.
// Pull Requests are considered Issues in GitHub.

type GithubIssueLabel struct {
	ConnectionId uint64 `gorm:"primaryKey"`
	IssueId      int    `gorm:"primaryKey;autoIncrement:false"`
	LabelName    string `gorm:"primaryKey;type:varchar(255)"`
	archived.NoPKModel
}

func (GithubIssueLabel) TableName() string {
	return "_tool_github_issue_labels"
}

Here is the relevant code for reference:

func convertGithubLabels(issueRegexes *IssueRegexes, issue *IssuesResponse, githubIssue *models.GithubIssue) ([]interface{}, errors.Error) {
	var results []interface{}
	var joinedLabels []string
	for _, label := range issue.Labels {
		results = append(results, &models.GithubIssueLabel{
			ConnectionId: githubIssue.ConnectionId,
			IssueId:      githubIssue.GithubId,
			LabelName:    label.Name,
		})

		if issueRegexes.SeverityRegex != nil && issueRegexes.SeverityRegex.MatchString(label.Name) {
			githubIssue.Severity = label.Name
		}
		if issueRegexes.ComponentRegex != nil && issueRegexes.ComponentRegex.MatchString(label.Name) {
			githubIssue.Component = label.Name
		}
		if issueRegexes.PriorityRegex != nil && issueRegexes.PriorityRegex.MatchString(label.Name) {
			githubIssue.Priority = label.Name
		}
		if issueRegexes.TypeRequirementRegex != nil && issueRegexes.TypeRequirementRegex.MatchString(label.Name) {
			githubIssue.StdType = ticket.REQUIREMENT
		} else if issueRegexes.TypeBugRegex != nil && issueRegexes.TypeBugRegex.MatchString(label.Name) {
			githubIssue.StdType = ticket.BUG
		} else if issueRegexes.TypeIncidentRegex != nil && issueRegexes.TypeIncidentRegex.MatchString(label.Name) {
			githubIssue.StdType = ticket.INCIDENT
		}
		joinedLabels = append(joinedLabels, label.Name)
	}
	if len(joinedLabels) > 0 {
		githubIssue.Type = strings.Join(joinedLabels, ",")
	}
	return results, nil
}

Additionally, ensure that the ConvertIssueLabels subtask is correctly converting the labels from the tool layer to the domain layer. Here is the relevant code for the ConvertIssueLabels subtask:

func ConvertIssueLabels(taskCtx plugin.SubTaskContext) errors.Error {
	db := taskCtx.GetDal()
	data := taskCtx.GetData().(*GithubTaskData)
	repoId := data.Options.GithubId

	cursor, err := db.Cursor(
		dal.From(&models.GithubIssueLabel{}),
		dal.Join(`left join _tool_github_issues on _tool_github_issues.github_id = _tool_github_issue_labels.issue_id`),
		dal.Where("_tool_github_issues.repo_id = ? and _tool_github_issues.connection_id = ?", repoId, data.Options.ConnectionId),
		dal.Orderby("issue_id ASC"),
	)
	if err != nil {
		return err
	}
	defer cursor.Close()
	issueIdGen := didgen.NewDomainIdGenerator(&models.GithubIssue{})

	converter, err := api.NewDataConverter(api.DataConverterArgs{
		RawDataSubTaskArgs: api.RawDataSubTaskArgs{
			Ctx: taskCtx,
			Params: GithubApiParams{
				ConnectionId: data.Options.ConnectionId,
				Name:         data.Options.Name,
			},
			Table: RAW_ISSUE_TABLE,
		},
		InputRowType: reflect.TypeOf(models.GithubIssueLabel{}),
		Input:        cursor,
		Convert: func(inputRow interface{}) ([]interface{}, errors.Error) {
			issueLabel := inputRow.(*models.GithubIssueLabel)
			domainIssueLabel := &ticket.IssueLabel{
				IssueId:   issueIdGen.Generate(data.Options.ConnectionId, issueLabel.IssueId),
				LabelName: issueLabel.LabelName,
			}
			return []interface{}{
				domainIssueLabel,
			}, nil
		},
	})
	if err != nil {
		return err
	}

	return converter.Execute()
}

Ensure that the issue.Labels are being correctly iterated over and that each label is being processed and appended to the results slice. Additionally, verify that the joinedLabels are correctly joined and assigned to githubIssue.Type. Also, make sure that the GithubIssueLabel table is correctly defined and exists in the database.

Finally, ensure that the issue_labels table in the domain layer is correctly defined to store the labels. The schema should be as follows:

package ticket

import (
	"github.com/apache/incubator-devlake/core/models/common"
)

// Please note that Issue Labels can also apply to Pull Requests.
// Pull Requests are considered Issues in GitHub.

type IssueLabel struct {
	IssueId   string `json:"id" gorm:"primaryKey;type:varchar(255);comment:This key is generated based on details from the original plugin"` // format: <Plugin>:<Entity>:<PK0>:<PK1>
	LabelName string `gorm:"primaryKey;type:varchar(255)"`
	common.NoPKModel
}

func (IssueLabel) TableName() string {
	return "issue_labels"
}

By verifying these configurations and ensuring the correct processing of labels, you can address the issue of missing labels in the devlake._raw_github_api_issues table ^[1]^[2]^[3]^[4].

Details

About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

from incubator-devlake.

mparthi3-ford commented on September 27, 2024

Thanks for your reply.
I have few queries to be answered.

Since we are using GitHub Server option, the GraphQL option is toggled-off at the time of establishing the connection. Attached screen-shot for reference.
Which table captures ConnectionID and Name parameters?

from incubator-devlake.

mparthi3-ford commented on September 27, 2024

I am able to view the corresponding connections ID and name in devlake._tool_github_connections table...
However, I could not see the data in pull_request table.

from incubator-devlake.

mparthi3-ford commented on September 27, 2024

I see that the missing repo does not have the below snippet in devlake._raw_github_api_issues. Could this the reason for why the data was not loaded "pull-request" table.

"labels": [
{
"id":
"node_id":
"url":
"name": "",
"color": "",
"default":,
"description":
}

from incubator-devlake.

mparthi3-ford commented on September 27, 2024

I am just a user which uses devLake tool to get the metrics. How do I know as to why the labels field is missing?
What could be the contributing factor for the missing Labels in my connection.

I am sure that the way I have established a connection and configured the data source (Github) and hence the project is same as before.
Is this something that needs to be fixed at the DevLake side?

from incubator-devlake.

mparthi3-ford commented on September 27, 2024

@JoshuaPoddoku : Even with the latest upgrade to v1.0.0 Beta8, I still face the same issue.
Appreciate if anyone from DevLake community could join for a quick meeting.
There are no Github related data in Pull_requests.

from incubator-devlake.

github-actions commented on September 27, 2024

This issue has been automatically marked as stale because it has been inactive for 60 days. It will be closed in next 7 days if no further activity occurs.

from incubator-devlake.

github-actions commented on September 27, 2024

This issue has been closed because it has been inactive for a long time. You can reopen it if you encounter the similar problem in the future.

from incubator-devlake.

[Bug][GitHub Data source] pull_requests table does not populate all the repos added to the project about incubator-devlake HOT 12 CLOSED

Comments (12)

Details

Details

Details

Details

Details

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent