Giter Club home page Giter Club logo

Comments (8)

bulwahn avatar bulwahn commented on July 25, 2024 1

Nice collection of stuff, but unfortunately probably all irrelevant.

The pasta project faces two issues:

  1. we have very few applications using pasta implemented (we have no real users of the overall program), because many ideas of all the use cases are not implemented. This should be the focus.

  2. We have only a very small ground truth. Any sophisticated algorithm does not help because the ground truth is small. There is no way to increase the ground truth dataset, so we should focus on fixing specific systematic issues.

This issue is about extending the data structures to identify and include the notion of patch series and try to compute a relationship between them.

from pasta.

bulwahn avatar bulwahn commented on July 25, 2024

Pasta relates patches to each other based on a suitable heuristics optimised for relating patches.

However, pasta also has the information of which patches are in which series. So, we can use a further algorithm/heuristics to conclude from related patches among multiple series which series (possibly identified by their cover letters) are related to each other.

We can also consider possible metrics in the cover letters of the series as further factor for determining the correct relation between series.

@rralf Would this be a suitable task for a bachelor's/master's thesis topic?

from pasta.

bulwahn avatar bulwahn commented on July 25, 2024

Probably, the issue should be renamed to "Compute relation between patch series"; the cover letter is only a part of a patch series, that identifies and is unique to the patch series.

from pasta.

vaniisgh avatar vaniisgh commented on July 25, 2024

Hey :)
I read about PaStA on the community bridge website and was looking at this issue, it seemed really interesting ( &challenging ) but I would like to try and work one something like this possibly ... or even contribute to smaller issues independently, if you have any pointers on how to go about this process I would really appreciate it.

thanks & regards

edit: maybe something like #33 combining PaStA with the cregit tool will be more suitable, but the algorithm part of this issue really excites me :)

from pasta.

bulwahn avatar bulwahn commented on July 25, 2024

@vaniisgh we have enough work on all ends of this project, deep internals, nice visualisations, connecting with other tools etc.

I think this task here is suitable for a mentorship. For the beginning, I need to ask if you roughly know the kernel workflow on the mailing lists, e.g., do you know what is a cover letter, what is a patch series etc.

Also, a bit simpler to get started is to look into #21 or #14; please have a look, then we create a vision for a tool that we would like to develop for those points.

from pasta.

vaniisgh avatar vaniisgh commented on July 25, 2024

Thanks for the reply :)
Honestly, I am a beginner to contribution workflows, but I have only ever used GitHub & Gerrit to push changes, so I haven't really used a mailing list before. Though I am aware of cover letters, I haven't ever sent one. I think my knowledge of patch and patch series is a bit better :)

I will look at the issues you have mentioned and possibly comment on my doubts/ideas on the appropriate one and try to get started on one of those first.

from pasta.

vaniisgh avatar vaniisgh commented on July 25, 2024

So I was reading though the papers mentioned in the readme :)
and was thinking about how the current algorithms could be extended most elegantly, I was wondering if this should be done by

  • extracting keywords from the cover letter and commit messages (which is done currently with the Levenshtein string distance together after tokenisation) by extending it with something like the Needleman and Wunsch or an Affine gap algorithm.

I was referring to these resources to understand the string matching better:

and then extend the same methodology for the diffs too.

I also have this ... kind of adventurous idea. It really is half baked though ...
In bioinformatics we use a lot of sequence alignment algorithms it would be cool to use them here too, since code like DNA or Protein code has a fixed number of sensible tokens, this is still something I am thinking about but I wanted to share. I was thinking of something like :

  • the BLAST algorithm

  • The algotithm used by Clustal :
    Steps for CLUSTAL algorithm are:

    -- Calculate all possible pairwise alignments, record the score for each pair.
    -- Calculate a guide tree based on the pairwise distances (algorithm: Neighbor Joining).
    -- Find the two most closely related sequences
    -- Align the sequences by progressive method
    i. Calculate a consensus of this alignment
    ii. Replace the two sequences with the consensus
    iii. Find the two next-most closely related sequences (one of these could be a previously determined consensus sequence).
    iv. Iterate until all sequences have been aligned

  1. Expand the consensus sequences with the (gapped) original sequences
  2. Report the multiple sequence alignment
    then we could use this sequence alignment to generate similarity results based on the weights/significance and amout of total changed code that matches ?

from pasta.

vaniisgh avatar vaniisgh commented on July 25, 2024

thanks for taking the time to review all this and answer any doubts I have, I'm just trying to understand PaStA atm. so sorry about all the irrelavant comments.
I think I understand the issues outlined now. thanks :)
I will follow up soon with a more appropriate solution idea/POC.

from pasta.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.