Comments (8)
Nice collection of stuff, but unfortunately probably all irrelevant.
The pasta project faces two issues:
-
we have very few applications using pasta implemented (we have no real users of the overall program), because many ideas of all the use cases are not implemented. This should be the focus.
-
We have only a very small ground truth. Any sophisticated algorithm does not help because the ground truth is small. There is no way to increase the ground truth dataset, so we should focus on fixing specific systematic issues.
This issue is about extending the data structures to identify and include the notion of patch series and try to compute a relationship between them.
from pasta.
Pasta relates patches to each other based on a suitable heuristics optimised for relating patches.
However, pasta also has the information of which patches are in which series. So, we can use a further algorithm/heuristics to conclude from related patches among multiple series which series (possibly identified by their cover letters) are related to each other.
We can also consider possible metrics in the cover letters of the series as further factor for determining the correct relation between series.
@rralf Would this be a suitable task for a bachelor's/master's thesis topic?
from pasta.
Probably, the issue should be renamed to "Compute relation between patch series"; the cover letter is only a part of a patch series, that identifies and is unique to the patch series.
from pasta.
Hey :)
I read about PaStA on the community bridge website and was looking at this issue, it seemed really interesting ( &challenging ) but I would like to try and work one something like this possibly ... or even contribute to smaller issues independently, if you have any pointers on how to go about this process I would really appreciate it.
thanks & regards
edit: maybe something like #33 combining PaStA with the cregit tool
will be more suitable, but the algorithm part of this issue really excites me :)
from pasta.
@vaniisgh we have enough work on all ends of this project, deep internals, nice visualisations, connecting with other tools etc.
I think this task here is suitable for a mentorship. For the beginning, I need to ask if you roughly know the kernel workflow on the mailing lists, e.g., do you know what is a cover letter, what is a patch series etc.
Also, a bit simpler to get started is to look into #21 or #14; please have a look, then we create a vision for a tool that we would like to develop for those points.
from pasta.
Thanks for the reply :)
Honestly, I am a beginner to contribution workflows, but I have only ever used GitHub & Gerrit to push changes, so I haven't really used a mailing list before. Though I am aware of cover letters, I haven't ever sent one. I think my knowledge of patch and patch series is a bit better :)
I will look at the issues you have mentioned and possibly comment on my doubts/ideas on the appropriate one and try to get started on one of those first.
from pasta.
So I was reading though the papers mentioned in the readme :)
and was thinking about how the current algorithms could be extended most elegantly, I was wondering if this should be done by
- extracting keywords from the cover letter and commit messages (which is done currently with the Levenshtein string distance together after tokenisation) by extending it with something like the Needleman and Wunsch or an Affine gap algorithm.
I was referring to these resources to understand the string matching better:
- https://people.cs.umass.edu/~mccallum/courses/cl2006/lect4-stredit.pdf
- http://www.cs.utexas.edu/~ml/papers/marlin-kdd-03.pdf
and then extend the same methodology for the diffs too.
I also have this ... kind of adventurous idea. It really is half baked though ...
In bioinformatics we use a lot of sequence alignment algorithms it would be cool to use them here too, since code like DNA or Protein code has a fixed number of sensible tokens, this is still something I am thinking about but I wanted to share. I was thinking of something like :
-
the BLAST algorithm
-
The algotithm used by Clustal :
Steps for CLUSTAL algorithm are:-- Calculate all possible pairwise alignments, record the score for each pair.
-- Calculate a guide tree based on the pairwise distances (algorithm: Neighbor Joining).
-- Find the two most closely related sequences
-- Align the sequences by progressive method
i. Calculate a consensus of this alignment
ii. Replace the two sequences with the consensus
iii. Find the two next-most closely related sequences (one of these could be a previously determined consensus sequence).
iv. Iterate until all sequences have been aligned
- Expand the consensus sequences with the (gapped) original sequences
- Report the multiple sequence alignment
then we could use this sequence alignment to generate similarity results based on the weights/significance and amout of total changed code that matches ?
from pasta.
thanks for taking the time to review all this and answer any doubts I have, I'm just trying to understand PaStA atm. so sorry about all the irrelavant comments.
I think I understand the issues outlined now. thanks :)
I will follow up soon with a more appropriate solution idea/POC.
from pasta.
Related Issues (20)
- Combine PaStA with the cregit tool
- Collect user feedback on relating patches in patchwork tool to improve Pasta
- Determine the relevant entries and maintainers for a provided list of files
- [GSOC] Add a requirements.txt to make setup easier HOT 6
- Fix erroneous behaviour in LinuxMaintainers HOT 8
- Analysis jailhouse repo with PaSta HOT 19
- Readme mentions 4 steps but only 3 are explicitly mentioned HOT 5
- Running "pasta analyse succ" in mbox mode doesnt show appropriate error message HOT 5
- Linux weekly digest HOT 7
- Patch groups file is not created HOT 6
- Support identification of kernel developers for improving the precision of analysis HOT 3
- Update Readme for Getting PaStA HOT 5
- Create a ML model for the patch recipients based on the recipients of sent patches HOT 9
- Derive a rule set for the patch recipients based on the existing email data
- Introduce Redis to handle resources HOT 3
- `git -C resources submodule update` is taking a huge amount of time HOT 6
- set_config shows invalid literal for int with base 10 HOT 11
- git and MAINTAINERS only: plot mailing lists over time HOT 1
- Have an option to only run representative analyses (No repository required) HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pasta.