microsoft / linux-comma Goto Github PK

The Linux Commit Analyzer tracks patches from upstream and downstream kernels.

License: MIT License

Python 100.00%

linux-comma's Introduction

Linux Commit Analyzer (CommA)

This tool populates a database with the details of which patches that are upstream (in linux-mainline) are missing from distros in their downstream kernels (e.g. Ubuntu kernels). For our use case, we are looking at Hyper-V patches and checking downstream azure specific kernels, but this is adaptable.

Getting Started

Install Dependencies

Setup Microsoft SQL repos (this is for mssql-tools):

Caution: these commands use sudo and add a package repository. Change the Ubuntu version to your version (cat /etc/os-release for the number).

. /etc/os-release
curl https://packages.microsoft.com/keys/microsoft.asc | sudo apt-key add -
curl https://packages.microsoft.com/config/ubuntu/${VERSION_ID}/prod.list | sudo tee /etc/apt/sources.list.d/msprod.list

Install the required packages for connecting to the Microsoft SQL database:

sudo apt update
sudo apt install mssql-tools unixodbc-dev

For the symbol matcher, install exuberant-ctags as the default ctags will not work:

apt install exuberant-ctags

Install CommA

pip install .

Running CommA

Provide the URL to the secrets repo with export COMMA_SECRETS_URL=<URL/to/secrets/repo>.
Run comma run --upstream --downstream

This will parse the upstream and downstream repos.

Setting Up Secrets

Place database info into the following environment variables before running CommA: COMMA_DB_URL COMMA_DB_NAME COMMA_DB_USERNAME COMMA_DB_PW

Tip

We leave it to the user to keep their secrets secure before running CommA. Azure Pipelines provides a few mechanisms for managing secrets, use a comparable tool when creating a CommA pipeline to ensure you don't leak your database credentials.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Legal Notices

Microsoft and any contributors grant you a license to the Microsoft documentation and other content in this repository under the Creative Commons Attribution 4.0 International Public License, see the LICENSE-DOCS file, and grant you a license to any code in the repository under the MIT License, see the LICENSE file.

Microsoft, Windows, Microsoft Azure and/or other Microsoft products and services referenced in the documentation may be either trademarks or registered trademarks of Microsoft in the United States and/or other countries. The licenses for this project do not grant you rights to use any Microsoft names, logos, or trademarks. Microsoft's general trademark guidelines can be found at http://go.microsoft.com/fwlink/?LinkID=254653.

Privacy information can be found at https://privacy.microsoft.com/en-us/

Microsoft and any contributors reserve all other rights, whether under their respective copyrights, patents, or trademarks, whether by implication, estoppel or otherwise.

linux-comma's People

Contributors

Stargazers

Watchers

Forkers

taffywrinkle claudiusgonzo mcgov standardgalactic frankfanslc isabella232 avylove

linux-comma's Issues

Improve spreadsheet handling

Multiple changes can be implemented to improve spreadsheet experience

Spreadsheet export stalls if a column doesn't exist, create and log instead
Add support for creating spreadsheet from scratch rather than requiring an existing spreadsheet
Export and update spreadsheet in single operation
Add option to limit spreadsheet operations (ex: commit, tag, date, distro, etc)
Support updating spreadsheet in place rather than generating new file
- modify a copy, replace original as last step only if there are no errors
See if fetch operations can be avoided for spreadsheet operations
- Can this use only info from the database?
Can a spreadsheet be generated directly without requiring a database?

Refactor session-specific code to facilitate testing

A few components, specifically database sessions, tracking info, and config settings are implemented in a way they are singletons within a run of the program. However, this makes testing more difficult because the tester will need to be aware of all of those and reset them, or run all the tests in subprocesses.

Typically, this is all done through session instances which are instantiated at or close to the program entry point and held in the entry point function or class instance. This makes it easier to control and test.

Getting to a standard state will need to be done in steps, some of which have already been implemented.

Convert config to another object

Config is currently a module, but that creates issues for testing because values need to be changed or mocked for each test. Config should be a class or potentially an argparse.Namespace object. It may, at some point, also make sense to add validation and this should be kept in mind when implementing a solution. It would also be helpful to be able to specify configuration options both through CLI options and a config file with CLI options taking precedence.

Add option to run downstream against specific distros or repos

Currently, downstream is run against all the distros and monitoring subjects in the database. This can take a very long time. Sometimes it desireable to just check one distro or repo, so an option should be added to handle that case.

Allow specifying priority of upstream commits

Currently, priority exists in the database, but has to be set manually.

Priority could potentially be set in the following ways:

Subcommand that requires commit and priority number, sets value in database, could also take file or stdin for bulk
Config-base priority overrides
Import from spreadsheet

Add type annotations and checking

In order to to detect inconsistencies and bugs, type annotations should be added to the code and type checking should be added to CI.

Determine if metadata and code fuzzy-matching is reliable

This is a followup to #34 since the data is not currently available.
Look at the result to see how often this is used to find additional merged commits and how accurate it is.

Reference naming inconsistent

Git references are referred to by multiple names in the code and documentation: ref, reference, revision, tag, kernel
They should be referred to consistently as reference (or ref for short, though it should be avoided) unless referring to a specific kind of reference (tag, branch, head, commit). This will make it clearer what the code is doing and operating against.

Add metrics report option to 'run' subcommand

In order to determine bottlenecks and identify regressions, record metrics for time spent in each stage for upstream and each downstream target. For example:

Total run time: 8m 37s
-- Upstream --
linux:
Fetching: 4m 2s
Processing Commits: 12s

-- Downstream --
Ubuntu22.04:
Fetching: 1m 12s
Cherries search: 34s
Metadata matching: 12s
Code comparison: 1m 37s

Is it necessary to skip the root commit in process_commits()

Original comment

        # First ever commit, we don't need to store this as it'll be present in any distro
        # TODO revisit, maybe check against set hash of first commit?
        # Get code some other way? Unsure if first commit matters or not.

This skips commits if they don't have parent, i.e. root commit. There's no checking to make sure the root commits match and other mechanisms should filter out the root commit if it even comes up, so this probably isn't necessary. Technically, in the current state, it would be wrong if compared against an empty repo.

Avoid creating PatchDiff objects more than once

PatchDiff objects seem to get generated multiple times for the same objects

For every missing cherry, patch_matches() is called. assuming there is no metadata match, a new PatchDiff object will be created for all the downstream patches
- Probably better to do the metadata matching for all upstream in bulk, and then do code matching in bulk to avoid duplicate operations

Add capability to check a specific commit

There may be cases where an end user is only interested in a single commit, potentially only for a single distro and reference. This could be combined with a --quiet option that sets the return code based on the result. This could then be used in scripts. It may also be useful to give an option to only check the database rather that having to do all the analysis.

Spreadsheet isn't sorted

New commits are added to the bottom of the spreadsheet, usually after a few empty rows rather than being sorted with the latest commits on top. openpyxl is used fir spreadsheet operations, and does support adding sort commands to the spreadsheet, but will not actually sort on save. In order to do this, all the rows need to be read, sorted, and written back to the spreadsheet.

Spreadsheet uses hardcoded tag and ignores `--since` for oldest commit

The spreadsheet logic currently uses ancestors of the tag v4.15 to determine if a commit should be included for export. Using a hardcoded tag has limitations of it's own, but it also should rely of the --since tag if it is provided (or the default value) since the age of the commit should be determinable from the commit metatdata already in the database.

This requires the spreadsheet operations to clone the repo even when the tag won't be found in the database.

Migrate to SQLAlchemy 2.0

In preparation for SQLAlchemy 2.0, SQLAlchemy 1.4 (released March 2021) added support for SQLAlchemy 2.x syntax while retaining backwards support for 1.x syntax to give time for projects to migrate. SQLAlchemy 2.0 was released in January 2023. SQLAlchemy 1.x is now in maintenance status and maintainers recommend active projects migrate to 2.x. Comma will first need to move to 2.x syntax with 1.4 installed and confirm functionality before switching to the 2.x package. Since 2.x adds improved typing support, it may be more efficient to do this before typing is added.

Migration guide: https://docs.sqlalchemy.org/en/20/changelog/migration_20.html

Improve get_tracked_paths()

get_tracked_paths() looks at the maintainers file for all tags in the repo and then filters so it's only looking at v4 and above. But you're not guaranteed to have all the tags locally, so the only one that will always get checked is the default one, origin/master.

There is a chance, a path we care about is in a previous tag, but not in master. So that wouldn't be caught. There's really no way to know unless you make sure all the tags are local. Previously this was accomplished by pulling the complete history, but that's expensive if you do it over and over.

It's also problematic because this is a hard-coded limit.

A potential fix is to provide an option such as --fetch-all to pull the complete history in cases where the cost is accepted or the full repo is already available. Another potential fix is to use another source for this data, such as GitHub. Since the goal is to get history of a single text file, this would be less bandwidth intensive than pulling the whole repo. This would make more sense if this logic is moved to a Linux-specific plugin rather than being in the main program logic.

Use local diff data for comparisons

Patch objects are created and uploaded to the database and then later downloaded for comparison. Does it make sense to use these objects for comparison rather than just doing a direct comparison from the commit objects available locally?

One reason we may still want to do this is for cases where we aren't pulling all of the repo history. However, what is pulled from the database is based on the result of the missing cherries, so it's all already available locally

Review database model

Review the current database model and modify if changes are needed.

There are several TODO items in the code describing changes that could/should be made

PatchData.commitDiffs currently stores data in the format "filename1\n+added line\n-removed line\nfilename2\n+added line\n-removed line". This is not a robust format and could break if a \n within a line is not properly handled by str.splitline(). This is a corner case, but possible. A binary blob format would be more reliable where each entry is prefaced with the number of bytes to read. For example: 0x080x02filename0x0b+added line0x0e-removed line where denotes the filename has a length of 8 and there are 2 lines changed in the file. Each line is prefaced with the length of that line. This example uses 2 bytes for the numbers, which would be a maximum size of 65535. That should likely be more that sufficient for the file name length, the number of lines changed in a file, and the length of a line.

When no paths are returned from get_tracked_paths(), all paths are checked

If get_tracked_paths() returns no paths, then it gets treated as all paths. This could happen if an invalid section is given. It should log a warning that no paths for found and skip path-dependent operations. If the section is not found, it should also print a warning.

Review TODO statments in code

There are currently 21 TODO statements in the code. Review to make sure they are still relevant and, if they are, create and add an issue number to track.

Treat database as input source and output destination, not working area

The database should be treated as an optional input source and output destination rather than a working memory space.
This would then support capabilities like working directly with a spreadsheet, output as a text report, or output as JSON.

Configuration information such as the repos to check can still come from the database, but should be available another way as well.

Add capability to edit and remove distros

There are currently only options to add a new distro or add a new revision to track. There should be options to remove either and to edit them.

Symbols doesn't print in CI

nox -vs symbols does not seem to detect any symbols when run in GitHub Actions, but does detect symbols when run locally. Will need to investigate,

print-symbols clones the Linux repo twice

The print-symbols command clones the repo once for get_tracked_paths() and again to generate the symbol table. This significantly slows down the total time. Use the same repo for get_tracked_paths() to avoid this.

Support tracking of out-of-tree patches

Add capability to track patches that are not in upstream.

There are two approaches for this depending on the use case.

Patches are consolidated in another branch or repo
Use capability tracked in #28 to track a different upstream source.
Potentially we'd want to support multiple upstream sources in order to combine results from multiple sources

Patches are in multiple repositories
Add a capability to add patches to database. They would be pulled down when the comparison occurs. The only way to check these now is fuzzy matching, but we could also use patch apply and patch apply --reverse before fuzzy matching, though this may not be worth it.

If one wanted to track when these get into upstream, the upstream reference could be added as downstream as well. Since they are the same source, there would be no missing cherries and the only patched compared would be the ones in the database.

Internal Issue: https://microsoft.visualstudio.com/OS/_workitems/edit/23401757

Spreadsheets uses a hard-coded path value to exclude commits from spreadsheet

Spreadsheet.include_commit() uses a hard-coded value, 'tools/hv/' to determine if a commit should be excluded from the spreadsheet..

Spreadsheets.get_db_commits() also uses a hard-coded value, "%fs/cifs%" to exclude commits.

There shouldn't be two different methods for the same thing. These should be made into a configurable filter to both make the tool generic and to better inform the user what's happening.

This repo is missing important files

There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.

Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

Merge this pull request

Add Centos Stream as default downstream source

Centos stream kernel seems to located here:
https://gitlab.com/redhat/centos-stream/src/kernel/centos-stream-9

Internal issue: https://microsoft.visualstudio.com/OS/_workitems/edit/36368506/

Improve CLI argument parsing

Currently, global command line arguments have to be applied before the sub command or they are treated as unknown arguments. Instead, global options should be interpreted regardless of where they are.

There are multiple ways to do this, but the easiest is probably to create a base parser for global options and pass the parents argument when creating new parsers, including the base parser as a parent.

Some options can be better group following a typical noun verb order, for example distro -list and distro --add rather than print-distros and add-distro

Separate downstream and upstream "since"

Currently "since" is treated universally for upstream and downstream, but it is desirable to use a longer upstream history than downstream history.

Note: The original behavior did not put a "since" on upstream. This is less efficient, but may need to be restored.

Option to address:

Adding an options to set upstream since
Use a fixed multiple of upstream since for download (2x, 4x, etc)
Use full upstream history (less efficient for fetching and finding missing cherries)

Whichever option is used should also apply to git log --cherry-pick

Track how a patch was determined to be present

Currently a patch the release a patch is found in, but not how that was determined. Recording how this was determined is useful for validation and to see if matching is effective. An additional field should e added that records this (cherries, metadata match, code patch, manual).

Make repository location configurable

Repositories are currently stored in a Repos directory in the directory comma is executed from and can not be configured. The default directory name should identify itself with comma. The directory should also be configurable.

Add capability to check for patches in code, not git

This is to be able to check against source tarballs and similar when there is no git repo available.

Internal issue: https://microsoft.visualstudio.com/OS/_workitems/edit/23401829/

Add support for Debian Kernels

The debian kernel, available at https://salsa.debian.org/kernel-team/linux.git, uses method of pulling the upstream source and layering a collection of patches on it. Patches can also be found at https://sources.debian.org/src/linux//debian/patches.

Additional logic is needed to parse the Debian config and patches to determine what patches have been applied.

print-symbols raises an error if called twice without deleting repo

The symbols subcommand currently fails with a raised exception when called more than once without deleting the database. It likely is trying to update the repo in an unsupported way.

$ comma --dry-run --no-fetch --verbose print-symbols --file /tmp/CommA_kaefc2on
Welcome to Commit Analyzer!
Starting the Symbol Checker...
05-18 13:24 root  INFO    Using local SQLite database at 'comma.db'.
05-18 13:24 root  INFO    Pulling 'linux-sym' repo...
Traceback (most recent call last):
  File "/bin/comma", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/lib/python3.11/site-packages/comma/__main__.py", line 242, in main
    args.func(args)
  File "/lib/python3.11/site-packages/comma/__main__.py", line 138, in <lambda>
    symbol_parser.set_defaults(func=(lambda args: print_missing_symbols(args.file)))
                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/comma/util/symbols.py", line 113, in print_missing_symbols
    get_hyperv_patch_symbols()
  File "/lib/python3.11/site-packages/comma/util/symbols.py", line 83, in get_hyperv_patch_symbols
    map_symbols_to_patch(
  File "/lib/python3.11/site-packages/comma/util/symbols.py", line 51, in map_symbols_to_patch
    repo = get_repo(name="linux-sym", shallow=False, pull=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/comma/util/tracking.py", line 59, in get_repo
    repo.remotes.origin.pull(progress=GitProgressPrinter())
  File "/lib/python3.11/site-packages/git/remote.py", line 1055, in pull
    res = self._get_fetch_info_from_stderr(proc, progress, kill_after_timeout=kill_after_timeout)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/lib/python3.11/site-packages/git/remote.py", line 854, in _get_fetch_info_from_stderr
    proc.wait(stderr=stderr_text)
  File "/lib/python3.11/site-packages/git/cmd.py", line 604, in wait
    raise GitCommandError(remove_password_if_present(self.args), status, errstr)
git.exc.GitCommandError: Cmd('git') failed due to: exit code(1)
  cmdline: git pull -v --progress -- origin

Improve logging

A few improvements are possible with logging

Tasks

Beta Give feedback

Use lazy logging to avoid unnecessary interpolation
Since we almost always run with verbose, default to info and add option for quiet
Avoid using root logger directly, use per-module loggers instead
Enumerate the number of distros / refs (ex: 1 of 5) to give indication of progress
Colorize output for readability
Remove or convert unnecessary print statements
Options

Add support to import data from spreadsheets

This came from unimplemented methods and notes in the codebase.

Speadsheet data could be added to existing database data or used to seed a new database.

Data will need to generated for commits based on the commit ID, which will require cloning the upstream repo and searching for the referenced commits.

Add standard linting

Enable pylint in CI, disabling any failing checks, then enable checks as issues are resolved. An initial run indicates there may be bugs in the code that are being flagged along with some inefficient and inconsistent code.

Add support for Fedora Kernel repo

Fedora kernel is maintained at https://gitlab.com/cki-project/kernel-ark

Look to see how tags are managed and add logic to get latest if needed

Update documentation

With the various changes, we need to ensure there is sufficient documentation available to end users. Currently, the documentation is pretty limited. There's a readme and some internal documentation. A readme may be sufficient, but there may be a need to create additional documentation.

Symbols uses a hard-coded value to determine where to begin checking

map_symbols_to_patch() uses a hard-coded value for previous commit, 097c1bd5673edaf2a162724636858b71f658fdd2

This could be passed as a command line argument or use the git log against the paths (like done in get_missing_cherries) to find the earliest commit

--dryrun is not an actual dry run

The --dryrun option will use a local database comma.db in the current directory (creating one if it doesn't exist) instead of having to specify a database. This is not actually a dry run which is typically performing all the normal operations, but not writing.

This behavior should probably be renamed and an actual dry run behavior implemented. The main roadblock is the database is used as a working space rather than an input/output destination. We'll need to see if it's possible to remove it from the core processing.

Review Patch-Helper to see if it offers any additional capability that can be integrated

Look at Rakesh Ginjupalli's (Microsoft) patch-helper to see if it has additional capabilities that can be used in Comma.

https://github.com/linuxelf001/Patch-helper

Internal issue: https://microsoft.visualstudio.com/OS/_workitems/edit/25899563/

Genericize tool

Comma is coded to wok specifically with the Linux kernel, but most of the logic is not Linux-specific and could work with other forked projects.

Outside of some hard-coded values, the only logic that appears to be Linux-specific is the parsing of the maintainers file to get paths. This can likely be done through an plugin referenced in a config file.

Git operations fail for transient HTTP errors

Ubuntu repos randomly, but regularly fail with errors similar to:

  stderr: 'fatal: unable to access 'https://git.launchpad.net/~canonical-kernel/ubuntu/+source/linux-azure/+git/jammy/': The requested URL returned error: 500'

Usually retrying will succeed for the same operation. Detect error, log, and retry operation.

Exception is raised when called without a subcommand

This should not throw an exception. Probably it should print usage.

$ comma
Welcome to Commit Analyzer!
Traceback (most recent call last):
  File ".venv/bin/comma", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "comma/__main__.py", line 242, in main
    args.func(args)
    ^^^^^^^^^
AttributeError: 'Namespace' object has no attribute 'func'

Add capability to tweak what is analyzed

This would add the ability to adjust what upstream patches are analyzed, For example:

Add or remove specific commits
- Ex: commit unrelated to monitored sections
Add or remove specific paths
- Ex: all paths for a section except one