From <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-i

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Write demonstrator of Python bytecode → Python AST,about numba/numba-rvsdg

Comments (24)

esc commented on June 14, 2024 1

@jpivarski you can also run the restructure method from https://numba-rvsdg.readthedocs.io/en/latest/reference/datastructures.html#numba_rvsdg.core.datastructures.byte_flow.ByteFlow.restructure

This will apply the LOOP-RESTRUCTURE and BRANCH-RESTRUCTURE algorithms from sections 4.1 and 4.2 of:

https://dl.acm.org/doi/pdf/10.1145/2693261

and then render the graph to see what it looks like.

The main goal of the RVSDG package right now is to transform Python bytecode into an SCFG which can be restructured.

Note also, that some of the top-level classes are scheduled for removal in:

#75

But it shouldn't be that big of a change.

from numba-rvsdg.

rocky commented on June 14, 2024 1

For a prototype, I'd start with https://github.com/rocky/python-decompile3 which handles 3.7 and 3.8 code only. The code style is generally more modern Python and it follows better the ideas in the branch that I believe addresses control flow issues.

I have written a bit about recent thoughts and work as to how to do decompilation in rocky/python-uncompyle6#412

from numba-rvsdg.

jpivarski commented on June 14, 2024

(Oh, I can't assign it to myself, but feel free to assign it to me. I'll do the work in a fork and make it a draft PR, which will probably never get out of draft status or be merged.)

from numba-rvsdg.

esc commented on June 14, 2024

@jpivarski I have assigned you.

BTW: the current numba-rvsdg 0.0.2 only supports transformation from Python bytecode to SCFG (structured control flow graph). While the documentation is still a bit sparse at present, we do have:

https://numba-rvsdg.readthedocs.io/en/latest/reference/datastructures.html#numba_rvsdg.core.datastructures.scfg.SCFG

Although the SCFG object has mostly methods for creating the SCFG from python bytecode, rather than introspecting it.

from numba-rvsdg.

jpivarski commented on June 14, 2024

Thanks! I figured that at present, there would only be conversions from Python bytecode to something new, such as this SCFG class. Through this project, I'm hoping to learn about the new code representation, even if the API changes and I'll have to rewrite it later.

from numba-rvsdg.

jpivarski commented on June 14, 2024

I started investigating it, and learned that not just any bytecode will produce a graph—only bytecode with control flow.

from numba_rvsdg.core.datastructures.byte_flow import ByteFlow

def foo(x, y):
    return x + y

byteflow = ByteFlow.from_bytecode(foo.__code__)

print(byteflow.scfg.to_dict())

produces

{'python_bytecode_block_0': {'jt': []}}

but

def foo(num, start):
    out = start
    for i in range(num):
        out += i**2
    return out

produces

{
    "python_bytecode_block_0": {"jt": ["python_bytecode_block_1"]},
    "python_bytecode_block_1": {"jt": ["python_bytecode_block_2", "python_bytecode_block_3"]},
    "python_bytecode_block_2": {"jt": ["python_bytecode_block_1"]},
    "python_bytecode_block_3": {"jt": []},
}

It mainly seems to be grouping bytecodes in a useful way, perhaps indicating data dependencies. Still going with that second example,

bcmap = byteflow.scfg.bcmap_from_bytecode(byteflow.bc)

for i in range(4):
    print(f"python_bytecode_block_{i}")
    python_bytecode_block = byteflow.scfg.graph[f"python_bytecode_block_{i}"]
    for instruction in python_bytecode_block.get_instructions(bcmap):
        print(f"    {instruction.opname} {instruction.arg} {instruction.argval}")

produces

python_bytecode_block_0
    LOAD_FAST 1 start
    STORE_FAST 2 out
    LOAD_GLOBAL 0 range
    LOAD_FAST 0 num
    CALL_FUNCTION 1 1
    GET_ITER None None
python_bytecode_block_1
    FOR_ITER 8 30
python_bytecode_block_2
    STORE_FAST 3 i
    LOAD_FAST 2 out
    LOAD_FAST 3 i
    LOAD_CONST 1 2
    BINARY_POWER None None
    INPLACE_ADD None None
    STORE_FAST 2 out
    JUMP_ABSOLUTE 6 12
python_bytecode_block_3
    LOAD_FAST 2 out
    RETURN_VALUE None None

while

import dis
print(dis.dis(foo))

produces

  6           0 LOAD_FAST                1 (start)
              2 STORE_FAST               2 (out)

  7           4 LOAD_GLOBAL              0 (range)
              6 LOAD_FAST                0 (num)
              8 CALL_FUNCTION            1
             10 GET_ITER
        >>   12 FOR_ITER                 8 (to 30)
             14 STORE_FAST               3 (i)

  8          16 LOAD_FAST                2 (out)
             18 LOAD_FAST                3 (i)
             20 LOAD_CONST               1 (2)
             22 BINARY_POWER
             24 INPLACE_ADD
             26 STORE_FAST               2 (out)
             28 JUMP_ABSOLUTE            6 (to 12)

  9     >>   30 LOAD_FAST                2 (out)
             32 RETURN_VALUE

Looking at it with

from numba_rvsdg.rendering.rendering import ByteFlowRenderer

bfr = ByteFlowRenderer()
bfr.bcmap_from_bytecode(byteflow.bc)
byteflow.scfg.view("scfg")

produces

I know that I can convert expressions from bytecode to AST; the control flow was always the hard part. This graph must be capturing the control flow in some way, though I'll need to play with it and maybe ask some specific questions to understand how.

Now I'm going to try running through a complete set of Python expressions (no statements) to see if this always gives me single-node graphs, as I would expect.

from numba-rvsdg.

jpivarski commented on June 14, 2024

Hypothesis: all Python expressions produce an SCFG graph of length 1.

Result: yup, they do.

import ast

from numba_rvsdg.core.datastructures.byte_flow import ByteFlow

# all of the expression AST node types
expression_types = [
    x
    for x in dir(ast)
    if (
            isinstance(getattr(ast, x), type)
            and issubclass(getattr(ast, x), ast.expr)
            and x not in (
                "expr",

                "Num",           # deprecated since Python 3.8
                "Str",           #   |
                "Bytes",         #   |
                "NameConstant",  #   |
                "Ellipsis",      #   V
            )
    )
]

# remove an expression node type when we see one
class RemoveFromList(ast.NodeVisitor):
    def visit(self, node):
        if type(node).__name__ in expression_types:
            expression_types.remove(type(node).__name__)
        return super().visit(node)

remove_from_list = RemoveFromList()

# examples of all expression types in Python
for string_expression in [
        "x.y",                                      # Attribute
        "await f()",                                # Await
        "x + y",                                    # BinOp
        "x and y",                                  # BoolOp
        "f()",                                      # Call
        "x < y",                                    # Compare
        "123",                                      # Constant
        "{1: 2}",                                   # Dict
        "{x: x for x in collection}",               # DictComp
        "f'{x}'",                                   # FormattedValue
        "(x for x in collection)",                  # GeneratorExp
        "consequent if predicate else alternate",   # IfExp
        "f'x'",                                     # JoinedStr
        "lambda: None",                             # Lambda
        "[1, 2, 3]",                                # List
        "[x for x in collection]",                  # ListComp
        "x",                                        # Name
        "(x := 4)",                                 # NamedExpr
        "{1, 2, 3}",                                # Set
        "{x for x in collection}",                  # SetComp
        "x[1:2]",                                   # Slice
        "(x, *y)",                                  # Starred
        "x[1]",                                     # Subscript
        "(1, 2, 3)",                                # Tuple
        "~x",                                       # UnaryOp
        "yield x",                                  # Yield
        "yield from x",                             # YieldFrom
]:
    # has to be wrapped in a function so that Await is valid
    asyncy = "async " if "await " in string_expression else ""
    syntax_tree = ast.parse(f"{asyncy}def _(): {string_expression}")

    # remove from the list of expression types
    remove_from_list.visit(syntax_tree)

    code = compile(syntax_tree, "<string>", "exec")

    # hypothesis: SCFG for all bare expressions is a one-node graph
    byteflow = ByteFlow.from_bytecode(code)
    assert len(byteflow.scfg.graph) == 1

# did we miss any expression types?
assert len(expression_types) == 0

from numba-rvsdg.

esc commented on June 14, 2024

@jpivarski what you have discovered is correct. Any linear sequence of bytecodes without if-else or for constructs will produce only a single PythonBytecodeBlock. As soon as control flow or looping are introduced you get multiple blocks. You can assume that any PythonBytecodeBlock contains a linear sequence of Python bytecodes / expressions.

from numba-rvsdg.

jpivarski commented on June 14, 2024

Any linear sequence of bytecodes without if-else or for constructs will produce only a single PythonBytecodeBlock. As soon as control flow or looping are introduced you get multiple blocks.

If you get graph structures for if-else and for, I think there would also be graph structures for while.

For the following statement types, here's what I think about whether they'll split into multiple graph nodes. (YES = you said that they split, MAYBE = I think they would, and NO = I don't think they would; I think several of these can be part of the same graph node.)

FunctionDef: probably a whole new __code__ object
AsyncFunctionDef: probably a whole new __code__ object
ClassDef: probably a whole new __code__ object
Return: NO
Delete: NO
Assign: NO
AugAssign: NO
AnnAssign: NO
For: YES
AsyncFor: YES (whether Numba will support async for or not is a different matter)
While: MAYBE, because if for splits nodes, while should, too
If: YES
With: NO
AsyncWith: NO
Match: MAYBE, because if if splits nodes, match should, too
Raise: NO
Try: MAYBE, because there's control flow in try-except (same comment about Numba supporting it, though)
TryStar: MAYBE, because there's control flow in try-except (same comment about Numba supporting it, though)
Assert: NO
Import: NO, and this definitely won't be lowered, anyway
ImportFrom: NO, and this definitely won't be lowered, anyway
Global: NO
Nonlocal: NO
Expr: NO
Pass: NO
Break: YES, because it's part of the for and while logic
Continue: YES, because it's part of the for and while logic

The main goal of the RVSDG package right now is to transform Python bytecode into an SCFG which can be restructured.

It's probably the case that I'll only want to look at reconstructed graphs, then. Right now, I'm still getting a basic understanding of these things.

I wonder if I should first focus on the parts that aren't spread across graph nodes. The parsing rules from uncompyle6 would still be useful here, applied to each graph node as an individual sequence of instructions, since they still need to be turned into AST trees. It would be like this, which targets Python expressions only, but making ast.AST nodes instead of the rejig.syntaxtree thing in that old demo.

As for the control flow parts that are spread across graph nodes, they could be reconstructed into for, if, while, etc. from the restructured graph (somehow) instead of using uncompyle6.

And maybe I shouldn't be thinking of this project as using uncompyle6, but maybe being part of uncompyle6, as a way to help @rocky get that tool up-to-date with recent Pythons.

(Hi, @rocky, bringing you into this conversation for the first time. I've been thinking about the problem of converting bytecode → AST for recent versions of Python, which is also a problem for Numba and this RVSDG formalism promises to fix it. I've been doing these tests to see if the bytecode interpretation part can remain independent of LLVM/pure Python. Should we talk on https://github.com/rocky/python-uncompyle6/discussions about whether you'd want to fold that into your project? This is all very early/preliminary/just seeing what's possible.)

Note also, that some of the top-level classes are scheduled for removal in:

#75

But it shouldn't be that big of a change.

I understand that it's in flux. I'll keep doing git pull as things change.

from numba-rvsdg.

rocky commented on June 14, 2024

@jpivarski - I am interested, but it will take me some time for me to get up to speed.

from numba-rvsdg.

rocky commented on June 14, 2024

Some initial comments. Overall I think the approach or idea is sound, although this is going to be a lot of work.

(Believe it or not, that the approach is sound is a big deal. I have read about other approaches that feel a little whacky and not that sound.)

A couple of observations though when I look at https://github.com/diana-hep/rejig/blob/master/reinterpreted-python/rejig/pybytecode.py

If this is to be able to work on more than one Python bytecode, you will probably want to structure in a way that segregates versions.

There are basically two approaches - one that treats each Python version as its own separate entity, and one accommodates sharing between versions. For accommodating sharing between versions, the approach I have been using has been something like version Python 3.9 is like 3.8 except here and here and here. In other words we say that this for kind of rule or action change it it such and such a way.

Personally I have been moving away from the shared approach on the grammar or parse side because it was just too complicated to keep track of things. So here I just cut and paste. If I then need to adjust okay.

On the semantic action side though there is still sharing going on. See https://github.com/rocky/python-uncompyle6/blob/master/uncompyle6/semantics/customize3.py#L354-L376

One other important thing to understand and keep in mind is the difference between uncompyle6, decompyle3, and this yet another not-public fork called decompile-cfg.

In theory and ideally, there is no reason there could not be one code base for all versions. The problem has always been the amount of work needed to back-port stuff.

There is a lot of work needed still to backport decompyle3 code to uncompyle6. And although you can't see decompile-cfg, the same is true there to backport to decompyle3. (And the decompile-cfg code is still changing a bit on its own.)

So for my own use such as in the trepan3k debugger, I import the right code base of these depending on version: for 3.7 and 3.8 I use decompyle3. Otherwise uncompyle6.

On recent and big change in the way I think about this, is that I am now doing more in the way of abstracting the parse tree that I get from a parse of tokenized bytecode instructions. See rocky/python-uncompyle6#412 (comment)

This has a big and useful impact in that when we get to the semantic actions there is less junk in the Tree, and that means things will be more then same across different Python versions (and Python bytecode versions). In the semantic actions we have to go by position in the tree - 0th item, 1st item and so on. Again, reducing the noise or pseudo operations that don't have any real bytecode in them makes different versions or different derivations in the same Python version look more similar.

You will find a little bit of this in https://github.com/rocky/python-decompile3/blob/master/decompyle3/semantics/transform.py#L30-L67 and https://github.com/rocky/python-decompile3/blob/master/decompyle3/semantics/transform.py#L558

from numba-rvsdg.

rocky commented on June 14, 2024

In the list above for branching ops there are:

IfElse
all of the Comprehensions.
And/Or ?

from numba-rvsdg.

jpivarski commented on June 14, 2024

Having looked around the uncompyle6 codebase (though not the other ones), I'm aware of how big of a project it is. What I'm hoping is that we can take advantage of the modularization of Numba to get some parts from shared effort, in particular the hardest parts. After all, Numba has to unravel the control structures, too, though Numba does not need to parse instructions into an AST. (I think Numba may be able to do a Python instruction → LLVM instruction mapping without knowing the tree structure of expressions.)¹

Separating by Python versions sounds like a good idea to me. Maybe the parsing code should be the same and each Python version should have its own set of grammar rules, copying where necessary? I say that because it's how I had thought that uncompyle6 works. I saw that there was a lot of separation between the Python versions already, and I had assumed that it went all the way down to the grammar rules.

In the list above for branching ops there are:

IfElse

all of the Comprehensions.

And/Or ?

That's a good point. In #81 (comment), these formed a single graph node, but they ought to be control flow (including and/or). Maybe that's still to be addressed.

We should also keep in mind that the Numba team has a large project in front of them, and ensuring that Python bytecode → AST is not a goal of that project. I'm hoping to piggy-back this on it, since it seems so promising. ↩

from numba-rvsdg.

rocky commented on June 14, 2024

I think I am confused. What would you like to do, and how do you propose doing it? I see code, but perhaps I don't understand to bigger picture.

We should also keep in mind that the Numba team has a large project in front of them, and ensuring that Python bytecode → AST is not a goal of that project. I'm hoping to piggy-back this on it, since it seems so promising.

Python bytecode to Python AST (or any sort of equivalent AST) is about as hard as Python bytecode to source code - no easier and no more difficult.

The reality is that with respect to source code is that I have not been able to keep up. I have been doing this as a hobby is something that would keep a couple of people fully occupied full time as a paid activity.

So I am confused on what you expect to see. A proof of concept? Being satisfied working in some limited context? e.g. for basic blocks of only? (Then why RVSDS?) If a limited context, what limited context?

Maybe the parsing code should be the same and each Python version should have its own set of grammar rules, copying where necessary? I say that because it's how I had thought that uncompyle6 works.

Yes, that is how uncompyle6 works. But not in the newer, not public decompiler that works for 3.8-3.10 though. Here there is no grammar sharing. I mention this though as kind of an aside. decompyle3 is in the middle - I started moving in this direction.

The main thing here on parsing is that is important, is that over time there is too much drift in bytecode to have one piece of code that has tests for all the variations or a grammar that is a superset grammar of all the versions. Instead, it has been conceptually easier to separate these into separate pieces of code for a single bytecode version.

from numba-rvsdg.

jpivarski commented on June 14, 2024

The bigger picture isn't clear yet. As a first step, I want to see if this is going to be possible, so it's a proof of concept. Beyond that, if it does work, it could be

integrated into uncompyle6 or uncompyle3 (whichever is the one with the long-term future), and then uncompyle would depend on numba-rvsdg and not LLVM,
a package that depends on numba-rvsdg and uncompyle,
a package that depends on numba-rvsdg only,
nothing at all, because maybe you've got another way to address Python bytecode changes that doesn't get any help from numba-rvsdg.

I think it would be better overall if there's only one package that provides Python bytecode → AST, though option (1) would have to be something you're interested in, too, if it's going to go forward.

Still, the first thing (for me) to do is the proof of concept, to know if this is even a good approach. What scope? definitely more than basic blocks only, though I was thinking of starting with that and using uncompyle6 for it, exactly because it's an already-solved problem. Then, when addressing something like an if statement or a for loop, I'd have a framework to fit it in.

Python AST (or any sort of equivalent AST) is about as hard as Python bytecode to source code - no easier and no more difficult.

Especially because ast.unparse is in the Python standard library.

not public decompiler that works for 3.8-3.10

That's interesting, and it's the reason I added option (4) above. I didn't know that you were ready to make this leap.

I have been doing this as a hobby is something that would keep a couple of people fully occupied full time as a paid activity.

That's why I thought that using numba-rvsdg was a good idea: it will be kept up-to-date with Python bytecode changes, and the Numba team is large enough that I know this will happen.

from numba-rvsdg.

esc commented on June 14, 2024

Any linear sequence of bytecodes without if-else or for constructs will produce only a single PythonBytecodeBlock. As soon as control flow or looping are introduced you get multiple blocks.

If you get graph structures for if-else and for, I think there would also be graph structures for while.

Oh, so sorry for the misunderstanding, for means "loop" and that includes while.

from numba-rvsdg.

esc commented on June 14, 2024

For a prototype, I'd start with https://github.com/rocky/python-decompile3 which handles 3.7 and 3.8 code only. The code style is generally more modern Python and it follows better the ideas in the branch that I believe addresses control flow issues.

numba-rvsdg supports 3.11 only.

from numba-rvsdg.

rocky commented on June 14, 2024

I just cloned numba-rvsdg and built it. The project seems to be pretty new - less than a year old.

It seems similar to https://github.com/rocky/python-control-flow , although the end goal is to produce a RSVDG rather than a control-flow graph with dominator and reverse dominator information.

If the future of Python is like its past, then if this project is to be long-lived it will have to come up with a strategy for supporting more Python versions than 3.11. Initially and the path of least resistance, is to assume that the next version bytecode won't be that different from the last.

That generally happens. But periodically it doesn't. Python went from a variable 1 or 3 bytecode format to a fixed 2-byte "wordcode" format, I think between 3.5 and 3.6. Nowadays this kind of change wouldn't be a problem, because I think Python now has abstracted instructions from bytecode. But back in the 3.5 to 3.6 change, I don't believe that was the case. And although I haven't looked at the details all that much, Python 3.11 is a bit different from 3.10.

Assuming 3.12 will be not that different from 3.11 there is the question whether you'll allow handing a different bytecode from the bytecode used by Python interpreter running numba-rvsdg. If so, then I believe you will need to use something other than the Python dis module. And that's why I wrote and use xdis instead.

But as I said, this kind of complication and decision on how to handle more than one bytecode is definitely something that can be put off for later.

Now that I understand numba-rvsdg better, let me close with how its overall approach to understanding control flow is different from the decompiling approach in uncompyle6 and its descendants.

uncompyle6 leverages a couple features of the high-level bytecode interpreter approach (also found in Ruby, Smalltalk, Lua, various Lisps like Emacs Lisp). First, for all practical purposes the translation to bytecode is done by one kind of parsing system for any given version. (What happens afterwards can vary with say pyston or pypy). The fact that there is just one source code to bytecode translator reduces variation in how a given construct might be conceivably be represented in bytecode: the interpreter picks largely one way and uses that.

Given this, it becomes easier to use parsing technology that is similar (but not the same) as in conventional compilers. Because the translation is like DNA and has its quirks and leaves fingerprints all over the place, we can use those quirks to retrieve more often the exact Python construct that was used, rather than some other construct or sequence that is semantically equivalent.

The control-flow to higher-level construct used in numba-rvsdg, while it is more conventional and is general purpose, does not make use of these kinds of clues that specific compilers put in. Uncompyle6 can distinguish if a and b: from if a: if b: in some bytecode versions when he compiler produces different bytecode for these. (Nowadays this happens less often).

Python, in particular, has this rich and whacky set of control-flow structures, such as the "else" branches on "for", "while" and "try" statements. I doubt numba-rsvg would bother to go through the effort needed to look for these as opposed to trying to write these along simpler and more basic control flow structures involving if's wrapped around something else. Should it try to tackle "try ... else .. finally" I think you'll find the control-flow algorithm getting a bit more complicated.

In sum, right now the two approaches to handling control flow, while each is valid, is a bit different from the other.

from numba-rvsdg.

sklam commented on June 14, 2024

Thanks @rocky for the detailed insights.

The control-flow to higher-level construct used in numba-rvsdg, while it is more conventional and is general purpose, does not make use of these kinds of clues that specific compilers put in. Uncompyle6 can distinguish if a and b: from if a: if b: in some bytecode versions when he compiler produces different bytecode for these. (Nowadays this happens less often).

Yes, unlikely uncompyle6, the goal for numba-rvsdg is not to provide a syntactic equivalent (AST recovery) but instead to provide a semantic equivalent. For python compilers (particularly Numba), syntactic information is not important. The semantic equivalent in rvsdg will provide compilers a nice data-centric view (where control-flow is more or less implied) that makes compiler passes a lot easier to write. Also, we don't want to rely on the bytecode "clues" and "quirks" because they change overtime. The heuristic based approach is a huge burden in the current Numba bytecode frontend and we anticipate more bytecode changes due to the faster-cpython effort, which will be adding more bytecode level optimizations.

Python, in particular, has this rich and whacky set of control-flow structures, such as the "else" branches on "for", "while" and "try" statements. I doubt numba-rsvg would bother to go through the effort needed to look for these as opposed to trying to write these along simpler and more basic control flow structures involving if's wrapped around something else. Should it try to tackle "try ... else .. finally" I think you'll find the control-flow algorithm getting a bit more complicated.

We will need to tackle all whacky python control-flow to reach parity with what the Numba bytecode frontend currently support. The good thing is that for, while, and else are modeled by jumps/branches in the bytecode so no additional work is needed. With Python>=3.11, try-except-finally are encoded in a side table in __code__.co_exceptiontable. The plan is to represent try/except as specific regions so the exception-flow are again implied.

from numba-rvsdg.

jpivarski commented on June 14, 2024

@gordonwatts

from numba-rvsdg.

Write demonstrator of Python bytecode → Python AST about numba-rvsdg HOT 24 CLOSED

Comments (24)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Comments (24)

Footnotes

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org