Comments (24)
@jpivarski you can also run the restructure
method from https://numba-rvsdg.readthedocs.io/en/latest/reference/datastructures.html#numba_rvsdg.core.datastructures.byte_flow.ByteFlow.restructure
This will apply the LOOP-RESTRUCTURE
and BRANCH-RESTRUCTURE
algorithms from sections 4.1 and 4.2 of:
https://dl.acm.org/doi/pdf/10.1145/2693261
and then render the graph to see what it looks like.
The main goal of the RVSDG package right now is to transform Python bytecode into an SCFG which can be restructured.
Note also, that some of the top-level classes are scheduled for removal in:
But it shouldn't be that big of a change.
from numba-rvsdg.
For a prototype, I'd start with https://github.com/rocky/python-decompile3 which handles 3.7 and 3.8 code only. The code style is generally more modern Python and it follows better the ideas in the branch that I believe addresses control flow issues.
I have written a bit about recent thoughts and work as to how to do decompilation in rocky/python-uncompyle6#412
from numba-rvsdg.
(Oh, I can't assign it to myself, but feel free to assign it to me. I'll do the work in a fork and make it a draft PR, which will probably never get out of draft status or be merged.)
from numba-rvsdg.
@jpivarski I have assigned you.
BTW: the current numba-rvsdg 0.0.2 only supports transformation from Python bytecode to SCFG (structured control flow graph). While the documentation is still a bit sparse at present, we do have:
Although the SCFG
object has mostly methods for creating the SCFG from python bytecode, rather than introspecting it.
from numba-rvsdg.
Thanks! I figured that at present, there would only be conversions from Python bytecode to something new, such as this SCFG class. Through this project, I'm hoping to learn about the new code representation, even if the API changes and I'll have to rewrite it later.
from numba-rvsdg.
I started investigating it, and learned that not just any bytecode will produce a graph—only bytecode with control flow.
from numba_rvsdg.core.datastructures.byte_flow import ByteFlow
def foo(x, y):
return x + y
byteflow = ByteFlow.from_bytecode(foo.__code__)
print(byteflow.scfg.to_dict())
produces
{'python_bytecode_block_0': {'jt': []}}
but
def foo(num, start):
out = start
for i in range(num):
out += i**2
return out
produces
{
"python_bytecode_block_0": {"jt": ["python_bytecode_block_1"]},
"python_bytecode_block_1": {"jt": ["python_bytecode_block_2", "python_bytecode_block_3"]},
"python_bytecode_block_2": {"jt": ["python_bytecode_block_1"]},
"python_bytecode_block_3": {"jt": []},
}
It mainly seems to be grouping bytecodes in a useful way, perhaps indicating data dependencies. Still going with that second example,
bcmap = byteflow.scfg.bcmap_from_bytecode(byteflow.bc)
for i in range(4):
print(f"python_bytecode_block_{i}")
python_bytecode_block = byteflow.scfg.graph[f"python_bytecode_block_{i}"]
for instruction in python_bytecode_block.get_instructions(bcmap):
print(f" {instruction.opname} {instruction.arg} {instruction.argval}")
produces
python_bytecode_block_0
LOAD_FAST 1 start
STORE_FAST 2 out
LOAD_GLOBAL 0 range
LOAD_FAST 0 num
CALL_FUNCTION 1 1
GET_ITER None None
python_bytecode_block_1
FOR_ITER 8 30
python_bytecode_block_2
STORE_FAST 3 i
LOAD_FAST 2 out
LOAD_FAST 3 i
LOAD_CONST 1 2
BINARY_POWER None None
INPLACE_ADD None None
STORE_FAST 2 out
JUMP_ABSOLUTE 6 12
python_bytecode_block_3
LOAD_FAST 2 out
RETURN_VALUE None None
while
import dis
print(dis.dis(foo))
produces
6 0 LOAD_FAST 1 (start)
2 STORE_FAST 2 (out)
7 4 LOAD_GLOBAL 0 (range)
6 LOAD_FAST 0 (num)
8 CALL_FUNCTION 1
10 GET_ITER
>> 12 FOR_ITER 8 (to 30)
14 STORE_FAST 3 (i)
8 16 LOAD_FAST 2 (out)
18 LOAD_FAST 3 (i)
20 LOAD_CONST 1 (2)
22 BINARY_POWER
24 INPLACE_ADD
26 STORE_FAST 2 (out)
28 JUMP_ABSOLUTE 6 (to 12)
9 >> 30 LOAD_FAST 2 (out)
32 RETURN_VALUE
Looking at it with
from numba_rvsdg.rendering.rendering import ByteFlowRenderer
bfr = ByteFlowRenderer()
bfr.bcmap_from_bytecode(byteflow.bc)
byteflow.scfg.view("scfg")
produces
I know that I can convert expressions from bytecode to AST; the control flow was always the hard part. This graph must be capturing the control flow in some way, though I'll need to play with it and maybe ask some specific questions to understand how.
Now I'm going to try running through a complete set of Python expressions (no statements) to see if this always gives me single-node graphs, as I would expect.
from numba-rvsdg.
Hypothesis: all Python expressions produce an SCFG graph of length 1.
Result: yup, they do.
import ast
from numba_rvsdg.core.datastructures.byte_flow import ByteFlow
# all of the expression AST node types
expression_types = [
x
for x in dir(ast)
if (
isinstance(getattr(ast, x), type)
and issubclass(getattr(ast, x), ast.expr)
and x not in (
"expr",
"Num", # deprecated since Python 3.8
"Str", # |
"Bytes", # |
"NameConstant", # |
"Ellipsis", # V
)
)
]
# remove an expression node type when we see one
class RemoveFromList(ast.NodeVisitor):
def visit(self, node):
if type(node).__name__ in expression_types:
expression_types.remove(type(node).__name__)
return super().visit(node)
remove_from_list = RemoveFromList()
# examples of all expression types in Python
for string_expression in [
"x.y", # Attribute
"await f()", # Await
"x + y", # BinOp
"x and y", # BoolOp
"f()", # Call
"x < y", # Compare
"123", # Constant
"{1: 2}", # Dict
"{x: x for x in collection}", # DictComp
"f'{x}'", # FormattedValue
"(x for x in collection)", # GeneratorExp
"consequent if predicate else alternate", # IfExp
"f'x'", # JoinedStr
"lambda: None", # Lambda
"[1, 2, 3]", # List
"[x for x in collection]", # ListComp
"x", # Name
"(x := 4)", # NamedExpr
"{1, 2, 3}", # Set
"{x for x in collection}", # SetComp
"x[1:2]", # Slice
"(x, *y)", # Starred
"x[1]", # Subscript
"(1, 2, 3)", # Tuple
"~x", # UnaryOp
"yield x", # Yield
"yield from x", # YieldFrom
]:
# has to be wrapped in a function so that Await is valid
asyncy = "async " if "await " in string_expression else ""
syntax_tree = ast.parse(f"{asyncy}def _(): {string_expression}")
# remove from the list of expression types
remove_from_list.visit(syntax_tree)
code = compile(syntax_tree, "<string>", "exec")
# hypothesis: SCFG for all bare expressions is a one-node graph
byteflow = ByteFlow.from_bytecode(code)
assert len(byteflow.scfg.graph) == 1
# did we miss any expression types?
assert len(expression_types) == 0
from numba-rvsdg.
@jpivarski what you have discovered is correct. Any linear sequence of bytecodes without if-else
or for
constructs will produce only a single PythonBytecodeBlock
. As soon as control flow or looping are introduced you get multiple blocks. You can assume that any PythonBytecodeBlock
contains a linear sequence of Python bytecodes / expressions.
from numba-rvsdg.
Any linear sequence of bytecodes without
if-else
orfor
constructs will produce only a singlePythonBytecodeBlock
. As soon as control flow or looping are introduced you get multiple blocks.
If you get graph structures for if
-else
and for
, I think there would also be graph structures for while
.
For the following statement types, here's what I think about whether they'll split into multiple graph nodes. (YES = you said that they split, MAYBE = I think they would, and NO = I don't think they would; I think several of these can be part of the same graph node.)
- FunctionDef: probably a whole new
__code__
object - AsyncFunctionDef: probably a whole new
__code__
object - ClassDef: probably a whole new
__code__
object - Return: NO
- Delete: NO
- Assign: NO
- AugAssign: NO
- AnnAssign: NO
- For: YES
- AsyncFor: YES (whether Numba will support
async for
or not is a different matter) - While: MAYBE, because if
for
splits nodes,while
should, too - If: YES
- With: NO
- AsyncWith: NO
- Match: MAYBE, because if
if
splits nodes,match
should, too - Raise: NO
- Try: MAYBE, because there's control flow in
try
-except
(same comment about Numba supporting it, though) - TryStar: MAYBE, because there's control flow in
try
-except
(same comment about Numba supporting it, though) - Assert: NO
- Import: NO, and this definitely won't be lowered, anyway
- ImportFrom: NO, and this definitely won't be lowered, anyway
- Global: NO
- Nonlocal: NO
- Expr: NO
- Pass: NO
- Break: YES, because it's part of the
for
andwhile
logic - Continue: YES, because it's part of the
for
andwhile
logic
The main goal of the RVSDG package right now is to transform Python bytecode into an SCFG which can be restructured.
It's probably the case that I'll only want to look at reconstructed graphs, then. Right now, I'm still getting a basic understanding of these things.
I wonder if I should first focus on the parts that aren't spread across graph nodes. The parsing rules from uncompyle6 would still be useful here, applied to each graph node as an individual sequence of instructions, since they still need to be turned into AST trees. It would be like this, which targets Python expressions only, but making ast.AST
nodes instead of the rejig.syntaxtree
thing in that old demo.
As for the control flow parts that are spread across graph nodes, they could be reconstructed into for
, if
, while
, etc. from the restructured graph (somehow) instead of using uncompyle6.
And maybe I shouldn't be thinking of this project as using uncompyle6, but maybe being part of uncompyle6, as a way to help @rocky get that tool up-to-date with recent Pythons.
(Hi, @rocky, bringing you into this conversation for the first time. I've been thinking about the problem of converting bytecode → AST for recent versions of Python, which is also a problem for Numba and this RVSDG formalism promises to fix it. I've been doing these tests to see if the bytecode interpretation part can remain independent of LLVM/pure Python. Should we talk on https://github.com/rocky/python-uncompyle6/discussions about whether you'd want to fold that into your project? This is all very early/preliminary/just seeing what's possible.)
Note also, that some of the top-level classes are scheduled for removal in:
But it shouldn't be that big of a change.
I understand that it's in flux. I'll keep doing git pull
as things change.
from numba-rvsdg.
@jpivarski - I am interested, but it will take me some time for me to get up to speed.
from numba-rvsdg.
Some initial comments. Overall I think the approach or idea is sound, although this is going to be a lot of work.
(Believe it or not, that the approach is sound is a big deal. I have read about other approaches that feel a little whacky and not that sound.)
A couple of observations though when I look at https://github.com/diana-hep/rejig/blob/master/reinterpreted-python/rejig/pybytecode.py
If this is to be able to work on more than one Python bytecode, you will probably want to structure in a way that segregates versions.
There are basically two approaches - one that treats each Python version as its own separate entity, and one accommodates sharing between versions. For accommodating sharing between versions, the approach I have been using has been something like version Python 3.9 is like 3.8 except here and here and here. In other words we say that this for kind of rule or action change it it such and such a way.
Personally I have been moving away from the shared approach on the grammar or parse side because it was just too complicated to keep track of things. So here I just cut and paste. If I then need to adjust okay.
On the semantic action side though there is still sharing going on. See https://github.com/rocky/python-uncompyle6/blob/master/uncompyle6/semantics/customize3.py#L354-L376
One other important thing to understand and keep in mind is the difference between uncompyle6, decompyle3, and this yet another not-public fork called decompile-cfg.
In theory and ideally, there is no reason there could not be one code base for all versions. The problem has always been the amount of work needed to back-port stuff.
There is a lot of work needed still to backport decompyle3 code to uncompyle6. And although you can't see decompile-cfg, the same is true there to backport to decompyle3. (And the decompile-cfg code is still changing a bit on its own.)
So for my own use such as in the trepan3k debugger, I import the right code base of these depending on version: for 3.7 and 3.8 I use decompyle3. Otherwise uncompyle6.
On recent and big change in the way I think about this, is that I am now doing more in the way of abstracting the parse tree that I get from a parse of tokenized bytecode instructions. See rocky/python-uncompyle6#412 (comment)
This has a big and useful impact in that when we get to the semantic actions there is less junk in the Tree, and that means things will be more then same across different Python versions (and Python bytecode versions). In the semantic actions we have to go by position in the tree - 0th item, 1st item and so on. Again, reducing the noise or pseudo operations that don't have any real bytecode in them makes different versions or different derivations in the same Python version look more similar.
You will find a little bit of this in https://github.com/rocky/python-decompile3/blob/master/decompyle3/semantics/transform.py#L30-L67 and https://github.com/rocky/python-decompile3/blob/master/decompyle3/semantics/transform.py#L558
from numba-rvsdg.
In the list above for branching ops there are:
- IfElse
- all of the Comprehensions.
- And/Or ?
from numba-rvsdg.
Having looked around the uncompyle6 codebase (though not the other ones), I'm aware of how big of a project it is. What I'm hoping is that we can take advantage of the modularization of Numba to get some parts from shared effort, in particular the hardest parts. After all, Numba has to unravel the control structures, too, though Numba does not need to parse instructions into an AST. (I think Numba may be able to do a Python instruction → LLVM instruction mapping without knowing the tree structure of expressions.)1
Separating by Python versions sounds like a good idea to me. Maybe the parsing code should be the same and each Python version should have its own set of grammar rules, copying where necessary? I say that because it's how I had thought that uncompyle6 works. I saw that there was a lot of separation between the Python versions already, and I had assumed that it went all the way down to the grammar rules.
In the list above for branching ops there are:
- IfElse
- all of the Comprehensions.
- And/Or ?
That's a good point. In #81 (comment), these formed a single graph node, but they ought to be control flow (including and/or). Maybe that's still to be addressed.
Footnotes
-
We should also keep in mind that the Numba team has a large project in front of them, and ensuring that Python bytecode → AST is not a goal of that project. I'm hoping to piggy-back this on it, since it seems so promising. ↩
from numba-rvsdg.
I think I am confused. What would you like to do, and how do you propose doing it? I see code, but perhaps I don't understand to bigger picture.
We should also keep in mind that the Numba team has a large project in front of them, and ensuring that Python bytecode → AST is not a goal of that project. I'm hoping to piggy-back this on it, since it seems so promising.
Python bytecode to Python AST (or any sort of equivalent AST) is about as hard as Python bytecode to source code - no easier and no more difficult.
The reality is that with respect to source code is that I have not been able to keep up. I have been doing this as a hobby is something that would keep a couple of people fully occupied full time as a paid activity.
So I am confused on what you expect to see. A proof of concept? Being satisfied working in some limited context? e.g. for basic blocks of only? (Then why RVSDS?) If a limited context, what limited context?
Maybe the parsing code should be the same and each Python version should have its own set of grammar rules, copying where necessary? I say that because it's how I had thought that uncompyle6 works.
Yes, that is how uncompyle6 works. But not in the newer, not public decompiler that works for 3.8-3.10 though. Here there is no grammar sharing. I mention this though as kind of an aside. decompyle3 is in the middle - I started moving in this direction.
The main thing here on parsing is that is important, is that over time there is too much drift in bytecode to have one piece of code that has tests for all the variations or a grammar that is a superset grammar of all the versions. Instead, it has been conceptually easier to separate these into separate pieces of code for a single bytecode version.
from numba-rvsdg.
The bigger picture isn't clear yet. As a first step, I want to see if this is going to be possible, so it's a proof of concept. Beyond that, if it does work, it could be
- integrated into uncompyle6 or uncompyle3 (whichever is the one with the long-term future), and then uncompyle would depend on numba-rvsdg and not LLVM,
- a package that depends on numba-rvsdg and uncompyle,
- a package that depends on numba-rvsdg only,
- nothing at all, because maybe you've got another way to address Python bytecode changes that doesn't get any help from numba-rvsdg.
I think it would be better overall if there's only one package that provides Python bytecode → AST, though option (1) would have to be something you're interested in, too, if it's going to go forward.
Still, the first thing (for me) to do is the proof of concept, to know if this is even a good approach. What scope? definitely more than basic blocks only, though I was thinking of starting with that and using uncompyle6 for it, exactly because it's an already-solved problem. Then, when addressing something like an if
statement or a for
loop, I'd have a framework to fit it in.
Python AST (or any sort of equivalent AST) is about as hard as Python bytecode to source code - no easier and no more difficult.
Especially because ast.unparse
is in the Python standard library.
not public decompiler that works for 3.8-3.10
That's interesting, and it's the reason I added option (4) above. I didn't know that you were ready to make this leap.
I have been doing this as a hobby is something that would keep a couple of people fully occupied full time as a paid activity.
That's why I thought that using numba-rvsdg was a good idea: it will be kept up-to-date with Python bytecode changes, and the Numba team is large enough that I know this will happen.
from numba-rvsdg.
Any linear sequence of bytecodes without
if-else
orfor
constructs will produce only a singlePythonBytecodeBlock
. As soon as control flow or looping are introduced you get multiple blocks.If you get graph structures for
if
-else
andfor
, I think there would also be graph structures forwhile
.
Oh, so sorry for the misunderstanding, for
means "loop" and that includes while
.
from numba-rvsdg.
For a prototype, I'd start with https://github.com/rocky/python-decompile3 which handles 3.7 and 3.8 code only. The code style is generally more modern Python and it follows better the ideas in the branch that I believe addresses control flow issues.
numba-rvsdg supports 3.11 only.
from numba-rvsdg.
I just cloned numba-rvsdg and built it. The project seems to be pretty new - less than a year old.
It seems similar to https://github.com/rocky/python-control-flow , although the end goal is to produce a RSVDG rather than a control-flow graph with dominator and reverse dominator information.
If the future of Python is like its past, then if this project is to be long-lived it will have to come up with a strategy for supporting more Python versions than 3.11. Initially and the path of least resistance, is to assume that the next version bytecode won't be that different from the last.
That generally happens. But periodically it doesn't. Python went from a variable 1 or 3 bytecode format to a fixed 2-byte "wordcode" format, I think between 3.5 and 3.6. Nowadays this kind of change wouldn't be a problem, because I think Python now has abstracted instructions from bytecode. But back in the 3.5 to 3.6 change, I don't believe that was the case. And although I haven't looked at the details all that much, Python 3.11 is a bit different from 3.10.
Assuming 3.12 will be not that different from 3.11 there is the question whether you'll allow handing a different bytecode from the bytecode used by Python interpreter running numba-rvsdg. If so, then I believe you will need to use something other than the Python dis
module. And that's why I wrote and use xdis
instead.
But as I said, this kind of complication and decision on how to handle more than one bytecode is definitely something that can be put off for later.
Now that I understand numba-rvsdg better, let me close with how its overall approach to understanding control flow is different from the decompiling approach in uncompyle6 and its descendants.
uncompyle6 leverages a couple features of the high-level bytecode interpreter approach (also found in Ruby, Smalltalk, Lua, various Lisps like Emacs Lisp). First, for all practical purposes the translation to bytecode is done by one kind of parsing system for any given version. (What happens afterwards can vary with say pyston or pypy). The fact that there is just one source code to bytecode translator reduces variation in how a given construct might be conceivably be represented in bytecode: the interpreter picks largely one way and uses that.
Given this, it becomes easier to use parsing technology that is similar (but not the same) as in conventional compilers. Because the translation is like DNA and has its quirks and leaves fingerprints all over the place, we can use those quirks to retrieve more often the exact Python construct that was used, rather than some other construct or sequence that is semantically equivalent.
The control-flow to higher-level construct used in numba-rvsdg, while it is more conventional and is general purpose, does not make use of these kinds of clues that specific compilers put in. Uncompyle6 can distinguish if a and b:
from if a: if b:
in some bytecode versions when he compiler produces different bytecode for these. (Nowadays this happens less often).
Python, in particular, has this rich and whacky set of control-flow structures, such as the "else" branches on "for", "while" and "try" statements. I doubt numba-rsvg would bother to go through the effort needed to look for these as opposed to trying to write these along simpler and more basic control flow structures involving if's wrapped around something else. Should it try to tackle "try ... else .. finally" I think you'll find the control-flow algorithm getting a bit more complicated.
In sum, right now the two approaches to handling control flow, while each is valid, is a bit different from the other.
from numba-rvsdg.
Thanks @rocky for the detailed insights.
The control-flow to higher-level construct used in numba-rvsdg, while it is more conventional and is general purpose, does not make use of these kinds of clues that specific compilers put in. Uncompyle6 can distinguish
if a and b:
fromif a: if b:
in some bytecode versions when he compiler produces different bytecode for these. (Nowadays this happens less often).
Yes, unlikely uncompyle6, the goal for numba-rvsdg is not to provide a syntactic equivalent (AST recovery) but instead to provide a semantic equivalent. For python compilers (particularly Numba), syntactic information is not important. The semantic equivalent in rvsdg will provide compilers a nice data-centric view (where control-flow is more or less implied) that makes compiler passes a lot easier to write. Also, we don't want to rely on the bytecode "clues" and "quirks" because they change overtime. The heuristic based approach is a huge burden in the current Numba bytecode frontend and we anticipate more bytecode changes due to the faster-cpython effort, which will be adding more bytecode level optimizations.
Python, in particular, has this rich and whacky set of control-flow structures, such as the "else" branches on "for", "while" and "try" statements. I doubt numba-rsvg would bother to go through the effort needed to look for these as opposed to trying to write these along simpler and more basic control flow structures involving if's wrapped around something else. Should it try to tackle "try ... else .. finally" I think you'll find the control-flow algorithm getting a bit more complicated.
We will need to tackle all whacky python control-flow to reach parity with what the Numba bytecode frontend currently support. The good thing is that for
, while
, and else
are modeled by jumps/branches in the bytecode so no additional work is needed. With Python>=3.11, try-except-finally are encoded in a side table in __code__.co_exceptiontable
. The plan is to represent try/except as specific regions so the exception-flow are again implied.
from numba-rvsdg.
from numba-rvsdg.
Related Issues (20)
- release.yml is broken HOT 1
- Fix version number in docs HOT 1
- Feature request: Data structure to hide details of graph rendering
- Feature request: remove `BasicBlock.replace_back_edge`
- Feature Request: hookup readthedocs HOT 6
- Feature Request: getting started ipynb HOT 2
- Feature Request: add tracing to simulator
- 0.0.3 Checklist HOT 1
- Refactor: Make `jump_target` a property of SCFG HOT 2
- Refactor: Remove `ByteFlow` and `FlowInfo` classes. HOT 1
- Feature Request: better rendering
- Feature Request: add region testing to YAML
- Triage: potential bug
- Branch transformation creates invalid edge HOT 4
- Refactor `numba_rvsdg/tests/test_figures.py` HOT 1
- Feature Request: conda packages
- Unifying RETURN_VALUE bytecode
- Include package building in `test.yml` workflow.
- Add `conda` package building to `release.yml`
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from numba-rvsdg.