rgrig / barista Goto Github PK
View Code? Open in Web Editor NEWfork of Barista, an OCaml library for handling Java bytecode
Home Page: http://barista.x9c.fr/
License: GNU Lesser General Public License v3.0
fork of Barista, an OCaml library for handling Java bytecode
Home Page: http://barista.x9c.fr/
License: GNU Lesser General Public License v3.0
It would speed up topl a lot if we could parse only signatures.
The type inferer wrongly identifies which objects are initialized by INVOKESPECIAL. It is possible, for example, to execute the same NEW twice before executing a corresponding INVOKESPECIAL.
Topl was doing this. But, it is the job of Barista.
Any code that passes the type inference in Barista should pass JVM's type checking. See issue 12 of topl for a counterexample.
Type inference should use the types from Section 4.10.1.1 of the JVM7 spec. For example, the following should be illegal.
Barista fails on interfaces with default
methods. The function AccessFlag.check_method_flags
should implement what S4.6 of JVM8 spec says (or whatever section it is in JVM9).
Make the symbolic execution size-agnostic. This should be much easier now that locals are size-agnostic in the high-level representation.
It is somewhat risky to preserve unknown attributes. They could contain, for example, references inside the constant pool that become wrong. It might be safer (1) to warn when an unknown attribute is decoded, and (2) drop it.
bincheck
should count Longs in the constant pool as 2 entries
see 14f7cfe
There are several Coder.decode_attr_
functions that are unimplemented. In particular, processing Tomcat7 fails because Coder.decode_attr_bootstrap_methods
is unimplemented.
Reduce the use of polymorphic variants.
Pros of PV: you don't need to decide to which module they belong, you can do cool stuff with subtyping.
Cons of PV: you get less help from the type system, they are one of the more complicated features of OCaml
What prompted this: the data type t for constants in (HighInstruction.LDC t).
At the moment, encode and encode_code in HighAttribute are mutually recursive. For decoding we broke up the recursion, and we might want to do the same for encoding.
I would like to put all code for the high-level representation of the bytecode in a module High that has submodules Class, Instruction, Field, Attribute, Method, and so on.
At the moment, we have both the old high-level representation (Class, Instruction, Attribute, ....) and the new high-level representation (HighClass and its submodules HighInstruction, HighAttribute, ...) In a few instances I could not understand OCaml's name resolution rules when one has both module A and module B.A, so I think it's a bad idea to have both Instruction and High.Instruction. In other words, the renaming that I propose here should be done only after we get rid of the old high-level representation altogether.
There should be a high-level way to deal with the argument--local correspondence in clients of Barista.
Example of the problem: If the arguments are [Class Foo, long, int], then the locals are [0,1,3], because long
occupies 2 locals.
The ConstantValue attribute must not exist in the high-level representation of the bytecode. Instead, the constant should be tucked on the field. This way, we don't have a situation where the type of the field and the type of the constant used to initialize it do not match.
The bytecode generated by TOPL on Tomcat 7 crashes Infer, so it is likely invalid.
The flowgraph should be built during symbolic execution (especially because of JSR). Then symbolic execution should also return this graph. Finally, [compute_max_stack_locals] should filter the entry points of basic blocks (using the flow graph).
The following are times of running parts of topl on dacapo:
Encoding is more than twice slower than decoding. There's no good reason for this difference; should be fixed.
Right now we always use the wide variant of instructions. To pick the smallest size we need another fixed-point computation: Changing the size of instructions may change the size of jumps, which in turn may change the size of instructions.
I'm not sure if this really makes a difference for the speed of the bytecode. Just for the size of the bytecode I wouldn't bother.
It is undesirable to distinguish between INVOKEINTERFACE and INVOKEVIRTUAL in the high-level representation. The choice should be made based on whether the type name is that of an interface or that of a class. Which brings up another point: It is impossible to distinguish between interfaces and classes given their name. In particular, Barista won't complain if someone uses both INVOKEs on the same type name. Is there a setup in which such bytecode will run without error?
I'm inclined to say that the high-level representation should distinguish between names of classes and names of interfaces. One option is to introduce a type Name.for_interface, another is to have a map Name.for_class -> bool. However, this whole issue needs more thought.
Name.for_class is a pair that tries to distinguish the name of an inner class. However, there seems to be nothing in the spec that constraints the name of nested classes in any way.
Recently, Rasmus solved a cyclic dependency between submodules of HighClass by separating types and operations into different modules. It might be a good idea to have one module with the types and one module with the operations: High and, respectively, HighOperations. (Each will have several submodules, such as High.Attribute, High.Instruction, and HighOperations.Attribute.)
The module High would have only an ml file, and no mli file. So we won't have to keep those huge variants in sync.
Related:
#3
@yoff inserted a comment saying that SymbExe should be size agnostic. I file this issue so we have a place where we discuss.
These are two attributes that need some extra work during encoding and decoding. They hold debugging information; so, they aren't high priority.
Class files with this property exists and may lead to error during the symbolic execution, usually
"SE: found top where a reference was expected".
The symbolic execution in coder.ml doesn't handle long
and double
properly. (For example, TOPL's test longarg
fails.)
The function Coder.encode_attr_line_number_table
has a dummy implementation and a todo note on it.
We need to implement something sensible. (Why? I think this happens: Infer detects errors in the bytecode instrumented by TOPL but does not print anything because it has no line numbers.)
The following scenario is allowed by our SE, but not by the spec:
We must have a type Right_half, or with a better name.
Put s8_of_2s4 into Utils
and make sure it's used everywhere (grep shift_left
).
It's quite clear now that we have a full fork. We don't want to maintain two high-level representations. All that depends on the old high-level representation should be removed.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.