mitdbg / amoeba Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
Line 158:
if (TypeUtils.compareTo(p.value, value, type) < 0)
should be
if (TypeUtils.compareTo(p.value, value, type) <= 0)
right?
For example, the value on the RNode is 10. Even the predicate value is 10, we still do not need to go right.
We do not parse A,"B1,B2",C correctly. These are 3 just fields not four.
We currently just do split(",").
Temporary hack is to use | separator.
for example, the values are [1,2,3,4,5], and the given value is 6
it returns [1,2,3,4], [5]
Use the new Schema class. Cleanup the keyAttributeIdx logic.
Not urgent.
Floats get expanded and are represented with full precision. As a results
1.02 becomes 1.020000001. This makes 227 MB file on parsing -> serializing 277 MB.
We can reproduce the bug by adding System.out.println("~ pos: " + position); at line 143 in https://github.com/mitdbg/mdindex/blob/3ba081a7b345d7bd35566e7cef6f13d92a6b6a55/src/main/java/core/index/build/InputReader.java
You can see some duplicated position. The reason is that if the test fails at line 154, then this block will be sampled more than once!
Conf: /Users/ylu/Documents/workspace/mdindex/conf/cartilage.properties
~ pos: 30720
~ pos: 35840
~ pos: 35840
~ pos: 92160
~ pos: 194560
~ pos: 209920
~ pos: 368640
~ pos: 394240
~ pos: 419840
~ pos: 542720
~ pos: 593920
~ pos: 599040
~ pos: 599040
Observed that in the partitions written out, very rarely there are two newlines between records.
This creates an error as we get a super tiny record.
line 187, 199.
If we put \n in RawIndexKey and the last attribute of a record is a string, we cannot go to the right bucket based on RobustTree.
The result is we have non-empty bucket in the index, but empty data blocks.
Globals currently a global schema. Make it a per table info.
I would like to work on this and it can also solve Issue #18 meanwhile.
Recommended limit is 100KB but we exceed it.
SparkFileSplit has paths[] put explicitly. We can probably remove the base path and make it a variable. This will greatly reduce the size.
Line 74:
if (TypeUtils.compareTo(this.value, value, this.type) < 0)
should be
if (TypeUtils.compareTo(this.value, value, this.type) > 0)
right?
Please make sure TPCHWorkload is ok. @anilshanbhag
I am going to fix TPCHJoinWorkload
it happens if there is a long double string
Use it to adjust buffer size. Make sure we don't get exceeded heap size exception.
Running the following query
SELECT COUNT(*) FROM tpch WHERE l_shipmode == "MAIL" AND l_receiptdate >= "1997-01-01" AND l_receiptdate < "1998-01-01";
We took 30.972s
While spark took 21.73s
The results did match, so that is good.
Currently we are hand setting it. Set it automatically to ensure enough samples but make sure it is not larger than 1GB.
Reference URLs
http://hbase.apache.org/book.html#trouble
http://ccgtech.blogspot.com/2010/02/hadoop-hdfs-deceived-by-xciever.html
2015-09-09 17:07:53,370 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: istc2:50010:DataXceiverServer:
java.io.IOException: Xceiver count 4098 exceeds the limit of concurrent xcievers: 4096
at org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:140)
at java.lang.Thread.run(Thread.java:722)
2015-09-09 17:07:53,384 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: istc2:50010:DataXceiver error processing WRITE_BLOCK operation src: /128.30.77.86:53903 dst: /128.30.77.88:50010
java.io.EOFException: Premature EOF: no length prefix available
at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2203)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:672)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235)
at java.lang.Thread.run(Thread.java:722)
2015-09-09 17:07:53,383 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DataNode{data=FSDataset{dirpath='[/data/mdindex/dfs/dfs/data/current]'}, localName='istc2:50010', datanodeUuid='96a62a6a-65de-4007-bc6c-b1acc64b9d14', xmitsInProgress=0}:Exception transfering block BP-336117526-127.0.1.1-1432094327835:blk_1077379710_21271759 to mirror 128.30.77.83:50010: java.io.EOFException: Premature EOF: no length prefix available
Don't keep files open. Open, write and close.
One bug is in Optimizer::NumTuplesAccessed
There might be more.
Currently we just choose the best plan among a set of plans, each of which tries to add just one of the predicates.
Adding the other predicates into the best plan would be the most obvious option.
The write_partitions completed, I think it got stuck on the zookeeper count update.
Run the write_partitions correctly and verify using the sum_bucket_counters call that the bucket count was set correctly.
It is the same as RepartitionIterator in our current implementation.
public Query(String predString) {
String[] parts = predString.split(";");
this.predicates = new Predicate[parts.length - 1];
for (int i = 0; i < parts.length - 1; i++) {
this.predicates[i] = new Predicate(parts[i]);
}
}
should be
public Query(String predString) {
String[] parts = predString.split(";");
this.predicates = new Predicate[parts.length];
for (int i = 0; i < parts.length; i++) {
this.predicates[i] = new Predicate(parts[i]);
}
}
In the index builder, after splitting the sample based on attribute chosen use the sample sizes of the splits to adjust the allocation to be deducted in the left and right branches.
AccessMethod::init - We are reading from file /info which is not written out anywhere.
We set it to 20 or something now. Make sure its atleast 25. Bonus make it var.
We have many variations of the index builder. Remove all except one which will be the one used .
Tried to sample 700MB data with 512MB heap size. This should be easy but get OOM.
One issue might be we store the entire sample. However there might be other things we are storing accidentally.
Currently we require the data to be on local disk.
We have CartilageIndexKeyMT and TypeUtilsMT. Make sure they are correct, make them default and remove the old ones.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.