mitdbg / amoeba Goto Github PK

View Code? Open in Web Editor NEW

16.0 16.0 3.0 7.1 MB

License: MIT License

Shell 2.11% Python 15.42% Java 74.12% Scala 8.16% PLpgSQL 0.18%

amoeba's People

Contributors

Stargazers

Watchers

Forkers

weima527 qinzhewudao

amoeba's Issues

A bug in RNode

Line 158:

if (TypeUtils.compareTo(p.value, value, type) < 0)

should be

if (TypeUtils.compareTo(p.value, value, type) <= 0)

right?

For example, the value on the RNode is 10. Even the predicate value is 10, we still do not need to go right.

Creating a new index should overwrite the existing index

CSV parsing incorrect

We do not parse A,"B1,B2",C correctly. These are 3 just fields not four.
We currently just do split(",").

Temporary hack is to use | separator.

Binary search in ParsedTupleList.splitAt is buggy

for example, the values are [1,2,3,4,5], and the given value is 6

it returns [1,2,3,4], [5]

Rewrite the CartilageIndexKey::detectTypes function

Use the new Schema class. Cleanup the keyAttributeIdx logic.

Keep floats compact in the sample.

Not urgent.
Floats get expanded and are represented with full precision. As a results
1.02 becomes 1.020000001. This makes 227 MB file on parsing -> serializing 277 MB.

Some blocks could be sampled more than once!

We can reproduce the bug by adding System.out.println("~ pos: " + position); at line 143 in https://github.com/mitdbg/mdindex/blob/3ba081a7b345d7bd35566e7cef6f13d92a6b6a55/src/main/java/core/index/build/InputReader.java

You can see some duplicated position. The reason is that if the test fails at line 154, then this block will be sampled more than once!

Conf: /Users/ylu/Documents/workspace/mdindex/conf/cartilage.properties
~ pos: 30720
~ pos: 35840
~ pos: 35840
~ pos: 92160
~ pos: 194560
~ pos: 209920
~ pos: 368640
~ pos: 394240
~ pos: 419840
~ pos: 542720
~ pos: 593920
~ pos: 599040
~ pos: 599040

Partition writer is probably introducing new lines

Observed that in the partitions written out, very rarely there are two newlines between records.
This creates an error as we get a super tiny record.

Severe bug in InputReader.java

line 187, 199.

If we put \n in RawIndexKey and the last attribute of a record is a string, we cannot go to the right bucket based on RobustTree.

The result is we have non-empty bucket in the index, but empty data blocks.

Move globals to per table info

Globals currently a global schema. Make it a per table info.

write a Spark app to sample

I would like to work on this and it can also solve Issue #18 meanwhile.

Improve SparkFileSplit format.

Recommended limit is 100KB but we exceed it.
SparkFileSplit has paths[] put explicitly. We can probably remove the base path and make it a variable. This will greatly reduce the size.

A bug in Predicate

Line 74:

if (TypeUtils.compareTo(this.value, value, this.type) < 0)

should be

if (TypeUtils.compareTo(this.value, value, this.type) > 0)

right?

Merge Join

Please make sure TPCHWorkload is ok. @anilshanbhag

I am going to fix TPCHJoinWorkload

int overflow in RawIndexKey.getDoubleAttribute

it happens if there is a long double string

While writing partitions, take input memory limit

Use it to adjust buffer size. Make sure we don't get exceeded heap size exception.

Perf Improvement Needed

Running the following query

SELECT COUNT(*) FROM tpch WHERE l_shipmode == "MAIL" AND l_receiptdate >= "1997-01-01" AND l_receiptdate < "1998-01-01";

We took 30.972s
While spark took 21.73s

The results did match, so that is good.

Dont parse data while sampling

Choose sampling percentage automatically

Currently we are hand setting it. Set it automatically to ensure enough samples but make sure it is not larger than 1GB.

fill_cmd does not work if parameters share the same prefix

Do we need to change it to

cmd = cmd.replace('"$' + k + '"', '"' + env.conf[k] + '"')

@anilshanbhag

Too many files on HDFS opened

Reference URLs
http://hbase.apache.org/book.html#trouble
http://ccgtech.blogspot.com/2010/02/hadoop-hdfs-deceived-by-xciever.html

2015-09-09 17:07:53,370 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: istc2:50010:DataXceiverServer:
java.io.IOException: Xceiver count 4098 exceeds the limit of concurrent xcievers: 4096
        at org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:140)
        at java.lang.Thread.run(Thread.java:722)

2015-09-09 17:07:53,384 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: istc2:50010:DataXceiver error processing WRITE_BLOCK operation  src: /128.30.77.86:53903 dst: /128.30.77.88:50010
java.io.EOFException: Premature EOF: no length prefix available
        at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2203)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:672)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235)
        at java.lang.Thread.run(Thread.java:722)

2015-09-09 17:07:53,383 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DataNode{data=FSDataset{dirpath='[/data/mdindex/dfs/dfs/data/current]'}, localName='istc2:50010', datanodeUuid='96a62a6a-65de-4007-bc6c-b1acc64b9d14', xmitsInProgress=0}:Exception transfering block BP-336117526-127.0.1.1-1432094327835:blk_1077379710_21271759 to mirror 128.30.77.83:50010: java.io.EOFException: Premature EOF: no length prefix available

Don't keep files open. Open, write and close.

GEQ and LT are buggy

One bug is in Optimizer::NumTuplesAccessed
There might be more.

Support adding multiple predicates into the tree.

Currently we just choose the best plan among a set of plans, each of which tries to add just one of the predicates.
Adding the other predicates into the best plan would be the most obvious option.

Zookeeper bucket count incorrect

The write_partitions completed, I think it got stuck on the zookeeper count update.

Run the write_partitions correctly and verify using the sum_bucket_counters call that the bucket count was set correctly.

Remove DistributedRepartitionIterator

It is the same as RepartitionIterator in our current implementation.

A bug in Query constructor

public Query(String predString) {
        String[] parts = predString.split(";");
        this.predicates = new Predicate[parts.length - 1];
        for (int i = 0; i < parts.length - 1; i++) {
            this.predicates[i] = new Predicate(parts[i]);
        }
    }

should be

public Query(String predString) {
        String[] parts = predString.split(";");
        this.predicates = new Predicate[parts.length];
        for (int i = 0; i < parts.length; i++) {
            this.predicates[i] = new Predicate(parts[i]);
        }
    }

mitdbg / amoeba Goto Github PK

amoeba's People

Contributors

Stargazers

Watchers

Forkers

amoeba's Issues

Recommend Projects

Recommend Topics

Recommend Org