Giter Club home page Giter Club logo

Comments (20)

bluestreak01 avatar bluestreak01 commented on May 22, 2024

Hi Jaromir, you are quite right, it is a problem with index. At the moment there no efficient way to store large volume of distinct ids and search on them. This problem has been hanging over my head for quite some time and i'll definitelly add an efficient index for case like this one.

Symbol is kind of Enum type, it is designed for time series data where you have relatively low number of items (stock symbols, sensors or other subjects) and large volume of time series data associated with each of them. Symbols do noty work well with counts over 100k.

One workaround that you can try is this:

$str("uniqueId").size(15).index().buckets(5000)

This will create an index that you want, a hashtable.

from questdb.

jaromirs avatar jaromirs commented on May 22, 2024

Hi Vlad, thanks for your prompt response. I'll try the workaround and let you know.

from questdb.

jaromirs avatar jaromirs commented on May 22, 2024

I've tried the workaround and it is considerably faster but the performance is still much worse than standard index with few distinct values. Is there any ETA when you add the efficient solution for case like this?

Thanks a lot.
Jaromir

from questdb.

bluestreak01 avatar bluestreak01 commented on May 22, 2024

could you give me an example of your object and config setup for it? it could be possible to tweak performance as is?

I cannot give an ETA but that's very high on priority list. I'll keep this issue updated with progress.

from questdb.

jaromirs avatar jaromirs commented on May 22, 2024

OK, thanks.

Here is my object:

public class Order {
    private long id;
    private String clOrdId;
    private long timestamp;

    public void clear(){
        clOrdId = null;
    }

    public long getId() {
        return id;
    }

    public void setId(long id) {
        this.id = id;
    }

    public long getTimestamp() {
        return timestamp;
    }

    public void setTimestamp(long timestamp) {
        this.timestamp = timestamp;
    }

    public void setClOrdId(String clOrdId) {
        this.clOrdId = clOrdId;
    }

    public String getClOrdId() {
        return clOrdId;
    }
}

And the setup is here:
$(Order.class)
.partitionBy(PartitionType.DAY)
.location("orders-by-day")
.key("id")
.$str("clOrdId").index().size(15).buckets(5000)
.$ts()

from questdb.

bluestreak01 avatar bluestreak01 commented on May 22, 2024

This code appends 1M Orders in 216ms without index and 300ms with index. Is this similar to what you are getting?

        JournalFactory factory = new JournalFactory(new JournalConfigurationBuilder() {{
            $(Order.class)
                    .partitionBy(PartitionType.DAY)
                    .$str("clOrdId").index().buckets(5000)
                    .$ts()
            ;
        }}.build(args[0]));


        Order order = new Order();

        long t = System.nanoTime();
        JournalWriter<Order> w = factory.writer(Order.class);
        for (int i = -1000000; i < 1000000; i++) {
            if (i == 0) {
                t = System.nanoTime();
            }
            order.setTimestamp(System.currentTimeMillis());
            order.setId(i);
            order.setClOrdId(Integer.toString(i));
            w.append(order);
        }
        w.commit();

        System.out.println(System.nanoTime() - t);

from questdb.

jaromirs avatar jaromirs commented on May 22, 2024

I was getting much worse numbers - I compared my example with yours and the difference was the timestamp. In my example the timestamp was calculated so it distributed the items among 90 days - thus creating 90 partitions.

One last question, Vlad: in the example above, how do I find an item with specific clordId? It seems the journal query does not work:

journal.query().all().withSymValues("clOrdId", "1001")

Thansk again, Jaromir

from questdb.

bluestreak01 avatar bluestreak01 commented on May 22, 2024

With multiple partitions it is likely to be sizing/memory issue rather than index performance. I'm going to try to tune that and let you know.

In the mean time there are couple of things you can try:

  • $(Order.class).recordCountHint(10000) this hint is per partition, since 1M spread over 90 days 10K per partition should be fine, otherwise journal will be mapping memory too agressively.
  • depending on how much RAM you have you may end up with filling it up quite quickly. To avoid that user bulkWriter() instead of writer(), it is a bit slower but much more memory frugal:
JournalWriter<Order> w = factory.bulkWriter(Order.class);

In release there is no way to simply search string. You can do it but that's quite a few lines of code (and understanding of index structure). In snapshot however you can do something like this:

    for (Order o : q.ds(
            q.top(1
                    , q.forEachPartition(
                            q.source(w, false)
                            , q.forEachRow(
                                    q.kvSource("clOrdId", q.hashSource("clOrdId", "10"))
                                    , q.equalsConst("clOrdId", "10")
                            )
                    )
            )
            , order
    )) {
        System.out.println(o);
    }

it is a little fiddly as well, but simpler that doing it all by hand :)

from questdb.

bluestreak01 avatar bluestreak01 commented on May 22, 2024

With config like this:

        JournalFactory factory = new JournalFactory(new JournalConfigurationBuilder() {{
            $(Order.class).recordCountHint(10000)
                    .partitionBy(PartitionType.DAY)
                    .$str("clOrdId").index().buckets(100)
                    .$ts()
            ;
        }}.build(args[0]));

I can append 1M in 700ms without index and ~900ms with index.

For practical applications it doesn't make sense to partition 1M of data like that. Its about 126MB on disk underpopulated. You could be looking at growing database to 50-100M rows before considering partitioning. Otherwise cost of managing files is higher than the cost of search if data was kept all in one place.

from questdb.

sirinath avatar sirinath commented on May 22, 2024

I am not sure if this might help. Perhaps you can borrow some ideas from:

https://code.google.com/p/cqengine/
https://code.google.com/p/concurrent-trees/

from questdb.

jaromirs avatar jaromirs commented on May 22, 2024

Hi Vlad,

thanks for your response. You are right, it does not make sense to have such small partitions. I wasn't planning it - it was just an artificial example. Btw: when do you think you will release the support for string searching?

Hi Sirinath,

thanks for the tip - I think we might have come across the CQEngine already - will double check.

Jaromir

from questdb.

bluestreak01 avatar bluestreak01 commented on May 22, 2024

Hi Jaromir, snapshot is already available in maven repo if you want to play with it. Just add this to pom:

<repositories>
    <repository>
        <id>sonatype-snapshots</id>
        <url>https://oss.sonatype.org/content/repositories/snapshots</url>
    </repository>
</repositories>

I'm planning to add hash index on int fields, so you can perhaps use "id" for unique key? I'll release over this weekend, having said that the Q API is part of a large project and will definitely take some time to stabilise and test properly, so you can use it but if treat it as beta i'd very much appreciate that.

from questdb.

bluestreak01 avatar bluestreak01 commented on May 22, 2024

I have release 2.0.1 containing support for key search (both int and string)

https://github.com/NFSdb/nfsdb/releases/tag/2.0.1

from questdb.

jaromirs avatar jaromirs commented on May 22, 2024

Cool. Thanks again Vlad.

from questdb.

bluestreak01 avatar bluestreak01 commented on May 22, 2024

My pleasure!

from questdb.

jaromirs avatar jaromirs commented on May 22, 2024

Hi Vlad,

I haven't found Release 2.0.1 on Maven Central Repository - could you please publish it? I am using 2.0.2-SNAPSHOT for the time being ...

Thanks, Jaromir

from questdb.

bluestreak01 avatar bluestreak01 commented on May 22, 2024

Hi Jaromir, you are right, release was incomplete.

All steps are now done and its with maven replication system. Please check in an hour or two.

Vlad

from questdb.

jaromirs avatar jaromirs commented on May 22, 2024

Thanks!

from questdb.

bluestreak01 avatar bluestreak01 commented on May 22, 2024

it is out, finally 👍

from questdb.

jaromirs avatar jaromirs commented on May 22, 2024

Great, thanks.

from questdb.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.