Giter Club home page Giter Club logo

pdk's People

Contributors

acmiyaguchi avatar alanbernstein avatar asvetlik avatar codysoyland avatar jaffee avatar kcrodgers24 avatar shaqque avatar travisturner avatar yuce avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pdk's Issues

Can I use pdk to ingest data from other database like Hive?

Hi guys:
I'm a freshman of Pilosa and go. I read some docs in your web site, and I am not clear with Pilosa and it's data model so far.
For example, some records of relational database like below:

name age
A 18
B 19

It will be transformed to bitmap index:

Field name

ID 1 2
A 1 1 0
B 2 0 1

Field age (BSI)

age 18 19
comp 0 1 0
comp 2 0 0
comp 3 1 1
comp 4 1 1
comp 5 0 0
not_null 1 1

Is that right?

If I understand the above correctly. I have the following confusion:

  1. Pilosa store row and column ids in field, how can I transform the string values to int ids while ingestion. For example Field name, how can I transform A to Row id 1 while ingestion. The map A to 1 need to be maintained in other server of user's system or pilosa will do it automatically?If it's latter, can I get the value A while query Row(name=1)?
  2. If I want to ingest data from hive, what should I do ? Is there any DML like batch query or it need users to set Rows of field one by one?

Excepting to hear from you guys and thanks for helping me to use and understand Pilosa better !

Develop CSV importer

For the use case work, we put together a CSV import system that is specific to the two use cases, but lays some groundwork for working with more general data sources. The scope is limited to well-formatted, well-defined tabular data, so users will be responsible for providing clean data.

Fix/Rewrite Proxy

The current Pilosa proxy has a couple limitations.

  1. It only translates /query requests, and doesn't forward other requests to Pilosa.
  2. It translates rows but not columns.
  3. The API is a little weird to setup and use it.

We should probably rewrite it using something like https://golang.org/pkg/net/http/httputil/#ReverseProxy to handle most of the proxying details, and we can add in column translation on the response as well.

Write an S3 Source and command for pdk

We've seen some interest in ingesting json data from S3. It should be pretty easy to write a pdk.Source which returns json objects one by given an S3 bucket or list of S3 files.

unify taxi and net mapping and nexter code

The net subcommand contains an in-memory nexter implementation, and a generic string mapper. taxi has separated its mapping functionality out into a library, and I think it has come kind of nexter as well. We should get all mapping functionality into the pdk library, with as uniform an interface as possible and make sure both commands are sharing nexter code if possible.

Use the opentracing support in go-pilosa

Need to expose options so that PDK ingestion code can configure go-pilosa to use tracing.

Additionally, the usual configuration will probably be that go-pilosa is writing to the same tracing infra as Pilosa, so it might be worth Pilosa exposing its tracing config through an endpoint which go-pilosa can consume, thereby allowing hands-off config by default... 🤔 This second part is not too relevant directly to the PDK... I think we still need the first part anyway.

taxi import failing after ~150GB with OOM

While doing some multi-cloud benchmarks, I noticed that the taxi import would use a "normal" (~10GB) amount of memory for a long time, and then suddenly over the course of about 90 seconds spike up and OOM even on boxes as large as 128GB.

It did not happen at the same point in the import each time - I observed it at 115.5GB, 159.6GB, 162.4GB, and 165.3GB among others.

I saw this happen both on AWS and Azure instances, but not OCI. The OCI instances I was using did have >200GB of RAM though - I did not measure their memory usage during operation to see if it spiked up over 128GB at any point.

data races in pdk.Index.setupFrame

setupFrame accesses some maps in Index, but also launches some goroutines which access the same maps. I think we can pass the accessed values to the goroutines rather than having them access the maps directly.

(This is in generic-improvements)

GenericParser should be resilient to bad fields

The generic parser currently returns an error any time it fails to parse a field's value, and stops processing for that record. In reality, there are lots of innocuous reasons why a certain field might fail to parse, and it doesn't indicate that the entire record is suspect.

We should have stat/log options for notifying and counting which fields are failing, why, and how often, but we should make a best effort to parse any record and return some data.

I don't think we need to go crazy with configurability, some count stats that use the field path and encapsulate what the error is (e.g. null value, unsupported type, etc.). pdk/ingest.go has an example of a simple stats interface, that we should probably extend throughout the codebase so that it can be configured at the top level.

convert network, taxi, and weather use cases to use Go client

currently, pdk/import.go contains import functionality which relies on importing pilosa/ctl and using that importer. We should remove this code and use the stuff in pdk/pilosa.go which uses the Go client and has nascent support for BSI field values.

Enable CI

We should have CI enabled on this repo.

taxi, speed_mph broken after import

Do the taxi import, then issue the following query and observe the result (edited error message for readability):

curl 10.1.110.70:10101/index/taxi/query -d'TopN(speed_mph)'

curl 10.1.110.70:10101/index/taxi/query -d'TopN(speed_mph)'
{"error":
"executing: retrieving full counts: server error 500 Internal Server Error: 'PANIC: strconv.ParseInt: parsing \"9223372036854775808\": value out of range
goroutine 25657836 [running]:
runtime/debug.Stack(0xcc4b37c000, 0x1f4, 0xc57417be90)
	/usr/local/go/src/runtime/debug/stack.go:24 +0xa7
github.com/pilosa/pilosa/http.(*Handler).ServeHTTP.func1(0xe26100, 0xcc4b37c000, 0xc00020c780)
	/home/ubuntu/go/src/github.com/pilosa/pilosa/http/handler.go:286 +0x91
panic(0xc4ef40, 0xc57417be90)
	/usr/local/go/src/runtime/panic.go:513 +0x1b9
github.com/pilosa/pilosa/pql.(*Query).addNumVal(0xc84ffe8010, 0xc9bdde7cf5, 0x13)
	/home/ubuntu/go/src/github.com/pilosa/pilosa/pql/ast.go:150 +0x758
github.com/pilosa/pilosa/pql.(*PQL).Execute(0xc84ffe8010)
	/home/ubuntu/go/src/github.com/pilosa/pilosa/pql/pql.peg.go:474 +0x2ddc
github.com/pilosa/pilosa/pql.(*parser).Parse(0xc84ffe8000, 0xc84ffe8000, 0x0, 0x0)
	/home/ubuntu/go/src/github.com/pilosa/pilosa/pql/parser.go:62 +0x19f
github.com/pilosa/pilosa.(*API).Query(0xc0002aeab0, 0xe26e40, 0xc57417b8c0, 0xc138a1c740, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/home/ubuntu/go/src/github.com/pilosa/pilosa/api.go:110 +0x1b7
github.com/pilosa/pilosa/http.(*Handler).handlePostQuery(0xc00020c780, 0xe26100, 0xcc4b37c000, 0xcddb1d8400)
	/home/ubuntu/go/src/github.com/pilosa/pilosa/http/handler.go:458 +0x24b
github.com/pilosa/pilosa/http.(*Handler).handlePostQuery-fm(0xe26100, 0xcc4b37c000, 0xcddb1d8400)
	/home/ubuntu/go/src/github.com/pilosa/pilosa/http/handler.go:254 +0x48
net/http.HandlerFunc.ServeHTTP(0xc0002de4d0, 0xe26100, 0xcc4b37c000, 0xcddb1d8400)
	/usr/local/go/src/net/http/server.go:1964 +0x44
github.com/pilosa/pilosa/http.(*Handler).extractTracing.func1(0xe26100, 0xcc4b37c000, 0xcddb1d8300)
	/home/ubuntu/go/src/github.com/pilosa/pilosa/http/handler.go:231 +0x153
net/http.HandlerFunc.ServeHTTP(0xd20d491fe0, 0xe26100, 0xcc4b37c000, 0xcddb1d8300)
	/usr/local/go/src/net/http/server.go:1964 +0x44
github.com/pilosa/pilosa/http.(*Handler).queryArgValidator.func1(0xe26100, 0xcc4b37c000, 0xcddb1d8300)
	/home/ubuntu/go/src/github.com/pilosa/pilosa/http/handler.go:222 +0xcc
net/http.HandlerFunc.ServeHTTP(0xc990d08000, 0xe26100, 0xcc4b37c000, 0xcddb1d8300)
	/usr/local/go/src/net/http/server.go:1964 +0x44
github.com/pilosa/pilosa/vendor/github.com/gorilla/mux.(*Router).ServeHTTP(0xc000234af0, 0xe26100, 0xcc4b37c000, 0xcddb1d8300)
	/home/ubuntu/go/src/github.com/pilosa/pilosa/vendor/github.com/gorilla/mux/mux.go:162 +0xf1
github.com/pilosa/pilosa/vendor/github.com/gorilla/handlers.(*cors).ServeHTTP(0xc00031e2d0, 0xe26100, 0xcc4b37c000, 0xcddb1d8100)
	/home/ubuntu/go/src/github.com/pilosa/pilosa/vendor/github.com/gorilla/handlers/cors.go:51 +0xa32
github.com/pilosa/pilosa/http.(*Handler).ServeHTTP(0xc00020c780, 0xe26100, 0xcc4b37c000, 0xcddb1d8100)
	/home/ubuntu/go/src/github.com/pilosa/pilosa/http/handler.go:294 +0xde
net/http.serverHandler.ServeHTTP(0xc00020b380, 0xe26100, 0xcc4b37c000, 0xcddb1d8100)
	/usr/local/go/src/net/http/server.go:2741 +0xab
net/http.(*conn).serve(0xc000214000, 0xe26d80, 0xc82f1b3540)
	/usr/local/go/src/net/http/server.go:1847 +0x646
created by net/http.(*Server).Serve
	/usr/local/go/src/net/http/server.go:2851 +0x2f5
'"}

PLACEHOLDER: pdk ingest command

currently we have a number of PDK commands for doing ingest, each of which tends to be based around a Source. Each one is pretty much the same with the exception of the Source and options on the source. We should try to unify them somehow so that they can share logic for setting up parser, mapper, pilosa, signal handling, etc.

taxi: fully download/open CSV files before importing

We're seeing slightly different in Pilosa between imports of the taxi data. Errors have been observed while while downloading the CSV files from S3. e.g. 2017/07/06 02:54:19 scan error on https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2009-10.csv, err: unexpected EOF

In order to make imports repeatable and compare results between different versions of Pilosa, we should ensure that each CSV is downloaded completely before beginning to import it.

Ensure frames don't share go-pilosa StatusChannels

When providing a gopilosa.ImportStatusUpdate to pdk.SetupPilosa() via:

    statusChan := make(chan gopilosa.ImportStatusUpdate, 1000)

    indexer, err := pdk.SetupPilosa(m.PilosaHosts, m.Index, frames,
        gopilosa.OptImportStatusChannel(statusChan))

and when multiple frames are passed in to frames, then two problems occur:

  1. multiple frames share the same status channel
  2. upon import completion, all frames try to close the channel, which causes a panic

I'm still not sure if it's best to handle this in PDK or go-pilosa, but it seems we need a way to provide multiple channels (one per frame) for status updates. An alternative would be to modify go-pilosa such that the single status channel is aware of index/frame, and therefore supports multiple frames.

This needs more investigation.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.