Comments (9)
/+ for rewriting using beam, I think there is much to gain given that we will not have to manually implement many of the functionalities provided by beam e.g windowing, parallel processing, and micro-batching. Let's go for whichever approach works the best between the 2.
I like the scalability that can be provided using Kafka! On the issue of architectural complexity, it is possible to have kafka without a zookeeper, however, this is going to take a few more years before it becomes production-ready https://www.confluent.io/blog/removing-zookeeper-dependency-in-kafka/. On the other hand, the same stack (kafka + zk) can be significantly scaled-down, depending on server specs, please take a look at this https://kafka.blog/posts/scaling-down-apache-kafka/
from fhir-data-pipes.
Thanks @kimaina for the notes.
Re. benefits of Beam: Yes, definitely Beam brings many nice features and our long term plan should be to switch.
Re. scaled down Kafka: My concern is not much around resource usage but the complexity. When things go wrong (and they will definitely will) debugging issues is potentially much more complicated since you need to deal/understand 5 pieces of infrastructure instead of 2.
BTW, do you care about the scalability of Kafka in the AMPATH use case? I am asking because you have a central installation and my expectation is that converting data to FHIR would be the bottleneck not the message passing/processing part.
from fhir-data-pipes.
we really don't care about scaling Kafka. I think one Kafka broker should be able to handle all incoming streams of data at any given point in time (whether peak or offpeak). During the peak, we should expect an average of 80 records per second.
from fhir-data-pipes.
When things go wrong (and they will definitely will) debugging issues is potentially much more complicated since you need to deal/understand 5 pieces of infrastructure instead of 2.
I see your point! Let's look at the other option. I am curious to know how long this will take to implement
from fhir-data-pipes.
Implement a custom IO connector for Beam as described here which includes an UnboundedSource that wraps Debezium. The main drawback of this approach is its custom nature and also the fact that it is not extendable to a distributed environment (for which Kafka is probably the right approach). The main benefit is its architectural simplicity
.
@bashir2 , have we concluded to take on the second option ?
i could do some work on this
from fhir-data-pipes.
Thanks @mozzy11 for volunteering. I don't think that we need to work on this for now. Both of these options are significant endeavors and the reason I was considering them was because of #65 for which I have implemented a different (temporary?) solution for now.
I'll add more notes about pros/cons of the two options soon and make a suggestion (based on what I have learnt so far) to see what everyone thinks. But I would say this is not something to do for the MVP.
from fhir-data-pipes.
But I would say this is not something to do for the MVP.
oh sure , that makes sense @bashir2
from fhir-data-pipes.
Here are some more notes to have a record of my investigation/thoughts and to put this issue on the back burner for now:
Between the two approaches, i.e., using Kafka or embedded Debezium ("Debezium" for short), here is a list of benefits of each:
- Debezium provides simpler architecture; no need for dealing with Kafka and ZooKeeper.
- Kafka makes it easier to merge multiple OpenMRS instances into a single Data Warehouse (DW). It also makes it easier to import data from other non-OpenMRS sources.
- Kafka has standard support in Beam. But I have learnt that there are plans to implement a DebeziumIO in Beam too which is what I was considering to implement myself (so we can just wait for that standard implementation instead).
- The Kafka based approach has an extra pair of serialization/deserialization of DB update messages compare to Debezium.
The main motivation for considering this rewrite at this time was Issue #65 which I resolved with a custom windowing implementation to deal with Parquet file issues in the streaming mode. So we don't need to do the Beam rewrite for now. We should wait until both DebeziumIO is available and we have a better sense of whether consolidating multiple data sources into a single DW is a need. If it is, we should consider the Kafka based approach. If it is not, we should go with the DebeziumIO based approach, IMO.
from fhir-data-pipes.
This is obsolete now because of #952.
from fhir-data-pipes.
Related Issues (20)
- Keep the resource IDs consistent when the objects are converted from HAPI->Avro->HAPI. HOT 1
- In the ViewDefinition editor, add an option for fetching resources from the FHIR server
- Support for storing parquet files in AWS S3 is not available HOT 1
- Data retrieved from the Hapi JPA Database only includes the latest information, with no historical data being fetched HOT 1
- Feature Request: Requires Delta Data to be retained in incremental Snapshots HOT 1
- `text` field is lost when the Hapi object is converted to `Avro` HOT 1
- FhirEtl with DataflowRunner does not produce Parquet files HOT 7
- Come up with a common parquet schema for the sql-on-FHIR v1 type
- NPE while trying to find StructureDefinition for a resource extensions that is missing.
- Integrate FHIR Bulk export API with data-pipes
- Add config to change the pipeline controller cron schedule timezone HOT 1
- add config to override the default 100 batch size for server queries HOT 1
- Add last run time and purging schedule to the pipeline contoller UI
- Various documentations
- Feature Request: Options to rerun pipelines for selective date ranges HOT 2
- Feature Request: Pipeline Failure Notifications
- Relook at the `recursiveDepth` parameter
- Fix the number of parquet files generation for Direct Runner
- Create a Util Function for testing Resources in the FHIR store
- Add ability to turn on /off parquet file generation in case of syncying fhir to fhir server
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fhir-data-pipes.