As part of <a class="issue-link js-issue-link" data-error-text="Failed to load title"

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

But I would say this is not something to do for the MVP. </blockquote

This is obsolete now because of <a class="issue-link js-issue-link" data-error-text="F

Rewrite the streaming-binlog pipeline using Apache Beam about fhir-data-pipes HOT 9 CLOSED

bashir2 commented on June 14, 2024

Rewrite the streaming-binlog pipeline using Apache Beam

from fhir-data-pipes.

Comments (9)

kimaina commented on June 14, 2024

/+ for rewriting using beam, I think there is much to gain given that we will not have to manually implement many of the functionalities provided by beam e.g windowing, parallel processing, and micro-batching. Let's go for whichever approach works the best between the 2.

I like the scalability that can be provided using Kafka! On the issue of architectural complexity, it is possible to have kafka without a zookeeper, however, this is going to take a few more years before it becomes production-ready https://www.confluent.io/blog/removing-zookeeper-dependency-in-kafka/. On the other hand, the same stack (kafka + zk) can be significantly scaled-down, depending on server specs, please take a look at this https://kafka.blog/posts/scaling-down-apache-kafka/

from fhir-data-pipes.

bashir2 commented on June 14, 2024

Thanks @kimaina for the notes.

Re. benefits of Beam: Yes, definitely Beam brings many nice features and our long term plan should be to switch.

Re. scaled down Kafka: My concern is not much around resource usage but the complexity. When things go wrong (and they will definitely will) debugging issues is potentially much more complicated since you need to deal/understand 5 pieces of infrastructure instead of 2.

BTW, do you care about the scalability of Kafka in the AMPATH use case? I am asking because you have a central installation and my expectation is that converting data to FHIR would be the bottleneck not the message passing/processing part.

from fhir-data-pipes.

kimaina commented on June 14, 2024

we really don't care about scaling Kafka. I think one Kafka broker should be able to handle all incoming streams of data at any given point in time (whether peak or offpeak). During the peak, we should expect an average of 80 records per second.

from fhir-data-pipes.

kimaina commented on June 14, 2024

When things go wrong (and they will definitely will) debugging issues is potentially much more complicated since you need to deal/understand 5 pieces of infrastructure instead of 2.

I see your point! Let's look at the other option. I am curious to know how long this will take to implement

from fhir-data-pipes.

mozzy11 commented on June 14, 2024

Implement a custom IO connector for Beam as described here which includes an UnboundedSource that wraps Debezium. The main drawback of this approach is its custom nature and also the fact that it is not extendable to a distributed environment (for which Kafka is probably the right approach). The main benefit is its architectural simplicity.

@bashir2 , have we concluded to take on the second option ?

i could do some work on this

from fhir-data-pipes.

bashir2 commented on June 14, 2024

Thanks @mozzy11 for volunteering. I don't think that we need to work on this for now. Both of these options are significant endeavors and the reason I was considering them was because of #65 for which I have implemented a different (temporary?) solution for now.

I'll add more notes about pros/cons of the two options soon and make a suggestion (based on what I have learnt so far) to see what everyone thinks. But I would say this is not something to do for the MVP.

from fhir-data-pipes.

mozzy11 commented on June 14, 2024

But I would say this is not something to do for the MVP.

oh sure , that makes sense @bashir2

from fhir-data-pipes.

bashir2 commented on June 14, 2024

Here are some more notes to have a record of my investigation/thoughts and to put this issue on the back burner for now:

Between the two approaches, i.e., using Kafka or embedded Debezium ("Debezium" for short), here is a list of benefits of each:

Debezium provides simpler architecture; no need for dealing with Kafka and ZooKeeper.
Kafka makes it easier to merge multiple OpenMRS instances into a single Data Warehouse (DW). It also makes it easier to import data from other non-OpenMRS sources.
Kafka has standard support in Beam. But I have learnt that there are plans to implement a DebeziumIO in Beam too which is what I was considering to implement myself (so we can just wait for that standard implementation instead).
The Kafka based approach has an extra pair of serialization/deserialization of DB update messages compare to Debezium.

The main motivation for considering this rewrite at this time was Issue #65 which I resolved with a custom windowing implementation to deal with Parquet file issues in the streaming mode. So we don't need to do the Beam rewrite for now. We should wait until both DebeziumIO is available and we have a better sense of whether consolidating multiple data sources into a single DW is a need. If it is, we should consider the Kafka based approach. If it is not, we should go with the DebeziumIO based approach, IMO.

from fhir-data-pipes.

bashir2 commented on June 14, 2024

This is obsolete now because of #952.

from fhir-data-pipes.

Rewrite the streaming-binlog pipeline using Apache Beam about fhir-data-pipes HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent