Giter Club home page Giter Club logo

etl's Introduction

FLow PHP - ETL

Flow PHP is a premier ETL (Extract, Transform, Load) framework engineered for agile and precise data processing and transformation. By adhering to strong typing principles, it ensures data consistency and accuracy throughout your workflows. One of Flow PHP's standout features is its minimal memory footprint achieved through the utilization of PHP's generators, which enable efficient iterative data handling. Additionally, Flow PHP is well-equipped with a plethora of adapters, offering a wide range of extractors and loaders that facilitate seamless interaction with diverse data sources and destinations. This makes Flow PHP a highly adaptable and resourceful framework for managing large-scale data processing tasks and building scalable web systems. Whether you are dealing with data transformation or orchestrating complex data flows, Flow PHP is tailored to meet the demands of modern web infrastructures.

Important

This repository is a subtree split from our monorepo. If you'd like to contribute, please visit our main monorepo flow-php/flow.

etl's People

Contributors

aeon-automation avatar dawidsajdak avatar dependabot[bot] avatar norbertmwk avatar norberttech avatar peter279k avatar stloyd avatar szepeviktor avatar tomaszhanc avatar wiktor6 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

etl's Issues

MySQL DB-Adapter for etl-adapter-doctrine

There is currently no support for MySQL-Databases. As most infrastructures do proved MySQL/MariaDB instead of PostgreSQL, it would be a good addition to support MySQL-based databases as well.

CSV's having "" removed

Hi there,

I'm using .csv files with encapsulated headers and values, like "Header", "Header2" etc,
But for some reason all the "" get stripped off. Any ideas why this happens?

dynamically add new Entry to row

I do have one requirement where i need to add a new entry after a transformation. The pseudo-code would be something like this.

 public function transform(Rows $rows): Rows
            {
                foreach ($rows as $row) {
                    $entries = $row->entries();

                    // This will return a new entries object
                    $newEntries = $entries->add(new Row\Entry\StringEntry('name', 'Blabla'));

                    // The newEntries variable can not be used within the Rows-object. It would be great to have the option
                    // to dynamically add a new entry to an existing row.
                }
                return $rows;
            }

It seems like this is not a supported case at the moment. What i do instead is to create a new Rows-object instead. So something like this:

   public function transform(Rows $rows): Rows
            {
                foreach ($rows as $row) {
                    return new Rows(Row::create(
                        Entry::integer('name', '')
                    ));
                }
            }

flow-php/etl-adapter-excel

https://phpspreadsheet.readthedocs.io/en/latest/ brings huge potential into the FLow ETL for both, reading and writing data.
If anyone would be interested in taking this please let me know so I can prepare the repository or help to design that adapter.

Even if initially this adapter would only support only reading and writing into Xlsx it would be enough.

  • PhpSpreadsheetXlsxExtractor
  • PhpSpreadsheetXlsxLoader

General understanding

I want to evaluate the following case:

  1. Read data from one system and enrich it with data from another system. One system returns a json while the other one returns some strings.
  2. The data requires heavy transformation. Basically applying some algorithms is required where the result is a new json payload for an external systen.
  3. Afterwards the data needs to be grouped in bulks of for example 100 items and then published to another system.

I currently struggle with the transformation-part. I would build a class which runs all the transformations but do not understand, how this could be achieved in this library:

$flow ->read($externalSystem1) ->transform() // How would i pass a custom transformer? It would be a class. ->write(To::memory($array)) ->run();

Remove webmozart/assert dependency

Webmozart Assert dependency was used here in early POC version however now when the library is open we should reduce the number of dependencies as much as possible.

Ascii-Table will not work if Rows only contains one element

I was able to reproduce this error: If the rows-entries only contain one single element, then an exception will be thrown.

yield new Rows(
                Row::create(Entry::integer('id', 1), Entry::string('name', 'EN'))
            );

Exception

Throwmax(): Argument #1 ($value) must be of type array, int given

In AbstractErrorHandler.php line 38:

  [Exception]
  TypeError: max(): Argument #1 ($value) must be of type array, int given in /var/www/share/dev/htdocs/vendor/flow-php/etl/src/Flow/ETL/Formatter/ASCII/ASCIITable.php:176
  Stack trace:
  #0 /var/www/share/dev/htdocs/vendor/flow-php/etl/src/Flow/ETL/Formatter/ASCII/ASCIITable.php(176): max(1)
  #1 /var/www/share/dev/htdocs/vendor/flow-php/etl/src/Flow/ETL/Formatter/ASCII/ASCIITable.php(89): Flow\ETL\Formatter\ASCII\ASCIITable->getColWidths(Array, 20)
  #2 /var/www/share/dev/htdocs/vendor/flow-php/etl/src/Flow/ETL/Formatter/AsciiTableFormatter.php(31): Flow\ETL\Formatter\ASCII\ASCIITable->makeTable(Array, 20)
  #3 /var/www/share/dev/htdocs/vendor/flow-php/etl/src/Flow/ETL/Loader/StreamLoader.php(89): Flow\ETL\Formatter\AsciiTableFormatter->format(Object(Flow\ETL\Rows), 20)
  #4 /var/www/share/dev/htdocs/vendor/flow-php/etl/src/Flow/ETL/Pipeline/SynchronousPipeline.php(66): Flow\ETL\Loader\StreamLoader->load(Object(Flow\ETL\Rows))
  #5 /var/www/share/dev/htdocs/vendor/flow-php/etl/src/Flow/ETL/DataFrame.php(260): Flow\ETL\Pipeline\SynchronousPipeline->process(Object(Flow\ETL\Config))
  #6 /var/www/share/dev/htdocs/bundle/Mothership/YadiBundle/src/Command/AbstractFlowCommand.php(138): Flow\ETL\DataFrame->run()

Complete example:

        $flow  = new Flow();
        $countries = new Rows(
            Row::create(Entry::integer('id', 1), Entry::string('country', 'EN')),
        );

        $flow->read(From::rows($countries))
            ->write(To::stdout())
            ->run();

CSV::from_file does not work with remote URLs

I want to be able to extract documents stored on S3, so I have registered the stream wrapper:

$client = $sdk->createS3();
$client->registerStreamWrapper();

And then I attempt to load the file:

(new Flow())
    ->extract(CSV::from_file(
        file_name: 's3://my-bucket-name/test.csv',
        header_offset: 0,
    ))
    // ...
    ->run();

But the following exception is thrown:

 [Flow\ETL\Exception\InvalidArgumentException]                                                    
  File s3://my-bucket-name/test.csv not found.

I think the check for file existence should be modified to allow remote URLs:

if (! (str_contains($file_name, needle: '://') || is_file($file_name))) { /* throw exception */ }

OR add another method to handle resources, eg:

$extractor = CSV::from_resource(
    resource: fopen('s3://my-bucket-name/test.csv'),
    header_offset: 0,
);

RFC | Replace "in-house" cache system with PSR-6

In the PHP community, there is PSR-6 which defines common interfaces for the common use of cache implementation.

Its key concepts can be described as:

Item
A single unit of information is stored as a key/value pair, where the key is the unique identifier of the information and the value is its contents;

Pool
A logical repository of cache items. All cache operations (saving items, looking for items, etc.) are performed through the pool. Applications can define as many pools as needed.

Adapter
It implements the actual caching mechanism to store the information in the filesystem, in a database, etc. The component provides several ready-to-use adapters for common caching backends (Redis, APCu, PDO, etc.)

Implementing those will allow to use of more adapters and reduce the maintenance of the ETL library while improving the DX.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.