basemodelai / cleora Goto Github PK

Cleora AI is a general-purpose model for efficient, scalable learning of stable and inductive entity embeddings for heterogeneous relational data.

Home Page: https://cleora.ai

License: Other

Rust 17.99% Jupyter Notebook 82.01%

ai graphs synerise embeddings ml machine-learning pytorch-biggraph deepwalk cleora-embeddings entity hypergraphs inductive-entity-embeddings datasets

cleora's Introduction

Achievements

1️⃣st place at SIGIR eCom Challenge 2020

2️⃣nd place and Best Paper Award at WSDM Booking.com Challenge 2021

2️⃣nd place at Twitter Recsys Challenge 2021

3️⃣rd place at KDD Cup 2021

Cleora

Cleora is a genus of moths in the family Geometridae. Their scientific name derives from the Ancient Greek geo γῆ or γαῖα "the earth", and metron μέτρον "measure" in reference to the way their larvae, or "inchworms", appear to "measure the earth" as they move along in a looping fashion.

Cleora is a general-purpose model for efficient, scalable learning of stable and inductive entity embeddings for heterogeneous relational data.

Read the whitepaper "Cleora: A Simple, Strong and Scalable Graph Embedding Scheme"

Cleora embeds entities in n-dimensional spherical spaces utilizing extremely fast stable, iterative random projections, which allows for unparalleled performance and scalability.

Types of data which can be embedded include for example:

heterogeneous undirected graphs
heterogeneous undirected hypergraphs
text and other categorical array data
any combination of the above

Key competitive advantages of Cleora:

more than 197x faster than DeepWalk
~4x-8x faster than PyTorch-BigGraph (depends on use case)
star expansion, clique expansion, and no expansion support for hypergraphs
quality of results outperforming or competitive with other embedding frameworks like PyTorch-BigGraph, GOSH, DeepWalk, LINE
can embed extremely large graphs & hypergraphs on a single machine

Embedding times - example:

Algorithm	FB dataset	RoadNet dataset	LiveJournal dataset
Cleora	00:00:43 h	00:21:59 h	01:31:42 h
PyTorch-BigGraph	00:04.33 h	00:31:11 h	07:10:00 h

Link Prediction results - example:

	FB dataset		RoadNet dataset		LiveJournal dataset
Algorithm	MRR	HitRate@10	MRR	HitRate@10	MRR	HitRate@10
Cleora	0.072	0.172	0.929	0.942	0.586	0.627
PyTorch-BigGraph	0.035	0.072	0.850	0.866	0.565	0.672

Cleora design principles

Cleora is built as a multi-purpose "just embed it" tool, suitable for many different data types and formats.

Cleora ingests a relational table of rows representing a typed and undirected heterogeneous hypergraph, which can contain multiple:

typed categorical columns
typed categorical array columns

For example a relational table representing shopping baskets may have the following columns:

user <\t> product <\t> store

With the input file containing values:

user_id <\t> product_id product_id product_id <\t> store_id

Every column has a type, which is used to determine whether spaces of identifiers between different columns are shared or distinct. It is possible for two columns to share a type, which is the case for homogeneous graphs:

user <\t> user

Based on the column format specification, Cleora performs:

Star decomposition of hyper-edges
Creation of pairwise graphs for all pairs of entity types
Embedding of each graph

The final output of Cleora consists of multiple files for each (undirected) pair of entity types in the table.

Those embeddings can then be utilized in a novel way thanks to their dim-wise independence property, which is described further below.

Key technical features of Cleora embeddings

The embeddings produced by Cleora are different from those produced by Node2vec, Word2vec, DeepWalk or other systems in this class by a number of key properties:

efficiency - Cleora is two orders of magnitude faster than Node2Vec or DeepWalk
inductivity - as Cleora embeddings of an entity are defined only by interactions with other entities, vectors for new entities can be computed on-the-fly
updatability - refreshing a Cleora embedding for an entity is a very fast operation allowing for real-time updates without retraining
stability - all starting vectors for entities are deterministic, which means that Cleora embeddings on similar datasets will end up being similar. Methods like Word2vec, Node2vec or DeepWalk return different results with every run.
cross-dataset compositionality - thanks to stability of Cleora embeddings, embeddings of the same entity on multiple datasets can be combined by averaging, yielding meaningful vectors
dim-wise independence - thanks to the process producing Cleora embeddings, every dimension is independent of others. This property allows for efficient and low-parameter method for combining multi-view embeddings with Conv1d layers.
extreme parallelism and performance - Cleora is written in Rust utilizing thread-level parallelism for all calculations except input file loading. In practice this means that the embedding process is often faster than loading the input data.

Key usability features of Cleora embeddings

The technical properties described above imply good production-readiness of Cleora, which from the end-user perspective can be summarized as follows:

heterogeneous relational tables can be embedded without any artificial data pre-processing
mixed interaction + text datasets can be embedded with ease
cold start problem for new entities is non-existent
real-time updates of the embeddings do not require any separate solutions
multi-view embeddings work out of the box
temporal, incremental embeddings are stable out of the box, with no need for re-alignment, rotations or other methods
extremely large datasets are supported and can be embedded within seconds / minutes

Documentation

More information can be found in the full documentation.

Cleora Enterprise

Cleora Enterprise is now available for selected customers. Key improvements in addition to this open-source version:

performance optimizations: 10x faster embedding times
latest research: significantly improved embedding quality
new feature: item attributes support
new feature: multimodal fusion of multiple graphs, text and image embeddings
new feature: compressed embeddings in various formats (spherical, hyperbolic, sparse)

For details contact us at [email protected]

Cite

Please cite our paper (and the respective papers of the methods used) if you use this code in your own work:

@article{DBLP:journals/corr/abs-2102-02302,
  author    = {Barbara Rychalska, Piotr Babel, Konrad Goluchowski, Andrzej Michalowski, Jacek Dabrowski},
  title     = {Cleora: {A} Simple, Strong and Scalable Graph Embedding Scheme},
  journal   = {CoRR},
  year      = {2021}
}

License

Synerise Cleora is MIT licensed, as found in the LICENSE file.

How to Contribute

You are welcomed to contribute to this open-source toolbox. The detailed instructions will be released soon as issues.

cleora's People

Contributors

Stargazers

Watchers

cleora's Issues

--seed not working

in the latest version, -s or --seed is not working, the error is :
error: Found argument '--seed' which wasn't expected, or isn't valid in this context

edges with weight

hi
I am not able to find two things:

How to give data where edges are present with some weight?? So that edge weights are taken into embedding calculation.
Cold start problems? How to use embeddings for cold start?

[paper] definition of M vs Fig. 2.

Hi!
Apologies if this is not the best place to ask this but. I've been reading Your paper and I hope You could clarify this little thing which is confusing to me:

Section III.B offers the definition of matrix M, M_ab = e_ab/deg(v_a).
Figure 2. presents toy example that includes explicit M matrices.

Are those latter matrices in 2. meant to follow the definition from 1.?
If so, I must be misinterpreting something about this definition. Based on it I was expecting the rows of Ms in Fig. 2. to be normalized. Can I ask for example what is the value of deg(v_a) for the third row (i.e. v_a = p2_hash, if I understand correctly) of M_3 in Fig. 2. It seems that the entries in this row were constructed as follows:

taking v_b=u1_hash, we have e_ab=2 and deg(v_a)=4, and the entry in M_3 reads 1/2;
and taking v_b=u2_hash, we have e_ab=1 and deg(v_a)=3, and the entry in M_3 reads 1/3;
so it seems that deg(v_a) is not just a function of v_a, which - for me - is not captured in the presentation in Section III.

What am I missing? Thanks in advance for any hints! :)

Calculating embeddings for new nodes after training

I am trying to run Cleora on a simple dataset. My TSV file is simple and follows the format of "leads attributes"

l1 <\t> a1
l2 <\t> a1
l1<\t> a2
l3 <\t> a2

Leads are connected to some attributes.

I have Set A which is used to train embeddings for all nodes ( leads and attributes ) in the set.

For new nodes with the same format of "leads attributes" in Set B, I calculate embeddings by using the following 2 methods. Then I use the embeddings for all "leads" nodes of Set A to train XGBoost model and predict on "leads" nodes of Set B to calculate the AUC.

Method 1

I jointly train embeddings by combining Set A and Set B. I get the embeddings for all "leads" nodes. On Set B, the XGBoost model AUC (trained on "leads" embeddings of Set A) is ~0.8

Method 2

I used another method as suggested in another closed issue #21 - where I train the embeddings only on Set A. Then for all "leads" nodes of Set B, I extract the embeddings of all the attributes a particular lead is connected to, average and do L2 normalization. Then with the XGBoost model trained on Set A "leads" embeddings, I predict on "leads" embeddings of Set B. The AUC drops to 0.65

Any reason why there is a drop in the AUC using Method 2 which was suggested to calculate embeddings for incoming nodes on the fly ? The alternative is method 1 where I have to retrain the graph by including new nodes every time.

Thanks

Question about reproducing Dunhumby product complements/substitutes from white paper

Hello,
Thanks for the cool project!

I was trying to reproduce the results of using Cleora to identify product complements/substitutes in the Dunhumby Complete Journey shown in the white paper but had a question about how the transaction data should be formatted and how the column types should be specified when running the Cleora binary:

I've formatted the transaction data by "basket" such that each row contains a user, and a sequence of product_ids

user_id <\t> product_id product_id product_id
user_id <\t> product_id product_id
user_id <\t> product_id product_id product_id product_id

and so on. and then I've run Cleora using...

for product complements:
./cleora-v1.1.1-x86_64-unknown-linux-gnu -i ./dunhumby_data/dh_clique.txt --columns="transient::user complex::products" -d 1024 -n 1

and for product substitutes:
./cleora-v1.1.1-x86_64-unknown-linux-gnu -i ./dunhumby_data/dh_clique.txt --columns="transient::user complex::products" -d 1024 -n 4

After this just comparing cosine similarities and looking at the top 5 most similar "complements" and "substitutes" for one of the products form the white paper "SOUP RAMEN NOODLES/RAMEN CUPS 3 OZ", the complments and substitutes are not really similar to what is reported in the white paper. E.g. Top 3 Substitutes are

BUTTER BUTTER 8OZ 
CODIMENTS/SAUCES BBQ SAUCE 18OZ
STATIONERY & SCHOOL SUPPLIES TAPE & MAILING PRODUCTS 60CT

instead of

SOUP RAMEN NOODLES/RAMEN CUPS 3 OZ
SOUP RAMEN NOODLES/RAMEN CUPS 3 OZ
SOUP RAMEN NOODLES/RAMEN CUPS 3 OZ

As reported in the white paper.

I guess that I've not set the data up correctly or have specified the column types incorrectly. Is there code somewhere that describes using Cleora for this type of problem? Otherwise any hints would be greatly appreciated. Thanks!

To build cleora on Ubuntu 20.04 needs clang-11

Hello all,
Great project, missing dependency in readme dependency on clang-11 (simd-json dependency)
for ubuntu 20.04 I had to install from llvm repo:

export LLVM_VERSION=12
sudo add-apt-repository "deb http://apt.llvm.org/focal/ llvm-toolchain-focal-12 main"
sudo apt-get install -y clang-$LLVM_VERSION lldb-$LLVM_VERSION lld-$LLVM_VERSION clangd-$LLVM_VERSION

Calculation of embedding for new entity on-the-fly

As this module claims that Cleora embeddings of an entity are defined only by interactions with other entities, vectors for new entities can be computed on-the-fly.

Could you please help me with how to calculate embeddings for a new node? It will be really helpful if you can share a jupyter notebook on this as well.

Interesting phenomenon

Hello Cleora team, a very interesting and clever solution for creating embeddings. However, I noticed a behavior that I cannot explain. When creating embeddings with one column (a category of a single node type) that contains both a start and an end node (simple edge list), nodes that are further away from each other generate a vector that is closer to each other. E.g .: (a) -> (b) -> (c) -> (d)

as Edge List:
a b
b c
c d

The vectors a and d are closer together than the vectors a and b (by Cosin value)

Volume: approx. 5.5 million nodes and 41 million edges

I created the embeddings with the following call:

--columns = 'complex :: reflexive :: nodes' -d = 128 -i = 'node.edgelist' -n = 4

As I understand the pattern, the reflexive relationship in a column of a single type (complex) should cover an edge list with a category of node types. What am I doing wrong with the configuration or is this an issue?

A short tip would be very appreciated.

Best

Remove binaries from code repository and instead use Github Releases

Usually it's not a good practice to keep binaries in git source repository. It bloats repository size forever, which in turn causes a lot more data to be transferred when repository is cloned. GitHub provides convenient solution for keeping and managing binary releases.

Initialize with node features

Is it possible to leverage the information of the node features, e.g. initialize the embeddings?

Performance on homogeneous graphs

Hi! I was trying to measure the performance of the Cleora on the homogeneous graphs used in the YouTube original paper, with no success. Specifically, I was trying to evaluate Blogcatalog, as it is the smallest graph.

I used the ./target/release/cleora -c "node node" configuration to get a single embedding per node. This way, the performance is close to random; however, on a different dataset with far simpler structure (CoCit dataset from the VERSE paper), I was able to obtain poor but not random performance - 0.21 Macro-F1 avg. for 5% labelled nodes, compared with ±30 for DW/VERSE.

Could you clarify if I am doing something horribly wrong, or the method needs some other hyperparameters to run properly? I would imagine adding a homogeneous example used across literature would strengthen the repository greatly. :)

directed graph

How can I force it to learn the embedding of a directed graph?

Usage with EMDE?

Cleora is used by Synerise for internal purposes, working together with Terrarium DB processing billions of datapoints in real-time and solving multi-modal challenges which involves graph data. Cleora algorithm is flexible and can be applied to different segments of the market, inter alia: retail, banking, and telco behavioral data at scale (billions of entities, trillions of interactions). Cleora is also used in input embeddings in EMDE (Efficient Manifold Density Estimation) that was also created by the Synerise Team.

Is there an example of Cleora being used with EMDE somewhere?

The abstract of the EMDE paper says

We release the source code and our own real-world dataset of e-commerce product purchases, with special focus on modeling of the item cold-start problem.

But the code doesn't appear to be public anywhere. I've been trying to replicate the method myself but I'd like to check a few details I'm not sure about

Thanks!

/lib64/libc.so.6: version `GLIBC_2.18' not found

I am trying to run Cleora on a simple dataset. My TSV file is simple and follows the format of "leads attributes"

l1 <\t> a1
l2 <\t> a1
l1<\t> a2
l3 <\t> a2

I am trying to run an embedding task with the following command from one of the jupyter notebook examples. I am trying to run it on a linux machine.

#!/bin/bash
! ./new_method/cleora-v1.1.1-x86_64-unknown-linux-gnu --input DATA_PATH+'edges.tsv' --columns="leads attributes" --dimension 32 --number-of-iterations 4

I am getting the following error -

./new_method/cleora-v1.1.1-x86_64-unknown-linux-gnu: /lib64/libc.so.6: version `GLIBC_2.18' not found (required by ./new_method/cleora-v1.1.1-x86_64-unknown-linux-gnu)

Is there a solution to this ? Even with the latest version of the linux binary file, I get the same error.

graph partitioning

There's an option named 'num_partitions' in pytorch-biggraph that can reduce the peak memory usage, Can Cleora provide that option too? is it possible in the future?
my situation:
40M nodes
180M edges
more than 20GB of peak memory usage to train Cleora embeddings!
I also set ( --in-memory-embedding-calculation 0 )

Is this the true result of cleora ?

https://paperswithcode.com/paper/cleora-a-simple-strong-and-scalable-graph
Why cleora is superior to GCNs?

RAM Issue with Cleora on Large Graphs

Hello everyone,

I am working on a graph consisting of 300M+ nodes and 1.4B+ relationships. I tried training embeddings using cleora on this graph but facing a shortage of RAM issue. The server I am trying to run this on contains 661GB of RAM.

Is there any way where I don't load the entire files in the memory and just train it, based on a batch size to keep the memory used steadily and relatively lower?

Turn off cleora logs

Hi Team,

I want to turn off the cleora line processing logs. Please help me with this. Thanks.

2021-11-09T06:22:17Z INFO cleora::entity] Number of lines processed: 2960000 [2021-11-09T06:22:17Z INFO cleora::entity] Number of lines processed: 2970000 [2021-11-09T06:22:17Z INFO cleora::entity] Number of lines processed: 2980000 [2021-11-09T06:22:17Z INFO cleora::entity] Number of lines processed: 2990000 [2021-11-09T06:22:17Z INFO cleora::entity] Number of lines processed: 3000000 [2021-11-09T06:22:17Z INFO cleora::entity] Number of lines processed: 3010000 [2021-11-09T06:22:17Z INFO cleora::entity] Number of lines processed: 3020000

User-Item Embedding

Hello,
Thank you very much for this work. The performance of your algorithm is stunning!
We are testing Cleora for a user-items embedding task. I have run into some result and wondering if this is by design or my mistake.
My TSV file is simple and follows the format of "user item"

u1 <\t> i1
u2 <\t> i1
u1<\t> i2
u3 <\t> i2

As you can see, the relation between users and items is many to many.
Im running a simple embedding task
./cleora --input ~/test.tsv --columns="users items" --dimension 32 --number-of-iterations 4

In the resulting embeddings it seems that users and items are "remote" from each other, as in the image below (cluster 0 is users and 1 is items). That is very different than cases in which we used simple matrix factorization, where we saw users are closer to the items they buy than other items, but here it seems that these relationships are somewhat lost.
Does my question make sense? Is this result expected in this case?

Many thanks!

Online learning

Is there any way to update the model on the fly with new data?

help in understanding output file format

I was running cleora using the command below:

cleora-v1.2.3-x86_64-apple-darwin --columns transient::cluster_id StarNode --dimension 1024 -n 5 --input fb_cleora_input_star.txt -o output

I got something similar to the following output:
(I added some spacing just for better readability)

39361 1024
1        1    0.029419877 ..... -0.0073362226
16260    7    0.033474464 ..... -0.00906976
.
.
.
22459    1    0.010709517 ..... 0.026430061

I cant figure out what does the 1st (1, 16260, ..., 22459) and the 2nd (1, 7, ..., 1) columns represent?

Thanks

pyo3 instegration and adding support for parquet output and s3 stores

I'd like to add a few features.

Integration with pyo3 bindings which will enable to publish library as a python package and use without using subprocess
Support for a parquet output persistence:
output_format="parquet"
because writing to parquet row by row is inefficient thus and additional parameter will be required to write with the chunks:
chunk_size=3000
Support for a s3 as a input and output store

Example usage:

import cleora

output_dir = 's3://output'
fb_cleora_input_clique_filename = "s3://input/fb_cleora_input_clique.txt"
fb_cleora_input_star_filename = "s3://input/fb_cleora_input_star.txt"

cleora.run(
    input=[fb_cleora_input_clique_filename],
    type_name="tsv",
    dimension=1024,
    max_iter=5,
    seed=10,
    prepend_field=False,
    log_every=1000,
    in_memory_embedding_calculation=True,
    cols_str="complex::reflexive::CliqueNode",
    output_dir=output_dir,
    output_format="parquet",
    relation_name="emb",
    chunk_size=3000,
)

Check for malformed lines in input. Computation partially proceeds without warning instead of aborting on Windows.

I have spoken with Jack Dabrowski today regarding some problems with processing a large input file on Windows. The input and output files can be found below. Here is a sample command execution for reproduction purposes:

C:\Path\To\Repo\cleora\target\debug\cleora.exe --input edges.tsv --dimension 100 --number-of-iterations 10 --columns="media complex::tropes" --output-dir Output

This was run on a Windows 10 machine and yields the following debug information...

C:\Path\To\Repo\cleora>C:\Path\To\Repo\cleora\target\debug\cleora.exe --input edges.tsv --dimension 100 --number-of-iterations 10 --columns="media complex::tropes" --output-dir Output
[2022-06-24T16:47:38Z INFO  cleora] Reading args...
[src\[main.rs:202](http://main.rs:202/)] &config = Configuration {
    produce_entity_occurrence_count: true,
    embeddings_dimension: 100,
    max_number_of_iteration: 10,
    seed: None,
    prepend_field: false,
    log_every_n: 10000,
    in_memory_embedding_calculation: true,
    input: "edges.tsv",
    file_type: Tsv,
    output_dir: Some(
        "Output",
    ),
    output_format: TextFile,
    relation_name: "emb",
    columns: [
        Column {
            name: "media",
            transient: false,
            complex: false,
            reflexive: false,
            ignored: false,
        },
        Column {
            name: "tropes",
            transient: false,
            complex: true,
            reflexive: false,
            ignored: false,
        },
    ],
}
[2022-06-24T16:47:38Z INFO  cleora] Starting calculation...
[src\[pipeline.rs:25](http://pipeline.rs:25/)] &sparse_matrices = [
    SparseMatrix {
        col_a_id: 0,
        col_a_name: "media",
        col_b_id: 1,
        col_b_name: "tropes",
        edge_count: 0,
        hash_2_id: {},
        id_2_hash: [],
        row_sum: [],
        pair_index: {},
        entries: [],
    },
]
[2022-06-24T16:47:38Z INFO  cleora::sparse_matrix] Number of entities: 6629
[2022-06-24T16:47:38Z INFO  cleora::sparse_matrix] Number of edges: 13985
[2022-06-24T16:47:38Z INFO  cleora::sparse_matrix] Number of entries: 27970
[2022-06-24T16:47:38Z INFO  cleora::sparse_matrix] Total memory usage by the struct ~ 0 MB
[2022-06-24T16:47:40Z INFO  cleora::pipeline] Number of lines processed: 10000
[2022-06-24T16:47:41Z INFO  cleora::pipeline] Number of lines processed: 20000
[2022-06-24T16:47:43Z INFO  cleora::pipeline] Number of lines processed: 30000
[2022-06-24T16:47:44Z INFO  cleora::pipeline] Number of lines processed: 40000
[2022-06-24T16:47:46Z INFO  cleora::pipeline] Number of lines processed: 50000
[2022-06-24T16:47:49Z INFO  cleora::pipeline] Number of lines processed: 60000
[2022-06-24T16:47:53Z INFO  cleora::pipeline] Number of lines processed: 70000
[2022-06-24T16:47:56Z INFO  cleora] Finished Sparse Matrices calculation in 18 sec
[2022-06-24T16:47:56Z INFO  cleora::embedding] Start initialization. Dims: 100, entities: 6629.
[2022-06-24T16:47:56Z INFO  cleora::embedding] Done initializing. Dims: 100, entities: 6629.
[2022-06-24T16:47:56Z INFO  cleora::embedding] Start propagating. Number of iterations: 10.
[2022-06-24T16:47:56Z INFO  cleora::embedding] Done iter: 0. Dims: 100, entities: 6629, num data points: 27970.
[2022-06-24T16:47:57Z INFO  cleora::embedding] Done iter: 1. Dims: 100, entities: 6629, num data points: 27970.
[2022-06-24T16:47:57Z INFO  cleora::embedding] Done iter: 2. Dims: 100, entities: 6629, num data points: 27970.
[2022-06-24T16:47:57Z INFO  cleora::embedding] Done iter: 3. Dims: 100, entities: 6629, num data points: 27970.
[2022-06-24T16:47:57Z INFO  cleora::embedding] Done iter: 4. Dims: 100, entities: 6629, num data points: 27970.
[2022-06-24T16:47:57Z INFO  cleora::embedding] Done iter: 5. Dims: 100, entities: 6629, num data points: 27970.
[2022-06-24T16:47:57Z INFO  cleora::embedding] Done iter: 6. Dims: 100, entities: 6629, num data points: 27970.
[2022-06-24T16:47:58Z INFO  cleora::embedding] Done iter: 7. Dims: 100, entities: 6629, num data points: 27970.
[2022-06-24T16:47:58Z INFO  cleora::embedding] Done iter: 8. Dims: 100, entities: 6629, num data points: 27970.
[2022-06-24T16:47:58Z INFO  cleora::embedding] Done iter: 9. Dims: 100, entities: 6629, num data points: 27970.
[2022-06-24T16:47:58Z INFO  cleora::embedding] Done propagating.
[2022-06-24T16:47:58Z INFO  cleora::embedding] Start saving embeddings.
[2022-06-24T16:47:58Z INFO  cleora::embedding] Done saving embeddings.
[2022-06-24T16:47:58Z INFO  cleora::embedding] Finalizing embeddings calculations!
[2022-06-24T16:47:58Z INFO  cleora] Finished in 20 sec

C:\Path\To\Repo\cleora>

I was told the following:

1. my binary (gnu-linux) on your input file throws an exception during data loading phase and fails immediately:
thread 'main' panicked at 'index out of bounds: the len is 1 but the index is 1', src/entity.rs:227:29

2. This is caused by lines containing only a single entity, without any other corresponding entities (e.g. line number 101 in your input file).
Such inputs are meaningless (because they do not represent an edge in the graph), and we do not handle them currently.

3. The code should throw an error & abort, but apparently on Windows, the exception happens silently, and the code proceeds to the next phase, despite not having loaded all inputs into memory successfully.

We will introduce a proper workaround (handle the case without errors + display a warning that "such lines are meaningless and will be skipped").

If there are any other materials necessary for addressing this issue, please reach out. Thank you very much!

Output:
Output.zip

Input:
https://drive.google.com/file/d/1YjTSQ-DMEaOE5wRbO1bN4SuXv__PBE0W/view?usp=sharing