Giter Club home page Giter Club logo

Comments (33)

that1matt avatar that1matt commented on September 27, 2024 4

But I think we are both of the opinion that one does not necessarily want to disclose that information for that use case, which is why an additional dedicated field makes more sense. Do you feel the same way?

Yes, along with keeping the file id and filename_disk in sync as they currently are. Which was the case before that patch went into place to allow a custom filename_disk.

If we're actually going to update the filename_disk field et al., the modified_on field will be updated as well when replacing a file. But that also seems to make sense to me.

That makes sense to me as well. The goal of this change is to provide a definitive way of identifying files that have been replaced, to find a way to invalidate cache without allowing every modification to invalidate cache. (This could PR could fix areas where you are calling modified_on as a way to invalidate cache within the app.)

Got it, thanks! This sound like something that is usually solved by storing a checksum/hash of the file (I thought we already had something like this), which has the added benefit that the cache is only invalidated if the file is actually different.

I agree, having a hash/checksum would solve the problem, as long as this field is returned through the API. I am happy to work on changing the implementation. I also like this better than the replaced_on. My goal is to cause as little disruption as possible.

from directus.

that1matt avatar that1matt commented on September 27, 2024 2

@nickrum the issue right now, is that modified_on gets updated when you make any change to the file. This doesn't narrow the scope of when a file has been replaced. Using the modified_on would defeat the ability of breaking the cache when using an external cache provider. This replaced_on will only update when the file is replaced, not if the filename was renamed.

from directus.

rijkvanzanten avatar rijkvanzanten commented on September 27, 2024 2

I'm out one day for a wedding and we've changed the plot! 😄 Good shoutout to use a checksum instead @nickrum. Feels more appropriate than a version number indeed.

const string = await readableStreamToString(stream);
const checksum = generateChecksum(string);

@that1matt This feels dangerous to me. That readable stream could be gigabytes worth of file, which is then all read into a single string. This is an easy way to crash the server with an out of memory issue. I believe you can pipe a readable stream to the hasher, so hopefully we can change that to something like (pseudo code):

import { pipeline } from 'node:stream/promises';
import { createHash } from 'node:crypto';

const hash = await pipeline(stream, createHash('md5'));

from directus.

that1matt avatar that1matt commented on September 27, 2024 2

I have updated the linked PR to use a replaced_on field. Please feel free to modify it if it works. I also kept the logic on updating file extensions, mime-types, when you replace the image.

from directus.

that1matt avatar that1matt commented on September 27, 2024 1

Ideally I wanted to have a replaced_on field that was a timestamp that was only updated when the existing file was replaced. I wasn't able to figure out how to get that to work in my testing. PostgresQL would always update the field on any change. If you have thoughts on this, I would love to replace the version with a replaced_on timestamp value, as it made more sense than a version field.

I do like that the primary key is the same as the asset filename_disk, it does save some logic when we are pulling down files from Directus.

My goal with the additional field was to try and keep it simple without making a potential breaking change on how users expect file assets to work. (only because I am not as familiar with the depth of Directus and how a change like that can affect different areas). Having just a replaced_on or version field, would allow a simple additional field to be returned and if it is not null, it means it was updated. Where the timestamp/version field would indicate the date/int at which it was replaced, could be used as a query string to break caching when we are pulling the asset from Directus.

from directus.

that1matt avatar that1matt commented on September 27, 2024 1

Also would we want to block direct modification of that value?

Yes, this field should only change when an existing file is being replaced. You should not be able to change it via API/APP.

I'd consider using a replacement date instead,

For a version field, as mentioned by you, I'd choose a more descriptive name like file_version to make it clearer that the value is only about the file itself.

Your comment here about the date field, solved the issue I was having when using a timestamp field, where the INSERT/UPDATE issue no longer happens when any field is changed. So I created a new field replaced_on to match convention of modified_on & uploaded_on.

from directus.

nickrum avatar nickrum commented on September 27, 2024 1

@that1matt Got it, thanks! This sound like something that is usually solved by storing a checksum/hash of the file (I thought we already had something like this), which has the added benefit that the cache is only invalidated if the file is actually different.

from directus.

paescuj avatar paescuj commented on September 27, 2024 1

@that1matt Thanks again!

Some final remarks:

  • My initial idea in #22606 (comment) was that an additional field could be avoided by using the updated filename_disk field as a cache invalidator. But I think we are both of the opinion that one does not necessarily want to disclose that information for that use case, which is why an additional dedicated field makes more sense. Do you feel the same way?
  • If we're actually going to update the filename_disk field et al., the modified_on field will be updated as well when replacing a file. But that also seems to make sense to me.

@that1matt Got it, thanks! This sound like something that is usually solved by storing a checksum/hash of the file (I thought we already had something like this), which has the added benefit that the cache is only invalidated if the file is actually different.

Awesome input ❤️ I think I'd prefer this to a replaced_on date field, as we could, as mentioned above, now get this information from the modified_on field if needed. Implementation wise, we would need to see how we can calculate the checksum, though.

from directus.

that1matt avatar that1matt commented on September 27, 2024 1

@paescuj @nickrum made the first round of changes, let me know what you think.

from directus.

hanneskuettner avatar hanneskuettner commented on September 27, 2024 1

While we can't manually safe and restore the hasher instance, couldn't we just keep it around in memory? I haven't checked it, but I would assume that the hasher instance isn't much bigger than the resulting hash.

Not possible, since the upload can happen across multiple horizontally scaled instances and does not have to be pinned to one instance only

from directus.

nickrum avatar nickrum commented on September 27, 2024 1

Not possible, since the upload can happen across multiple horizontally scaled instances and does not have to be pinned to one instance only

It's never easy when multiple horizontally scaled instances come into play...
Looks like this xxhash implementation allows you to access and modify the internal state, which means we could store the state somewhere where it can be accessed by all instances.

from directus.

paescuj avatar paescuj commented on September 27, 2024 1

Thanks @that1matt for your active participation in this matter!

We've now decided within the team to continue with the date approach. Hashing currently raises too many questions, especially when it comes to the new "resumable uploads" feature.

The reason for the separate explorational pull request from my side (#22900), was that if we would add hashing, this most likely has to happen directly during the upload (as opposed to loading the file again from storage after upload).
The solution in #22900 seems to work so far, however as mentioned above, we choose the date approach, which is much simpler but still meets the original requirements.

Nevertheless, I'd like to keep your pull request (#22751) open for a bit, as it contains other useful work, that we can look at at a later point. We'll open an additional pull request shortly, which will exclusively cover the the date field change.

from directus.

that1matt avatar that1matt commented on September 27, 2024 1

@paescuj let me know if you want me to adjust the PR I have open, I can change it back to use a replaced_on date, with the other updates to make sure that the file extension gets updated as well.

from directus.

that1matt avatar that1matt commented on September 27, 2024

Potential Solution:

  • Create a new field on directus_files named replaced_on that takes the same timestamp/datetime value as other datetime fields. This might not work as the timestamp value would be updated on insertion/update.

  • Then where calling the Replace File functionality it would update the value of the timestamp for this field. Only calling the Replace File through the API or UI would ever update this field.

  • Creating a version integer column on directus_files that is updated by one when an image is replaced using the Replace File. Current example is using version, but might be a good idea to use another name such as cache_version, file_version, file_revision or replaced_version something more descriptive.

This has the benefit of giving a version number when the image was replaced, and would allow the user to build their own logic on how to handle caching. With it being as simple as checking the field, if it is not null, append a query param to break previous cache issues.

I am happy to take on solving this, with whatever solution the team thinks would be ideal.

from directus.

paescuj avatar paescuj commented on September 27, 2024

I'd like to suggest another potential solution:

When replacing a file, we could update the filename_disk field, using a new random UUID.
That way we can also ensure that the file extension present in filename_disk is updated - currently this is not the case (for example, when replacing a jpeg with a png file).
This would possibly eliminate the need for a new version field and solve #22556 at the same time 🤔

If we take this path, we might want to use an arbitrary UUID for filename_disk on the initial upload as well, instead of the primary key, so that there's no coupling from the beginning.

from directus.

that1matt avatar that1matt commented on September 27, 2024

@paescuj I didn't even think about the file extension, that is a great point. I think we would also want to update the filename_download to make sure it includes the file extension as well. We have had issues where the filename_download doesn't have the file extension, and users cannot open the file. Would we want to overwrite the filename_disk/filename_download to keep them both in "sync"?

Would the primary key stay the same, just the filename_disk/filename_download change?

I did see that this change did get added, #22848 which allows a custom filename_disk.

from directus.

that1matt avatar that1matt commented on September 27, 2024

@paescuj another option would be to check the data passed in from the new file, and replace the filename_disk extension here https://github.com/directus/directus/blob/main/api/src/services/files.ts#L89

from directus.

that1matt avatar that1matt commented on September 27, 2024

@paescuj I have a change within files.ts that would solve changing the filename_disk extension, along with making sure that the filename_download extension is present. I added some tests to make sure that when changing an existing file, it does update the extension only for both filename_download and filename_disk.

if (!path.extname(payload.filename_download!) && payload?.type) {
   payload.filename_download = payload.filename_download + '.' + extension(payload.type)
}

// The filename_disk is the FINAL filename on disk
payload.filename_disk ||= primaryKey + (fileExtension || '');

if (path.extname(payload.filename_disk!) !== fileExtension) {
   payload.filename_disk = primaryKey + (fileExtension || '');
}

from directus.

paescuj avatar paescuj commented on September 27, 2024

@paescuj I didn't even think about the file extension, that is a great point. I think we would also want to update the filename_download to make sure it includes the file extension as well. We have had issues where the filename_download doesn't have the file extension, and users cannot open the file. Would we want to overwrite the filename_disk/filename_download to keep them both in "sync"?

Would the primary key stay the same, just the filename_disk/filename_download change?

I did see that this change did get added, #22848 which allows a custom filename_disk.

Thanks for your prompt feedback! ❤️

I think we would also want to update the filename_download [...]

Oh good point! Yeah, I think consequently we should also update that - as well as the type field!

Would the primary key stay the same [...]

Right, I think the primary key always stays the same, as after a replacement, it's still the same file "resource/container/..." that is now just pointing to a different "asset".


With those changes, do you think there's still a need for the version field?
Not that I am fundamentally against it, but in the end it's a new field that we would have to maintain.
As it seems, it would primarily serve as cache invalidation, so I wonder how useful it really is besides that.
For example, the revisions already indicate the "version" of a file.

If it turns out that such a field is necessary:

  • I'd consider using a replacement date instead, as in my opinion this is a bit more meaningful than a simple version counter. It seems like for the current use case we just need a value that changes, but we're not really interested in the actual "version number". At the same time, a date field would contain information that could be useful in other ways...
  • For a version field, as mentioned by you, I'd choose a more descriptive name like file_version to make it clearer that the value is only about the file itself.
  • Seems pretty irrelevant at the moment, but still (🤓😇): What happens when the version counter reaches the max integer value? 1 Also would we want to block direct modification of that value?

Footnotes

  1. https://arstechnica.com/information-technology/2014/12/gangnam-style-overflows-int_max-forces-youtube-to-go-64-bit/

from directus.

that1matt avatar that1matt commented on September 27, 2024

@paescuj what do you think about this implementation change on #22751 ?

from directus.

that1matt avatar that1matt commented on September 27, 2024

This would possibly eliminate the need for a new version field and solve #22556 at the same time 🤔

I do not have a Cloudinary account ATM, but I will create one and work on seeing if I can test this change against that issue.

from directus.

nickrum avatar nickrum commented on September 27, 2024

I'm probably missing something, but wouldn't it make sense to simply update modified_on when a file is replaced? I feel like I would expect a replacement to also be a kind of modification 🤔

from directus.

that1matt avatar that1matt commented on September 27, 2024

@nickrum I wasn't able to find anything that informed me when a file was replaced. Nothing that returned through the API.

from directus.

that1matt avatar that1matt commented on September 27, 2024

@paescuj I believe we will still need a new field to be returned through the API when replacing a file, as there is currently not a way to determine/notify when a file has been replaced through the API?

Just to make sure I am understanding what to implement, we would have a new hash field, that by default would be null. Starting off null indicates that the original file hasn't been changed. Then when replacing the file, we would calculate the hash of the image, and update the hash field with this value, along with updating the filename_disk & filename_download extension only, as well as the type field if the mime-type has changed.

from directus.

paescuj avatar paescuj commented on September 27, 2024

Correct, while already creating a hash for the initial upload 👍

from directus.

that1matt avatar that1matt commented on September 27, 2024

I am having some trouble getting a hash value, hopefully I am just missing something that someone can point out.

Here is the snippet of code

import { readableStreamToString } from '@directus/utils/node';
import { createHash } from 'node:crypto';

const generateChecksum = (data: string) => createHash('md5').update(data, 'utf8').digest('hex');

const string = await readableStreamToString(stream);
const checksum = generateChecksum(string);

While this does give me a hash, I am now getting an error.

Input buffer contains unsupported image format at Sharp.metadata

Any insights would be much appreciated.

It seems calling the stream before line 123 will throw the error, calling it after will not.

from directus.

that1matt avatar that1matt commented on September 27, 2024

@rijkvanzanten thanks for the feedback. I appreciate all the insights into Directus that I miss.

I have a working implementation with the changes you suggested. I am still figuring out a typescript issue.

Property 'toArray' does not exist on type AsyncIterable

/**
* Extract file hash from stream content
*/
async getHash(stream: Readable): Promise<string> {
  let hash = '';

  await pipeline(stream, createHash('md5').setEncoding('hex'), async function (source: AsyncIterable<Hash>) {
     for await (const chunk of source) {
   			hash += chunk;
   		}
  });

  return hash;
}

I haven't played with streams much, so hopefully this is on the right track.

from directus.

hanneskuettner avatar hanneskuettner commented on September 27, 2024

Just a consideration here. Since we do support uploading multi-gig files (in chunks) in the next version, doing a hash does seems like a performance pitfall, if the file has to be downloaded from the provider just to do local hashing in the Directus instance.

from directus.

nickrum avatar nickrum commented on September 27, 2024

@hanneskuettner Provided the chunked upload still goes through the Directus instance, couldn't we just calculate the hash on the fly and update it as new chunks become available?
Also, we might want to use a very fast hashing algorithm that is intended for this kind of operation like xxhash.

from directus.

paescuj avatar paescuj commented on September 27, 2024

@hanneskuettner Provided the chunked upload still goes through the Directus instance, couldn't we just calculate the hash on the fly and update it as new chunks become available? Also, we might want to use a very fast hashing algorithm that is intended for this kind of operation like xxhash.

Problem is that we cannot continue hashing once the upload has paused, as we cannot access and store the hash state... See e.g. nodejs/node#25411 (comment)
Probably the same with xxhash?

from directus.

nickrum avatar nickrum commented on September 27, 2024

While we can't manually safe and restore the hasher instance, couldn't we just keep it around in memory? I haven't checked it, but I would assume that the hasher instance isn't much bigger than the resulting hash.

from directus.

that1matt avatar that1matt commented on September 27, 2024

@paescuj I actually like a lot of your changes on #22900 for the file hashing, and solved the issue I was having with trying to clone the stream. With your PR, should I stop working on #22751 in favor of your PR, could we combine them?

from directus.

rijkvanzanten avatar rijkvanzanten commented on September 27, 2024

The hashing approach felt like the theoretical "correct" way to do it, but while testing it out in those PRs it does raise too many performance concerns and stability questions. The main goal of this implementation is to have the flag to indicate whether the file was changed, to solve that we don't need full checksumming. Juice ain't worth the squeeze 🙂

from directus.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.