Comments (33)
But I think we are both of the opinion that one does not necessarily want to disclose that information for that use case, which is why an additional dedicated field makes more sense. Do you feel the same way?
Yes, along with keeping the file id
and filename_disk
in sync as they currently are. Which was the case before that patch went into place to allow a custom filename_disk
.
If we're actually going to update the filename_disk field et al., the modified_on field will be updated as well when replacing a file. But that also seems to make sense to me.
That makes sense to me as well. The goal of this change is to provide a definitive way of identifying files that have been replaced, to find a way to invalidate cache without allowing every modification to invalidate cache. (This could PR could fix areas where you are calling modified_on
as a way to invalidate cache within the app.)
Got it, thanks! This sound like something that is usually solved by storing a checksum/hash of the file (I thought we already had something like this), which has the added benefit that the cache is only invalidated if the file is actually different.
I agree, having a hash/checksum would solve the problem, as long as this field is returned through the API. I am happy to work on changing the implementation. I also like this better than the replaced_on
. My goal is to cause as little disruption as possible.
from directus.
@nickrum the issue right now, is that modified_on
gets updated when you make any change to the file. This doesn't narrow the scope of when a file has been replaced. Using the modified_on
would defeat the ability of breaking the cache when using an external cache provider. This replaced_on
will only update when the file is replaced, not if the filename was renamed.
from directus.
I'm out one day for a wedding and we've changed the plot! 😄 Good shoutout to use a checksum instead @nickrum. Feels more appropriate than a version number indeed.
const string = await readableStreamToString(stream);
const checksum = generateChecksum(string);
@that1matt This feels dangerous to me. That readable stream could be gigabytes worth of file, which is then all read into a single string. This is an easy way to crash the server with an out of memory issue. I believe you can pipe a readable stream to the hasher, so hopefully we can change that to something like (pseudo code):
import { pipeline } from 'node:stream/promises';
import { createHash } from 'node:crypto';
const hash = await pipeline(stream, createHash('md5'));
from directus.
I have updated the linked PR to use a replaced_on
field. Please feel free to modify it if it works. I also kept the logic on updating file extensions, mime-types, when you replace the image.
from directus.
Ideally I wanted to have a replaced_on
field that was a timestamp that was only updated when the existing file was replaced. I wasn't able to figure out how to get that to work in my testing. PostgresQL would always update the field on any change. If you have thoughts on this, I would love to replace the version
with a replaced_on
timestamp value, as it made more sense than a version field.
I do like that the primary key is the same as the asset filename_disk
, it does save some logic when we are pulling down files from Directus.
My goal with the additional field was to try and keep it simple without making a potential breaking change on how users expect file assets to work. (only because I am not as familiar with the depth of Directus and how a change like that can affect different areas). Having just a replaced_on
or version
field, would allow a simple additional field to be returned and if it is not null, it means it was updated. Where the timestamp/version field would indicate the date/int at which it was replaced, could be used as a query string to break caching when we are pulling the asset from Directus.
from directus.
Also would we want to block direct modification of that value?
Yes, this field should only change when an existing file is being replaced. You should not be able to change it via API/APP.
I'd consider using a replacement date instead,
For a version field, as mentioned by you, I'd choose a more descriptive name like file_version to make it clearer that the value is only about the file itself.
Your comment here about the date field, solved the issue I was having when using a timestamp field, where the INSERT/UPDATE issue no longer happens when any field is changed. So I created a new field replaced_on
to match convention of modified_on
& uploaded_on
.
from directus.
@that1matt Got it, thanks! This sound like something that is usually solved by storing a checksum/hash of the file (I thought we already had something like this), which has the added benefit that the cache is only invalidated if the file is actually different.
from directus.
@that1matt Thanks again!
Some final remarks:
- My initial idea in #22606 (comment) was that an additional field could be avoided by using the updated
filename_disk
field as a cache invalidator. But I think we are both of the opinion that one does not necessarily want to disclose that information for that use case, which is why an additional dedicated field makes more sense. Do you feel the same way? - If we're actually going to update the
filename_disk
field et al., themodified_on
field will be updated as well when replacing a file. But that also seems to make sense to me.
@that1matt Got it, thanks! This sound like something that is usually solved by storing a checksum/hash of the file (I thought we already had something like this), which has the added benefit that the cache is only invalidated if the file is actually different.
Awesome input ❤️ I think I'd prefer this to a replaced_on
date field, as we could, as mentioned above, now get this information from the modified_on
field if needed. Implementation wise, we would need to see how we can calculate the checksum, though.
from directus.
@paescuj @nickrum made the first round of changes, let me know what you think.
from directus.
While we can't manually safe and restore the hasher instance, couldn't we just keep it around in memory? I haven't checked it, but I would assume that the hasher instance isn't much bigger than the resulting hash.
Not possible, since the upload can happen across multiple horizontally scaled instances and does not have to be pinned to one instance only
from directus.
Not possible, since the upload can happen across multiple horizontally scaled instances and does not have to be pinned to one instance only
It's never easy when multiple horizontally scaled instances come into play...
Looks like this xxhash implementation allows you to access and modify the internal state, which means we could store the state somewhere where it can be accessed by all instances.
from directus.
Thanks @that1matt for your active participation in this matter!
We've now decided within the team to continue with the date approach. Hashing currently raises too many questions, especially when it comes to the new "resumable uploads" feature.
The reason for the separate explorational pull request from my side (#22900), was that if we would add hashing, this most likely has to happen directly during the upload (as opposed to loading the file again from storage after upload).
The solution in #22900 seems to work so far, however as mentioned above, we choose the date approach, which is much simpler but still meets the original requirements.
Nevertheless, I'd like to keep your pull request (#22751) open for a bit, as it contains other useful work, that we can look at at a later point. We'll open an additional pull request shortly, which will exclusively cover the the date field change.
from directus.
@paescuj let me know if you want me to adjust the PR I have open, I can change it back to use a replaced_on
date, with the other updates to make sure that the file extension gets updated as well.
from directus.
Potential Solution:
-
Create a new field onThis might not work as the timestamp value would be updated on insertion/update.directus_files
namedreplaced_on
that takes the same timestamp/datetime value as other datetime fields. -
Then where calling theReplace File
functionality it would update the value of the timestamp for this field. Only calling theReplace File
through the API or UI would ever update this field. -
Creating a
version
integer column ondirectus_files
that is updated by one when an image is replaced using theReplace File
. Current example is using version, but might be a good idea to use another name such ascache_version
,file_version
,file_revision
orreplaced_version
something more descriptive.
This has the benefit of giving a version number when the image was replaced, and would allow the user to build their own logic on how to handle caching. With it being as simple as checking the field, if it is not null, append a query param to break previous cache issues.
I am happy to take on solving this, with whatever solution the team thinks would be ideal.
from directus.
I'd like to suggest another potential solution:
When replacing a file, we could update the filename_disk
field, using a new random UUID.
That way we can also ensure that the file extension present in filename_disk
is updated - currently this is not the case (for example, when replacing a jpeg
with a png
file).
This would possibly eliminate the need for a new version
field and solve #22556 at the same time 🤔
If we take this path, we might want to use an arbitrary UUID for filename_disk
on the initial upload as well, instead of the primary key, so that there's no coupling from the beginning.
from directus.
@paescuj I didn't even think about the file extension, that is a great point. I think we would also want to update the filename_download
to make sure it includes the file extension as well. We have had issues where the filename_download
doesn't have the file extension, and users cannot open the file. Would we want to overwrite the filename_disk/filename_download
to keep them both in "sync"?
Would the primary key stay the same, just the filename_disk/filename_download
change?
I did see that this change did get added, #22848 which allows a custom filename_disk
.
from directus.
@paescuj another option would be to check the data passed in from the new file, and replace the filename_disk
extension here https://github.com/directus/directus/blob/main/api/src/services/files.ts#L89
from directus.
@paescuj I have a change within files.ts
that would solve changing the filename_disk
extension, along with making sure that the filename_download
extension is present. I added some tests to make sure that when changing an existing file, it does update the extension only for both filename_download
and filename_disk
.
if (!path.extname(payload.filename_download!) && payload?.type) {
payload.filename_download = payload.filename_download + '.' + extension(payload.type)
}
// The filename_disk is the FINAL filename on disk
payload.filename_disk ||= primaryKey + (fileExtension || '');
if (path.extname(payload.filename_disk!) !== fileExtension) {
payload.filename_disk = primaryKey + (fileExtension || '');
}
from directus.
@paescuj I didn't even think about the file extension, that is a great point. I think we would also want to update the
filename_download
to make sure it includes the file extension as well. We have had issues where thefilename_download
doesn't have the file extension, and users cannot open the file. Would we want to overwrite thefilename_disk/filename_download
to keep them both in "sync"?Would the primary key stay the same, just the
filename_disk/filename_download
change?I did see that this change did get added, #22848 which allows a custom
filename_disk
.
Thanks for your prompt feedback! ❤️
I think we would also want to update the
filename_download
[...]
Oh good point! Yeah, I think consequently we should also update that - as well as the type
field!
Would the primary key stay the same [...]
Right, I think the primary key always stays the same, as after a replacement, it's still the same file "resource/container/..." that is now just pointing to a different "asset".
With those changes, do you think there's still a need for the version
field?
Not that I am fundamentally against it, but in the end it's a new field that we would have to maintain.
As it seems, it would primarily serve as cache invalidation, so I wonder how useful it really is besides that.
For example, the revisions already indicate the "version" of a file.
If it turns out that such a field is necessary:
- I'd consider using a replacement date instead, as in my opinion this is a bit more meaningful than a simple version counter. It seems like for the current use case we just need a value that changes, but we're not really interested in the actual "version number". At the same time, a date field would contain information that could be useful in other ways...
- For a version field, as mentioned by you, I'd choose a more descriptive name like
file_version
to make it clearer that the value is only about the file itself. - Seems pretty irrelevant at the moment, but still (🤓😇): What happens when the version counter reaches the max integer value? 1 Also would we want to block direct modification of that value?
Footnotes
from directus.
@paescuj what do you think about this implementation change on #22751 ?
from directus.
This would possibly eliminate the need for a new version field and solve #22556 at the same time 🤔
I do not have a Cloudinary account ATM, but I will create one and work on seeing if I can test this change against that issue.
from directus.
I'm probably missing something, but wouldn't it make sense to simply update modified_on
when a file is replaced? I feel like I would expect a replacement to also be a kind of modification 🤔
from directus.
@nickrum I wasn't able to find anything that informed me when a file was replaced. Nothing that returned through the API.
from directus.
@paescuj I believe we will still need a new field to be returned through the API when replacing a file, as there is currently not a way to determine/notify when a file has been replaced through the API?
Just to make sure I am understanding what to implement, we would have a new hash
field, that by default would be null
. Starting off null
indicates that the original file hasn't been changed. Then when replacing the file, we would calculate the hash
of the image, and update the hash
field with this value, along with updating the filename_disk
& filename_download
extension only, as well as the type
field if the mime-type
has changed.
from directus.
Correct, while already creating a hash
for the initial upload 👍
from directus.
I am having some trouble getting a hash value, hopefully I am just missing something that someone can point out.
Here is the snippet of code
import { readableStreamToString } from '@directus/utils/node';
import { createHash } from 'node:crypto';
const generateChecksum = (data: string) => createHash('md5').update(data, 'utf8').digest('hex');
const string = await readableStreamToString(stream);
const checksum = generateChecksum(string);
While this does give me a hash, I am now getting an error.
Input buffer contains unsupported image format at Sharp.metadata
Any insights would be much appreciated.
It seems calling the stream before line 123 will throw the error, calling it after will not.
from directus.
@rijkvanzanten thanks for the feedback. I appreciate all the insights into Directus that I miss.
I have a working implementation with the changes you suggested. I am still figuring out a typescript issue.
Property 'toArray' does not exist on type AsyncIterable
/**
* Extract file hash from stream content
*/
async getHash(stream: Readable): Promise<string> {
let hash = '';
await pipeline(stream, createHash('md5').setEncoding('hex'), async function (source: AsyncIterable<Hash>) {
for await (const chunk of source) {
hash += chunk;
}
});
return hash;
}
I haven't played with streams much, so hopefully this is on the right track.
from directus.
Just a consideration here. Since we do support uploading multi-gig files (in chunks) in the next version, doing a hash does seems like a performance pitfall, if the file has to be downloaded from the provider just to do local hashing in the Directus instance.
from directus.
@hanneskuettner Provided the chunked upload still goes through the Directus instance, couldn't we just calculate the hash on the fly and update it as new chunks become available?
Also, we might want to use a very fast hashing algorithm that is intended for this kind of operation like xxhash.
from directus.
@hanneskuettner Provided the chunked upload still goes through the Directus instance, couldn't we just calculate the hash on the fly and update it as new chunks become available? Also, we might want to use a very fast hashing algorithm that is intended for this kind of operation like xxhash.
Problem is that we cannot continue hashing once the upload has paused, as we cannot access and store the hash state... See e.g. nodejs/node#25411 (comment)
Probably the same with xxhash?
from directus.
While we can't manually safe and restore the hasher instance, couldn't we just keep it around in memory? I haven't checked it, but I would assume that the hasher instance isn't much bigger than the resulting hash.
from directus.
@paescuj I actually like a lot of your changes on #22900 for the file hashing, and solved the issue I was having with trying to clone the stream. With your PR, should I stop working on #22751 in favor of your PR, could we combine them?
from directus.
The hashing approach felt like the theoretical "correct" way to do it, but while testing it out in those PRs it does raise too many performance concerns and stability questions. The main goal of this implementation is to have the flag to indicate whether the file was changed, to solve that we don't need full checksumming. Juice ain't worth the squeeze 🙂
from directus.
Related Issues (20)
- Nextjs 14 documentation issue HOT 3
- SDK: prettify rest command output type in IDE HOT 2
- LDAP autorization dosn't work HOT 2
- Migration Error: Missing `uploaded_on` Field in `directus_files` Table After Update to 11.0.2 HOT 2
- LDAP authentication error with INVALID_CREDENTIALS HOT 4
- Allow static prefix / suffix on input-translated-string interface
- z-index issue for translation-drawer on bookmark creation / editing HOT 1
- Can't preselect "Archive" value on bookmarks HOT 1
- Language-Fallback on custom translations HOT 3
- Block editor image attributes such as "Stretch image" are buggy in admin UI HOT 2
- Race condition in api/websocket dispatch subscription handler
- Existing field relationship is overwritten when new field with the same name is created. HOT 1
- REST API create empty content when "Content-Type" header is not defined
- REST API ignore all "fields" value if one item is invalid HOT 1
- "ws://undefined/websocket" in docker startup logs HOT 2
- MSSQL Policy that allows non-admin users to assign policies
- type error in OperationContext
- Right sidebar tab content is missing after navigation, only gets rendered after interaction
- Is there any way to disable the progress bar that goes over the logo? HOT 1
- Automatically generate default value of collection based on another schema value HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from directus.