I'm re-posting a question from <a class="user-mention notranslate" data-hovercard-type

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Why Files? about blockstore HOT 4 CLOSED

ormsbee commented on July 28, 2024

Why Files?

from blockstore.

Comments (4)

regisb commented on July 28, 2024

@ormsbee Thanks for taking the time to thoroughly answer my questions, I appreciate it.

I can understand the cost/simplicity argument for choosing file-based storage. For large amounts of data, such as those required by edX, cost and simplicity are important factors to take into account.

That being said, file-based data storage and not all flowers and lollipops 🌹 🍭 Basically, what we are building here is a database. And databases usually have nice properties that we are so much used to, that we tend to forget they even exist. For SQL databases, these properties are summarised as ACID. With file-based databases, it's difficult to guarantee Atomicity, and Isolation, and very difficult to guarantee Consistency.

For instance, what happens in case of power failure? In MySQL, a transaction either succeeds entirely, or fails entirely. In blockstore, a power failure or crash in the middle of writing a bundle would leave us with corrupt data:

Here is the function responsible for writing a snapshot to disk:

def create_snapshot(self, bundle_uuid, paths_to_files):
    files = {}
    for path, data in paths_to_files.items():
        files[str(path)] = self._save_file(bundle_uuid, path, data)
    return self._create_snapshot(bundle_uuid, files)

There are ways to make this method ACID, but they are not trivial. If you try to improve this function, basically you are going to re-create a file-based database, and there are other, very sophisticated tools that already do that.

I would suggest using plain old Postgresql. Just store binary blobs and json files there. Costs can be reduced by limiting read/write calls, and not using RDS. For serving large binary assets, such as videos, we can use a file-based caching layer, such as HAProxy, or even S3. That way, we shift the responsibility of serving assets away from the blockstore, which then becomes a more simple component.

In addition, choosing Postgresql is a first step for moving Open edX away from MySQL, which would be an improvement.

from blockstore.

ormsbee commented on July 28, 2024

@regisb

With file-based databases, it's difficult to guarantee Atomicity, and Isolation, and very difficult to guarantee Consistency.

For instance, what happens in case of power failure? In MySQL, a transaction either succeeds entirely, or fails entirely. In blockstore, a power failure or crash in the middle of writing a bundle would leave us with corrupt data:

I agree with you in the general case, but I believe that the way we're using files guards against this when creating new versions. Putting files one by one into a file system isn't going to be ACID compliant without a lot of extra complexity (which I have no desire to add to Blockstore). But the result of an abrupt failure in this scenario should be a little wasted space, and not data corruption.

The overall flow of creating a new BundleVersion is:

Create data files, with naming derived from a hash of the data.
Create a summary JSON file that points to the files that was created. This is the Snapshot.
Create a BundleVersion in the database that points to the Snapshot.

File writes are atomic on object systems, and they can be made atomic in most file systems by writing to a temp file and then doing a move (again, not sure if Django storage does the right thing here). So now let's say we're creating a new version in this way and we have a sudden failure:

If failure happens during data file creation, then those files are basically orphaned. No summary JSON file is ever created, so no Snapshot exists. We just have junk data lying around. But re-trying will want to create the same data files over again, and since the files are named after their content hash, we will end up using those files for the next Snapshot.

If failure happens after Snapshot creation but before the BundleVersion is created, then we have a Snapshot file that's never referenced. Re-trying will create a new Snapshot (the create time is part of the snapshot content), and the BundleVersion will point to that new Snapshot. It's not atomic or consistent in totality, but it is for the parts we actually care about.

If failure happens during BundleVersion creation, we are now purely in the database, and the net result of a rollback is again a Snapshot that has no reference to it.

The abrupt failure scenario is more of a potential issue with Drafts, but one that I hope we've mitigated in other ways. Typical writes to Draft files in the Studio scenario are one at a time (as changes are made to a given XBlock). Every import would get its own Draft, so imports wouldn't step on each other, a failure mid-import means that you have some junk left in the system, but have not caused corruption -- your next attempt to import would create a new Draft.

We're also making the dependency between Drafts and Bundles one-way, to make it easier to throw out the Drafts implementation we have today if those assumptions turn out to be wrong.

FWIW, the reason BundleVersion and Snapshots are separate things is mostly to have a stronger separation between where we're going to hang the data itself (Snapshot) vs. the metadata that references it (tags, search, etc.). This makes it easier for us to pull down data that violates licensing agreements without disrupting a bunch of plugins which may need to update asynchronously. It also lets us walk back from the storage decisions we're making today and migrate to something else (whether that's a database, a version control system, or something else), with less disruption to those same plugins.

I would suggest using plain old Postgresql. Just store binary blobs and json files there. Costs can be reduced by limiting read/write calls, and not using RDS. For serving large binary assets, such as videos, we can use a file-based caching layer, such as HAProxy, or even S3. That way, we shift the responsibility of serving assets away from the blockstore, which then becomes a more simple component.

Managing a 40+ TB replicated Postgres database + caching layer is more operational complexity than we want to take on. Putting metadata in the database via hosted RDS and the raw data files in an object store strikes a balance between cost and operational complexity that we're more confident about.

In addition, choosing Postgresql is a first step for moving Open edX away from MySQL, which would be an improvement.

I am not a fan of MySQL. Most of the bundles models.py docstring is me explaining the horribleness of MySQL. But for the sake of overall stack simplification, I don't want to introduce a transitional period where for years we're going to have part of the Open edX stack on MySQL and part of it on PostgreSQL. As much as MySQL annoys me, there are more important things to work on, and the feature set difference is not so compelling that it makes up for the long term pain of understanding both systems operating at scale.

I realize that going with file/object storage is not all "flowers and lollipops". But I think it's currently the most pragmatic tradeoff we can make between simplicity, scale, and cost. I definitely agree that it's an area where we could fall into a trap of trying to graft on features that our primitives don't support well. It's definitely something we need to keep an eye on going forward.

from blockstore.

bradenmacdonald commented on July 28, 2024

Great discussion so far, guys - thanks!

A couple other points I want to mention:

A lot of the data that will eventually be stored in blockstore (everything on the "Files & Uploads" page in Studio) is currently stored as blobs in MongoDB (GridFS), and that approach is awful. Regardless of the decision about where XBlock data & metdata gets stored, moving the image/PDF/video/etc. files that they use to a proper object store (which can more easily be fronted by a CDN) is a huge win, and helps us get rid of MongoDB.

Remember that most reading + searching queries done in the system will happen in the LMS, and the LMS is unlikely to talk to Blockstore directly, ever. XBlock data will be read from some intermediate caching system like the course blocks API (with hot data stored in redis/memcached) and course content searching will be done from ElasticSearch.

Even in Studio, which will talk to Blockstore directly, most read operations other than "fetch the data of one specific XBlock for rendering in the XBlock runtime" should be handled by Blockstore's SQL DB and/or ElasticSearch, and not require reads to S3.

I would love to move to PostgreSQl too, but the fact is that we haven't been able to do the (relatively straightforward) MySQL 5.6->5.7 upgrade yet, nor even the utf8->utfmb4 upgrade, both of which I think offer major improvements (efficient JSON fields, emoji support). That should give us pause when considering anything that's orders of magnitude harder, like a MySQL->PostgreSQL change.

from blockstore.

ormsbee commented on July 28, 2024

Closing this for now, but happy to reopen if folks want to discuss further.

from blockstore.

Why Files? about blockstore HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent