Context
During the same conversation already mentioned inside #60 that I had with @angelorc yesterday, another possible problem that he raised about our chain is regarding the storage of posts.
He told me that in Bitsong they have been testing through all the year different methods of storing songs data and metadata so that the chain can always perform good without becoming larger than it should be. He told me that the best way they found out in the end was storing all the metadata on IPFS and linking the CID (the IPFS hash of the file containing such data) on the chain itself.
After such conversation, I've done some hypotheses on how large our chain could become if we keep all the posts contents on-chain. This is what I got.
Chain space analysis
Premises
All the following analysis have been done referring to the JSON format of posts, which is the one used when exporting them inside the genesis or importing them during an upgrade process. This has been done due to the fact that is simpler to verify how large a JSON file is instead of how much space it occupies inside KV stores.
Current status
Currently Posts are stored using the following JSON structure:
{
"id": "134",
"parent_id": "156",
"message": "This is an estimated average long post message",
"created": "566879",
"last_edited": "899978225",
"allows_comments": false,
"subspace": "dwitter",
"optional_data": {
"local_id": "e3b98a4da31a127d4bde6e43033f66ba274cab0eb7eb1c70ec41402bf6273dd8"
},
"owner": "desmos1wjtg20d7hl9y409hhfydeqaph5pnfmzxlgxjg0"
}
In this case I've put an average length value for each field, considering the optional_data
value to have only one key inside, which should be the average case. This JSON has a size on disk of 378 bytes.
Now, with the introduction of media-supporting posts (#36), an average JSON file of such time could look like the following:
{
"id": "134",
"parent_id": "156",
"message": "This is an estimated average long post message",
"created": "566879",
"last_edited": "899978225",
"allows_comments": false,
"subspace": "dwitter",
"optional_data": {
"local_id": "e3b98a4da31a127d4bde6e43033f66ba274cab0eb7eb1c70ec41402bf6273dd8"
},
"owner": "desmos1wjtg20d7hl9y409hhfydeqaph5pnfmzxlgxjg0",
"medias": [
{
"uri": "http://ipfs.pics/QmXr1iejP2zHptFFDr3hycZvbaXaQNwrK6VVXYbxFAYQ7x",
"mime_type": "image/png"
},
{
"uri": "http://ipfs.pics/QmTbx9HLaN7gsKiunNc5NWvNRytydKn5uWWmpUzAHPU3NS",
"mime_type": "image/png"
},
{
"uri": "http://ipfs.pics/QmQZWZxfSHB3FpU8w4EMfX9bF52wUyPCiVNjtjf6jeNBBp",
"mime_type": "image/png"
}
]
}
Due to the higher amount of information stored, such JSON has an on-disk size of 768 bytes.
Let's now consider the case in which Desmos gains 1/1000th of Twitter usage. This could lead to 2 millions posts per each (source of Twitter usage stats: InternetLiveStats).
Considering an average of 1/5 posts being a media post, this would lead to 400.000 media posts/year and 1.600.000 text posts/year created. Considering 378 bytes per each text post and 768 bytes per each media post, this would sum up to approximately 0.92GB of posts per year.
If we scale up to 1/100th of Twitter usage, we then would collect 9.2GB/year of only posts. This amount is crazily high and I personally think that we should reduce it as in the long run it might become a problem considering that there are a lot of posts type we still need to define (e.g. #14).
Solution
The solution of storing all the post content into an IPFS file could be a good way to decrease the size of the chain on-disk. What we could do is define a new Post structure that is stored like the following:
{
"id": "87556",
"created": "566879",
"content": "QmP8jTG1m9GSDJLCbeWhVSVgEzCPPwXRdCRuJtQ5Tz9Kc9",
"owner": "desmos1wjtg20d7hl9y409hhfydeqaph5pnfmzxlgxjg0"
}
Inside the IPFS file reachable using the content
reference, we can then have different JSON structures based on the type of the post itself. As an example, we could have
Text post
{
"type": "post/Text",
"parent_id": "156",
"message": "This is an estimated average long post message",
"allows_comments": false,
"subspace": "dwitter",
"optional_data": {
"local_id": "e3b98a4da31a127d4bde6e43033f66ba274cab0eb7eb1c70ec41402bf6273dd8"
}
}
Media post
{
"type": "post/Media",
"parent_id": "156",
"message": "This is an estimated average long post message",
"allows_comments": false,
"subspace": "dwitter",
"optional_data": {
"local_id": "e3b98a4da31a127d4bde6e43033f66ba274cab0eb7eb1c70ec41402bf6273dd8"
},
"medias": [
{
"uri": "http://ipfs.pics/QmXr1iejP2zHptFFDr3hycZvbaXaQNwrK6VVXYbxFAYQ7x",
"mime_type": "image/png"
},
{
"uri": "http://ipfs.pics/QmTbx9HLaN7gsKiunNc5NWvNRytydKn5uWWmpUzAHPU3NS",
"mime_type": "image/png"
},
{
"uri": "http://ipfs.pics/QmQZWZxfSHB3FpU8w4EMfX9bF52wUyPCiVNjtjf6jeNBBp",
"mime_type": "image/png"
}
]
}
Of course this are just generic examples and better schemes should be defined more in depth and maybe event stored on chain for a generic reference.
On-disk space changes
Thanks to this approach, all different posts would have an on-disk size of just 166 bytes, which would reduce the above yearly disk-space increase from 0.92GB/year to 0.332GB/year (โ69,456%) considering 2.000.000 posts/year.
The best thing is that this approach keeps a linear increase of the disk space that is not based on the posts contents at all. If all users suddenly started creating media posts, on chain they would weight the same as text posts so it's also a spam prevention system.
Querying
Thank to the system proposed inside #60, the parser could simply ready the posts from the chain and then get the real content from IPFS, index it and putting into the database allowing for easier queries.
Consideration on IPFS
IPFS has proven to be reliable and we can even improve its availability by creating a cluster or a private network so that posts content will be always available, although I don't think this will be needed.
Cons of this
The only con of this is that the chain would not be able to check the contents of the posts during the transaction processing. However, I don't know if that might be a problem for us or not.
Also, clients would have to upload the content before the transaction, so we are moving onto them this responsibility. This can however be fixed by creating a REST API that performs such operation for the client.
Conclusion
I personally think this should be the way to follow that would allow us to:
-
focus less on the development of the posts module.
Once you define the specification for each post, you do not need to edit the way they are stored on chain.
-
achieve a higher scalability of the chain itself.
As posts are lighter on disk, the chain can scale better and transactions will be processed faster.
I would love to hear @kwunyeung and @bragaz feedback on this and what it might be improved or should change on such approach.