Learn decentralized Storage

How to store data in decentralized Applications and Smart Contracts. What are the best practices?

How to think about Storage

Storage, the hero of the digital age. It’s the backbone that upholds our entire digital existence. Just close your eyes for a moment and imagine a world without storage – a place where all your posts, your treasured family photos, and crucial business data simply vanish into the digital ether. It’s a chilling thought, isn’t it?

As we turn more and more digital storage is the new brain memory. Your brain has memories, digitally we have storage. We want that storage to either be very fast or to last for a long time and not break. Ideally both but we’ll focus on the longevity of storage. How long will these memories, that storage last and what is it for?

For NFTs and blockchain projects we typically want the data to be there forever or at least decades. How bad would it be if your favorite NFT picture would be gone tomorrow because storage failed? Some projects have their images and metadata modifiable so they can level up your token or track redemptions of some kind. Maybe they also want to have the option to fix errors. If you’re okay with that it is fine. Otherwise you want permanent storage. So how do we get there?

There are different types of storage trying to achieve our goals. We’ll first go through the Onchain storage meaning that there is a blockchain used to actually store our data. Therefore the likelyhood of our storage being dead and gone is the likely hood of the blockchain failing. If all your data is on Ethereum, how likely is it dead? not very likely right? Then due the disadvantages we see we’ll learn about the types of Offchain storage used to circumvent them. Finally we’ll discuss what to choose and look forward to the future afterwards.

Types of Onchain Storage

Fully Onchain

We can store all the data on the blockchain itself. This is Onchain storage (on the blockchain) so the guarantee about longevity is the same as for the blockchain itself.

This means as long as the blockchain is around you will have the data available. No risk about the data getting lost and the blockchain still being around to have broken links.

Additionally depending on the implementation the data can’t be changed. You get what you see forever. No risk of your legendary item turning into a normal one because the developers want it themselves. Though you should check the Smart Contract to see if there is actually no way of changing it. It could be the case that storage is onchain but changable. This is actually the norm. Like the owner of an NFT is changable and stored onchain. You will forever know who that NFT on Ethereum belongs to as long as Ethereum is around, but of course it is changable otherwise you couldn’t transfer your NFT to your friend or sell it.

Technically Onchain is mostly a variable in the smart contract storing the data value as with every other program or database you already know outside of blockchain.

Fully Onchain Storage visualized

If we are talking about NFT Metadata there are two ways of doing fully on chain storage

Fully stored Onchain
Fully generated Onchain

Fully stored Onchain

If we fully store our data onchain then we guarantee that it is available as long as the chain exists. How can we do that?

Typically how is to use Smart Contract variables just like you have fields in your traditional database. Store the data values in the variables. Let us look at an example.

We want to store the loyalty points of our lemonade stand and we want all of it to be on the blockchain itself with the data being available at once without generation!

What we do is we store the amount of points belonging to the corresponding user.

mapping(address => uint256) pointsOf;

Here we assume the user to be identified by a wallet address though if you change it to bytes32 you can easily map any value (string) you want like the Fullname or customerId.

Onchain Storage access visualized

What we get is we can ask pointsOf to return the amount of points our user has. If we need extra information we just add variables and then we can query for all of them. One of the ways to do that would be to build a Struct holding all variables and then one mapping of user identification to that struct. But what if we can infer some data by calculation and don’t need to store it?

Fully generated Onchain

Every single piece of storage on chain costs huge amounts of fees. If we can reduce the amount of times we write to storage we end up with huge savings!

So how do we do it?

We combine data we already have to derive what we need. Since that data is stored completely on chain we retain the guarantee that data is available as long as the underlying blockchain is.

Lets stick with our lemonade stand royalty program example and assume we made the whole thing bigger. We added some alcoholic beverage and will not let childreen redeem or collect points for these.

If we have the users birthday stored we can get their age by simple calculation and we can also give them a bonus loyalty point of they come in on their birthday. How can we do this with just one variable stored?

Well we have the date. By comparing the date to the current time the difference is their age. By looking at the current time and checking if day and month match the ones in the birthdate we can get if its their birthday or not.

This is a simplistic example but it illustrates that you do not need to store every value and should look to infer from the minimal set of variables needed.

In the below example we leverage the fact that the blockchain exposes the current date in form of a timestamp (block.timestamp in EVM) and we store the birthdate for each address in our smart contract locally to compute the age in the ageOf(address) function.

Fully generated Onchain visualized

But what if we have sooooo much data that we can’t store it on chain? for simple SVG images you can do generation on chain, but what if we have movies? Let us look at linking to another chain.

Link to another chain

There are chains optimized for storage. These allow for much lower costs per unit compared to execution optimized blockchains like Ethereum.

We will be looking at the biggest storage optimized chains

Arweave
IPFS + Filecoin

What is done here is our Smart Contract on the blockchain only remembers one variable. Namely where to send us to lookup the data we actually want to get. This is how most NFTs are operating today, they send you somewhere else then the Blockchain the Smart Contract (NFT) is on, though that link may also not go to a storage Blockchain, but we’ll cover that when talking about Offchain Storage.

So the SmartContract when asked about “Hey give me the data for this identifier” says “sure you can look it up at dtech.vision” where dtech.vision will be a link to the storage location. This way you only pay for one storage slot on the Blockchain like Ethereum which saves costs. Now you are dealing with two additional risks though.

Additional Risks when not using Onchain Storage in your Smart Contract:

The lifetime of the other storage solution (e.g. Storage Blockchain like Arweave) may be shorter then the one of your Smart Contract -> links will not work, data unaccessible
The link you put in the Smart Contract may not be resolved by the user asking about the data -> data may be there, but not accessible -> data unaccessible

These additional risks should be considered when deciding on the optimal storage solution and thinking about cost. The cost of breaking things may be higher then the increased ongoing cost of running fully onchain storage when changing the variables.

The below example illustrates needing to look up the URL offchain instead of getting all the data to directly use from the blockchain smartcontract.

Linking to Offchain storage from Blockchain SmartContract visualized

Arweave

What would a blockchain system look like with the single purpose of storing data for eternity? Arweave is using a blockweave (thats where the name is from) system each block is linked to both the block that preceded it and a recall block – a block from the earlier history of the blockweave. So each piece of data is in a neighbourhood that checks if it is still there. If you now duplicate it onto multiple cities (nodes) you get redundancy and others checking that it will not disappear. This is basically what Arweave and its consensus algorithm do by asking new block proposers to randomly proof they still have a historic piece of data.

Due to randomness the optimal solution to maximize rewards for Arweave storage providers is to store the full network.

The weave of Arweave

The Arweave nodes store some part of all the data on the network so that the chance of permanent loss is very unlikely as if one node goes missing others take on that storage and keep it. If there are enough nodes with free storage capacity your data will never be deleted on Arweave.

That guarantee about permanence that Arweave attempts to give is also the reason why I include it in the types of onchain storage instead of putting it offchain since it’s not on the original blockchain where the Smart Contract is stored, but is as close as one gets.

Arweave files will, guaranteed by network design, be available as long as Arweave exists.

To link to Arweave we have to either add the protocol, then our identifier (example: ar://) or use https links to a gateway (example: https://arweave.net/).

The benefit of direct links is that any user knows how to deal with them. Say we take arweave.net as gateway then our link starts with https://arweave.net which is like linking to any other webcontent. Though if that Gateway is down, no one can reach our content as they treat it like normal links.

For the link to a gateway we have the issue of the gateway disappearing. This will break our link. Also if the DNS name of the gateway has issue we are doomed as we can’t use the link. We could extract the identifier and query our own gateway but then we can just use ar:// links.

Though what can go wrong with protocol links? Well the person or service we give the protocol link to may try to use it as web link and it will not work. You need to know that ar:// means Arweave and use the Gateway of your choice to access it. Otherwise it is awesome as we as user or service developer choose the gateway. If one breaks we change our default gateway and the link is still fine. Link will work as long as Arweave is around!

That’s why I‘d recommend the following:

When sending to people (e.g. messages, mail, social media posts) make it as easy as possible and send Gateway Links.
For anything else use protocol links! Your API returns Arweave links? If not sending to end consumer use protocol. Your smart contract points to Arweave? You want permanence? Use protocol links. The receiver will implement the gateway logic of their choosing and can use what works best for them! Potentially using their own nodes/gateways!

How can we deal with that? Well if we use protocol links ar:// then we need tool support to resolve to a gateway, but it will never change and users/tools choose their own gateway that is online. This way we negate the issue of dns or gateway failures.

So what should we do? If you share it on social media use the link that works for most people and include the gateway. When building permanent infrastructure like Smart Contracts use ar:// the protocol link so that it is future proof! You can’t predict what gateway will work and should let tools and infrastructure resolve that issue. A simple solution would be to strip ar:// and replace it with arweave.net and voila you have your working link created by tooling. This could also be any other gateway.

Why do we want different gateways to exist in the first place? Why is it important?

With DNS (on the Gateway) we resolve the name to the IP, to where we are actually going. So arweave.net becomes the server we access and then ask for the content we wish to see.

But if that resolution is faulty or manipulated we can get anything back and get no guarantees. If we used ar:// and then decided on our own gateway we could simply switch gateway to a non manipulated one, but with the fixed link we can’t unless we implement the same logic that makes ar:// work in the first place. So there is no benefit to the full link with the gateway when you start thinking security and resiliance.

Though like stated earlier the full link including the gateway is what’s compatible with what we have now in terms of social link sharing and browser support. Just make sure you understand the trade offs outlined here.

With Arweave as a unique identifier is applied for each file uploaded, we get a new link for each upload, making permanent names not directly possible. We would then update our pointers in the SmartContract or plattform we use if a new file version should be refered to. Or we leverage the Arweave Name System which is a project aiming to bring humand readable permanent names to content on Arweave. So that you can say “dtechsnewblogpost” always points to the newest blog post even though that post changes with a new upload obviously.

To learn how to use Arweave in your application, please refer to our Arweave documentation

The Arweave ecosystem has expanded quite far beyond simple storing of files to databases and smart contracts running on the permaweb (the permanent version of the web based on permanent storage on Arweave). To learn more please refer to our Arweave Section in the Documentation.

IPFS + Filecoin

IPFS itself does not guarantee permanence as it relies on nodes to “pin” (keep) your content. If no nodes wants to keep your content then it is gone, which is not the case with Arweave which guarantees it is kept.

For Links to IPFS content you can either use protocol links (ipfs://) or Links to your gateway (ifps.io/). For a discussion of when to use which and what their pros and cons are please refer to the Arweave section where we discussed this

But you can back IPFS with Filecoin, which is a project aiming to give the permanance guarantee. Combined you could have IPFS for tooling and accessibility with Filecoin as backing for your data to be stored permanently.

Both public documentations of IPFS and Filecoin have guides to get you started if you want to use that. The advantages and disadvantages discussed in the Arweave section do apply. A total overview including cost is presented at the end of this post.

When not using Filecoin for permanance, then IPFS falls under the semipermanent offchain storage category.

Remarks

I intentionally put linking to a storage blockchain solution in Onchain Storage, because it is onchain though not on the original chain where the Smart Contract is. As long as you can give guarantees about your links not breaking and the other blockchain being alive for at least as long as it needs to be or even as long as your original blockchain where the Smart Contract is then you are likely to not experience issues and saving costs.

You want to check if the Storage Solution fits your requirements though as Arweave for example doesn’t let you change data available at some link. If you change data on Arweave you need to change the link. There are solutions to this but they require more effort and therefore introduce additional complexity cost.

Types of Offchain Storage

Link to semipermanent storage

When linking to semipermanent storage we get no guarantee about permanance, though I call it semipermanent because it could be permanently stored, but we do not get the guarantee.

IPFS

IPFS is a distributed file storage network, where anyone can join and store data on the network. This allows for anyone to also host the content you host and even keep hosting it after you want it deleted, still making it accessible.

Meaning there is no delete on IPFS besides everyone on the network not “pinning” (keeping) it anymore. But that can happen so you also don’t get a guarantee that it is never gone (deleted).

There are service providers you can pay a subscription fee to store the content while you pay on IPFS or you can run your own node.

Other then permanence the decentralized nature of IPFS behaves similar to the discussed permanent storage solutions linking to another chain. As we can use ipfs:// as well as direct links including the gateway. The tradeoff being future proofing and potentially corrupt gateways as discussed before.

Also if you want to change a file on IPFS you need to reupload and get a new link, which is the same for Arweave as well.

IPNS

Though what if you don’t want a new link? Say you want one link that keeps true even when changing the file on IPFS. Enter IPNS (InterPlanetary Name System) which is a system for creating such mutable pointers to CIDs known as names or IPNS names. IPNS names can be thought of as links that can be updated over time, while retaining the verifiability of content addressing.

It gives the same guarantees as IPFS or IPFS with permanence on Filecoin and you can read more about it in the IPFS Docs about IPNS.

Link to Offchain Storage

When resiliance through using a decentralized solution and permanance don’t matter for your usecase or the cost of running a server is cheaper then the onetime upload cost to Arweave you can always leverage your existing infrastructure.

There are two options to consider when going the own offchain infrastructure approach

AWS (or any other Cloud provider)
Own Server

When using your own server at least you gain full control and are the king of your data, while using cloud providers can give your better scalability, speed and potentially ease of use as you may already use them and don’t need to teach people the upload process to other solutions.

Especially for NFT Metadata this is considered highly suboptimal as you offer no permanance guarantee. If you don’t pay your monthly bill the data is gone, where as on Arweave or Onchain you pay once and have the data as long as the network lives.

And you are at the mercy of the cloud provider or your regulatory overseers to the extend in which you can freely choose what to host and who to serve it to. Imagine a country being banned, then they can’t see your data.

Why choose Onchain or Offchain Storage?

Also encryption and privacy are things to consider. Putting encrypted data that only the intended reciever can read on a permanent solution like a Blockchain or Arweave may not be a good idea. Say your encryption will be broken in 10 years, then the data will be readable. When the encryption used is strong enough so that the odds of it being decrypted are low in the timeframe you intend then you are fine. This needs to be checked with lawyers and your product team though. And yes you can’t change it once its on the permanent storage layer, else it would not be permanent ;)

One of the main reasons for using Offchain Storage is exactly this concern as you may consider anything that goes onto a Blockchain or permanent data storage layer (potentially even semi-permanent layer) to be public. With Offchain Storage you have access controls not available on public networks as of today, though Zero Knowledge (ZK) technology and other cryptographic tools being developed may make these available on decentralized networks at some point in the future.

From a cost standpoint:

Onchain: gas fees network dependand, but very expensive for large amounts of data, very little capacity
Permanent offchain: One time fee for upload (Arweave), works fine with large or small amounts of data
Offchain: Mostly mothly fees for managed services, not only storage but also traffic costs or manage your own hardware

Looking forward

Other honorable mentions to look into and dive deeper:

Arweave naming system (ARNS)
WarpContracts (Smart Contracts on Arweave)
WeaveDB (database on top of Arweave)
KwilDB (database on top of Arweave)
Arweave’s GraphQL interface
Sia
Ceramic Network
Storj
ICP (Internet Computer - Dfinity)

Thanks to Sam for pointing out the Arweave options on X here

Link article on Farcaster/Warpcast: https://farcaster.xyz/samuellhuber/0x7e0b3245

Fun Little addon: Here is what a LLM (codellama) has to say

>>> would you use arweave, ipfs or aws to store your valuable data?

It depends on the specific use case and requirements. However, I can provide some general insights on each option:

1. Arweave: Arweave is a decentralized data storage network that allows users to store data in a highly secure and fault-tolerant
manner. It uses a blockchain-based architecture to ensure data integrity and authenticity. Arweave charges a small fee for storing
data, but it provides a high level of security and durability.
2. IPFS: InterPlanetary File System (IPFS) is a decentralized storage system that allows users to store and share files in a
peer-to-peer manner. It is built on top of the blockchain technology and uses a distributed hash table (DHT) to store and retrieve
data. IPFS provides a highly secure and decentralized storage solution, but it may not be as scalable or cost-effective as other
options.
3. AWS: Amazon Web Services (AWS) is a cloud computing platform that offers a wide range of services for storing, processing, and
analyzing data. It provides a high level of security, scalability, and reliability, but it can be more expensive than other
options.