Do you know what your data's up to?

Do you know what your data's up to?

Historically, your data hasn’t been in your control. It lives with the institutions you interact with: large tech companies, governments, health care providers, etc. This applies to both physical records, which there are increasingly less of, and digital records, which there are increasingly more of. There’s so much digital data that most of it is untouched — sitting around doing nothing. When something is done with the data, given the misaligned incentives of our existing financial system, it is often used to extract value from individuals: selling data to advertisers, facilitating political agendas and dragging you deeper into walled-garden ecosystems. Chris Dixon’s writing on Why Decentralization Matters clearly articulates the implications of our risk-seeking behavior engaging with centralized systems.

Source: https://onezero.medium.com/why-decentralization-matters-5e3f79f7638e

Luckily for us, “web3” is here to save the day. Web3 is jargon for software being built on top of decentralized systems, and the wider zeitgeist which encourages taking back control from the man. So what does decentralization mean? Simply, the properties of a system that make censorship difficult, usually by distributing trust and creating redundancies. I have not seen a reliably used decentralization metric, but the Nakamoto Coefficient comes closest in defining one. Whether web3 systems are truly decentralized is a question that’s rightfully rearing its head in more and more places!

Succinctly: too much data is not in your hands, centralization has risks, decentralization has promises.

Why I Think Decentralization Is Cool

Before I go on I’d like to give an example to illustrate one of the promises I’m hopeful for in moving toward a more decentralized world. I’ve spoken about this example for some time and believe it elucidates why decentralization and owning your own data has benefits in many situations.

I started using Netflix in the mid-2000s. I became obsessed with curating my preference to gain better recommendations. I rated every movie I’d seen in hopes that Netflix would learn what I like and suggest new content that I would have a high chance of enjoying. Then came Hulu, Prime Video, HBO Max, Disney+, Paramount, and an endless flurry of other streaming services. At the same time as I thought “ah, the beauty of the free market,” I also thought “what a pain in the ass.”

I continually found, and still find, myself switching between platforms frequently to find specific content that the other platforms did not have. The better-than-nothing content recommendations I had been receiving on Netflix were weakened with each external stream. The problem was clear: I don’t own my own preference data. Or in a less extreme sense, my preference data isn't directly accessible to me in a communicable format.

I like to imagine a streaming utopia where the free market can still flourish, but I have less of a pain in my ass. The vision is simple: able to flip the system on its head, I engage in a one-time extraction of my preference data from each service I use, putting it into a standards-based preference format. I store this data somewhere reliable (more on this later) and choose whom I give out access to. I hook my data API up to each streaming service I engage with, sharing it seamlessly between service providers. This API can be bi-directional as I receive new data insights from each service, and they can be shared between all services. In this utopia, I own my preference data, I get stronger recommendations on each platform and everyone benefits. This pattern can be applied to many, many, data systems.

Notably, as I was writing this I saw Tidal come out with a playlist export feature quite similar to what I described. I'm thrilled to see such a voice like Jack Dorsey championing decentralized vision. An import/export model is not a perfect user experience but moves us in the right direction towards decentralization.

Considerations for Your Data

Now we have some understanding of why decentralization can be cool, so let’s own all our data and put it on public data networks! What?! Wait a minute…

As more and more decentralized systems sprout up it has become clear that there’s an emergent pattern of embracing decentralization for both retrofits of preexisting systems and for web3-native-systems. I’ve seen little discourse about the types of data people have, the data systems used, the different risk and threat models that exist for these data and data systems, and the options that exist – or don’t yet – for storing and accessing all this data.

Sharing data is a lot like sharing pizza

Access

For simplicity, I view data access as falling into one of three main buckets: public, shared, and private. Public data is what it sounds like, both personal and impersonal. Data such as your blog, tweets, profile pictures, open-source code, your name, Wikipedia pages, YouTube videos, and so on. Private data may be sensitive, but primarily it has some sort of access control around it. It could be your SSN, medical history, street address, recovery words, bank account information, photo library, etc. Shared data is when the private becomes public, such as sharing my medical history with a doctor or a resume with a prospective employer but can also exist in an in-between private/public sense. The "in-between" is what we are seeing with NFT-gated access communities, or more traditionally web services, social profiles, or any resource that you need permission to access.

Data can change the bucket it's in but as you go up, the bucket hierarchy carries risk and irreversibility. The primary risks we can consider are leaks – data getting unwililngly exposed,  and privilege escalation – data being used to gain access to data not intended for your eyes. The instant I post a private photo of myself to a social network, even if I later delete it, I cannot fully return it to a private state. It's possible that someone or some piece of software saved a copy. Similarly, shared data carries publicity risk. I may trust a friend with a draft of this blog post, but there's not much I can do to prevent them from making it public data. One strategy to reduce access risk is to change the longevity of access. Put plainly, if I publish a shared link to my personal photo album, the longer that link resolves to my photos, the higher chance it'll get into the wrong hands. If I change the link on a regular cadence I begin a cat-and-mouse game to reduce my risk (and put a burden on all my friends).

Longevity

Data longevity or how long a piece of data is considered valid, and risk or how long a piece of data carries a risk if it is exposed are closely related considerations. If I rotate my password for an account every month, and have a bunch of my old, rotated passwords leaked there's no risk assuming I practice good password hygiene. However, my SSN or an old medical record may forever carry personal risk if leaked. PII, or Personally Identifiable Information, is what leads to really fun regulations like GDPR, CPPA, and many others. If you're not careful with the data you store you could get an unwelcome knock on the door.

The longer data carries risk the more we should treat it like toxic waste. The most straightforward way to mitigate data longevity risk is to reduce the longevity of your data. Changing your passwords frequently is an example which has a hard time generalizing. Much data is out of our control, and when it is in our control we can't affect its temporal-relevance. We cannot rotate our SSN every month. We cannot rewrite our health information. We can change the identifiers of our data (filenames, content addresses) and the means by which it is accessed (storage systems, URIs).

In designing systems, considering how to make "leak irrelevant" should be the topmost consideration. In web3 apps, it's often really hard to make your data leak irrelevant because so much of it is public anyway. What isn't fully public, or what's public and encrypted consider limiting.

BigCorp is handling your data really carefully

Correlation

Correlation risk is stringing together multiple pieces of data which independently don't leak information, but together do. Let's consider a contrived example:

An individual is searching for a new job and starts applying online to a number of companies across borders. One international company has 'age' as a required field on the job application, perfectly legal in its jurisdiction. No other companies applied to in the US ask for such information. Unknowingly, the individual applied to a company in the US which is a subsidiary of the international company. This company has questionable ethics, and as such shares the applicant's age with the US team, resulting in the candidate being unfairly disqualified for the position.

This may seem far-fetched, but similar correlations happen all the time. Probably one of the most prevalent examples is in the advertising industry: user X bought product A on our site, and product B on your site, user Y bought product B on your site, so let's show them ads for product A on our site.

The lesson from correlative risk is the more you give away, the more your data can be used against you. Who knows what associations will be made between all the pieces of data you publicize. And it goes further than that – who knows how your data will be used to make inferences about the groups and systems you are a part of, and how your data mixed with the data of millions will strengthen targeted influence machines**.** Fortunately, legislators are at work to limit such systems. On blockchains, a similar risk is rampant. Associating your financial data (blockchain wallet) with identity-based web3 applications (ENS, OpenSea, BAYC), Discord communities, forums, Telegram groups, and countless others, exposes you to significant correlation risk. Tooling to prevent such risk is lacking and so is the conversation around the risk. Beware of all the immutable breadcrumbs you leave behind.

Obfuscation

Obfuscation is a method by which you hide meaningful data inside meaningless data. To the untrained eye, the data may be gibberish, to the trained, meaningful. Obfuscation can also be done to subvert data mining and other predatory algorithms. One of my favorite talks from DEFCON28 was from a guy who created a tool to fill all his social media accounts with bogus information. Imagine taking your name and creating dozens or hundreds of fake profiles using it filled with photos, preference data, writing, and more. Someone searching for authentic data about you will have a much harder time!

"Are you obfuscating me right now?"

In blockchain communities you'll frequently notice folks using pseudonyms or going 'full anon' – this is obfuscation at work and a great countermeasure against surveillance.

Encryption

Encryption is a really broad topic and I don't aim to cover much of it here. Some data is encrypted, which makes it indecipherable unless you are given access to decrypt it. Some cryptosystems are better than others. Some are better for some data types than others, and some better for some situations than others: it depends. When using best practices encryption is the best method we have for making sure data is "for your eyes only." However, similar to access risk described above, once a decryption mechanism is distributed it can be hard to guarantee that unintended audiences don't end up with access to encrypted data.

Source: https://www.flickr.com/photos/ibm_research_zurich/40645906341

The looming threat for encryption is post-quantum risk. Once quantum computers are sufficiently sophisticated they'll be able to crack much of our existing cryptosystems. The US government, and certainly many governments and private entities worldwide are taking this threat seriously. Encrypted data is being stockpiled today for exploitation in our post-quantum future. Since 2016 NIST, the National Institute for Standards and Technology in the United States has been hosting a competition to develop a next-generation post-quantum cryptosystem. Most experts believe we have about a decade to solve this problem. Unfortunately, the encrypted data that is already public will forever be vulnerable.

Data Storage

With a better sense of your data and the risks it may carry, you can make more informed decisions about which storage medium makes sense. Let's illustrate the data storage landscape:

Definitely not a pyramid scheme

As much as possible, you should store data offline. This often isn't practical, and physical security is a whole 'nother beast. For the sake of simplicity, I'm assuming offline storage to mean storing sensitive documents in a bolted-down fireproof safe. When that's not possible, use a sufficiently robust cryptosystem like AES-256 or an Authenticated Encryption scheme like the popular XChaCha20Poly1305. Moving online, we have private web storage. This could be a local server or one from a trusted cloud provider (aka "the public cloud"). You can limit the access controls on this server to have data be private or shared as you wish. Private servers can and often do store public information – much like the content on the websites you visit.

Moving to the orange and red portions of the period we enter "web3 storage". Networks like IPFS, Filecoin (which uses IPFS under the hood), Arweave, Storj, Sia, and Ceramic (which also uses IPFS under the hood) are all public, decentralized storage solutions. Data on these networks is often unencrypted, though most protocols support encryption. Any data on these networks is publicly accessible and often replicated across many machines. In a similar class to purpose-specific data, networks are blockchains. Blockchains store public information! If the blockchain in question is any good, you can be pretty damn sure that data on the blockchain is going to be long-lived. Without pinning or another incentivization mechanism, there is a risk that data stored on other storage networks will not be permanent.

An important dimension in many decentralized web projects is preventing censorship. Censorship here generally means making it difficult for one individual or a group of individuals to remove content or access to content. The more censorship-resistant your data storage, the more censorship-resistant your data, but also the more risk you expose your data to. A blockchain built on your 13 year old selves' tweets may be censorship-resistant, but do you really want that information out there forever?

What You Should Do

Awareness is the greatest skill in the face of the 'data wild west'. If you begin to consider all facets of your data, its risks, and where you can store it, you're on the right track to make some informed decisions! The most important thing before unleashing your data into the wild is to do a light risk and threat modeling exercise: plan for the worst; hope for the best. Consider all that you store and expose, and see what can be made more private, or what can be eliminated from storage altogether.

Her data's safe, is yours?

We're at a great point in innovation where data privacy and cryptography are being given the spotlight. There are more tools now than ever to take control of your data, your identity, and regain control of your privacy. I am hopeful for the proliferation of Decentralized Identity, Decentralized Identifiers, and their ecosystems to put privacy-first technologies in the hands of the world. Make your data your own!

Show Comments