Elasticsearch and GDPR

Distributed Storage meets regulation

Edit: Since this was originally posted, I've come up with a solution.

When I was in Philadelphia this summer, taking the Elasticsearch Engineer I and II courses, something interesting occurred to me which might not be immediately obvious – and which may further complicate the already non-trivial technical issues that have appeared in the wake of GDPR.

Some background

If you don't already know, Elasticsearch is an open source, near real-time distributed search engine with a REST-API. Though it is suited for NoSQL storage, you should not use Elasticsearch as a primary data store (further detailed here).

Don't get me wrong – Elasticsearch is great for search; But it's not a database.

We were in the midst of discussing the nitty gritty of the advanced features of Elasticsearch related to distributed storage, specifically the internals of Lucene, upon which Elasticsearch is based.

The relevant concepts are neatly summed up here, but in short: You insert documents (data) in an index (a logical namespace, comparable to a database), which is mapped to one or more primary shards (and however many replica shards); the shards are distributed across the nodes in your cluster; each node runs a number of shards (including primary and replica shards); each shard consists of segments. Segments are essentially immutable "mini-indices", which handles searching on its part of the entire data collection when Lucene does a search. Since segments are immutable, deleted documents are not really deleted, but only marked as such – so the segments filter out documents marked for deletion when searches are performed.

Any given shard is continuously processing its document queue, inputting data to Lucene documents. These documents are added to the index buffer, which is eventually flushed in a new segment, and lastly committed. When this all happens is up to the shard's host node. These flushes are not synchronized across nodes, so situations may occur where, for instance, master and replica shards may hold differing "truths" (data) as refreshes propagate across nodes.

Since the distributed search will get increasingly more complicated as more segments are added, Lucene will on occation merge segments according to a merge policy. When this happens, documents marked as deleted are dropped. This also means that adding more documents may some times result in a smaller index size, since it can trigger a segment merge.

Though you can use the optimize API to force a merge operation, this is not usually wise. In short, it reduces the number of segments (usually to one), and hinders the background merge process. It should not be used on a dynamic (actively updated) index – it is typically only beneficial for older, essentially read-only indices.

Elastic explains the merge process here, and Michael McCandless further details segment merging here.

Segment merging can be visualized as such:

You don't necessarily understand what data you have

This all means that at any given point, you may still have data which is supposed to be deleted on disk (Lucene segments are files, after all) – depending on your cluster architecture, differing amount of data between primary and master shards on different nodes, and merge settings.

In addition, continuous index refreshing (which is what enables near real-time search) is the most common cause for flushes. This happens every second by default. A flush will create a new segment, and can possibly trigger a merge.

It is also possible to mess up your refresh-interval-setting or disable automatic refreshing altogether, for instance if you do this while (re-)indexing and forget to enable it later.

Lastly: Once a segment reaches max size (5GB by default) it will only be eligible for merging when it accumulates 50% deletions.

TLDR: Depending on your rate of indexing and searching, your configuration, your segment sizes, and your overall architecture (and even more technical details), you may very well have "deleted" data on disk for longer than you are aware of.

The GDPR

According to Article 17 ("Right to be Forgotten") of the GDPR, users have the right to have their data erased from a controller. Afterwards, the controller has 30 days to confirm what data has been deleted or the reasons they cannot delete it.

To be able to do this, one would need to:

  1. Know what data is stored
  2. Locate all instances of the data in question
  3. Guarantee that all relevant data is in fact erased

Back to our story...

During the Elasticsearch course, we were told that documents aren't really deleted, but marked as such - they are only really deleted when segments are merged (at some, possibly hard to define, point in the future). Additionally, there's issues of synchronisation, which doesn't necessarily happen instantaneously, and the fact that segment merging may happen at different times on different shards.

I then asked: "Does that mean that there's no way to guarantee that information is deleted - let's say in the context of complying with regulation, such as the GDPR?"

Our instructor said "no", but quickly added "Is there ever really such a guarantee?"

He also pointed out that you will potentially have the same issue all the way down to the disk level, i.e. files aren't typically really deleted from hard drives, but only "marked" as such.

Consequences

It is not necessarily easy to predict when Elasticsearch will merge segments and drop "deleted" documents. Depending on the system and its configuration, you may (in the worst case) not necessarily be able to guarantee that this will happen within 30 days – though this might not be a common issue in practice.

It is not a given that one may override Elasticsearch's internal policies for storage and performance manually in a way that is practical to do in production environments ; if so, things will likely get pretty complicated pretty fast. And even then: who knows what happens on different nodes in the cluster, not to mention on the disk-level.

If it is really necessary to guarantee that the data in question is deleted in a way that makes it impossible restore them, it wil likely be a very expensive solution.

One would maybe have to reimplement the merging mechanism in a way that meets these demands and handle it manually; Or make a program which finds and deletes the relevant data on a file level on disk on the relevant nodes, triggered upon "deletion" in Elasticsearch.

Another approach would be to encrypt the relevant data hard, and "throw away the key".

A final approach is something like what Reddit user 1s44c commented on this post's submission on the Elasticsearch subreddit:

On ES versions less than 6 we see a lot of minor inconsistencies in document counts on rebuilt shards on busy indices. This causes us to rebuild indices at least weekly and dump the old ones. That should wipe out all old data and is a short enough timeframe to solve GDPR worries.

Like you say ES is a great tool but it really, really isn't a primary data store. You should always have a well tested way to recreate indices from source. Having said that version 6.3 is FAR better at keeping consistent document counts in every shard.

No matter the approach, it is sure to become complicated. Additionally, this is probably a relevant problem for other comparable solutions as well.

Third party solutions

There's also the case of solutions built around Elasticsearch (not that you should use ES as your primary datastore).

For instance, there are Content Management Systems, such as Enonic XP, which uses ES as a primary datastore in the Enonic Content Repository ("NoSQL database - Lightning fast and built on Elasticsearch"). There are probably also custom solutions out there that do the same thing.

If someone stores personal data in these instances, they may potentially have a problem down the line if a "difficult" person demands their data deleted.

Conclusion

There are many that use Elasticsearch as a primary data store (even though they shouldn't) in some shape or form. This ranges from custom built architectures to commercial solutions.

One can't necessarily guarantee deletion of data (as defined by the GDPR) within 30 days – in accordance with the law – because of how Lucene segment merging works, as well as a myriad of other tetchnical details.

This means that there are a lot of people and companies out there that are at risk of not being GDPR-compliant, should the EU take a principal stand on the issue. Being non-compliant would, in the case of prosecution, potentially lead to fines up to €10 million, or 2% of the worldwide annual revenue of the prior financial year, whichever is higher.

There is no known solution as of yet, as this is new territory.

Edit: Since this was originally posted, I've come up with a solution.

Newer post
Lucene Indexes and GDPR
Older post
A VR Desktop