Last week, I wrote about a potential GDPR-issue with Elasticsearch (Lucene segments, really) that I discovered this summer. As this is uncharted territory, there was no obvious solution – and there are potentially lots of people and organizations at risk.
Now, I think I have a solution.
Edit: I've since held a presentation regarding this (in Norwegian).
Edit: I spoke about this subject at JavaZone 2019.
As I described in my post about Elasticsearch and GDPR, there is in all probability a large amount of solutions out there that are in violation of the GDPR, from my conservative, non-lawyer interpretation of the situation. This is because they rely upon Elasticsearch as a primary data store (which they should not do), or otherwise depend on Lucene for storage of personal data.
To sum it all up, the problem is that data isn't really deleted from Lucene segments before the segment containing the data in question is merged – the timing of which is not necessarily easily predictible, as it depends on a lot of technical details. This means that one can't necessarily guarantee deletion of data within 30 days, in accordance with the GDPR.
It is important to point out that the problem does not stem from Elasticsearch or Lucene themselves, but from how people use them, i.e. for something they were not designed to do.
I sent an email to Jan Høydahl (Lucene contributor and member of the Lucene Project Management Committee), who – in addition to spar about GDPR and practical solutions – told me that Lucene 7.5 (which will likely be released in the coming week) will change the segment merging logic to also include handling of segments with more than 5 GB of data, and therefore generally ensure that deleted docs will disappear faster than before, also in large segments. The limit for what percentage amount of deleted docs that will lead to a segment merge will also be configurable, which means that one can set this to a lower value at the expense of I/O to ensure more frequent segment merging.
On LinkedIn, he commented:
Luckily the upcoming Lucene/Solr 7.5, which is currently in the release pipeline, fixes some of the shortcomings you mention in your post, and allows Lucene to weed out deleted docs in a predictable manner while doing its ordinary merges in the background, also from large segments, see https://jira.apache.org/jira/browse/LUCENE-7976
He also told me that one could, for instance, perform a forced optimize or
expungeDeletes operation every 30th day to force a complete cleaning of deleted docs. Because of the large amounts of data that will be read and written to and from disk, this would be a very costly and time consuming operation. One should therefore either run this in a maintenence window or off-peak, or alternatively scale one's cluster a little up more than one would normally need to handle this I/O during peak hours. Since one would only do this every 30th day, the off-peak solution might be preferrable.
If the GDPR would lead to fines for "deleted" personal data being discoverable in an index-file after 30 days, one should upgrade to Lucene 7.5 (once it is released) and run
expungeDeletes every 30th day (e.g. via a cronjob).
This will likely lead to all sorts of upgrade- and dependency-issues for related software, and will in many cases mean upgrading to a version of Elasticsearch (or SOLR, for that matter) that is built upon Lucene 7.5. Then again, there is no telling when Elasticsearch will update its version of Lucene, or when solutions that depend upon Elasticsearch will upgrade its version of Elasticsearch, and so on.
Another thing I've thought about is that one will perhaps have to to prove compliance to some extent, i.e. that configs, rates and architecture will enable you to guarantee that data is deleted after 30 days. In the end, we likely won't know specifics such as this, and what "deleted" means specifically (i.e. how good is good enough) before a case dealing with this topic is processed in court.