3rd October 2023
Storing data – from genomic blueprints to green footprints
Sequencing genomes costs money, energy and time. And as a person's genome remains the same throughout their life, once a patient has whole genome (or whole exome) sequencing, it seems sensible to store that information in case it is needed again later.
Sequencing a whole genome is computationally expensive but the monetary cost is continually falling, and if it continues to do so, the costs of storage may become more significant by comparison.
However, storing a patient's genome is a huge volume of data compared to, say, standard medical records, so the pros and cons may be more complex than they first appear.
The pros and cons of storing data
There are some undeniable advantages to storing genomic data, and two that are especially pertinent.
Firstly, the ability to retrieve the data quickly in future, to inform precision medicine approaches, or to check for pharmacogenomic interactions. This use is a gamble, because it may not be known when the data is stored if it will ever be used again.
Secondly, the ability to use the data (usually deidentified) in research. When patients consent to their data to be used in research, large genomic databases can be assembled that can be analysed to uncover genetic variants associated with diseases. The larger the dataset, the more small effects can be noticed, or correlations between rare variants.
The potential disadvantages of storing data would include the risk of data breaches. Any data that is stored has a non-zero risk of being hacked, leaked or of being subject to some kind of privacy breach, even when measures are taken to provide the best possible security. Patients will need to understand and accept this risk.
Other potential issues could arise around whether genomic data, in the form we currently record it is still useful across a patient's lifetime. Perhaps the format in which the data is stored will become incompatible with newer technologies, in the way that few of us still listen to cassettes or CDs, or own devices on which to play them.
Or perhaps, newer sequencing technologies will come along that provide more detailed data, such as long-read sequencing which is already able to read genomic and epigenomic data together. If that information became medically useful, it might be that people will be re-sequenced at regular intervals in future, to look at epigenomic changes over time.
When we think about storage, it can intuitively feel like something passive, rather than something that is actively using energy and resources. Why does storing data impact the environment?
The simple answer is that servers, on which data is stored, need to be cooled, and that uses energy, and other resources such as water. Although many large databases are stored 'in the cloud', in practise the data is still held on physical drives somewhere.
The potential of cloud storage is that by outsourcing storage in this way, the computers storing the data are being professionally managed, and are able to be optimised for maximum efficiency. Cloud storage providers can also manage trade-offs to be made in in terms of accessibility vs energy costs. By building data storage facilities in naturally cold places the energy usage for cooling the drives can be significantly reduced, but as these locations tend to be further from major population centres accessing the data is slower.
Many of the large companies that offer cloud storage are proud to talk about their sustainability initiatives. For example, Amazon Web Services (AWS) who are the provider hosting the data from the Genomics England's initiatives such as the 100,000 Genomes Project [source], aim to be powered by 100% renewable energy by 2025, and claim to be five times more energy efficient than average data centres in Europe. However, it can be difficult to evaluate if claims like these are indications of genuine environmental benefit, or cherrypicked statistics amounting to little more than greenwashing.
Additionally, separately from the energy used to run data storage centres, there is also the energy and materials cost of manufacturing the hard drives on which the data is stored. Hard drives contain many different materials, making them difficult to recycle, and they commonly include rare earth metals such as neodymium, the mining of which have their own complicated environmental impacts.
These concerns need to be balanced against the environmental costs involved in sequencing genomic data, potentially repeatedly, at the time it is needed. This itself uses energy and generates medical waste, as well as taking longer to generate a result, potentially delaying the patient's treatment, compared with accessing information already stored.
Another necessary use of energy is encryption. The security of genomic data being stored is also important, and essential to maintaining patient trust.
A few years ago there seemed to be providers touting blockchain as the answer to securing genomic data at various expos, and a quick search will bring up numerous scientific papers on the topic. Blockchain-associated technologies, especially bitcoin mining can be extremely energy intensive. Providers such as AWS and their competitors are offering blockchain-based services, although the buzz around this technology seems to have died down, perhaps since cryptocurrencies became mainstream news with a brief bubble and the subsequent collapse of the cryptocurrency exchange FTX.
Regardless of the method, to store genomic data securely some kind of encryption is needed, to keep the data secure while in storage and while being transferred, but also to unencrypt it for use. Any method of encryption will have some energy requirement.
Overall, sustainability needs to be at the heart of discussions surrounding when and how genomic data is stored. It deserves to be a central concern, in the same way as other ethical issues such as consent and data privacy are.
These discussions need to go beyond the medical community – it would be unfair and unreasonable to expect clinicians and clinical scientists to be experts in all the facets of computing and environmental science that are involved in weighing the impact of storage. Experts in these fields need to support the genomic medicine community to empower informed decision-making in the interests of patients, NHS budgets, and the planet.