What is 'big data'?
Data is everywhere. It affects what you eat and how you drive; it powers our towns and shapes our economy. And one day it may help keep you alive.
And as computers have got more powerful and sophisticated, we can now combine huge amounts of data together to generate even deeper insights about the world around us.
Welcome to the era of ‘big data’.
Big data is the collision of huge collections of information; so massive that you need to throw the rule book out the window and devise entirely new ways of looking at them.
The potential of big data is massive. In recent years, we’ve seen supermarkets use data about buying habits to send customers targeted pregnancy adverts, and this is before they’ve even told their families the happy news.
And computer scientists have used data taken from Twitter to successfully track a flu outbreak in New York far faster than public health agencies.
With investment bankers, sports teams, traffic networks and engineers now all routinely using vast stacks of data, what role does big data have in improving our society and health – including the outlook for people with cancer. And is it all its hyped up to be?
From here to Pluto
Physicists working on the Large Hadron Collider (LHC) – a huge physics experiment in Geneva – can generate so much data, that if it were printed in paperback-book format, one year’s worth would produce a stack of books stretching from London to Pluto. And back. Fifty times. And 8,000 physicists need regular access to this data.
The investment firm, Winton Capital Management downloads two billion prices from just the US stock market every single day.
Engineering, traffic networks, hospitals and retailers all collect and use vast amounts of data. And all these different disciplines encounter problems and develop their own unique solutions to them.
Increasingly, solutions to one person’s data problems are developed by someone else.
For example, Formula 1 engineers are providing advanced monitoring software to intensive care wards to help them track patient health changes.
And systems used for spotting stars have been adapted to spot breast cancer cells.
The human genome and beyond
But biology and big data first caught the public eye in the year 2000. The world’s media was buzzing with the news that the most ambitious international biology project ever – the Human Genome Project – was almost finished.
Bill Clinton and Tony Blair announced the first draft of a complete genetic map of human DNA. This was, they said, “the scientific breakthrough of the century, perhaps of all time.”
2,000 scientists from all across the globe spent more than a decade, and several billion dollars working to give the world the first genetic blueprint of a human.
At more than three billion ‘letters’ long, this string of four chemicals, A, T, C and G, represented the essence of a human, containing all the instructions that we need to be us.
And deciphering this code had the potential to change medicine forever. Newspapers excitedly called it ‘the breakthrough that changes everything’ and ‘the discovery that will touch the life of every person on the planet’.
This fanfare was perhaps premature. The job of mapping the genome was only the first step, with the formidable task of actually understanding what the various bits of our genome do, still lying ahead.
But what this did do was open the data floodgates.
The data boom
In 1991, at the beginning of the project, the best scientific technology could analyse a thousand DNA ‘letters’ a day. Just 10 years later they could ‘read’ a million. And today? The latest techniques can analyse ten billion letters a day.
This means that scientists can now look at individual people’s genes in a timeframe that could make a difference. They can compare the genes of a patient’s tumour to those of their healthy cells, searching for the differences that may be fuelling that cancer.
But to even have a chance of understanding these differences and taking the first steps towards more personalised treatments, scientists need to embrace big data. And that’s exactly what they are doing.
Big data and cancer
As part of our TRACERx study, researchers are analysing the tumour DNA of 850 people with lung cancer, looking to understand how the disease evolves over time and finding new ways to treat it.
This ambitious project will generate a lot of data. For each patient on the study, the researchers will ‘read’ the equivalent of 65,000 human genomes, storing and analysing the equivalent of 200x the contents of Wikipedia.
It raises some big challenges that are common to the field of big data:
Finding a home for your data
The more data you have, the larger the storage required to house it. But as the cost of generating data has fallen so quickly, it now costs as much to store it as to get it. And accessing the data must be ‘future proof’.
Sharing is caring
As part of The International Cancer Genome Consortium (ICGC) our researchers, along with others around the world, are mapping and cataloguing all of the genetic faults seen in samples from 50 different types of cancer.
This will create one huge repository of data that any scientist can tap into to further their research. But many computers cannot handle such huge volumes of data.
And even when they can, the length of time required to sift through the sea of information can be great, meaning scientists are collecting more data than they can analyse, the dreaded ‘data bottleneck’.
Be careful what you test for
Computers are excellent at spotting patterns. But knowing which patterns are important can be a real challenge for even the smartest computer.
A computer scanning data may spot a chance pattern between increasing global temperatures and the decline in piracy on the high seas. But of course, pirates don’t prevent global warming. A human can understand this, but a machine may not.
Getting together to tackle big data
To help tackle these big data issues Cancer Research UK, in partnership with conference sponsors Winton Capital Management, recently organised ‘The Big Data Analytic Conference Series: Multidisciplinary challenges of big data.’
Hosted by the British Library, financiers mingled with biologists, physicists and engineers conversed with traffic planners, as they all had the chance to share their data problems and potential solutions.
Dr Harry Cliff, a particle physicist from CERN outlined some of the techniques the Large Hadron Collider uses to sift through its gargantuan data hauls. And expert speakers from Deloitte and Google outlined challenges and solutions to big data conundrums.
The Minister for Universities and Science, the Rt Hon David Willetts MP, stressed the importance of big data to the future prosperity of the country and outlined some future projects, and Dr Mark Roulston from Winton Capital Management warned the audience of the ease of drawing inaccurate conclusions from the data – and ways to prevent this.
The day ended with an international panel of experts – led by one of our world-leading computational biologists, Professor Nick Luscombe – discussing the problem of selection bias in research with such large sets of data.
“Today is only the beginning,” said Dr Barry Leventhal, a member of the conference organising committee, “we are seeing where we all are and starting to get the ball rolling.”
He went on to say that future meetings will focus on specific problems encountered by big data users, seeking innovative ways to tackle them.
The future is bright, the future is big data
We’re just beginning to realise the potential of big data. But we’re also only just beginning to grasp the scale of the challenges it presents.
By working together, banks, engineers, internet search engines and scientists will be able to take on these challenges and harness this information.
Big data found the Higgs boson. And big data is helping us gaze into the heart of cancer, moving us closer to a time when we can predict how a tumour will behave and choose the right treatment to take it on.
Despite all the big obstacles in our way, we are on the cusp of something special, and it’s an exciting place to be.
Sam Godfrey is a science communications manager at Cancer Research UK