The first thing I notice when I enter the operations center of the Genomic Data Commons is a map of the world displayed on a large HD monitor. The map is speckled with circles, variant in size and shade. I count 16 circles—16 users worldwide drinking from the GDC’s tap. At 11. a.m. on a Tuesday.
The GDC, which is run out of the University of Chicago and funded by the National Cancer Institute, is one of the largest data repositories of its kind on the planet. Specifically, genomic cancer data. And the size of each circle corresponds to the amount of data being accessed.
At present, five of these circles are in or near China. I turn to the U. of C. professor heading up the GDC, Robert Grossman. “So someone in eastern Mongolia might be downloading genetic information right now?”
He takes a look. “It appears that way.”
“And you don’t think that’s a little odd?”
He seems unfazed, but allows that the circle in Mongolia is “pretty big.”
I scan the globe, such as it is. “Not as big as the one over Nashville,” I point out. “Look at that. What is that? Vanderbilt?”
Grossman shrugs. “I have no idea.”
It doesn’t matter to him who is using the data. That’s the way it’s supposed to work. The data is there for the taking. It’s there for the world. Hurry up and take it.
The Genomic Data Commons represents a true pivot in the fight against cancer. It has taken 10 years to assemble a library of this size and in this form. Think of the GDC as a digital warehouse, one currently storing five petabytes (or five million gigabytes) of molecular-level information on cancerous tumors, accompanied by heavily detailed (but anonymized) data on the patients and the treatments they underwent, all collected from NCI-supported research.
The federal government began pulling this data soon after the mapping of the human genome was completed in 2003. By last June, when the GDC opened its database to the public, it had 14,500 patient cases. In five years, according to the plan, it will grow to 70 times that.
And the beauty of it: It’s available to anyone, anywhere, with an internet connection. (Yes, even you: gdc.cancer.gov. Good luck.) With massive amounts of new information at the fingertips of doctors and researchers, the philosophy goes, will come better—that is, more precise and tailored—treatments.
“We have over a thousand users who come to our site each day,” says Grossman, a genetic medicine professor whose official GDC title is coprincipal investigator (along with U. of C. research scientist Allison Heath). “They’re only beginning to understand what’s here.”
Just like that, the cancer battle shifts from the anatomical to the molecular. And with that comes the realization that an outright cure, if there ever is one, may come not out of laboratories, hospitals, or even pharmaceutical giants but instead from some yet-unwritten algorithm built from this mound of data being assembled here in Chicago.
Scale is the first challenge of comprehending genomic science and the era of big data. Consider that the genomic sequence is a humongous expression of information drawn from the teensiest components of what we are. The genome, the entirety of a person’s DNA, contains nearly 30,000 genes and three billion base pairs of nucleotides—the building blocks of genetic material. Genomic sequencing measures and maps all that.
And consider that the massive computing capability that stores and drives the analysis of that information resides in the smallish space, in relative terms, of a rack of servers and hard drives. (Actually, 43 racks, in the case of the GDC.) That’s like saying you just put the entirety of the human language into a pinball machine.
So the tiny becomes massive, and the massive hides in something small.
“Right now, research is not being done at the scale it can be because people don’t have the tools or they don’t have the data,” says Grossman. “That’s what the Commons does. We’re working to create a culture of data sharing, changing how people think about data sharing, while giving them tools for data sharing to improve health.”
He continues: “We’re designing the next-gen version of this now. I hope in five years to be able to manage a genomic commons that will hold a cohort of a million patients.”
I misunderstand. “A quarter of a million?”
“A cohort of a million,” he says. “Meaning, around a million.”
“Around a million,” I say. “Why don’t you just say that?”
He laughs. “ ‘Cohort’ is the term in medicine when you collect data on a group,” he says. “Part of President Obama’s Precision Medicine Initiative was to begin to create these cohorts, to collect genomic data on about a million individuals to improve health.”
He tries to explain why: “What we’re trying to do is, based on the genomics of the tumor, find the best treatment at that time for the particular patient, to make discoveries directly relevant to their particular cancer. That’s a big-data problem.”
Here’s the thing about big data. It’s useful only if it can be examined easily. Data sets need to be compared, clustered, and placed into cohorts quickly. So one of the major challenges in creating the GDC’s cancer library was taking hospital files from all over the country, stripping them of identifying details, and then “harmonizing” them—making the disparate data uniform. The National Cancer Institute has spent more than $400 million to collect and unify the information.
But now that the initial work has been done, it’s possible for a doctor in Boise, Idaho, to sort this information from a desktop. With a few mouse clicks, that doctor could pull genomic information on all female breast cancer survivors. Under 50. Smokers. With children. With a family history of stroke. And the doctor could get a full record of their treatment to see how the cancer responded. No waiting. No cost.
“The Genomic Data Commons is designed with the end user being a researcher,” Grossman says. He doesn’t mean just academic researchers, but doctors, too. He even mentions patients as potential users.
Seriously? Patients? “Over time, we’re going to make it easy for them to contribute their own data. At that point, we’ll probably give them a way to put a context to their data.”
A young bioinformatician at the GDC shows me what form that context takes, at least for now. The image on her screen looks like a funnel of pale spiderwebs. The webs, I’m told, represent masses of data points. The bioinformatician has isolated more than 60 such points, like clumps of houseflies caught in the webs. “Each green dot is a patient,” she tells me. “Typically, doctors in these cases are just looking at the slide stains from a tissue biopsy, which would look like this.” A mouse click, and everything flattens out into two points on x- and y-axes. Another click, and the spiderweb returns. “Here we’re looking at the genomic signature. This is a next-generation diagnosis based on each tumor’s specific genomics.”
Grossman, who’s been looking on, can see I’m a bit confused. He brings it back to the bigger picture. “By inputting more and more data for each particular genomic signature and the particular combination of drugs used in its treatment,” he says, “we are more able to predict how long a patient with a similar genomic type of cancer will live, which will allow doctors to decide what combinations of drugs for what tumors are most effective in that available time.”
Even with all its data clouds and clusters, you have to wonder if a government-funded, academic-run project like the GDC can, alone, push cancer treatment into its molecular future. Can it get doctors to fundamentally change their practices? Eventually, that force may come from the for-profit commercial side.
A couple of years ago, after Eric Lefkofsky made his fortune cofounding Groupon, his wife was diagnosed with cancer. This led to a lot of time spent in hospital rooms, watching instruments at work. Instruments, not technology.
“The vast majority of cancer patients will walk into a hospital and be treated in much the same way they were 10 years ago—and 10 years before that,” Lefkofsky tells me. “The standard regimen is still chemotherapy, surgery, and radiation, with a little bit of immunotherapy just showing up now. What I found so amazing is that this process isn’t more technology driven. Look at how we use data in health care, versus how we use data in e-commerce or in entertainment. The amount of technology you use in deciding what movie to watch tonight is infinitely more powerful than the amount of technology we’ve been putting in the hands of physicians fighting cancer. That was the bridge that had to be crossed.”
This realization led Lefkofsky in 2015 to launch Tempus, which he sees as that bridge between the kind of big data the GDC is compiling and the treatment practice for individual cancer patients. The health-tech startup, ensconced in the Groupon building in Goose Island, performs genomic sequencing of cancerous cells for doctors but also develops software tools to make it easier to employ that information in treatment.
Lefkofsky explains where Tempus fits into this puzzle: “There are companies that can help hospitals or physicians sequence a patient. There are companies that sell the software to read that. There are companies that you could engage as service providers to help with IT solutions. Nobody was taking a holistic approach, and nobody was really trying to say, ‘All right, how do we institutionalize this at scale?’ How do we actually sit down with physicians and say, ‘What are the tools you need in a clinical setting in order to, actually, improve patient outcomes?’ ”
Which is why Tempus has recruited an army of thinkers from various disciplines to come together under one roof. “One of the things that’s really special about this space,” says Tempus president Kevin White, who is on leave as a professor of human genetics and medicine at the U. of C., “is that you have computational biologists sitting next to software engineers, sitting next to data scientists, sitting next to operations folks, sitting next to physicians. It’s a collaborative space.”
Tempus’s future lies largely in whether it can make genomic sequencing a routine part of all cancer care. “Physicians aren’t used to collecting that data at the outset of treatment,” Lefkofsky says. “The possibility didn’t exist even 15 years ago. Sequencing somebody then would have cost millions of dollars, hundreds of millions even. Today the price is in the thousands of dollars—and falling.”
But genomic sequencing alone won’t be enough to bring more targeted care to cancer patients. “Just gathering the data is interesting, but only so interesting,” Lefkofsky says. “Where it gets clinically relevant is when you take that data and you actually help patients digest it and make decisions that impact what they do next. That’s what Tempus will do. We ask ourselves: How do you make this quantity of data digestible to an oncologist who has 50 patients at any given moment when their days are already overwhelming?”
Tempus is about to dive into the work of it. In September, the Robert H. Lurie Comprehensive Cancer Center of Northwestern University tapped the company to provide genomic sequencing and molecular analysis for a significant new program called OncoSET. The program targets patients whose cancers have resisted traditional therapies. The idea: use genomic data to identify new, individually tailored treatments, alongside novel clinical trials representing the best hope for these patients. (Lefkofsky has previous ties with the Lurie Cancer Center: In 2015, he and his wife donated $1 million to the center for the study of oncology treatment.) For Tempus, it’s another step toward standardizing genomic sequencing as a part of every cancer patient’s testing.
With a subject like cancer, sometimes you just want the bottom line. Some big takeaway. For me, my “Holy shit!” moment came in front of another computer screen at the GDC while I was speaking with Zhenyu Zhang, the GDC’s lead bioinformatician.
Grossman introduced us: “Zhenyu was in charge of the bioinformatics pipelines that process data that shows we can begin to look at cancer from a fundamentally different perspective. Not based upon ‘Is it from the lung? Is it from the ovaries? Is it from the skin?’ We examine how it looks genomically.”
Grossman has spent the better part of two hours convincing me that the human brain can’t really “look” at anything as incomprehensibly large as the genome. You don’t glance from a lighthouse to measure how much water is in the ocean. That’s where the analysis tools the GDC is developing come in.
“What can you really see?” I ask.
Zhang directs me to the screen, to the clumps of dots representing patients. One grouping is breast cancer, another is colon cancer, esophageal cancer. And so forth. The clumps are generally color-specific. Yet in each group there are data points in the color of another cluster.
“What’s that?” I say, pointing at two brown triangles among a clump of green squares. The brown triangles represent cases of bile duct cancer, and the green squares are liver cancer. As Zhang explains, these particular incidents of bile duct cancer have more in common, genetically speaking, with cases of liver cancer.
“So are you saying where the cancer is located doesn’t necessarily define the best way to classify it?” I ask. “Or even treat it? That every case of cancer is distinct from every other?”
“Yes,” Zhang says. “Every individual cancer has a genomic signature that is its own.”
This feels disorienting to me. Suddenly cancer as I’ve known it seems like a 19th-century medical delusion.
I ask Zhang to elaborate on what we are seeing on screen. “Basically, it’s a neighborhood method,” he says. “For each patient, to find the drug treatments that work, we look at what the other cancers around them look like. The genomic groupings tell us more than the [anatomic] labels.”
Grossman tries to clarify. “We do a mathematical analysis in several hundred dimensions.” He pauses. “Wait, I think it’s actually more.”
He turns to Zhang. “How many dimensions was it?”
Zhang clears his throat. “There are like half a million dimensions. I tried to trace back to 200 dimensions at first, and then I used another algorithm to push that back to—”
Grossman smiles and interrupts. “We have a lot of computers to do that,” he says, turning back to me. He’s made his point.