Interview with Dr. Anthony Kerlavage

Given that modern cancer research efforts need to link molecular information with patient data to inform precision medicine-based treatment strategies and therapeutic approaches; what are the major obstacles in achieving this and how are they being overcome?

anthonyAs you say, to progress towards the goal of true precision medicine, the cancer research community will need to access, integrate, and analyze many different types of data, and to be successful those data must be portable and easily shared among providers, researchers, patients and research participants. Facilitating this kind of data integration and access requires significant planning and investment in the underlying technology and informatics. Traditionally all these different data types are stored in separate databases without much consideration for how they might be shared.

This creates several challenges, which NCI and the broad cancer community are currently working to overcome.
One of the big challenges is quality and consistency of data, which requires harmonization and the application of standard metadata. Without this, it becomes much more difficult to share data, and even within repositories containing only one major data type, the value and usability of the data is diminished significantly. Active curation of the data as it is submitted to a repository or data commons is a necessary step to making the data usable and reliable.

A related issue is access to data across different domains – for example, genomic data and associated patient clinical data – which need to be queried and analyzed together to be truly useful. Efforts to create standard patient identifiers that allow for search and analysis while protecting patient privacy are critical, as are standardized metadata and APIs that facilitate the search across repositories. Additionally, patient consent needs to be much broader - most patients want to share their information to advance research and help other patients, yet consents still tend to be quite limited, restricted only to the study in which the patient is participating. Broader consent and support for data curation will also make it easier for researchers to contribute their data to open repositories, removing barriers that currently exist in data sharing.

Finally, the size of the data and the compute power required for analysis present additional challenges. Storing genomic and pathology imaging data, for example, requires extremely large databases, and the data are difficult and time-consuming to download. Many smaller institutions simply don’t have the servers to store or compute on such data. Investment in innovative infrastructures that support researcher and clinician access to big data is absolutely critical to progress towards the vision of precision medicine.

Can you give us an example of recent projects which are really pushing the field forward?

How can artificial intelligence allow us to achieve precision medicine at scale?

The Beau Biden Cancer Moonshot is allocating funding for the specific activities and goals outlined by its Blue Ribbon Panel. NCI is heavily engaged in all of the recommendations; one of particular note is building a national cancer data ecosystem. NCI has initiated several programs to advance this ambitious goal.

The Genomic Data Commons, or GDC which was launched by NCI last year, is a unified data repository that enables data sharing across cancer genomic studies. The GDC has been tasked with creating a standardized data submission process, ensuring that the data are fit for submission, harmonizing submitted genomic datasets using a common pipeline, and providing secure access to visualize and download these data. This data curation makes certain that the data is high quality and consistent.

The NCI also launched three Cancer Genomic Cloud (CGC) Pilots which co-locate genomic and other research data with elastic cloud computing and analytic tools that allow the researcher to analyze large datasets. As the pilot stage of these projects ends, these cloud platforms are transitioning to become on-going cloud resources for use by the cancer research community.

Leveraging the components created by the GDC and the CGC Pilots, NCI is also at work to create a Cancer Research Data Commons, creating a framework for the infrastructure required to stand up and support a commons, as well as a set of tools and guidelines for interoperability and data quality. This Data Commons would become a crucial component of the larger cancer data ecosystem described by the Blue Ribbon Panel.

There are many other activities and projects moving the field forward.

The Global Alliance for Genomics and Health (GA4GH), which has broad participation from the research community, has four working groups and 25 initiatives focused on approaches to interoperability for genomic data. One team, for example, is focused on strategies for containers and workflows, to provide the ability to share reproducible genomic pipelines. Another team is developing a set of standardized APIs to allow for streaming of genomic data.

The NCI Informatics Technology for Cancer Research (ITCR) program focuses exclusively on the informatics that support cancer research and allows for the development of innovative tools and approaches to solving difficult problems in cancer. The program provides seed money for early stage tools and algorithms, as well as larger grants when these are ready to scale up.

Artificial intelligence is a term that is rather broadly used. In the context of cancer research, I would focus on the machine learning efforts that strive to train algorithms how to detect potential cancers, and narrow the screening process for clinicians.  Ron Summers, who directs the Imaging Biomarkers and Computer-Aided Diagnosis (CAD) Laboratory at the NCI Clinical Center, is doing great work in fully-automated CT interpretation and decision support in radiology, for example. The Data Science Bowl this past spring focused on developing machine learning algorithms to more accurately diagnose the presence of lung cancer at lower false positive rates than are currently encountered. Another new project, Exascale Cancer Distributed Learning Environment (CANDLE), is a partnership between the DOE and Frederick National Labs for Cancer Research (FNLCR). DOE laboratories are drawing upon their strengths in high-performance computing (HPC), machine learning and data analytics, and combining those with the scientific domain strengths at NCI and FNLCR, to deliver critical tools to advance precision medicine.

Recognizing that at least 65% of clinical data elements come from unstructured data, the NCI Surveillance, Epidemiology, and End Results (SEER) program, in partnership with the DOE, is launching a Natural Language Processing pilot. The plan is to use the computing power of the DOE and their knowledge of complex algorithms in combination with the scientific expertise at the NCI to create scalable, deep learning for text comprehension and extraction. The CDC and FDA are also partners in this effort.

The concept of expanding cancer data access and sharing has been a fundamental principle in large precision medicine initiatives over the last few years. With the competition we are witnessing in the immuno-oncology field, how can we ensure this continues?

We first need to acknowledge that for all these efforts to be successful, data sharing and access is an absolute necessity. Researchers across-the-board know this, and realize that whether they are in a competitive situation or a collaborative one, the more data they have, the more likely they will succeed. The field of cancer research is so vast and complex that there is plenty of room for competition, and even more so when data is readily available.

The NIH is also working to incentivize data sharing and access. All grant recipients are required to submit a Data Sharing Plan as part of their proposal. There are also initiatives like the one sponsored by the Gates Foundation to eliminate the embargo period for data associated with a new publication, and ensure appropriate tagging and metadata for discoverability. These kinds of initiatives are moving the scientific field in the inevitable direction of open access to scientific data.

There are also major data sharing consortia already collecting and sharing patient data. The AACR Project GENIE is a regulatory-grade registry that aggregates and links clinical-grade cancer genomic data with clinical outcomes from tens of thousands of cancer patients treated at multiple international institutions. GENIE is a model for aggregating, harmonizing, and sharing clinical-grade, next-generation sequencing data obtained during routine medical practice. CancerLinQ is another collaborative project, spearheaded by ASCO and SAP, that aggregates real world data from patients, over 1 million records to-date, and provides access to stakeholders across the cancer community.

The Applied Proteogenomics OrganizationaL Learning and Outcomes (APOLLO) network is yet another collaboration, this one between NCI, the Department of Defense (DoD), and the Department of Veterans Affairs (VA), with a goal of incorporating proteogenomics into patient care as a way of looking beyond the genome, to the activity and expression of the proteins that the genome encodes. APOLLO data will be shared through the NCI Data Commons.

‘Challenges’ have become a popular way to engage and motivate the research and innovation communities to solve difficult problems. I have seen your department has initiated such projects in the past – eg. Cancer Genomics Cloud Pilots DREAM Challenge – what is planned now and in the future?

Challenges are a great way to engage the community. The imaging community has really spearheaded this approach with great success, with challenges to create new algorithms in areas such as CT radiomics, classification and nuclei segmentation in digital pathology, and quantitative image analysis methods for cancerous lesions. GA4GH is sponsoring a new challenge to evaluate systems and platforms for executing portable analysis workflows in the interest of developing common standards and best practices. The NCI is also sponsoring a challenge to use public and novel proteogenomic data generated by the Clinical Proteomic Tumor Analysis Consortium (CPTAC) to benchmark an understanding of the interfaces between different layers of information in a population of cancer cells. I think that challenges will continue to be used as a way to encourage and reward innovation.

What conversations and who are you hoping to meet at the Big Data in Precision Medicine Summit in the fall in Washington DC?

I’m looking forward to hearing about innovative initiatives in cancer research and other areas of medicine and health. The advances in technology have precipitated such a revolution in discovery and treatment that it is impossible to stay abreast of everything that is happening – summits like these are a great way to hear about the vast array of work being done in the field.

Featured Speakers



evidera excelra genrubix signet at big data stemtrak

The event is a good opportunity for industry to come together and identify ways to understand and utilize big data in the advancement of patient care.

Dr Mark Roche, CEO, Avanti iHealth

Show the sponsors

Vote who is making waves in the industry!

Phacilitate will gather influencers and industry innovators together from across healthcare, academia, Big Pharma, government and technology innovators into an ebook. Find out more