Publicly Available Databases: Closing the Gap in Cancer Health Disparities Research 

Cancer health disparities are driven by the compounding of multiple factors, including gaps in research resulting from insufficient representation of racial and ethnic minorities in clinical, genomic, and population studies.  

Because of this lack of diversity, results are not always applicable to all patients, and potential disease drivers can be missed. 

Large, publicly available databases provide a source of more diverse and comprehensive data and can be leveraged to extend the applicability of research findings, uncover biologic and socioeconomic aspects that contribute to disproportionate cancer burden in some populations, and expand the potential of precision medicine. 

A session at the 15th AACR Conference on The Science of Cancer Health Disparities in Racial/Ethnic Minorities and the Medically Underserved discussed some of these databases and the research they are enabling. 

National Cancer Database 

The National Cancer Database (NCDB), jointly sponsored by the American College of Surgeons and the American Cancer Society, is a clinical oncology database of data sourced from hospital registries. Ryan McCabe, PhD, NCDB senior manager from the American College of Surgeons, described it as the preeminent multidisciplinary national clinical cancer registry. 

Launched in 1989, the NCDB captures data from 1,500 Commission on Cancer-accredited hospitals, corresponding to about 26 percent of the hospitals in the U.S., and comprises approximately 72 percent of all newly diagnosed cancer cases in the country every year, covering 80 disease sites. 

The database includes information on tumor characteristics—such as site-specific markers, disease staging, and first course of treatment—and outcomes, collecting all-cause mortality for 15 years after diagnosis.  

As McCabe explained, NCDB’s primary focus is on quality measures to support improvement in cancer care. “We have algorithms that we build that process the data and then label the data as being compliant or noncompliant with a number of quality measures, and then we provide that data back to the hospitals.” 

These data are accessible to clinician-investigators at Commission on Cancer-accredited hospitals. “Every year, we distribute about 1,000 files, and, since 2009, the data that we distributed has resulted in over 1,500 publications,” said McCabe. “It is a very big program that is creating a lot of information out there.” 

The NCDB also maintains web-based applications to promote access by researchers, clinicians, and the general public. 

The NCDB represents a significant resource for cancer health disparities research, with information on race, insurance status, distance from patient address to treating hospital, and average income and educational level for each ZIP code. 

McCabe pointed out that the NCDB data are being used to address many questions related to cancer disparities research. For example, a study found that the expansion of Medicaid resulted in a shift to early-stage cancer at diagnosis and reducing certain disparities in young adult patients. Another study uncovered substantial racial disparities in the use of proton beam therapy. 

All of Us Research Program 

The All of Us Research Program is an initiative of the National Institutes of Health that intends to recruit more than 1 million participants across the U.S. to build one of the most diverse health databases in history.  

The Program is now in its fifth year. As its chief engagement officer Karriem S. Watson, DHSc, MS, MPH, pointed out, 80 percent of the 523,000 people enrolled so far come from groups underrepresented in biomedical research, and 45 percent self-identified as belonging to racial and ethnic minorities. 

Getty Images

The All of Us database has three levels of access. According to Watson, the “public tier” access, available to anyone with internet access, is critical because it allows participants to see firsthand how the information they volunteered is used by researchers, and what type of research questions are being asked. 

The “registered-tier” access, available to approved researchers, includes deidentified data from electronic health records, wearable devices, surveys, and physical measurements taken at the time of participant enrollment. 

Lastly, in addition to the data present in the registered tier, the “controlled tier” dataset contains genomic data from whole genome sequencing and genotyping arrays, expanded demographic data, and it will soon provide ZIP code data. 

Currently, the All of Us database includes about 42,000 individuals with a cancer diagnosis, representing approximately 18 percent of all participants.  

Watson reported that there are 162 active research projects focused on cancer using the All of Us dataset and investigating areas such as the impact of nutrition and physical fitness on prostate cancer, the social determinants of colorectal cancer, the utility of data on breast cancer genomic variants in predicting cancer risk, and the differences in the prevalence of cancer between people with and without a family history of cancer.  

He highlighted some examples of published studies that used All of Us data to analyze factors that impact health care utilization among cancer survivors; examine the current landscape of precision medicine for cancer patients; and study lung cancer risk factors.  

A priority of the All of Us Project is ensuring data access to investigators at all levels and from all backgrounds. The project is partnering with Historically Black Colleges and Universities to provide training in the use of the dataset and working with the University of Utah to develop an All of Us data curriculum for high school to ensure that teachers introduce students from all backgrounds to the importance of data analysis and genomics.  

The All of Us Project represents the NIH’s foray into the field of liberation data science, which is based on inclusion, cooperation, and data sharing between researchers and the communities impacted by health disparities, Watson explained.  


AACR Project GENIE is an international pan-cancer registry of real-world data. “We are driven by openness, transparency, and inclusion,” said Jocelyn Lee, PhD, associate director for Project GENIE. “The ultimate goal and the foundation of GENIE is to link clinical sequencing data to clinical outcomes from the patients that are seen and treated at our participating institutions.” 

Launched in 2015, Project GENIE has had 12 public data releases, and contains more than 150,000 sequenced tumors from about 130,000 patients and 110 major cancer types. Data are contributed by 19 leading cancer centers in North America and Europe.  

So far, Project GENIE has had 11 publications and more than 700 citations, has completed five sponsored studies in collaboration with pharmaceutical companies, and has contributed to one regulatory filing.  

The registry includes approximately 5,000 pediatric patients and 9,000 young adult patients. As Lee explained, most of the cancers within the registry are lung, breast, and colorectal cancers, but given the large number of samples, it also provides data for research on rare cancers.  

About a third of the data in the registry comes from non-white patients (more than 43,000). “In terms of our absolute numbers, Project GENIE is probably one of the most diverse datasets that are currently available,” said Lee. 

Different initiatives are underway to enhance the registry to catalyze health equity research, including an open call to recruit new participating institutions that treat and sequence underrepresented populations, and a pilot project using ZIP code and other proxies to surface social determinants of health variables related to the data already present in the database. 

Lee discussed how data in the Project GENIE registry have been used so far to focus on diversity. For example, a study presented at the AACR Annual Meeting 2021 found that Black patients with early-onset colorectal cancer had significantly higher tumor mutation burden than white patients, providing evidence that the molecular features of this disease may differ by race. This study and additional research on cancer disparities powered by Project GENIE were reviewed in an earlier blog post.   

Another study used Project GENIE data to analyze the racial differences in tumor genomic profiles of prostate cancer and reported that Black patients with metastatic prostate cancer were more likely to carry mutations in the androgen receptor gene and in DNA repair genes than white or Asian patients. 

Research published in the AACR journal Cancer Discovery revealed an association between Native American genetic ancestry and lung cancer molecular alterations. The investigators developed an algorithm to identify ancestry and detect associated molecular changes, and are now working with Project GENIE to understand the influence of ancestry-associated genes on patient outcomes.