Solid Tumor Clinical Abstraction Data Model Development RFP
Materials
Updated RFP Slide Deck
GENIE CDM Base Data Elements Overview
Request for proposals (RFP) Questions and Answers
The following questions and answers are also available as a downloadable document.
- What is the project timeline?
- Completion by Q2 2025
- How many patients are in the source dataset? How many years average of patient data?
- There are currently approximately 185,000 patients in the registry. Retrospective collection of data elements is typically captured from initial cancer diagnosis, as such, some patient timelines may be greater than 20 years. However, most of our genomic data spans the dates 2013-present.
- Please describe the registry process today. (1) How is data shared, aggregated, and harmonized? (2) What is the user experience? (3) What about this process is working/not working? (4) What tools are being used?
- More information and current workflow can be found online. The current process is well established and defined. The focus of this work is to build and (optionally) facilitate deployment in our infrastructure of a unified data model that facilitates programmatic data collection. Our tech stack includes REDCap as our EDC, Synapse (Sage BioNetworks) to store, harmonize, and provide access/compute, and cBioPortal for visualization.
- Given the registry is already live, why are you looking to do the clinical abstraction data model now? How do you currently get data?
- Past studies deployed different but related data models and dictionaries. A common base data model closely aligns with our strategic plan and facilitates scaling and better interoperability with other datasets, among other benefits.
- For the components of the data collection that may need to be completed with the assistance of manual curation, will that be done by clinical staff at each individual cancer center?
- Yes
- Will the AACR provide samples of the actual data for the data model construction?
- Yes
- Who will be responsible for training cancer center clinical staff on the manual curation of content into eCRFs? The vendor or AACR GENIE?
- AACR GENIE. If the vendor is able to produce curation directives and training materials, please include in the proposal; however, this is not a requirement.
- Source data is coming from different organizations. Is it normalized to a single format at the AACR? If not, how many different formats are there?
- The data is harmonized; for more information about the collection and harmonization of Project GENIE data please see the AACR Project GENIE data guide.
- How many new partners/centers will be encouraged in the GENIE initiative in the future?
- GENIE is always considering new applications to become a consortium member. There is no defined maximum amount of consortium sites allowed in the project.
- What security and compliance requirements exist?
- Our current data security and compliance processes are also outlined in the AACR Project GENIE data guide.
- Is design of data protection and data governance layer in scope for this project?
- No, these exist and the proposed solution will be deployed within this infrastructure. We welcome feedback on our existing governance if weaknesses are uncovered during the work.
- Are users accessing the data on a pay per use model, for example to fund the programs at the AACR and/or cover infrastructure costs.
- The current business model has no bearing on the scope or implementation of the RFP.
- Do you have use cases already defined which will drive the scope and granularity of the data model?
- Yes, you may review more about how the GENIE data are utilized.
- Have the data elements been specified and is the data dictionary built? Does GENIE contain PHI?
- Yes, a high-level overview of the data elements have now been listed on the RFP webpage. GENIE does not contain PHI—all data are de-identified.
- For the “other data standards or value sets,” will the AACR provide the terminology sets to load into the data model?
- The AACR has defined data standards and value sets. However, we’re open to additional suggestions or recommendations by your team.
- How often does the AACR anticipate this resource being updated? Constantly or once a year? From a scope perspective, are you interested in a model for long term support and further development of the data model or is the scope limited to the initial development only?
- The scope of this RFP is focused on the initial development of a model; technologies that facilitate long-term maintenance and continued development by our existing technological partners will be prioritized.
- Structured vs. unstructured data and Manual Curation:
- The final work product must support both ingestion of structured as well as unstructured data through both curation and NLP/ML/AI methodologies.
- If curation is in scope of this project, how much does the AACR anticipate manual vs. automatic curation? Do they have any idea how standardized the data already is they want to incorporate? If curation is in scope of this project, are there any data processing steps it would be useful to include as part of the curation? For example, automated histopathology results could be generated from image files.
- We have not yet determined the relative contributions of manual vs. programmatic data collection. Please include any proposed solutions for programmatic data collection for consideration.
- For the unstructured data, do you envision the use of NLP?
- There are parallel efforts underway to collect unstructured data programmatically. Please include any proposed solutions in your response for consideration.
- Is the data model intended to encompass all 12 cancer types or have 12 separate data model products?
- The request is to build a ‘core’ data model with additional cancer-specific modules contained within the same data model solution.
- Does the model need to be extendable to other cancer types than the ones mentioned in slide seven?
- Yes, but it is not part of this initial scoping. Currently, the focus is on a solid tumor cancer model and the 12 initial cancer types.
- What is the current development environment at the AACR? Is it cloud based (AWS/Azure/Google) or internal network?
- It is a mixture of US-based on-premise and AWS.
- Are there tools/platforms that must be integrated with this solution?
- Synapse (via Sage BioNetworks).
- Are there any existing vendor relationships or technologies that we should consider in our proposal? Is the Synapse infrastructure non-negotiable or can alternative providers be accepted?
- Use of Synapse is non-negotiable; however, we will review suggestions for alternative solutions that must also be compatible with REDCap.
- Does the final solution, along with data collection and modeling, for this RFP include a platform for data storage and management, and for users to perform analyses?
- No
- What EHR Systems (types), Registries are utilized by GENIE data partners/centers?
- Most use EPIC EHR; however, the sources are varied and pipes to the sources is out of scope for this RFP.
- Would the AACR be able to provide further detail on the desired format of the data model and the final format of the harmonized datasets? Would those datasets ultimately be stored in databases? If yes, would these databases be MongoDB (NoSQL) or relational SQL databases?
- The ultimate target for the data is a relational database, currently SQL. Any solution should also be compatible with REDCap.
- Per RFP, the model should establish a common language and framework and support research by providing a framework for analyzing data. Given this instruction, is the need only for a GENIE model or for a duo of model + data analysis framework?
- Currently for this proposal, we’re interested in just a GENIE model (data schema).
- Do you seek just that data schema model and UML, or do you need a working database with testing and real-world manual curation?
- Currently, we’re looking for assistance with developing the data schema model and UML.
- Do you need a machine-readable data schema, or do you also need the algorithm to use the machine-readable data schema?
- We need a machine-readable data schema, preferably in an UML.
- Is there a preferred format for the “machine-readable data schema”?
- No
- Is data harmonization or automatic data curation expected as a part of the deliverables? Related: What are expectations for the “Modular and editable derived variable code using python to support collected aligning to the AACR GENIE data model framework”?
- Data harmonization is out of scope. “Modular and editable” refers to structuring the data model/schema in a way that allows integration of individual modules for analysis while remaining distinct for individual updates. For example, cancer diagnosis data should be kept separate and distinct from treatment data within the code itself such that changing or editing one module’s code will not adversely affect the code of other modules.
- Does the AACR have requirements for data consumption layer? In other words, how does the AACR want the data to be presented for downstream analysis or consumption?
- We do not. The data will be stored in SQL tables and ingested into analytic layers from there. Providing the model in UML will allow mapping into our analytic layers.
- OMOP and FHIR are different standards. Have you chosen one and if so, what process will be used to decide which model to focus on? Both models are incomplete models for oncology and mapping from OMOP and FHIR is not a 1-1 mapping.
- The data schema should be built to the FHIR standard. If you also have a solution for mapping the produced schema to OMOP, please include in your proposal; however, it is not necessary for the data schema to be in the OMOP format.
- What is the role of HL7/FHIR: Is it extraction of HL7/FHIR into OMOP, or creation of an HL7/FHIR model?
- Creation of a HL7/FHIR model
- Will it be required that all source data from participating cancer centers be provided in FHIR or OMOP CDM?
- The intention is to collect data programmatically (FHIR based) and map the final dataset to OMOP, but it is not necessary for the data schema to be in the OMOP structure.
- Could you provide more details on how you envision integrating OMOP CDM, HL7 FHIR, and other data standards (e.g., NAACCR, ICD-10, AJCC) within the clinical data model?
- We have already specified the required data elements, value sets, and standard ontologies/standards (e.g., NAACCR, ICD-10, AJCC). The scope of this request is generation of a data schema and model, an optional book of work is to map this model to OMOP; however, it is not a requirement for a successful application.
- You mention widely accepted standards (e.g., OMOP, CDM, HL7)—do you intend to build an infrastructure only for a restricted set of clinical models (restricting inclusion to GENIE only to those conforming with those models) or is the intent to be able to ingest any type of data models? Also, Regarding OMOP and FIHR, do we expect ingestion of data in FIHR only and delivery of queriable data in OMOP? Or ingestion and query in both?
- See answer four questions above: Standards have been selected and will be iterated as appropriate. If your proposed solution can functionally ETL any model into the model created in this SOW, then please include for consideration; however, it is not a requirement for a successful application. We expect data ingestion via FHIR; mapping to OMOP is optional. Please include in submitted proposal if your team has this capability.
- Regarding Testing and Validation, does the AACR want an ongoing validation process that can be reviewed and update the mapping based on results? (For example: a report that identifies how well variables are populated, changes over time, and introduction of new data formats)
- The AACR will consider inclusions of an ongoing validation processes with the proposal.
- Do you envision the genomic data to be transformed into OMOP?
- While we intend to map the finalized, collected data (including genomic data) to OMOP, it is not necessary for the data schema to be in the OMOP structure.
- How important is to the AACR GENIE team to contribute any model work back to the community (e.g., Standard Development Organizations, terminology SDOs [such as SNOMED international (for SNOMED CT) or Regenstrief Institute (for LOINC), HL7 ]? Should the proposal only utilize existing routine health care or health care research standards and leave the extension of such standard to the larger community or within reason, try to actively advance?
- Proposed solutions should focus on current health care/research standards. If an opportunity to advance/expand/iterate a community standard arises, we could explore engagement with the respective organization(s) as appropriate.
- There is indication in RFP that at least some of the data is in HL7 format. Is it CCDA? Is there FHIR dataset available also?
- Currently our data are not in HL7 format.
- Are vendors expected to actually deploy the developed models into physical databases or FHIR APIs?
- Currently not in the scope of this request.
- Does the Vendor have to provide the specification details for the data model, or is it a collaboration process between the Vendor and the AACR experts?
- Specification details should be provided, with the ability of AACR staff and associated team to make recommendations or suggestions.
- What, if any, types of molecular and genomic data might be included in or linked to the clinical data model? This is relevant because mutational signature data already known to stratify cancer subtypes (ex: HER+, etc.), may be more readily incorporated as a clinical feature than something like RNA-seq or scRNA-seq that may be linked to a patient-timepoint. It would be helpful to know more precisely what types of data are expected to advise solutions.
- Project GENIE currently collects Next-Generation Sequencing (NGS) data for all patients. You can learn more in the data guide. Future efforts will focus on incorporating RNA-seq or single-cell RNA-seq (scRNA-seq) data. A proposal that can accommodate the integration of these types of molecular and genomic data, in addition to NGS, will be highly beneficial. Standard-of-care clinical tests, such as HER2 testing, PSA, etc., should be modeled as clinical variables.
- Given that the OMOP CDM is developed by the OHDSI open-source community, how do you plan to license the clinical data model that will be built through this project?
- We plan to make the data model available to the community through creative commons licenses.
- Are you open to a co-ownership of the code base and IP produced during the project?
- Our intention is to make this available to the community through creative commons license.
- Will the model be open source or closed source?
- Open source
- Additionally, do you intend to share the model and any related tools or resources with the broader community, continuing the spirit of openness and collaboration that defines the GENIE Project?
- Yes
- What communication and project management tools will be used?
- The AACR Project GENIE team currently utilizes Box and Freedcamp for communication and project management, respectively, but are open to other tools as needed.
- For the proposal, are we allowed to include an appendix section with supporting information (e.g., detailed CVs, etc.)?
- Yes
- Will AACR GENIE entertain a proposal submitted by two vendors in partnership?
- Yes
- If multiple vendors can submit as a single solution, is the response limit 10 pages total or 10 pages for each vendor?
- 10 pages total
- Can you provide us some details on the current stakeholder environment that the chosen vendor’s team will have to collaborate with, including AACR staff, volunteers, any other vendors, etc.?
- Other stakeholders beyond AACR staff will include cBioPortal and Sage BioNetworks supporting partners.
- How does the AACR anticipate collaboration between the AACR, its participating centers, and the vendor? Will there be a working group that is established by the AACR?
- Yes. The AACR has already established a working group for this project.
- Will there be AACR colleagues available to support the core project team? If so, please describe their role(s).
- Yes. The AACR has a staff of project managers and a clinical data manager that will be supporting the project, in addition to technical individuals through Sage Bionetworks.