How "anonymous" is Direct-to-Consumer genetic testing?

What's in the contract?

"Currently, some terms commonly included in DTC genomic testing contracts could be construed as unfair or unconscionable in the UK and EU, and also possibly in some US states."

When you go for a genetic testing at the hospital, your doctor or genetic counselor can help you make an informed consent. Your results are clearly being shared with your insurance provider and other caretakers within the healthcare system.  Direct-to-Consumer (DTC) Genetic Testing companies have complex privacy rules that appear in fine print as Terms of Use, Terms of Service, Terms and Conditions, Privacy Policy or Privacy Statement. At the end of the day, as a consumer, you will need to make an informed decision based on the information you are provided in the fine print. So, how trivial is it to make an informed decision?

A 2015 publication surveys the DTC genetic testing contracts and finds that in a 3 year period (Oct 2011-Nov 2014), 228 companies met the criteria for DTCGT across multiple subcategories (pharmacogenetic; predisposition; pre-symptomatic; nutrigenetic; carrier testing; and testing available through physicians; ancestry; paternity; non-consensual; DNA dating; child talent; athletic ability; misc). Of these 228 companies, 102 offered health related services and 71 websites had terms and conditions available. The author concludes that there is a need for greater transparency about the respective risks and benefits of DTCGT testing. Currently, some terms commonly included in DTCGT contracts could be construed as unfair or unconscionable in the UK and EU, and also possibly in some US states.

Mar 2015: "23andMe and the Promise of Anonymous Genetic Testing" - New York Times Room for Debate

Mar 2015: "23andMe and the Promise of Anonymous Genetic Testing" - New York Times Room for Debate

It all boils down to doing risk-benefit analysis: A few pieces of information to keep in mind while doing this risk-benefit analysis are - a) as research grows, more of our genetic markers will have established correlation with disease risks, and b) genetic data is long lived and is shared between current family members and progeny. In US, there are laws in place, Genomic Information Nondiscrimination Act (GINA), that prevent against discrimination based on genetic information. In Canada, GINA was passed only this year. In US, GINA was first proposed in 2003 and eventually passed in 2008 and there are continued threats to GINA. For example, a recent GOP bill, if passed, will allow bosses to pressure their employees to undergo genetic tests, and demand to see the results and medical histories for family members. And if employees refuse, they’d end up paying surcharge for their insurance.

 

Can algorithms help protect participant privacy?

data_privacy.jpg

While genetic data sharing has more recently highlighted the privacy problem (de-identified genome is like fingerprint and certain biomarkers are shared within families), the issue of re-identification from de-identified data exists with practically any data type. A 2006 publication, provides up-to-date picture of the threat to privacy posed by the disclosure of simple demographic information.

Global Alliance for Genomics and Health (GA4GH) has provided a forum to discuss the ramifications of privacy breach in data sharing. And last two iDASH Privacy & Security Workshops have seen dozens of teams participate from around the world in trying to solve some hard problems in mathematics and computation to enable privacy protecting analytics.

  1. 2017 challenge 1: De-duplication for Global Alliance for Genomics and Health (GA4GH): Participating teams were given hashed patient attributes (identification number, first and last name, gender, etc.) to develop efficient secure multiparty (>=3) patient linkage protocols that scale well to real world applications (e.g., thousands of centers and millions of records in total). 

  2.  2017 challenge 2: Software Guard Extension (SGX) based whole genome variants search: Given a database of Whole Genome Sequence VCFs (labeled with case/control), participating teams used SGX to generate top K most significant SNPs.

  3.  2017 challenge 3: Homomorphic encryption (HME) based logistic regression model learning: Participating teams developed homomorphic algorithms for training a logistic regression model. 

  4. 2016 challenge 1: Practical Protection of Genomic Data Sharing through  Beacon Services (privacy-preserving output release): Given a sample Beacon database, participating teams were asked to develop solutions to mitigate the Bustamante attack. Winning solution was from Vanderbilt University.

  5. 2016 challenge 2: Privacy-Preserving Search of Similar Cancer Patients across Organizations (secure multiparty computing): The scenario of this challenge is to find top-k most similar patients in a database on a panel of genes. The similarity is measured by the edit distance between a query sequence and sequences in the database. 

  6. 2016 challenge 3: Testing for Genetic Diseases on Encrypted Genomes  (secure outsourcing): Participating teams had to calculate the probability of genetic diseases through matching a set of biomarkers to encrypted genomes stored in a commercial cloud service. Winning solution was from Microsoft Research.

Nuances of these algorithms have indeed turned out to be non-trivial in terms of privacy risks, practicality and analytical accuracy.  However, these algorithms will continue to improve and will become available for general research consumption. And in combination with secure computing infrastructure, these algorithms will enable trusted insight sharing across existing data silos.

When scale-up wins over scale-out computing

william-bout-264826.jpg

Graph databases

deep learning, metagenomics, ...

High Performance Computing has been the bread and butter of scientific computing. Cloud computing has enabled massive scale distributed computing but there are still some applications that benefit from scale-up (aka supercomputer i.e. a single computer with large number of cores or RAM) computing. For example, researchers from Oklahoma State University completed the largest metagenomics assembly to date by sequencing data from a soil metagenome that required 4TB of memory. In another example, the NVidia Pascal GPUs (P100) show deep learning acceleration in recent benchmarks

A biomedical application that is nascent but shows promise for scale-up computing paradigm is use of graph databases for modern biomedical data mining. In complex multi-modal biology (e.g. omics, wearable, imaging, ...), the relationships between datasets are hard to characterize using relational databases. The appropriate paradigm for storing and mining these datasets is a graph database. Graph analytics offers capability to search and identify different characteristics of a graph dataset: nodes connected to each other, communities containing nodes, the most influential nodes, chokepoints in a dataset, and nodes similar to each other. New implementations in industry has shown that using graph algorithms can solve real-world problems such as detecting cyberattacks, creating value from internet of things sensor data, analyze the spread of epidemics (Ebola), and precisely identifying drug interactions faster than ever before. An open source tool, Bio4j, is a graph database framework for protein related information querying and management that integrates most data available in Uniprot KB (SwissProt + Trembl), Gene Ontology (GO), UniRef (50,90,100), NCBI Taxonomy, and Expasy Enzyme DB. NeuroArch is a graph database framework for querying and executing fruit fly brain circuits. Researchers are increasingly looking towards graph database when current data models and schemas will not support research queries and study has a lots of new and disparate data sources that are inherently unstructured.

Scale-up (aka supercomputing) architectures tend to be expensive. For academic researchers, access to these supercomputers are available at large academic centers or supercomputing centers. Here are some of the recent supercomputer installations in news:

  1. May 2017: Department of Genetics at Stanford University has acquired its first supercomputer, a SGI (now part of HPE) UV300 unit, via an NIH S10 Shared Instrumentation Grant. This is a newer and badder version of TGAC system. It has 360 cores, 10 terabytes of RAM, 20 terabytes of flash memory (essentially SSDs with NVMe* storage technology), 4 NVidia Pascal GPUs (P100s are especially suited to deep learning), and 150+ terabytes of local scratch storage. (More)
  2. Aug 2016: Pittsburgh Supercomputing Center (PSC) funded by NSF has two HPE Integrity Superdome Xs, each with 16 CPUs (22 cores per CPU totalling 352 cores), 12TB RAM, and 64TB on-node storage (More)
  3. May 2016: The Genome Analysis Centre (TGAC) has recently procured a set of SGI UV300 supercomputers. TGAC is a UK hub for innovative Bioinformatics and hosts one of the largest computing hardware facilities dedicated to life science research in Europe. Their new TGAC platform comprises two SGI UV 300 systems totalling 24 terabytes (12 terabytes each) of RAM, 512 cores and 64TB NVMe storage. (More)

Improving healthcare by data mining Electronic Health Records

PalliativeCare.png

Bringing AI to healthcare

More than 60% of deaths in the US happen in an acute care hospital. This predictive model helps the Palliative Care team to be engaged early enough to ensure meaningful services.

Although number of palliative service teams are at an all time high (67% of US hospitals have such teams), only 50% of patients in need of palliative care, receive service. The reason for the gap is two fold. First, physicians may not refer patients to palliative care for reasons of overoptimism, time pressures, or treatment inertia.  Second, there isn't sufficient capacity to proactively identify candidate patients via manual chart review, an expensive and time-consuming process. This leads to patients experiencing end of life discomfort.

This particular study, uses data from EHR to accurately predict patients who may need palliative care.  Most hospitals in US now have 10-20 years of Electronic Health Record (EHR or EMR). It is becoming increasingly possible to harness the EHR data to aid healthcare and a study like this is a demonstration of the possibilities that lie ahead.

Nov 28, 2017: "A New Algorithm Identifies Candidates for Palliative Care by Predicting When Patients Will Die" - MIT Tech Review

Nov 28, 2017: "A New Algorithm Identifies Candidates for Palliative Care by Predicting When Patients Will Die" - MIT Tech Review

Secure Cloud Computing for Genomic Data

Figure from: Peer reviewed commentary on Nature Biotechnology 34, 588–591 (2016)

Figure from: Peer reviewed commentary on Nature Biotechnology 34, 588–591 (2016)

Large scale genomics studies involving thousands of whole genome or exome sequences are underway on Cloud. What makes the Cloud security landscape discussion challenging is that security recommendations differ across regulatory bodies, besides being inconsistent between on-premise and Cloud requirements. For example, Institutional Review Board (IRB) often require Health Insurance Portability and Accountability Act (HIPAA) level Cloud security even for non Protected Health Information (PHI) data. In another example, Database of Genotypes and Phenotypes (dbGAP) has different encryption requirements for on-premise and Cloud environments. This peer reviewed commentary provides the genomics community with a set of Cloud security guidelines that will meet a wide range of regulatory requirements. Although the Cloud technology stack will continue to evolve rapidly, thus changing the specifics of implementation, these guidelines will be applicable for the foreseeable future.

While security is a necessary pre-requisite for genomic privacy, it is not sufficient. Privacy researchers have shown time and again that availability of de-identified partial genomic data can result in patient re-identification. Several studies suggest that algorithmic methods such as partial homomorphic encryption , secure multi-party computation or differential privacy can provide the necessary privacy protecting layer within such an architecture. The extent to which these algorithmic methods can be integrated with genomic workflows, statistical and machine learning tools are under active investigation. 

Interactive data analysis with terabytes of data

btx468.png

Interactive Big Data analytics

using Cloud technologies

Biomedical Big Data and associated research is rapidly finding home on public clouds. Part of this comes from maturation of cloud technologies and part comes from the desire to playing well in a research community. Cloud democratizes availability of affordable tools to the broader community. As little as ten years ago, doing large scale computing required building dedicated data centers. These data centers are still a lot more cost effective than Cloud but Cloud takes away the need for upfront investment thus making is possible to explore first.

In this publication, (originally published on bioRxiv pre-print server) use of interactive analytics using Google BigQuery is demonstrated for terabyte scale genomic data. What makes this approach different? Firstly, interactive analysis mode brings unprecedented power compared to batch mode analysis. Data exploration, by definition, is iterative. New exploration strategies are often based on intuition gained from previous exploration. If you can reduce the cost sufficiently and make the exploration real time (or near real time), then one explores faster and further. 

"Google wants to store your genome" - MIT Tech Review Link

"Google wants to store your genome" - MIT Tech Review Link

Figure from original submission to bioRxiv preprint server.

Figure from original submission to bioRxiv preprint server.

Due to nature of Big Data, it may not possible to move data around unless you are affiliated with universities that are on fast internets. So the authors here demonstrate use of Cloud environment for a range of genomic data exploration, all the way from variant calling to QA, GWAS and use of machine learning methods.

The following quote was part of a Google Cloud blog that summarized adoption of Google Cloud at Stanford University.

“We’re entering an era where people are working with tens of thousands or even millions of genome projects, and you’re never going to easily do that on a local cluster. Cloud computing is where the field is going.”
— Mike Snyder, PhD, Director, Stanford Center for Genomics and Personalized Medicine

NASA twin study

Nasa Twin Study.png

Biology, meet Space

Mark and Scott Kelly are both Engineers, retired US Navy Captains and now retired Astronauts.

 

Mark Kelly (@ShuttleCDRKelly) retired from NASA in 2011 and his space missions are:

2001: STS-108 (12 days)

2006: STS-121 (13 days)

2008: STS-124 (14 days)

2011: STS-134 (16 days)

 

Scott Kelly (@StationCDRKelly) retired from NASA in 2016. His space missions are:

1999: STS-103 (8 days)

2007: STS-118 (13 days)

2010: International Space Station, Expeditions 25-26 (159 days)

2015: International Space Station, Expeditions 43-45 (340 days). The twin study was conducted during this mission.

Twins Study is ten separate investigations coordinating together and sharing all data and analysis as one large, integrated research team. NASA has selected 10 investigations, two at Stanford, to conduct with identical twin astronauts Scott and Mark Kelly. These investigations will provide broader insight into the subtle effects and changes that may occur in spaceflight as compared to Earth by studying two individuals who have the same genetics, but are in different environments for one year. The studies are broadly classified under the four categories:

  1. Human Physiology: How does the spaceflight environment induce changes in different organs like the heart, muscles or brain?
  2. Behavioral Health: How does spaceflight affect perception, reasoning, decision making and alertness?
  3. Microbiology/Microbiome: How do dietary differences and stressors affect the organisms in the twins’ guts? 
  4. Molecular / Omics: How do genes in the cells turn on and off as a result of spaceflight; and how stressors like radiation, confinement and microgravity prompt changes in the proteins and metabolites gathered in biological samples like blood, saliva, urine and stool? 

 

    NASA Twin Study in News:

    • Oct 28, 2017: NASA Twins Study spots thousands of genes toggling on and off in Scott Kelly, PBS (Link)
    • Aug 26, 2017: NASA sent one identical twin brother to space for a year and studied how it changed him — here are the first results, Independent (Link)
    • Aug 24, 2017: Exploring the ground truth: NASA's twin study investigates metabolites, ScienceDaily (Link)
    • Sep 7, 2016: What are the long-term health effects of living in space? NASA is studying twins Mark and Scott Kelly to find out. - Los Angeles Times( Link)
    • Mar 4, 2016: "Everybody Stretches" without Gravity: Mark Kelly Talks About NASA's Twins Study - NPR ( Link)
    • Mar 1, 2016: A tale of two astronauts: Scott and Mark Kelly begin new phase of NASA Twins Study - Los Angeles Times ( Link)

    Shifting bioinformatics bottlenecks

    Somalee Datta, chaired a session on "Genome: Silos, Hacking, Privacy, and Collaboration at Precision Medicine World Conference 2016 where she presented on "Shifting Bottlenecks in Bioinformatics". Other presenters included, Dr. Philip Tsao, VA Palo Alto who presented on "The VA Million Veterans Program" and William Knox Carey, Intertrust Technologies who presented on "Access vs Privacy: A False Dichotomy".

    Following video is the talk on "Shifting Bioinformatics Bottlenecks" where she presents how the challenges have shifted from managing data analytics for one genome to multiple genome and now to the current challenge of data sharing, all within a timespan of 2010-2016.

    Heart Arrhythmias in Apple Heart Study

    The Apple Heart Study, a collaboration between Apple and Stanford (and partnership with American Well and BioTelemetry), uses data from Apple Watch to identify irregular heart rhythms, including those from potentially serious heart conditions such as atrial fibrillation.

    Atrial fibrillation (also called AFib or AF) is a quivering or irregular heartbeat (arrhythmia) that can lead to blood clots, stroke, heart failure and other heart-related complications. According to the American Heart Association, if you have AFib, you’re five times more likely to have a stroke than someone who doesn’t. If your heart beats too fast, it may even lead to heart failure. AFib can cause blood to clot in your heart. Blood clots can travel in the bloodstream, eventually causing a blockage (ischemia).

    News media coverage:

    • Dec 11, 2017: Why This FDA Approval Could Be a Huge Deal for Apple (The Motley Fool)

    • Dec 4, 2017: Apple’s First Medical Study Signals Broader Health Ambitions (WSJ)

    • Nov 30, 2017: Apple Heart Study launches to identify irregular heart rhythms (Apple Newsroom)

    • Nov 30, 2017: Apple and Stanford begin Heart Study to detect irregular heart rhythms using Watch (9to5Mac)

    • Nov 30, 2017: Stanford begins irregular heartbeat research using Apple Watch data (Endgadget)

    • Nov 30, 2017: Apple Watch will alert heart-study participants if they have an irregular beat (USAToday)

    Stanford team's wearable sensor study explores the limits of possibilities

    plos-page1.png

    Tracking personalized physiology

    250,000 daily measurements from more than 40 individuals

    Eminent scientist, and Chair of Genetics, Prof Mike Snyder's team published a PLOS Biology paper,  that recorded and analyzed over 250,000 daily measurements for up to 43 individuals, and found personalized differences in physiological parameters. Mike Snyder compared the information from the sensors with seeing the “check engine” light in your car. “You might hear some knocks” in the engine beforehand, tipping you off to a potential problem, but “it’s nice to see a little light when something’s not right”.

    It was tour de force study that explored the limits of possibilities. Correctly so, it caught the media frenzy and got mentioned anywhere and everywhere. Here are some of the early ones:

    • Jan 17, 2017: "Built for the Future. Study Shows Wearable Devices Can Help Detect Illness Early" (Dr. Francis Collins, NIH Director's Blog)
    • Jan 13, 2017: "Wearables show what “healthy” means for you—then tell if you’re not" (Ars Technica)
    • Jan 12, 2017: "Wearables could soon know you're sick before you do" (Wired
    • Jan 12, 2017: "Can wearable sensors tell when you're sick?" (Reuters)
    • Jan 12, 2017: "Smartwatches know you’re getting a cold days before you feel ill" (New Scientist)
    • Jan 12, 2017: "Fitness Bracelets May Warn of Serious Illness" (Scientific American)
    • Jan 12, 2017: "Smartwatches could soon tell you when you’re getting sick" (Tech Crunch)
    • Jan 12, 2017: "Testing wearable sensors as ‘check engine’ light for health" (Washington Times)