name: seal class: seal, center, middle, hide_logo, no-scribble background-image: url("img/bg-seal.jpg") background-position: center background-size: contain
<style type="text/css"> .highlight-last-item > ul > li, .highlight-last-item > ol > li { opacity: 0.5; } .highlight-last-item > ul > li:last-of-type, .highlight-last-item > ol > li:last-of-type { opacity: 1; } </style> # London School of Hygiene <br> and Tropical Medicine ## Improving Health Worldwide ??? The first slide hello! --- name: title class: left, middle, title, lshtm-blue, no-scribble # Big data in <br> environmental epidemiology ## ### Arturo de la Cruz Libardi ### Environment and Health Modelling (EHM) Lab<br>London School of Hygiene and Tropical Medicine ### 2024-03-14 --- name: ilo class: no-scribble # intended learning outcomes ## by the end of this session (lecture-demonstration), you will be able to: ### 1. Critically define big data ### 2. Describe implications and applications of big data in public health and epidemiology ### 3. Evaluate sources of big health and environmental data ### 4. Think critically about data linkage in the context of exposure assessment --- name: lecture-outline class: no-scribble # lecture outline - [online slides link](https://adlcruz.github.io/linked_content/pres_bigdataenvepi_2024/bdee_slides.html) #### 1. **motivation** - brief history #### 2. **big data** - definitions - trends and implications - epidemiology - applications #### 3. **health and environmental data** - source examples - harmonization and modelling - exposure assessment - examples --- name: motivation class: center, middle, highlight-last-item .bg-washed-green.b--dark-green.ba.bw2.br3.shadow-5.ph4.mt5[ ### big data → epidemiology → public health ] --- name: brief-history class: no-scribble # very brief history #### integration of subject matter knowledge, (large scale) data, and analysis #### from weekly burial counts (1662) to maps (1854) and death certificates to 180k cohort (1952) -- #### enabled by technology, .red[creativity], individual and collective effort #### ink and paper, punch-cards, telephone... -- .pull-left[ <img src="img/chol_an.gif" width="60%" height="100%" style="display: block; margin: auto;" /> ] --- name: big-def # the data line .pull-left[ ## big data ### Variety: many datasets merged ### Volume: very large tables ### Velocity: real-time updates ] -- .pull-right[ ## not big data ### ... ] -- .class-q[_more V's?_<br><br>_is it just about data?_] --- name: big-terms # specialized infrastructure, pipelines and jargon **Data - oceans, lakes, warehouses, bases** -- <img src="img/gad_21.png" width="60%" height="100%" style="display: block; margin: auto;" /> .footnote[Gadekallu, Pham, Huynh-The et al. [1]] --- name: big-trends # concurrent global trends - ageing population - urbanization (demographic change) - environmentally complex climate change -- # and technological developments - powerful unknowable functions (machine-learning) - smart(ish), cheap(er) and .red[pervasive monitoring] -- # implications - different training and emphasis - widened research opportunities --- name: big-chall # big data in epidemiology - challenges and opportunities ### Variety - measurement error, confounding etc... ### Volume - wide and tall datasets, methods, coverage, power relevant research questions ... ### Velocity - highest impact potential, most dependent on infrastructure -- .class-q[_anything else?_] --- name: big-app-ex # applications in research and public health - Research (genomics, electronic health records) - Healthcare administration (logistics) - COVID-19 (emergency response, tracking, data sharing) - [in references](#ref-source) --- name: thyroidautism class: top, left, no-scribble # using EHRs to question maternal thyroid function and ASD link - brain anatomy linked to autism is present at birth - thyroid hormones play key role in brain development -- #### Q 1: is hypothyroidism associated with inc. risk of autism (430k births) ? #### Q 2: does risk for medicated mothers differ from Q1 risk ? #### Q 3: does risk for lab-tested medicated mothers differ from Q1 and Q2 risks ? #### Q 4: are levels of TSH/fT4 associated with inc. risk of autism (50k births) ? -- > Results indicate that maternal thyroid conditions are associated with increased ASD risk in progeny, but suggestively not due to direct effects of thyroid hormones. Instead, factors that influence maternal thyroid function could have etiologic roles in ASD through pathways independent of maternal gestational thyroid hormones and thus be unaffected by medication treatment. Factors known to disrupt thyroid function should be examined for possible involvement in ASD etiology. .footnote[Rotem, Chodick, Shalev et al. [2]] --- name: opensafely class: top, left, no-scribble # OpenSAFELY [OpenSAFELY: the origin story](https://www.bennett.ox.ac.uk/blog/2021/05/opensafely-the-origin-story/) .vsmall[.pull-left[ > On 7th May 2020, the OpenSAFELY Collaborative pre-printed the world’s largest study into factors associated with death from Covid-19, based on an analysis running across the full pseudonymised health records of 40% of the English population. This is an unprecedented scale of data. > > ... a huge collaboration including the Bennett Institute for Applied Data Science at the University of Oxford, the EHR research group at London School of Hygiene and Tropical Medicine, NHS England, and TPP. Over 42 days during the peak of the first wave of COVID-19 this team worked day and night to produce a fully open-source, privacy-preserving software platform, capable of running open and reproducible analytics across electronic health records, all held securely in situ. Since then the OpenSAFELY platform has expanded to a full scale analytic environment for secure data analysis, reproducible data curation, federated analysis, and code sharing, with every line of code for the platform, for data management, and for data analysis .red[all shared openly by default, in re-usable forms, automatically, and without exception.] ]] .pull-right[ <embed src="img/Collaborative_et_al_2020_OpenSAFELY.pdf" width="100%" height="400" type="application/pdf" /> ] .center[ [All LSHTM OpenSAFELY projects](https://jobs.opensafely.org/orgs/lshtm/) ] --- name: bigh-ex # health data sources .pull-left[ - datasets: [ProjectTycho](https://www.tycho.pitt.edu/) - cohorts: [BioBank](https://www.ukbiobank.ac.uk/) and [OurFutureHealth](https://ourfuturehealth.org.uk/) - platforms: [CPRD](https://cprd.com/) and [OpenSAFELY](https://www.opensafely.org/research/) - personal sensor data ] -- .pull-right[ <img src="img/hdsize_Tonne2017.png" width="100%" height="100%" /> ] .footnote[ Tonne, Basagaña, Chaix et al. [3] ] --- name: bige-ex # environmental data - modelled: atmospheric dispersion models, reanalysis, digital twins - raw: ground monitors, mobile sensors, satellites - raster vs vector -- .pull-left[ <img src="img/daily_birch_longer_div.gif" width="50%" height="50%" /> ] .pull-right[ ** <img src="img/satsearch.png" width="100%" height="100%" /> ] .footnote[ ** [figure from: https://search.earthdata.nasa.gov](https://search.earthdata.nasa.gov) ] --- name: big-env-concept # why use both environment and health data - a part of disease etiology remains unexplained and is likely due to the environment - big data processes offer great potential for environmental health research - most of all data generated has a spatial and a temporal reference # environment + health data .synerg[synergy] 1. research question 1. get health data 1. **get/harmonize/model environmental data** 1. .red[**LINK**] 1. analyse --- name: datlink-types # from data to exposure -- .pull-left[ ### Env. data ] -- .pull-right[ ### Linkage ] -- .pull-left[ (complexity) (none) → continuous modelled output (simple) → inverse distance weighted surface from point measurement (complex) → multi-stage machine-learning models using harmonized features ] -- .pull-right[ (simple) → matching nearest (simple/medium) → points on raster (bilinear interpolation) [4] (medium) → aggregate over small area (complex) → from a trajectory accounting for microenvironments [5] ] .footnote[ Vanoli, Mistry, De La Cruz Libardi et al. [4] Smith, Mitsakou, Kitwiroon et al. [5] ] --- name: stuk ## A Satellite-Based Spatio-Temporal Machine Learning Model to Reconstruct Daily PM2.5 Concentrations across Great Britain [6] - Ground observations of PM<sub>2.5</sub> - A **lot** of environmental data - Random forest (ML) algorithms .footnote[Schneider, Vicedo-Cabrera, Sera et al. [6]] --- name: stuk-fig class: left, top ## A Satellite-Based Spatio-Temporal Machine Learning Model to Reconstruct Daily PM2.5 Concentrations across Great Britain [6] .center[ <img src="img/schne_1.png" width="33%" height="100%" /><img src="img/schne_2.png" width="33%" height="100%" /><img src="img/schne_3.png" width="33%" height="100%" /> ] --- name: lhem ## London Hybrid Exposure Model: Improving Human Exposure Estimates to NO2 and PM2.5 in an Urban Setting [5] > ...the London Hybrid Exposure Model (LHEM), (...) calculates exposure of the Greater London population to outdoor air pollution sources, in-buildings, in-vehicles, and outdoors, using survey data of when and where people spend their time. - London Travel Demand Survey, trip route simulation > Exposure to outdoor air pollution was provided by CMAQ-urban, which couples the Weather Research and Forecasting (WRF) meteorological model, the Community Multiscale Air Quality (CMAQ) regional scale model, and the Atmospheric Dispersion Modeling System (ADMS) roads model - I/O ratio for indoor air levels - for in-vehicle levels: `\(\frac{dC_{in}}{dt}=\lambda_{in}(C_{out}-C_{in})-n\lambda _{HVAV}\cdot C_{in}-V_{g}(\frac {A^{*}}{V})\cdot C_{in}+\frac {Q}{V}\)` - constant value for the underground - **"microenvironments"** .footnote[Smith, Mitsakou, Kitwiroon et al. [5]] --- name:lhem-fig ## London Hybrid Exposure Model: Improving Human Exposure Estimates to NO2 and PM2.5 in an Urban Setting [5] ## residential vs modelled exposure .center[ <img src="img/lhem_corrpmno2.png" width="50%" height="100%" /><img src="img/lhem_travmodes_miscl.png" width="50%" height="100%" /> ] --- name: recap ## we have learned to ### 1. Critically define big data as .red[big data processes] ### 2. Describe implications and applications of big data in public health and epidemiology - **classical .red[(measurement error, confounding)] challenges** - **new .red[(comprehensive health data, real-time action)] opportunities** ### 3. Evaluate sources of big health and environmental data - **health .red[genetic data, EHRs, wearable sensors]** - **environment .red[reanalyses, satellites, ground sensors]** ### 4. Think critically about data linkage in the context of exposure assessment --- name: ref-source class: left,top, no-scribble # references [1] T. R. Gadekallu, Q. Pham, T. Huynh-The, et al. _Federated Learning for Big Data: A Survey on Opportunities, Applications, and Future Directions_. En. 2021. [2] R. S. Rotem, G. Chodick, V. Shalev, et al. "Maternal Thyroid Disorders and Risk of Autism Spectrum Disorder in Progeny". En-US. In: _Epidemiology_ 31.3 (May. 2020), p. 409. ISSN: 1044-3983. DOI: [10.1097/EDE.0000000000001174](https://doi.org/10.1097%2FEDE.0000000000001174). URL: [https://journals.lww.com/epidem/fulltext/2020/05000/maternal_thyroid_disorders_and_risk_of_autism.15.aspx](https://journals.lww.com/epidem/fulltext/2020/05000/maternal_thyroid_disorders_and_risk_of_autism.15.aspx) (visited on 03/13/2024). [3] C. Tonne, X. Basagaña, B. Chaix, et al. "New frontiers for environmental epidemiology in a changing world". In: _Environment International_ 104 (Jul. 2017), pp. 155-162. ISSN: 0160-4120. DOI: [10.1016/j.envint.2017.04.003](https://doi.org/10.1016%2Fj.envint.2017.04.003). URL: [https://www.sciencedirect.com/science/article/pii/S0160412017301459](https://www.sciencedirect.com/science/article/pii/S0160412017301459) (visited on 02/15/2024). [4] J. Vanoli, M. N. Mistry, A. De La Cruz Libardi, et al. "Reconstructing individual-level exposures in cohort analyses of environmental risks: an example with the UK Biobank". En. In: _Journal of Exposure Science & Environmental Epidemiology_ (Jan. 2024). ISSN: 1559-0631, 1559-064X. DOI: [10.1038/s41370-023-00635-w](https://doi.org/10.1038%2Fs41370-023-00635-w). URL: [https://www.nature.com/articles/s41370-023-00635-w](https://www.nature.com/articles/s41370-023-00635-w) (visited on 03/10/2024). --- name: ref-source class: left,top, no-scribble # references [5] J. D. Smith, C. Mitsakou, N. Kitwiroon, et al. "London Hybrid Exposure Model: Improving Human Exposure Estimates to NO2 and PM2.5 in an Urban Setting". En. In: _Environmental Science & Technology_ 50.21 (Nov. 2016), pp. 11760-11768. ISSN: 0013-936X, 1520-5851. DOI: [10.1021/acs.est.6b01817](https://doi.org/10.1021%2Facs.est.6b01817). URL: [https://pubs.acs.org/doi/10.1021/acs.est.6b01817](https://pubs.acs.org/doi/10.1021/acs.est.6b01817) (visited on 02/02/2023). [6] R. Schneider, A. Vicedo-Cabrera, F. Sera, et al. "A Satellite-Based Spatio-Temporal Machine Learning Model to Reconstruct Daily PM2.5 Concentrations across Great Britain". En. In: _Remote Sensing_ 12.22 (Nov. 2020), p. 3803. ISSN: 2072-4292. DOI: [10.3390/rs12223803](https://doi.org/10.3390%2Frs12223803). URL: [https://www.mdpi.com/2072-4292/12/22/3803](https://www.mdpi.com/2072-4292/12/22/3803) (visited on 02/03/2022). [7] D. Cox, C. Kartsonaki, and R. H. Keogh. "Big data: Some statistical issues". In: _Statistics & Probability Letters_ 136 (May. 2018), pp. 111-115. ISSN: 0167-7152. DOI: [10.1016/j.spl.2018.02.015](https://doi.org/10.1016%2Fj.spl.2018.02.015). URL: [https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5992743/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5992743/) (visited on 03/11/2024). [8] E. J. Williamson, A. J. Walker, K. Bhaskaran, et al. "Factors associated with COVID-19-related death using OpenSAFELY". En. In: _Nature_ 584.7821 (Aug. 2020). Publisher: Nature Publishing Group, pp. 430-436. ISSN: 1476-4687. DOI: [10.1038/s41586-020-2521-4](https://doi.org/10.1038%2Fs41586-020-2521-4). URL: [https://www.nature.com/articles/s41586-020-2521-4](https://www.nature.com/articles/s41586-020-2521-4) (visited on 03/12/2024). --- # references [9] M. J. Khoury and J. P. A. Ioannidis. "Big data meets public health". En. In: _Science_ 346.6213 (Nov. 2014), pp. 1054-1055. ISSN: 0036-8075, 1095-9203. DOI: [10.1126/science.aaa2709](https://doi.org/10.1126%2Fscience.aaa2709). URL: [https://www.science.org/doi/10.1126/science.aaa2709](https://www.science.org/doi/10.1126/science.aaa2709) (visited on 03/08/2024). [10] S. J. Mooney and V. Pejaver. "Big Data in Public Health: Terminology, Machine Learning, and Privacy". In: _Annual Review of Public Health_ 39.1 (2018). \_ eprint: https://doi.org/10.1146/annurev-publhealth-040617-014208, pp. 95-112. DOI: [10.1146/annurev-publhealth-040617-014208](https://doi.org/10.1146%2Fannurev-publhealth-040617-014208). URL: [https://doi.org/10.1146/annurev-publhealth-040617-014208](https://doi.org/10.1146/annurev-publhealth-040617-014208) (visited on 03/05/2024). ### other info Presentation made with [xaringan](https://slides.yihui.org/xaringan/#1) in RStudio. Contact: Arturo.de-la-Cruz-Libardi@lshtm.ac.uk Slides: https://adlcruz.github.io/linked_content/pres_bigdataenvepi_2024/bdee_slides.html --- name: links1 class: middle, center #### points trajectories on dynamic map and corresponsing exposure <img src="img/trajectory3_exposure.gif" width="70%" height="100%" /> --- name: links2 #### finnish meteorological institute reanalysis <iframe width="90%" height="50" src="https://silam.fmi.fi/pollen.html?parameter=alder®ion=europe" title="FMIPage" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" style="width:100%; height:100%;" allowfullscreen></iframe> --- name: links3 class: left, middle ### how is OpenSAFELY testing their new features? with chatGPT of course. - OpenSAFELY query (ehrQL) reliability [testing](https://www.bennett.ox.ac.uk/blog/2023/12/how-we-test-ehrql/) using .red[generative artificial intelligence!] --- name: citizenship class: highlight-last-item, # suggestions? -- - DASH 26th March opening event -- - hundreds of hours of free and open resources -- - a lot of local and global circumstances to improve --- name: break class: center, middle