Writing a more complex dataset definition
In this section, you will write the following dataset definition.
from ehrql import (
case,
codelist_from_csv,
create_dataset,
days,
when,
)
from ehrql.tables.tpp import (
addresses,
clinical_events,
apcs,
medications,
patients,
practice_registrations,
)
index_date = "2023-10-01"
dataset = create_dataset()
dataset.configure_dummy_data(population_size=10)
# codelists
ethnicity_codelist = codelist_from_csv(
"codelists/opensafely-ethnicity.csv",
column="Code",
category_column="Grouping_6",
)
asthma_inhaler_codelist = codelist_from_csv(
"codelists/opensafely-asthma-inhaler-salbutamol-medication.csv",
column="code",
)
# population variables
is_female_or_male = patients.sex.is_in(["female", "male"])
was_adult = (patients.age_on(index_date) >= 18) & (
patients.age_on(index_date) <= 110
)
was_alive = (
patients.date_of_death.is_after(index_date)
| patients.date_of_death.is_null()
)
was_registered = practice_registrations.for_patient_on(
index_date
).exists_for_patient()
dataset.define_population(
is_female_or_male
& was_adult
& was_alive
& was_registered
)
# demographic variables
dataset.age = patients.age_on(index_date)
dataset.sex = patients.sex
dataset.ethnicity = (
clinical_events.where(
clinical_events.ctv3_code.is_in(ethnicity_codelist)
)
.sort_by(clinical_events.date)
.last_for_patient()
.ctv3_code.to_category(ethnicity_codelist)
)
imd_rounded = addresses.for_patient_on(
index_date
).imd_rounded
max_imd = 32844
dataset.imd_quintile = case(
when(imd_rounded < int(max_imd * 1 / 5)).then(1),
when(imd_rounded < int(max_imd * 2 / 5)).then(2),
when(imd_rounded < int(max_imd * 3 / 5)).then(3),
when(imd_rounded < int(max_imd * 4 / 5)).then(4),
when(imd_rounded <= max_imd).then(5),
)
# exposure variables
dataset.num_asthma_inhaler_medications = medications.where(
medications.dmd_code.is_in(asthma_inhaler_codelist)
& medications.date.is_on_or_between(
index_date - days(30), index_date
)
).count_for_patient()
# outcome variables
dataset.date_of_first_admission = (
apcs.where(
apcs.admission_date.is_after(
index_date
)
)
.sort_by(apcs.admission_date)
.first_for_patient()
.admission_date
)
Assign the index date to a variable🔗
We define the population, and several demographic, exposure, and outcome variables, relative to an index date. Rather than repeatedly type the index date, it's less error-prone to assign it to a variable.
index_date = "2023-10-01"
Create the dataset🔗
We create the dataset with the create_dataset
function, which we import now.
from ehrql import create_dataset
We must assign the dataset to a variable called dataset
.
dataset = create_dataset()
Define the population🔗
To be included in the population, a patient:
- is female or male
- was an adult on the index date
- was alive on the index date
- was registered with a GP practice on the index date
is vs was
The values of some of these characteristics don't change over time; their values on the date the dataset is generated are the same as their values on the index date. We prefix such characteristics with is. However, the values of some of these characteristics might change over time; their values on the date the dataset is generated might be different to their values on the index date. We prefix such characteristics with was.
These characteristics come from the patients
and the practice_registrations
tables,
which we import now.
from ehrql.tables.tpp import patients, practice_registrations
Is a patient female or male?🔗
We query the patients.sex
column to determine whether a patient is female or male.
Rows that match the strings "female"
or "male"
return True
.
Rows that don't match return False
.
We assign the result column to the is_female_or_male
variable.
is_female_or_male = patients.sex.is_in(["female", "male"])
Strings and lists
We enclose characters in double quotes to create strings,
meaning that "female"
and "male"
are strings.
We use lists to group items, such as strings, together.
We enclose items in square brackets to create lists,
meaning that ["female", "male"]
is a list of strings.
Notice that
patients.sex
is a column of strings but
patients.sex.is_in(["female", "male"])
is a column of Booleans,
meaning that is_female_or_male
is a column of Booleans.
Was a patient an adult on the index date?🔗
We call patients.age_on
to determine whether a patient was an adult
— 18 or over and 110 or under —
on the index date.
We assign the result column to the was_adult
variable.
was_adult = (patients.age_on(index_date) >= 18) & (
patients.age_on(index_date) <= 110
)
Notice that
patients.age_on(index_date)
is a column of integers but
patients.age_on(index_date) >= 18
is a column of Booleans.
Notice that we combine the two columns of Booleans with the &
operator (AND),
meaning that was_adult
is a column of Booleans.
Operator precedence
Normally, the &
operator has a higher precedence than the >=
and <=
operators.
However, we want the statements that include the >=
and <=
operators to have higher precedence,
so we enclose them in parentheses.
What do you think would happen if these statements didn't have higher precedence?
Was a patient alive on the index date?🔗
We query the patients.date_of_death
column to determine whether a patient was alive on the index date.
- If a patient died after the index date, then the patient was alive on the index date.
- If a patient's date of death is null, then the patient was alive on the index date.
We assign the result column to the was_alive
variable.
was_alive = (
patients.date_of_death.is_after(index_date)
| patients.date_of_death.is_null()
)
Notice that rows where the date of death is after the index date return True
;
other rows return False
.
In other words, the result is a column of Booleans.
patients.date_of_death.is_after(index_date)
Notice that rows where the date of death is null return True
;
other rows return False
.
In other words, the result is a column of Booleans.
patients.date_of_death.is_null()
Notice that we combine the two columns of Booleans with the |
operator (OR),
meaning that was_alive
is a column of Booleans.
Was a patient registered with a GP practice on the index date?🔗
We call practice_registrations.for_patient_on
and then query the result table
to determine whether a patient was registered with a GP practice on the index date.
We assign the result column to the was_registered
variable.
was_registered = practice_registrations.for_patient_on(
index_date
).exists_for_patient()
Notice that
practice_registrations
is a many rows per patient table, but
practice_registrations.for_patient_on(index_date)
is a one row per patient table.
Notice that
practice_registrations.for_patient_on(
index_date
).exists_for_patient()
is a column of Booleans,
meaning that was_registered
is a column of Booleans.
Define the population🔗
We combine the above variables with the &
operator (AND)
and pass the result column to dataset.define_population
.
dataset.define_population(
is_female_or_male
& was_adult
& was_alive
& was_registered
)
Notice that because we use variables with descriptive names, we can easily reason about the population.
Assign variables to the dataset🔗
We assign a variable to the dataset by adding a dot and the name of the variable to the dataset,
followed by an equals sign and the definition of the variable.
In the following example, the name of the variable is my_variable
and the definition of the variable is ...
.
dataset.my_variable = ...
Codelists🔗
The repository contains two codelists that we use when we assign demographic and exposure variables to the dataset.
They are stored in two CSV files.
We read each CSV file using the codelist_from_csv
function, which we import now.
from ehrql import codelist_from_csv
Codelists
Codelists are beyond the scope of the tutorial. If you would like to know more about them in general, then please read "What are codelists and how are they constructed?" on the Bennett Institute blog. If you would like to know more about them specifically in OpenSAFELY, then please read "Introduction to codelists".
Demographic variables🔗
Sex and age🔗
We assign patients.sex
to dataset.sex
.
Remember that to be included in the population,
a patient is female or male.
dataset.sex = patients.sex
We call patients.age_on
to determine the age of a patient on the index date
and assign the result column to dataset.age
.
Remember that to be included in the population,
a patient was an adult on the index date.
dataset.age = patients.age_on(index_date)
Ethnicity🔗
We use the
"Ethnicity"
codelist to query the clinical_events
table as well as to convert from 266 codes to six groups.
The codelist is stored in codelists/opensafely-ethnicity.csv
.
If we open the CSV file,
then we see that the Code
column contains the codes and that the Grouping_6
column contains the groups.
First, we use the codelist_from_csv
function to read the CSV file.
ethnicity_codelist = codelist_from_csv(
"codelists/opensafely-ethnicity.csv",
column="Code",
category_column="Grouping_6",
)
Notice that because we specified column
and category_column
, ethnicity_codelist
shows pairs of codes and groups.
Dictionaries
ethnicity_codelist
is a dictionary, or a data structure that maps from unique keys to values.
In this case, the keys are strings that represent codes and the values are strings that represent groups.
We separate keys from values with colons,
and enclose key-value pairs in curly brackets to create dictionaries,
meaning that {"my_key": "my_value"}
is a dictionary of one key-value pair,
where the key and the value are strings.
Next, we import the clinical_events
table.
from ehrql.tables.tpp import clinical_events
Finally, we query the table
and assign the result column to dataset.ethnicity
.
dataset.ethnicity = (
clinical_events.where(
clinical_events.ctv3_code.is_in(ethnicity_codelist)
)
.sort_by(clinical_events.date)
.last_for_patient()
.ctv3_code.to_category(ethnicity_codelist)
)
Notice that we:
- Filter the table using the codelist
- Sort the result table
- Select the last row for each patient
Index of Multiple Deprivation🔗
from ehrql import case, when
from ehrql.tables.tpp import addresses
imd_rounded = addresses.for_patient_on(
index_date
).imd_rounded
max_imd = 32844
dataset.imd_quintile = case(
when(imd_rounded < int(max_imd * 1 / 5)).then(1),
when(imd_rounded < int(max_imd * 2 / 5)).then(2),
when(imd_rounded < int(max_imd * 3 / 5)).then(3),
when(imd_rounded < int(max_imd * 4 / 5)).then(4),
when(imd_rounded <= max_imd).then(5),
)
Exposure variables🔗
Number of medications🔗
We use the
Asthma Inhaler Salbutamol Medication
codelist to query the medications
table.
The codelist is stored in codelists/opensafely-asthma-inhaler-salbutamol-medication.csv
.
If we open the CSV file,
then we see that the code
column contains the codes.
First, we use the codelist_from_csv
function to read the CSV file.
asthma_inhaler_codelist = codelist_from_csv(
"codelists/opensafely-asthma-inhaler-salbutamol-medication.csv",
column="code",
)
Notice that because we specified only column
, asthma_inhaler_codelist
shows only codes.
Next, we import the medications
table.
from ehrql.tables.tpp import medications
Finally, we query the table
and assign the result column to dataset.num_asthma_inhaler_medications
.
dataset.num_asthma_inhaler_medications = medications.where(
medications.dmd_code.is_in(asthma_inhaler_codelist)
& medications.date.is_on_or_between(
index_date - days(30), index_date
)
).count_for_patient()
Notice that we:
- Filter the table using the codelist and a date range
- Count the number of rows for each patient
Outcome variables🔗
Date of first admission🔗
First, we import the apcs
table.
from ehrql.tables.tpp import apcs
Finally, we query the table
and assign the result column to dataset.date_of_first_admission
.
dataset.date_of_first_admission = (
apcs.where(
apcs.admission_date.is_after(
index_date
)
)
.sort_by(apcs.admission_date)
.first_for_patient()
.admission_date
)
Notice that we:
- Filter the table using a date range
- Sort the result table
- Select the first row for each patient
First filter, then reduce🔗
The ethnicity, number of medications, and date of first admission variables follow a pattern: first filter, then reduce. The filter steps involve filtering by a codelist, or by a codelist and a date range. The reduce steps involve sorting and then selecting the first or last row, or counting the number of rows.
By first filtering, then reducing we transform a many rows per patient table into a one row per patient table.