Skip to content

Filtering Data

Interlude: Getting Organized

Now that our data is ready to be further explored and plotted, it is a good time to take a look at the projects organization. It would be advisable to move the data processing perti into its own module so that the __main__.py only outlines the steps to be taken without getting overloaded with all the details.

To achieve this we will create a new module processing.py along with a module plotting.py for future use. Now we can move the data loading and cleaning code into a function prepare_data(data_file) inside the processing module, which we can use in the __main__ module in turn.

Here is how our project currently looks:

Current project structure

📁 weather_analysis
├── __main__.py
├── label.py
├── plotting.py
├── processing.py
└── 📁 data
    └─ 725053-94728-2020.gz

Code Checkpoint

You can download the current state of the code files here:


Understanding Filter Masks

Filtering data with pandas is done in a multi-step process.

  1. Create one (or more) filter masks.
  2. (Optional) Modify or combine filter masks to filter for more complex conditions.
  3. Apply filter masks to the data set.

A filter mask is a data frame (or series) in itself, holding boolean values. The cells to be included after filtering are True, the ones to be excluded are False. Filter masks use the same row indexes and column labels as the data frame (or series) from which they were created.

In our weather data there is a special case for the precipitation-columns. In the case that some trace precipitation was measured (e.g. morning dew), which was not enough to reach at least 0.1mm it is recorded as -1 instead of 0 (which would mean no precipitation at all).

These values of -1 will of course throw off any statistics, so we need to get them out of the column without loosing the encoded information.

Our solution involved three steps: * Find all rows in which the precipitation has a value of -1 * Create a new column which contains the information about the trace precipitation * Replace all the -1 values by nan to prevent them being taken into account for statistics.

nan or not nan - That is a question!

Arguments can be made whether replacing the -1 values with 0 instead since it would be close enough. Which one to choose could actually depend on the intended use case and evaluation methods.

To prepare, we will introduce two new labels for the columns to be created.

label.py
# This module contains all the label names
# We want to use throughout our program

DATE_TIME_COLUMNS = [
    YEAR := "Year",
    MONTH := "Month",
    DAY := "Day",
    HOUR := "Hour"
]

MEASUREMENT_COLUMNS = [
    AIR_TEMP := "Air Temperature [°C]",
    DEW_POINT := "Dew Point [°C]",
    PRESSURE := "Sea Level Pressure [HPa]",
    WIND_DIRECTION := "Wind Direction [°]",
    WIND_SPEED := "Wind Speed [m/s]",
    SKY_CONDITION := "Sky Condition & Coverage Code",
    PRECIPITATION_1H := "Precipitation (1h) [mm]",
    PRECIPITATION_6H := "Precipitation (6h) [mm]"
]

# The column we create to use as an index
INDEX_COLUMN = "Date & Time"

DERIVED_COLUMNS = [
    TRACE_PRECIP_1H := "Trace precipitation (1h)",
    TRACE_PRECIP_6H := "Trace Precipitation (6h)"
]

# The columns that have been scaled by a factor of 10
# in the ISD Lite format
SCALED_COLUMNS = [
    AIR_TEMP, DEW_POINT, PRESSURE, WIND_SPEED,
    PRECIPITATION_1H, PRECIPITATION_6H
]

Creating Filter Masks

Instead of tediously searching with loops and conditionals, we instead create a filter mask There are multiple ways to create filter masks. One option is to compare a data frame or a singular column of it with a value.

processing.py
from pandas import read_csv, to_datetime, DataFrame
from pathlib import Path
from math import nan
import label

def prepare_data(data_file: Path) -> DataFrame:
    """ Load a file containing weather data and make it ready to use.

        The file might be compressed but is expected to hold data
        in the ISDLite format. The contained data will be labelled and cleaned,
        so it is ready to use. Labels are taken from the `label`-module

        Args:
            data_file: A path to an exisiting data file that is to be loaded

        Returns:
            A fully labelled and cleaned pandas data frame
    """

    # Loading the data
    # No header in data file, use one or more whitespaces as column separator
    weather_data = read_csv(data_file, header=None, sep=r"\s+")

    # Set the proper column labels
    # IMPORTANT: this only works so nicely
    # because the columns are provided in the correct order!
    weather_data.columns = label.DATE_TIME_COLUMNS + label.MEASUREMENT_COLUMNS

    # Combine the date and time columns into one, correctly parsing date and time
    weather_data[label.INDEX_COLUMN] = to_datetime(
        weather_data[label.DATE_TIME_COLUMNS]
    )

    weather_data = weather_data.set_index(label.INDEX_COLUMN)

    # We don't need the separate date and time columns anymore
    weather_data = weather_data.drop(label.DATE_TIME_COLUMNS, axis="columns")

    # Replace the missing value indicator -9999 by nan
    weather_data = weather_data.replace({-9999: nan})

    # Filter and Extract the trace precipitation
    for original_column, new_column in [
        (label.PRECIPITATION_1H, label.TRACE_PRECIP_1H),
        (label.PRECIPITATION_6H, label.TRACE_PRECIP_6H)
    ]:
        filter_traces = weather_data[original_column] == -1
        weather_data[new_column] = filter_traces

    # Undo the column scaling
    weather_data[label.SCALED_COLUMNS] = weather_data[label.SCALED_COLUMNS] / 10

    return weather_data

We cann see the results by looking at one of the newly created columns:

>>> filter_trace_1h = weather_data[label.TRACE_PRECIP_1H]
>>> filter_trace_1h
Date & Time
2020-01-01 01:00:00    False
2020-01-01 02:00:00    False
2020-01-01 03:00:00     True
2020-01-01 04:00:00     True
2020-01-01 05:00:00    False
                       ...
2020-12-31 19:00:00     True
2020-12-31 20:00:00    False
2020-12-31 21:00:00     True
2020-12-31 22:00:00    False
2020-12-31 23:00:00    False
Name: Trace precipitation (1h), Length: 8781, dtype: bool

Any at All?

Boolean columns can be reduced to a singular value with the .any() and .all()-methods. If we, for example wanted to know whether any trace precipitation occured over a 6 hour period we could try

>>> filter_trace_1h.any()
True
>>> filter_trace_1h.all()
False

Applying Filter Masks

To select parts of a data frame based on a filter mask, you can pass it as an index just as we did with the column name before.

>>> weather_data[filter_trace_1h]
                     Air Temperature [°C]  ...  Trace Precipitation (6h)
Date & Time                                ...
2020-01-01 03:00:00                   6.1  ...                     False
2020-01-01 04:00:00                   5.6  ...                     False
2020-01-03 08:00:00                   8.9  ...                     False
2020-01-03 13:00:00                   7.2  ...                     False
2020-01-04 01:00:00                   9.4  ...                     False
...                                   ...  ...                       ...
2020-12-25 16:00:00                  12.2  ...                     False
2020-12-28 16:00:00                   7.8  ...                     False
2020-12-31 18:00:00                   6.7  ...                     False
2020-12-31 19:00:00                   6.7  ...                     False
2020-12-31 21:00:00                   5.6  ...                     False

[499 rows x 10 columns]
As a result we get the data set, reduced to the rows that have been True in the filter mask. If you are only interested in obtaining the filtered results for the specific column, you may combine the column indexing with the filter mask application.

Alternatively, you can also use filter masks as a row-index in the .loc-property, which can be very handy to select specific columns at the same time.

>>> weather_data.loc[filter_trace_1h, label.PRECIPITATION_1H]
Date & Time
2020-01-01 03:00:00   -1
2020-01-01 04:00:00   -1
2020-01-03 08:00:00   -1
2020-01-03 13:00:00   -1
2020-01-04 01:00:00   -1
                      ...
2020-12-25 16:00:00   -1
2020-12-28 16:00:00   -1
2020-12-31 18:00:00   -1
2020-12-31 19:00:00   -1
2020-12-31 21:00:00   -1
Name: Precipitation (1h) [mm], Length: 499, dtype: float64

We can use this approach to set all precipitations that previously were -1 to nan

processing.py
from pandas import read_csv, to_datetime, DataFrame
from pathlib import Path
from math import nan
import label

def prepare_data(data_file: Path) -> DataFrame:
    """ Load a file containing weather data and make it ready to use.

        The file might be compressed but is expected to hold data
        in the ISDLite format. The contained data will be labelled and cleaned,
        so it is ready to use. Labels are taken from the `label`-module

        Args:
            data_file: A path to an exisiting data file that is to be loaded

        Returns:
            A fully labelled and cleaned pandas data frame
    """

    # Loading the data
    # No header in data file, use one or more whitespaces as column separator
    weather_data = read_csv(data_file, header=None, sep="\s+")

    # Set the proper column labels
    # IMPORTANT: this only works so nicely
    # because the columns are provided in the correct order!
    weather_data.columns = label.DATE_TIME_COLUMNS + label.MEASUREMENT_COLUMNS

    # Combine the date and time columns into one, correctly parsing date and time
    weather_data[label.INDEX_COLUMN] = to_datetime(
        weather_data[label.DATE_TIME_COLUMNS]
    )

    weather_data = weather_data.set_index(label.INDEX_COLUMN)

    # We don't need the separate date and time columns anymore
    weather_data = weather_data.drop(label.DATE_TIME_COLUMNS, axis="columns")

    # Replace the missing value indicator -9999 by nan
    weather_data = weather_data.replace({-9999: nan})

    # Filter and Extract the trace precipitation
    for original_column, new_column in [
        (label.PRECIPITATION_1H, label.TRACE_PRECIP_1H),
        (label.PRECIPITATION_6H, label.TRACE_PRECIP_6H)
    ]:
        filter_traces = weather_data[original_column] == -1
        weather_data[new_column] = filter_traces
        weather_data.loc[filter_traces, original_column] = nan

    # Undo the column scaling
    weather_data[label.SCALED_COLUMNS] = weather_data[label.SCALED_COLUMNS] / 10

    return weather_data

Filtering for Incomplete Data

As a second option, pandas also offers are some utilities that create filter masks, especially the isna() and notna() methods. These can be used to check for the presence or absence of NaN-values repectively.

Example

Let’s find all rows in which the air temperature is missing by creating the filter mask

temperature_unknown = weather_data[label.AIR_TEMP].isna()

If we apply this filter mask, the output will be

>>> weather_data[temperature_unknown]
                     Air Temperature [°C]  ...  Precipitation (6h) [mm]
Date & Time                                ...
2020-01-05 06:00:00                   NaN  ...                      NaN
2020-01-22 03:00:00                   NaN  ...                      NaN
2020-01-22 20:00:00                   NaN  ...                      NaN
2020-01-30 05:00:00                   NaN  ...                      NaN
2020-02-08 01:00:00                   NaN  ...                      NaN
2020-02-11 12:00:00                   NaN  ...                      NaN
2020-02-12 10:00:00                   NaN  ...                      NaN
2020-02-13 08:00:00                   NaN  ...                      NaN
2020-03-03 07:00:00                   NaN  ...                      NaN
2020-04-28 15:00:00                   NaN  ...                      NaN
2020-05-12 13:00:00                   NaN  ...                      NaN
2020-05-21 22:00:00                   NaN  ...                      NaN
2020-05-23 02:00:00                   NaN  ...                      NaN
2020-05-23 14:00:00                   NaN  ...                      NaN
2020-06-05 05:00:00                   NaN  ...                      NaN
2020-06-06 12:00:00                   NaN  ...                      NaN
2020-07-07 08:00:00                   NaN  ...                      NaN
2020-07-08 14:00:00                   NaN  ...                      NaN
2020-07-30 10:00:00                   NaN  ...                      NaN
2020-08-28 15:00:00                   NaN  ...                      NaN
2020-10-13 02:00:00                   NaN  ...                      NaN
2020-10-18 21:00:00                   NaN  ...                      NaN
2020-10-27 23:00:00                   NaN  ...                      NaN
2020-11-04 03:00:00                   NaN  ...                      NaN
2020-11-13 19:00:00                   NaN  ...                      NaN

[25 rows x 8 columns]

Modifying and combining Filter Masks

Filter masks can be modified by inverting or combining with other filter masks. For all these operations we can use Pythons element-wise logic operators.

Logic Operators

Besides the purely boolean logic operators, which only work on boolean values, Python also has element-wise variants, that can be used in combination with multi-element data types and will apply the logical opration to each element of the arguments individually.

Operation Boolean Element-wise
Negation not X ~X
Conjunction X and Y X & Y
Disjunction X or Y X | Y

Example

Let us assume we have a cold_hours filter mask indicating all hours in which the air temperature was at 5°C or below.

>>> cold_hours = weather_data[label.AIR_TEMP] <= 5

We can also invert it to filter for all rows which do not fulfill our condition of being cold (i.e. have an air termperature > 5.0°C).

weather_data[~cold_hours]

If we create a second filter mask windy_hours for all rows where the wind speed exceeds 6.0 m/s, we can combine this with our cold_hours to figure out which periods in our measurement were uncomfortable, because they were either windy or cold. We could further determine which moments in the measurement were very uncomfortable because they were windy and cold at the same time.

windy_hours = weather_data[label.WIND_SPEED] > 6.0
weather_data[cold_hours | windy_hours]  # Cold, windy or both
weather_data[cold_hours & windy_hours]  # Cold and windy at the same time

Below are the outputs of each of the filter applications.

>>> weather_data[~cold_hours]
                     Air Temperature [°C]  ...  Precipitation (6h) [mm]
Date & Time                                ...
2020-01-01 01:00:00                   6.7  ...                      NaN
2020-01-01 02:00:00                   6.1  ...                      NaN
2020-01-01 03:00:00                   6.1  ...                      NaN
2020-01-01 04:00:00                   5.6  ...                      NaN
2020-01-02 16:00:00                   6.1  ...                      NaN
...                                   ...  ...                      ...
2020-12-31 17:00:00                   6.7  ...                      NaN
2020-12-31 18:00:00                   6.7  ...                      9.9
2020-12-31 19:00:00                   6.7  ...                      NaN
2020-12-31 20:00:00                   6.1  ...                      NaN
2020-12-31 21:00:00                   5.6  ...                      NaN

[7135 rows x 8 columns]
>>> weather_data[windy_hours | cold_hours]
                     Air Temperature [°C]  ...  Precipitation (6h) [mm]
Date & Time                                ...
2020-01-01 02:00:00                   6.1  ...                      NaN
2020-01-01 05:00:00                   5.0  ...                      NaN
2020-01-01 06:00:00                   4.4  ...                      0.5
2020-01-01 07:00:00                   4.0  ...                      NaN
2020-01-01 08:00:00                   3.9  ...                      NaN
...                                   ...  ...                      ...
2020-12-30 22:00:00                   4.4  ...                      NaN
2020-12-30 23:00:00                   4.4  ...                      NaN
2020-12-31 00:00:00                   5.0  ...                      NaN
2020-12-31 22:00:00                   5.0  ...                      NaN
2020-12-31 23:00:00                   5.0  ...                      NaN

[1751 rows x 8 columns]
>>> weather_data[windy_hours & cold_hours]
                     Air Temperature [°C]  ...  Precipitation (6h) [mm]
Date & Time                                ...
2020-01-01 08:00:00                   3.9  ...                      NaN
2020-01-05 08:00:00                   3.9  ...                      NaN
2020-01-05 10:00:00                   3.9  ...                      NaN
2020-01-05 11:00:00                   3.3  ...                      NaN
2020-01-05 14:00:00                   3.3  ...                      NaN
...                                   ...  ...                      ...
2020-12-26 07:00:00                  -2.8  ...                      NaN
2020-12-26 13:00:00                  -4.4  ...                      NaN
2020-12-29 15:00:00                   2.2  ...                      NaN
2020-12-29 18:00:00                   2.8  ...                      NaN
2020-12-29 22:00:00                   1.7  ...                      NaN

[92 rows x 8 columns]

Note that the result of the conjunction is a subset of the result of the disjunction, since for each row that fulfills the cold and windy condition, it also automatically fulfills the cold or windy condition.