Task 03: Cleaning the Data¶

This is a suggested solution. It is meant to help you out if you struggle with a certain aspect of the exercise. Your own solution may differ widely and can still be perfectly valid.

Replacing the Placeholders for No value¶

Usually, Pandas would like to use the Not a number - value that comes with definition of floating point numbers (defined in the IEEE 754 standard). We can get this special value from the math module.

from math import nan

In the weather_data DataFrame, we can replace all the -9999 values as follows:

weather_data.replace(
    {-9999: nan},  # (1)
    inplace=True    # (2)
)

Explanations

(1) Replacements are given as a dictionary pairing what to replace (as the key) to the new value
(2) If we omit this, the replace method will return a modified copy and not change the original data set.

The wind direction is a special case and we need to take extra care of that. This time we replace multiple values at once for one specific column. Note that even though the replacements both involve the value 0, they do not affect each other.

weather_data[LABEL_DIRECTION].replace(
    {
        0: nan, 
        360: 0
    }, 
    inplace=True
)

Adding columns for trace amounts¶

For the new colums, we will introduce constants for the labels

LABEL_RAIN_TRACE_1H = "Rain (Trace) (1h)"
LABEL_RAIN_TRACE_6H = "Rain (Trace) (6h)"

Then we will add the respective columns by filtering for the magic -1 value.

for (original_column, new_column) in [
    (LABEL_RAIN_1H, LABEL_RAIN_TRACE_1H),
    (LABEL_RAIN_6H, LABEL_RAIN_TRACE_6H)
]:
    weather_data[new_column] = weather_data[original_column] == -1
    weather_data[original_column].replace({-1: 0}, inplace=True)

Since we want to do this multiple times we can use a loop, so we only have to write down the processing steps once. Note that we want to access and modify two columns each time, so we bundled them up in a tuple for each step. Thanks to Pythons’ unpacking mechanic we can now have two loop variables, one mapping to the column with our original data and one for the column that we create in the process.

Removing empty columns¶

“Empty” columns are those who do only contain the Not a number value for each cell. Luckily, the dropna(…) - method can therefore do all the heavy lifting for us:

weather_data.dropna(how="all", axis="columns", inplace=True)

Fixing the scaling factors¶

Some of the columns need to be scaled by a factor of 10. For easy handling we create a list of all column labes for the columns to be changes and then use Python‘s expanded assign operators.

Modify and assign

Many operators in Python naturally support the combination with the assignment. This allows to write something of the form a = a * b in the shorter combined form a *= b. It works for the operators-assignment combinations += -= *= /= %= **= <<= >>= &= ^= |=

scale_columns = [
    LABEL_TEMP, LABEL_DEW, LABEL_PRES, LABEL_SPEED, LABEL_RAIN_1H, LABEL_RAIN_6H
]
weather_data[scale_columns] /= 10

Deal with the missing hours¶

To ensure our data is continuous (especially for calculating statistics) we want to ensure that we fill up the indexes with the missing hours of the year for which no data exist.

To do so we define a timestamp for the start and end of our measurement period and use the date_range(…)-function from pandas to create a generator for all expected timestamps. Afterwards, we iterate over all the expected timestamps and if such a timestamp is not already present in our data, we add an empty row for that timestamp.

timestamp_start = "2020-01-01 00:00"
timestamp_end = "2020-12-31 23:00"

expected_timestamps = pandas.date_range(start=timestamp_start, end=timestamp_end, freq="H")

for timestamp in expected_timestamps:
    if not (weather_data.index == timestamp).any():
        new_row = DataFrame(index=[timestamp], columns=weather_data.columns)  # (1)
        weather_data = pandas.concat([weather_data, new_row])  # (2)
weather_data.sort_index(inplace=True) # (3)

Explanation

(1) We construct a data frame that has the same columns as the weather data and the missing timestamp as an index. Note that the index has to be givven as a collection i.e. a structure that can hold multiple elements. We therefore wrapped the index in […] to wrap it into a 1-element list.
(2) We can use concat(…) to glue the two data frames together.
(3) Because the concatenating put the new rows at the end, we sort the indexes to get all data back in chronological order.