Skip to content

Task 03: Cleaning the Data

This is a suggested solution. It is meant to help you out if you struggle with a certain aspect of the exercise. Your own solution may differ widely and can still be perfectly valid.

Replacing the Placeholders for No value

Usually, Pandas would like to use the Not a number - value that comes with definition of floating point numbers (defined in the IEEE 754 standard). We can get this special value from the math module.

from math import nan

In the weather_data DataFrame, we can replace all the -9999 values as follows:

weather_data.replace(
    {-9999: nan},  # (1)
    inplace=True    # (2)
)

Explanations

  • (1) Replacements are given as a dictionary pairing what to replace (as the key) to the new value
  • (2) If we omit this, the replace method will return a modified copy and not change the original data set.

The wind direction is a special case and we need to take extra care of that. This time we replace multiple values at once for one specific column. Note that even though the replacements both involve the value 0, they do not affect each other.

weather_data[LABEL_DIRECTION].replace(
    {
        0: nan, 
        360: 0
    }, 
    inplace=True
)

Removing empty columns

“Empty” columns are those who do only contain the Not a number value for each cell. Luckily, the dropna(…) - method can therefore do all the heavy lifting for us:

weather_data.dropna(how="all", axis="columns", inplace=True)

Fixing the scaling factors

Some of the columns need to be scaled by a factor of 10. For easy handling we create a list of all column labes for the columns to be changes and then use Python‘s expanded assign operators.

Modify and assign

Many operators in Python naturally support the combination with the assignment. This allows to write something of the form a = a * b in the shorter combined form a *= b. It works for the operators-assignment combinations += -= *= /= %= **= <<= >>= &= ^= |=

scale_columns = [
    LABEL_TEMP, LABEL_DEW, LABEL_PRES, LABEL_SPEED, LABEL_RAIN_1H, LABEL_RAIN_6H
]
weather_data[scale_columns] /= 10

Deal with the missing hours

To ensure our data is continuous (especially for calculating statistics) we want to ensure that we fill up the indexes with the missing hours of the year for which no data exist.

To do so we define a timestamp for the start and end of our measurement period and use the date_range(…)-function from pandas to create a generator for all expected timestamps. Afterwards, we iterate over all the expected timestamps and if such a timestamp is not already present in our data, we add an empty row for that timestamp.

timestamp_start = "2020-01-01 00:00"
timestamp_end = "2020-12-31 23:00"

expected_timestamps = pandas.date_range(start=timestamp_start, end=timestamp_end, freq="H")

for timestamp in expected_timestamps:
    if not (weather_data[LABEL_DATETIME]] == timestamp).any():
        weather_data.append(pandas.Series(name=timestamp))  # (1)

Explanation

  • (1) Remember that rows are made from Series in pandas. The series name becomes the index in the data frame. In this way we add an empty row into the weather data for the missing hours.