Skip to content

Task 02: Loading the Data

This is a suggested solution. It is meant to help you out if you struggle with a certain aspect of the exercise. Your own solution may differ widely and can still be perfectly valid.

The data file used for this solution is New York (Central Park) 2020 If you chose another file or placed it in a different location please adapt the value of DATA_FILE below accordingly

Setting up

Do not forget to import pandas before anything else.

We start by setting up a few constants for future use to make things a bit more manageable. We will need the column labels quite often. It is a good idea to introduce constants for those, so you can make use of auto-completion, have an easier time avoiding errors that happend due to typos, and be able to adapt your labels easier since you only need to change the original definition.

DATA_FILE = "725060-14756-2020.gz"

LABEL_DATETIME = "Date & Time"
LABEL_TEMP = "Temperature"
LABEL_DEW = "Dew Point"
LABEL_PRES = "Air Pressure"
LABEL_SPEED = "Wind Speed"
LABEL_DIRECTION = "Wind Direction"
LABEL_SKY = "Sky Condition"
LABEL_RAIN_1H = "Rain (1h)"
LABEL_RAIN_6H = "Rain (6h)"

This assumes that the data file is in working directory when the script is executed. If you have placed the file somewhere else, please insert the correct path there.

Loading the Data

Now, let’s load the data from the data file. We can use the read_csv-function. It will handle the fact that the data is compressed in a GZ-archive on its own.

weather_data = pandas.read_csv(
        DATA_FILE, 
        sep="\s+",                      # (1)
        header=None,                    # (2)
        parse_dates=[[0, 1, 2, 3]]      # (3)
)

Explanations

  • (1) Set the separator so that a sequence of whitespace is recognized as the partition between data elements in a row
  • (2) Set no header, since the data file does not include any. We will add our own in a following step.
  • (3) The date in the data file is given as four separate data fields. We want the first three columns (index 0 (year), 1 (month), 2 (day) and 3 (hour)) to be bundled together. The documentation for the parameter states:

    list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.

Setting the Header and Index

First we set the columns.

weather_data.columns = [
        LABEL_DATETIME,
        LABEL_TEMP,
        LABEL_DEW,
        LABEL_PRES,
        LABEL_DIRECTION,
        LABEL_SPEED,
        LABEL_SKY,
        LABEL_RAIN_1H,
        LABEL_RAIN_6H
]

And afterwards, the index.

weather_data.set_index(LABEL_DATETIME, inplace=True)

Make sure to set inplace=True so the changes persist.