Task 02: Loading the Data¶
This is a suggested solution. It is meant to help you out if you struggle with a certain aspect of the exercise. Your own solution may differ widely and can still be perfectly valid.
The data file used for this solution is New York (Central Park) 2020
If you chose another file or placed it in a different location please adapt the value of
DATA_FILE below accordingly
Do not forget to
import pandas before anything else.
We start by setting up a few constants for future use to make things a bit more manageable. We will need the column labels quite often. It is a good idea to introduce constants for those, so you can make use of auto-completion, have an easier time avoiding errors that happend due to typos, and be able to adapt your labels easier since you only need to change the original definition.
This assumes that the data file is in working directory when the script is executed. If you have placed the file somewhere else, please insert the correct path there.
Loading the Data¶
Now, let’s load the data from the data file.
We can use the
It will handle the fact that the data is compressed in a GZ-archive on its own.
- (1) Set the separator so that a sequence of whitespace is recognized as the partition between data elements in a row
- (2) Set no header, since the data file does not include any. We will add our own in a following step.
- (3) The date in the data file is given as four separate data fields. We want the first three columns (index 0 (year), 1 (month), 2 (day) and 3 (hour)) to be bundled together. The documentation for the parameter states:
list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.
Setting the Header and Index¶
First we set the columns.
And afterwards, the index.
Make sure to set
inplace=True so the changes persist.