Manipulating Time-series

Manipulating Time-series#

Time-series are a key element when assessing solar resource data. In this section, we present several examples to learn how to deal with different formats in the data and few common tasks to prepare our time-series for later analysis, such as down and up-sampling data when we need different temporal resolution than that initially available or interpolating missing values in the data.

The dataset used in the examples of this section is a customized dataset using solar radiation measurements from the Measurement and Instrumentation Data Center (MIDC) of the U.S. National Renewable Energy Laboratory (NREL). The station selected is located at the University of Nevada - Las Vegas (UNLV) and the data used are 1-minute GHI, DHI and DNI measurements for the year 2020 [].

In this section, we cover:

1 Time-series handling
2 Down and up-sampling time-series data
3 Interpolating time-series data
4 Visualizing time-series data

Content by Javier Lopez Lorente

1 Time-series handling#

Datasets often come in different formats depending on the source. Those formats sometimes cannot be used straightaway to build a time-series and may require additional processing steps before building the time-series. For example:

What if date and time are in different columns?
What if the year, month, day and time are in separate columns?
How to the define the timestamp format for a particular dataset?
How to deal with timestamp issues, local vs. universal (UTC) time?

This subsection presents several examples to deal with different formats in which time-series data could come and shows how to build a time-series or datetime series, as known in Python, for later analysis. The processing steps to build time-series are based on pandas library.

Let’s get started!

# Importing the needed libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pvlib

1.1 Build our customized dataset#

In order to build the customized dataset for this section, we make use of the I/O tools of the Python library pvlib to retrieve the data from the UNLV station in the MIDC. Data from other stations from the MIDC can be also retrieved using this method by adapting the station ID in the query. The different station IDs are available in the MIDC raw data page.

# Dictionary to rename certain variables from the raw data
var_map = {'Global Horiz [W/m^2]': 'ghi',
           'Direct Normal [W/m^2]':'dni',
           'Diffuse Horiz (calc) [W/m^2]':'dhi',
           'Year':'year'}

# Retrieving the raw data from the station 
df_ref = pvlib.iotools.read_midc_raw_data_from_nrel('UNLV',                     # Station id
                                                    pd.Timestamp('20200101'),   # Start date YYYYMMDD
                                                    pd.Timestamp('20201231'),   # End date  YYYYMMDD
                                                    variable_map=var_map)       # Variable Map
# Let's have a look to the first 2 rows of the dataset
df_ref.head(2)

	Unnamed: 0	year	DOY	PST	dni	ghi	Global UVA [W/m^2]	Global UVE [W/m^2]	Global UVE [Index]	Dry Bulb Temp [deg C]	...	Avg Wind Direction @ 30ft [deg from N]	Peak Wind Speed @ 30ft [m/s]	UVSAET Temp [deg C]	Logger Temp [deg C]	Logger Battery [VDC]	Wind Chill Temp [deg C]	dhi	Zenith Angle [degrees]	Azimuth Angle [degrees]	Airmass
2020-01-01 00:00:00-08:00	0	2020	1	0	-1.67051	-2.53035	-0.005	0.0	0.002	6.004	...	17.67	0.098	24.9	22.25	13.28	6.004	0.0	166.494	16.4003	-7999.0
2020-01-01 00:01:00-08:00	0	2020	1	1	-1.64962	-2.53035	-0.007	0.0	0.003	6.210	...	8.79	1.176	24.9	22.26	13.28	6.210	0.0	166.435	17.3510	-7999.0

2 rows × 21 columns

# Let's have a look to the last 2 rows of the dataset
df_ref.tail(2)

	Unnamed: 0	year	DOY	PST	dni	ghi	Global UVA [W/m^2]	Global UVE [W/m^2]	Global UVE [Index]	Dry Bulb Temp [deg C]	...	Avg Wind Direction @ 30ft [deg from N]	Peak Wind Speed @ 30ft [m/s]	UVSAET Temp [deg C]	Logger Temp [deg C]	Logger Battery [VDC]	Wind Chill Temp [deg C]	dhi	Zenith Angle [degrees]	Azimuth Angle [degrees]	Airmass
2020-12-31 23:58:00-08:00	0	2020	366	2358	0.000000	-3.09366	-0.016	0.0	0.003	10.45	...	4.834	7.546	24.75	18.97	13.34	8.06	0.0	166.558	14.0772	-7999.0
2020-12-31 23:59:00-08:00	0	2020	366	2359	0.473466	-3.09367	-0.015	0.0	0.003	10.46	...	7.997	6.664	24.81	18.97	13.35	8.24	0.0	166.507	15.0409	-7999.0

2 rows × 21 columns

The dataset is 1-minute resolution data with 21 variables related to meteorological and other relevant data: ambient temperature, wind speed, wind direction, global horizontal irradiance (GHI), direct normal irradiance (DNI), diffuse horizontal irradiance (DHI), zenith and azimuth angles, airmass, among other.

For the examples in this section we will use GHI, DNI and DHI measurements and time-related data.

# Slice desired variables out of the 21 variables provided in the raw data. 
df_ref = df_ref[['ghi', 'dni', 'dhi', 'year']]

# Add multiple temporal data to the dataset
df_ref['month'] = df_ref.index.month
df_ref['day'] = df_ref.index.day
df_ref['hour'] = df_ref.index.hour
df_ref['minute'] = df_ref.index.minute
df_ref['date'] = df_ref.index.strftime('%Y-%m-%d')
df_ref['time'] = df_ref.index.strftime('%H:%M:%S')
df_ref['timestamp'] = df_ref.index.strftime('%Y-%m-%d %H:%M:%S%z')

# Epoch format
df_ref['epoch'] = df_ref.index.astype('int64')//1e9

# Reset the Index of the DataFrame
df_ref = df_ref.reset_index(drop=True)

# Let's have a look to the resulting columns of the dataset
df_ref.columns

Index(['ghi', 'dni', 'dhi', 'year', 'month', 'day', 'hour', 'minute', 'date',
       'time', 'timestamp', 'epoch'],
      dtype='object')

Let’s visualize the first rows of the customized reference dataframe:

# First 3 rows in the dataframe
df_ref.head(3)

	ghi	dni	year	month	day	minute	date	time	timestamp	epoch
0	-2.53035	-1.670510	2020	1	1	0	2020-01-01	00:00:00	2020-01-01 00:00:00-0800	1577865600
1	-2.53035	-1.649620	2020	1	1	1	2020-01-01	00:01:00	2020-01-01 00:01:00-0800	1577865660
2	-2.53035	-0.856135	2020	1	1	2	2020-01-01	00:02:00	2020-01-01 00:02:00-0800	1577865720

Now that we have our customized reference dataset of 1-minute irradiance measurements for 2020 and temporal data, we can start building the timeseries in different ways.

1.2 Time-series when timestamps are available:#

When timestamps are available, the most straightforward way to build the DataFrame with a datetime index is to convert the column with the timestamp into datetime format and set it as index.

Let’s see how!

# A new dataframe copy of the reference dataset
df = df_ref.copy()
# Convert the timestamp string into datetime format 
df['timestamp'] = pd.to_datetime(df['timestamp'], format='%Y-%m-%d %H:%M:%S%z')
# Set timestamp column as index
df = df.set_index(df['timestamp'])
# See the first 3 rows of the DataFrame with Datetime Index
df.head(3)

	ghi	dni	dhi	year	month	day	hour	minute	date	time	timestamp	epoch
timestamp
2020-01-01 00:00:00-08:00	-2.53035	-1.670510	0.0	2020	1	1	0	0	2020-01-01	00:00:00	2020-01-01 00:00:00-08:00	1577865600
2020-01-01 00:01:00-08:00	-2.53035	-1.649620	0.0	2020	1	1	0	1	2020-01-01	00:01:00	2020-01-01 00:01:00-08:00	1577865660
2020-01-01 00:02:00-08:00	-2.53035	-0.856135	0.0	2020	1	1	0	2	2020-01-01	00:02:00	2020-01-01 00:02:00-08:00	1577865720

The format of the timestamp is specified in the argument ‘format’ as a string and can be adapted to any case. The available options in Python can be checked in this link.

Universal Time Coordinated (UTC) is usually the timestamp provided for many solar radiation data networks and platforms like the BSRN, PVGIS, etc. However, data can be also reported in local time like in our example. Timestamps can be converted to other timezones with the funcion tz_convert, which can be useful when dealing with data from different databases and locations worldwide:

# Add UTC timestamp from the local time (Pacific Summer Time)
df['timestamp_utc'] = df.index.tz_convert('UTC')
# See the first 3 rows 
df.head(3)

	ghi	dni	dhi	year	month	day	hour	minute	date	time	timestamp	epoch	timestamp_utc
timestamp
2020-01-01 00:00:00-08:00	-2.53035	-1.670510	0.0	2020	1	1	0	0	2020-01-01	00:00:00	2020-01-01 00:00:00-08:00	1577865600	2020-01-01 08:00:00+00:00
2020-01-01 00:01:00-08:00	-2.53035	-1.649620	0.0	2020	1	1	0	1	2020-01-01	00:01:00	2020-01-01 00:01:00-08:00	1577865660	2020-01-01 08:01:00+00:00
2020-01-01 00:02:00-08:00	-2.53035	-0.856135	0.0	2020	1	1	0	2	2020-01-01	00:02:00	2020-01-01 00:02:00-08:00	1577865720	2020-01-01 08:02:00+00:00

The valid timezone strings for other timezones can be found in this link. When the timezone is not provided as part of the timestamp, the function tz_localize can be used to localize the values in a timezone-naive series. tz_localize will be used in the next example.

1.3 Time-series when date and time are available:#

When date and time are available in separate columns, a timestamp can be created in a new column and the new column can then be set as index and localized. Let’s have a look how to do that:

# A new dataframe copy of the reference dataset
df = df_ref.copy()
# New column with the date and time 
df['datetime'] = df['date'] + 'T' + df['time']
# Convert the new column into datetime format 
df['datetime'] = pd.to_datetime(df['datetime'], format='%Y-%m-%dT%H:%M:%S')
# Set the column 'datetime' as index and localize it to its timezone
df = df.set_index(df['datetime']).tz_localize('Etc/GMT+8')
# See the first 3 rows 
df.head(3)

	ghi	dni	dhi	year	month	day	hour	minute	date	time	timestamp	epoch	datetime
datetime
2020-01-01 00:00:00-08:00	-2.53035	-1.670510	0.0	2020	1	1	0	0	2020-01-01	00:00:00	2020-01-01 00:00:00-0800	1577865600	2020-01-01 00:00:00
2020-01-01 00:01:00-08:00	-2.53035	-1.649620	0.0	2020	1	1	0	1	2020-01-01	00:01:00	2020-01-01 00:01:00-0800	1577865660	2020-01-01 00:01:00
2020-01-01 00:02:00-08:00	-2.53035	-0.856135	0.0	2020	1	1	0	2	2020-01-01	00:02:00	2020-01-01 00:02:00-0800	1577865720	2020-01-01 00:02:00

1.4 Time-series when the time data is split in multiple columns:#

If time-related data are split across multiple columns, a timestamp can be created in a new column similarly than in the previous case. Let’s imagine our dataset would have the year, month, day, hour, and minute in separate columns. In that case, we could build our time-series as follows:

# A new dataframe copy of the reference dataset
df = df_ref.copy()
# Let's reduce the code lines and define the new string within the 'to_datetime' function
df['datetime'] = pd.to_datetime(df[['year', 'month', 'day', 'hour', 'minute']], 
                                format = '%Y-%m-%d%H:%M')
# Set the column 'datetime' as index
df = df.set_index(df['datetime']) 
# Localize the datetime series
df.index = df.index.tz_localize('Etc/GMT+8') 
# See the first 3 rows 
df.head(3)

	ghi	dni	dhi	year	month	day	hour	minute	date	time	timestamp	epoch	datetime
datetime
2020-01-01 00:00:00-08:00	-2.53035	-1.670510	0.0	2020	1	1	0	0	2020-01-01	00:00:00	2020-01-01 00:00:00-0800	1577865600	2020-01-01 00:00:00
2020-01-01 00:01:00-08:00	-2.53035	-1.649620	0.0	2020	1	1	0	1	2020-01-01	00:01:00	2020-01-01 00:01:00-0800	1577865660	2020-01-01 00:01:00
2020-01-01 00:02:00-08:00	-2.53035	-0.856135	0.0	2020	1	1	0	2	2020-01-01	00:02:00	2020-01-01 00:02:00-0800	1577865720	2020-01-01 00:02:00

1.5 Time-series when the timestamp is given as epoch (Unix Time)#

If the dataset has epoch timestamps, note that the data will have UTC time. However, it can be converted to any timezone using the function tz_convert. If there are epoch timestamps, a datetime series can be formed as follows:

# A new dataframe copy of the reference dataset
df = df_ref.copy()
# Convert epoch timestamps to datetime format and localize
df['datetime'] = pd.to_datetime(df['epoch'], unit='s', utc=True)
# Set datetime as index and convert UTC time to local time
df = df.set_index(df['datetime']).tz_convert('Etc/GMT+8')
# See the results
df.head(3)

	ghi	dni	dhi	year	month	day	hour	minute	date	time	timestamp	epoch	datetime
datetime
2020-01-01 00:00:00-08:00	-2.53035	-1.670510	0.0	2020	1	1	0	0	2020-01-01	00:00:00	2020-01-01 00:00:00-0800	1577865600	2020-01-01 08:00:00+00:00
2020-01-01 00:01:00-08:00	-2.53035	-1.649620	0.0	2020	1	1	0	1	2020-01-01	00:01:00	2020-01-01 00:01:00-0800	1577865660	2020-01-01 08:01:00+00:00
2020-01-01 00:02:00-08:00	-2.53035	-0.856135	0.0	2020	1	1	0	2	2020-01-01	00:02:00	2020-01-01 00:02:00-0800	1577865720	2020-01-01 08:02:00+00:00

We have seen how the same DataFrame with datetimeindex can be obtained in multiple ways depending on the format of time data provided.

2 Down and up-sampling time-series data#

When assessing solar resource, you may need a different time-resolution than your data for a particular part of the analysis. In those cases, it is possible to down-sample and up-sample the data at different temporal resolutions using two different methods within pandas library called resample and asfreq. Depending on your needs, you will opt for one or the other. Regardless of the method, both of them require a DataFrame with datetimeindex either time-aware (localized) or time-naive (not localized).

2.1 Method ‘asfreq’ vs. ‘resample’#

Let’s first create a new DataFrame with only the columns with solar data and see the differences between both methods with examples.

# New DataFrame with 1-minute data and solar data
df_1min = df[['ghi', 'dhi', 'dni']]
# See our new DataFrame
df_1min.head(3)

	ghi	dhi	dni
datetime
2020-01-01 00:00:00-08:00	-2.53035	0.0	-1.670510
2020-01-01 00:01:00-08:00	-2.53035	0.0	-1.649620
2020-01-01 00:02:00-08:00	-2.53035	0.0	-0.856135

Let’s try to obtain a DataFrame down-sampled with the maximum monthly data with both methods and see the differences. With asfreq, it would be the following:

df_1min.asfreq("1M").max()

ghi   -2.529330
dhi    0.000000
dni   -0.834868
dtype: float64

With resample the result would be:

df_1min.resample("1M").max()

	ghi	dhi	dni
datetime
2020-01-31 00:00:00-08:00	802.383	386.083	972.789
2020-02-29 00:00:00-08:00	956.928	582.715	1024.650
2020-03-31 00:00:00-08:00	1237.910	632.961	1011.700
2020-04-30 00:00:00-08:00	1370.180	705.203	999.360
2020-05-31 00:00:00-08:00	1253.240	590.468	983.210
2020-06-30 00:00:00-08:00	1340.240	943.659	999.233
2020-07-31 00:00:00-08:00	1260.490	1056.520	1015.320
2020-08-31 00:00:00-08:00	1114.410	659.908	1010.130
2020-09-30 00:00:00-08:00	1081.750	670.886	959.591
2020-10-31 00:00:00-08:00	1047.400	895.895	2758.840
2020-11-30 00:00:00-08:00	902.913	411.350	973.702
2020-12-31 00:00:00-08:00	793.258	393.958	985.791

It is obvious that the outputs are not the same and that is because the methods work differently. asfreq takes the value at the simultaneous stamps given by the frequency argument. See below:

df_1min.asfreq("1M")

	ghi	dhi	dni
datetime
2020-01-31 00:00:00-08:00	-3.65348	0.0	-1.669830
2020-02-29 00:00:00-08:00	-3.09137	0.0	-1.669820
2020-03-31 00:00:00-08:00	-2.52933	0.0	-0.834917
2020-04-30 00:00:00-08:00	-3.61963	0.0	-1.669340
2020-05-31 00:00:00-08:00	-3.22207	0.0	-1.669590
2020-06-30 00:00:00-08:00	-3.65326	0.0	-0.834868
2020-07-31 00:00:00-08:00	-4.21426	0.0	-1.857120
2020-08-31 00:00:00-08:00	-5.14251	0.0	-1.620990
2020-09-30 00:00:00-08:00	-4.21555	0.0	-1.690710
2020-10-31 00:00:00-08:00	-3.65348	0.0	-1.669830
2020-11-30 00:00:00-08:00	-3.09265	0.0	-1.670510
2020-12-31 00:00:00-08:00	-2.53110	0.0	-1.176670

Then .max() has returned the maximum of each of the columns.

In contrast, resample does return the maximum value within the period of time at the specified frequency. resample method requires a mathematical operation to perform in the resampled data (the maximum value in our case). Otherwise, it would return a DatetimeIndexResampler object without showing any data. See below:

df_1min.resample("1M")

<pandas.core.resample.DatetimeIndexResampler object at 0x00000213CAB87C18>

The resample method accepts multiple mathematical and statistical operations. For example: maximum (max), minimum (min), arithmetic mean (mean), standard deviation (std), median (median), mode (mode), addition (sum), among others.

Both methods allow for multiple frequencies options, the available frequency tags within Python can be found here.

2.2 Down-sampling the data in a time-series#

Down-sampling permits turning more frequent values into less frequent. In the context of solar resource and considering our 1-minute resolution dataset, down-sampling can be used for:

Producing a timeseries of hourly/daily average irradiance.
Producing a timeseries of maximum daily irradiance.
Estimating the hourly/daily/monthly sums of irradiation.
And many more!

Let’s implement some of these listed examples!

Producing hourly average irradiance from minutely observations#

# Resampling to hourly mean values
df_hourly = df_1min.resample("1H").mean()
# Showing the shape of the new DataFrame
df_hourly.shape # returns Rows, Columns

(8784, 3)

There are 8760 hours in a year. Yet, we can have a look to the first few rows of the DataFrame:

df_hourly.head(12)

	ghi	dhi	dni
datetime
2020-01-01 00:00:00-08:00	-2.576505	0.000000	-0.764837
2020-01-01 01:00:00-08:00	-2.508561	0.000000	-0.631080
2020-01-01 02:00:00-08:00	-1.943371	0.000000	-0.374704
2020-01-01 03:00:00-08:00	-2.023108	0.000000	-0.379228
2020-01-01 04:00:00-08:00	-1.683542	0.000000	-0.013457
2020-01-01 05:00:00-08:00	-1.891827	0.000000	-0.006960
2020-01-01 06:00:00-08:00	0.116990	0.351740	1.146154
2020-01-01 07:00:00-08:00	42.787289	42.289485	3.388552
2020-01-01 08:00:00-08:00	128.938495	114.490610	50.297509
2020-01-01 09:00:00-08:00	303.623558	163.197578	371.458155
2020-01-01 10:00:00-08:00	327.718433	248.435667	171.612487
2020-01-01 11:00:00-08:00	345.318217	304.172200	80.353128

A time-series with the maximum irradiance would be similar replacing ‘mean()’ with ‘max()’.

Producing time-series of monthly total GHI, DHI, DNI irradiation from minutely observations#

# Resampling to monthly aggregated values
monthly_energy = df_1min[['ghi', 'dhi', 'dni']].resample("1M").sum()*(1/60)
# See the results expressed in kWh·sqm
monthly_energy/1000

	ghi	dhi	dni
datetime
2020-01-31 00:00:00-08:00	100.817440	29.251020	175.754908
2020-02-29 00:00:00-08:00	129.220507	30.514029	206.374110
2020-03-31 00:00:00-08:00	164.224723	59.140048	179.448458
2020-04-30 00:00:00-08:00	200.459859	59.307923	217.186189
2020-05-31 00:00:00-08:00	255.336215	54.922454	294.929842
2020-06-30 00:00:00-08:00	254.262295	62.085767	277.856782
2020-07-31 00:00:00-08:00	257.105784	48.821531	305.900237
2020-08-31 00:00:00-08:00	223.419662	56.213721	251.208590
2020-09-30 00:00:00-08:00	181.248566	58.956766	199.959672
2020-10-31 00:00:00-08:00	153.368877	51.432641	195.650044
2020-11-30 00:00:00-08:00	105.726263	25.099434	190.355946
2020-12-31 00:00:00-08:00	88.972596	25.698295	165.691623

It could be done in similar way for other resolutions (e.g. daily or annual irradiation).

2.3 Up-sampling the data in a time-series#

Up-sampling permits obtaining more frequent values from less frequent. For solar data, depending on the application up to sub-minutely data could be required and up-sampling is a technique that provides a manner to increase the temporal resolution to adapt it to our needs. For example, turning an hourly time-series into a half-hourly. Let’s see an example using both resample and asfreq.

Producing half-hourly irradiance series from hourly observations#

Using the DataFrame df_hourly created previously, it can be up-sample as follows:

# Using 'resample' method:
df_hourly.resample("30Min").mean().head(10)

	ghi	dhi	dni
datetime
2020-01-01 00:00:00-08:00	-2.576505	0.0	-0.764837
2020-01-01 00:30:00-08:00	NaN	NaN	NaN
2020-01-01 01:00:00-08:00	-2.508561	0.0	-0.631080
2020-01-01 01:30:00-08:00	NaN	NaN	NaN
2020-01-01 02:00:00-08:00	-1.943371	0.0	-0.374704
2020-01-01 02:30:00-08:00	NaN	NaN	NaN
2020-01-01 03:00:00-08:00	-2.023108	0.0	-0.379228
2020-01-01 03:30:00-08:00	NaN	NaN	NaN
2020-01-01 04:00:00-08:00	-1.683542	0.0	-0.013457
2020-01-01 04:30:00-08:00	NaN	NaN	NaN

# Using 'asfreq' method:
df_hourly.asfreq("30Min").head(10)

	ghi	dhi	dni
datetime
2020-01-01 00:00:00-08:00	-2.576505	0.0	-0.764837
2020-01-01 00:30:00-08:00	NaN	NaN	NaN
2020-01-01 01:00:00-08:00	-2.508561	0.0	-0.631080
2020-01-01 01:30:00-08:00	NaN	NaN	NaN
2020-01-01 02:00:00-08:00	-1.943371	0.0	-0.374704
2020-01-01 02:30:00-08:00	NaN	NaN	NaN
2020-01-01 03:00:00-08:00	-2.023108	0.0	-0.379228
2020-01-01 03:30:00-08:00	NaN	NaN	NaN
2020-01-01 04:00:00-08:00	-1.683542	0.0	-0.013457
2020-01-01 04:30:00-08:00	NaN	NaN	NaN

Contrary to the case of down-sampling, both asfreq and resample provide similar results when up-sampling. However, asfreq provides additional functionalities to treat the new timestamps without data, i.e. NaN values.

By passing the argument ‘method’ with the string ‘backfill’ or ‘bfill’ uses the next valid observation to fill the NaN value (back filling). If instead, the string ‘pad’ or ‘ffill’ is given, the method assigns the last valid observation forward to the next valid (forward filling).

Let’s see the same example adding this argument:

# Half-hourly up-sample with back filling function
df_hourly.asfreq("30Min", method='bfill').head(10)

	ghi	dhi	dni
datetime
2020-01-01 00:00:00-08:00	-2.576505	0.0	-0.764837
2020-01-01 00:30:00-08:00	-2.508561	0.0	-0.631080
2020-01-01 01:00:00-08:00	-2.508561	0.0	-0.631080
2020-01-01 01:30:00-08:00	-1.943371	0.0	-0.374704
2020-01-01 02:00:00-08:00	-1.943371	0.0	-0.374704
2020-01-01 02:30:00-08:00	-2.023108	0.0	-0.379228
2020-01-01 03:00:00-08:00	-2.023108	0.0	-0.379228
2020-01-01 03:30:00-08:00	-1.683542	0.0	-0.013457
2020-01-01 04:00:00-08:00	-1.683542	0.0	-0.013457
2020-01-01 04:30:00-08:00	-1.891827	0.0	-0.006960

We see that the DataFrame now contains the next valid hourly value in the newly obtained half-hourly timestamps of the previous hour. It would take the previous valid hourly value if we used forward filling. For example:

# Half-hourly up-sample with forward-filling function
df_hourly.asfreq("30Min", method='ffill').head(10)

	ghi	dhi	dni
datetime
2020-01-01 00:00:00-08:00	-2.576505	0.0	-0.764837
2020-01-01 00:30:00-08:00	-2.576505	0.0	-0.764837
2020-01-01 01:00:00-08:00	-2.508561	0.0	-0.631080
2020-01-01 01:30:00-08:00	-2.508561	0.0	-0.631080
2020-01-01 02:00:00-08:00	-1.943371	0.0	-0.374704
2020-01-01 02:30:00-08:00	-1.943371	0.0	-0.374704
2020-01-01 03:00:00-08:00	-2.023108	0.0	-0.379228
2020-01-01 03:30:00-08:00	-2.023108	0.0	-0.379228
2020-01-01 04:00:00-08:00	-1.683542	0.0	-0.013457
2020-01-01 04:30:00-08:00	-1.683542	0.0	-0.013457

The forward filling option provides the same value for o’clock and half past timestamps within the same hour. In addition to these two ways to complete the NaN values, the method asfreq can replace the NaN values with a constant. See below:

# Half-hourly up-sample filling the new timestamps with a constant
df_hourly.asfreq("30Min", fill_value=0).head(10)

	ghi	dhi	dni
datetime
2020-01-01 00:00:00-08:00	-2.576505	0.0	-0.764837
2020-01-01 00:30:00-08:00	0.000000	0.0	0.000000
2020-01-01 01:00:00-08:00	-2.508561	0.0	-0.631080
2020-01-01 01:30:00-08:00	0.000000	0.0	0.000000
2020-01-01 02:00:00-08:00	-1.943371	0.0	-0.374704
2020-01-01 02:30:00-08:00	0.000000	0.0	0.000000
2020-01-01 03:00:00-08:00	-2.023108	0.0	-0.379228
2020-01-01 03:30:00-08:00	0.000000	0.0	0.000000
2020-01-01 04:00:00-08:00	-1.683542	0.0	-0.013457
2020-01-01 04:30:00-08:00	0.000000	0.0	0.000000

The use of the methods asfreq or resample will depend on your dataset and the analysis you aim to undertake.

3 Interpolating time-series data#

When up-sampling the data series, it can happen that back-filling, forward-filling and constant replacement does not necessarily work for your analysis/application. An alternative approach is interpolating the replacing the NaN values with an interpolated result. Interpolation in Pandas DataFrames with DatetimeIndex is done with the interpolate method.

The mathematical interpolation method in interpolate is defined with the argument called ‘method’. Pandas permits several interpolation methods, such as ‘linear’, ‘cubic’, ‘quadratic’, ‘spline’, ‘polynomial’ and others. All the interpolation options can be found in the documentation of the interpolate method.

Following the previous example, let’s implement interpolation in the missing values of the half-hourly timestamps using ‘linear’, ‘cubic’ and ‘polynomial’ methods:

# Up-sample using the 'asfreq' method
df_30min = df_hourly.asfreq("30Min")
# Interpolate missing values (NaN) with linear interpolation
df_linear = df_30min.interpolate(method='linear')
# See the results:
df_linear.head(10)

	ghi	dhi	dni
datetime
2020-01-01 00:00:00-08:00	-2.576505	0.0	-0.764837
2020-01-01 00:30:00-08:00	-2.542533	0.0	-0.697958
2020-01-01 01:00:00-08:00	-2.508561	0.0	-0.631080
2020-01-01 01:30:00-08:00	-2.225966	0.0	-0.502892
2020-01-01 02:00:00-08:00	-1.943371	0.0	-0.374704
2020-01-01 02:30:00-08:00	-1.983240	0.0	-0.376966
2020-01-01 03:00:00-08:00	-2.023108	0.0	-0.379228
2020-01-01 03:30:00-08:00	-1.853325	0.0	-0.196342
2020-01-01 04:00:00-08:00	-1.683542	0.0	-0.013457
2020-01-01 04:30:00-08:00	-1.787684	0.0	-0.010209

Similarly, it can be implemented to other methods:

# Interpolate missing values (NaN) with cubic interpolation
df_cubic = df_30min.interpolate(method='cubic')
# See the results:
df_cubic.head(10)

	ghi	dhi	dni
datetime
2020-01-01 00:00:00-08:00	-2.576505	0.000000	-0.764837
2020-01-01 00:30:00-08:00	-2.708668	0.015477	-0.754314
2020-01-01 01:00:00-08:00	-2.508561	0.000000	-0.631080
2020-01-01 01:30:00-08:00	-2.184142	-0.015477	-0.477191
2020-01-01 02:00:00-08:00	-1.943371	0.000000	-0.374704
2020-01-01 02:30:00-08:00	-1.929019	0.046430	-0.371559
2020-01-01 03:00:00-08:00	-2.023108	0.000000	-0.379228
2020-01-01 03:30:00-08:00	-2.027425	-0.170243	-0.284694
2020-01-01 04:00:00-08:00	-1.683542	0.000000	-0.013457
2020-01-01 04:30:00-08:00	-1.097302	0.634542	0.333660

With polynomial interpolation, the degree or order of the polynomial function needs to be defined as an argument:

# Interpolate missing values (NaN) with polynomial interpolation
df_polynomial = df_30min.interpolate(method='polynomial', order=5)
# See the results:
df_polynomial.head(10)

	ghi	dhi	dni
datetime
2020-01-01 00:00:00-08:00	-2.576505	0.000000	-0.764837
2020-01-01 00:30:00-08:00	-1.992626	0.512587	0.655667
2020-01-01 01:00:00-08:00	-2.508561	0.000000	-0.631080
2020-01-01 01:30:00-08:00	-2.485236	-0.214189	-1.068238
2020-01-01 02:00:00-08:00	-1.943371	0.000000	-0.374704
2020-01-01 02:30:00-08:00	-1.619687	0.201377	0.199990
2020-01-01 03:00:00-08:00	-2.023108	0.000000	-0.379228
2020-01-01 03:30:00-08:00	-2.535919	-0.346026	-1.263387
2020-01-01 04:00:00-08:00	-1.683542	0.000000	-0.013457
2020-01-01 04:30:00-08:00	-0.233498	0.762150	2.341007

The interpolation of NaN values when up-sampling time-series data can help overcome the issues of using back or forward filling, specially if you aim to up-sample at higher frequencies than the example shown (e.g. 1-hour to 15-minute resolution series). The mathematical methods available for interpolation within Pandas are diverse and cover beyond the most common interpolation functions.

4 Visualizing time-series data#

It is often useful to visualize the data to grasp insighs and observe trends about the data. This section shows few examples to visualize time-series data.

4.1 Plotting a time-series for a day of interest#

Below there is an example to visualize a single day of interest. With DataFrames using DatetimeIndex it is easy to select a particular day and Pandas interacts with Matplotlib.Pyplot library to plot straight-away.

# Plotting GHI for a given day in the time-series
df_1min['2020-06-01']['ghi'].plot(label='GHI')
plt.ylabel('Irradiance [W/m$^2$]')
plt.xlabel('Local Time [HH:MM]')
plt.legend(loc='best')
plt.show() # Not needed in Jupyter Notebooks but usually required in other IDEs.

../_images/a7ce0138a75566c917e180e2130f4379fd65b0eda2b525035ac86ebb73afc02a.png

We can visualize the effect of using average (resample) vs. instantaneous (asfreq) measurements when down-sampling our data.

# Plotting GHI for a given day in the time-series
df_1min['2020-06-01']['ghi'].plot(label='1-min data', alpha=0.4) # Reference data
df_1min.asfreq('30Min')['2020-06-01']['ghi'].plot(label='30-min instant.') # Instantaneous 30-min values
df_1min.resample('30Min').mean()['2020-06-01']['ghi'].plot(label='30-min average') # Average 30-min values
plt.title('Average vs. Actual GHI Measurements') # title of the figure
plt.ylabel('Irradiance [W/m$^2$]') # y-axis label
plt.xlabel('Local Time [HH:MM]') # x-axis label
plt.legend(loc='upper left') # insert legend
plt.show() # Not needed in Jupyter Notebook but usually required in other IDEs.

../_images/74cddceb3ef4f03717575bd7fdf0ec6156ba9d6ef01b9df1b454c625c1993571.png

4.2 Plotting a time-series for a few consecutive days of interest#

Below there is an example to visualize a few consecutive days (e.g. 5 days) of interest. By using [‘start date’]:[‘end date’] it is possible to select time ranges easily with a DataFrame having a DatetimeIndex.

# Variables to plot
vars = ['ghi', 'dni', 'dhi'] 
# Create 3 subplots, with shared X and Y axis
fig, axs = plt.subplots(3, sharex=True, sharey=True, figsize=(9,6))
# Add title to the plot
fig.suptitle('Average Hourly Solar Radiation Observations', fontsize=14)

for i in range(3):
    axs[i].plot(df_1min.resample('1H').mean()['2020-06-01':'2020-06-05'][vars[i]], label='Average') # Average hourly
    axs[i].plot(df_1min.resample('1H').max()['2020-06-01':'2020-06-05'][vars[i]], label='Maximum') # Max. hourly
    axs[i].plot(df_1min.resample('1H').min()['2020-06-01':'2020-06-05'][vars[i]], label='Minimum') # Min. hourly
    axs[i].set_title(vars[i].upper()) # Title for each subplot
fig.subplots_adjust(hspace=0.3) # Adjust the white space between the subplots titles
fig.text(0.04, 0.5, 'Irradiance [W/m$^2$]', va='center', rotation='vertical', fontsize=12) # Common Y Axis
fig.text(0.51, 0.04, 'Local Time', ha='center', fontsize=12) # Common X Axis
plt.legend(loc='upper center', ncol=3) # Legend for the last subplot or 'axs[i].legend()' in the loop to a legend to each.
plt.show()

../_images/8a85d841c60a56bcfa38c7f37c44097f4be7d9781dcc3141cfdf96e955bc6507.png

4.3 Plotting a time-series for a few non-consecutive days of interest#

Below there is an example to visualize a few non-consecutive days of interest, which could be the case when we would like to observe several days scattered throughout the year a single plot. In order to do this, we need to select the day of interest from the DataFrame and then reset its DatetimeIndex. For example:

# List of days of interest
days = ['2020-01-01', '2020-03-01', '2020-06-01', '2020-09-01']
# Iterate over the days and plot each of them
for day in days: 
    df_day = df_1min.resample('1H').mean()[day]['ghi'].to_frame()  # average hourly of GHI for current day
    df_day = df_day.reset_index(drop=True) # reset its Index to numeric (i.e. 0,1,2,3...)
    plt.plot(df_day, label=day) # plot the current day
plt.title('Average Hourly GHI Measurements for Days of Interest') # title of the figure
plt.xticks(np.arange(0, 25, step=3), np.arange(0, 25, step=3)) # set labels positions and names
plt.ylabel('Irradiance [W/m$^2$]') # y-axis label
plt.xlabel('Local Time') # x-axis label
plt.legend(loc='best') # insert legend
plt.show()

../_images/2bc7d141ce5532a4d0188c7dbe426bf80d17bad3aa3333ee6673b99d5133b675.png

4.4 Daily insolation throughout the year#

With time-series data, the hourly/daily/monthly insolation (i.e. the sum of accumulated energy) can also be analysed throughout the year with time-series data. For example, below an example to visualize the daily insolation is shown:

# Calculate the daily insolation expressed in kWh·sqm from GHI measurements
daily_energy = (df_1min['ghi'].resample("1D").sum()*(1/60))/1000 # selecting only GHI returns a Pandas Series

# Create time-series plot
daily_energy.plot(figsize=(9,6), legend=False) # plot timeseries 
plt.title('Time-series of Daily Insolation')  # add title
plt.ylabel('Energy [kWh/m$^2$]') # add Y-axis label
plt.xlabel('Time') # add X-axis label
plt.show()

../_images/ec01bb8fec2afa814c86381ff2879f81b3c812454cb8feb0d414b9866d198e98.png

Time-series data can also be visualized in other ways, for instance, as a heat map.

# Prepare the data for heat map of hourly insolation
energy_array = pd.DataFrame() # empty DataFrame for the results
for i in range(1,13): # iterate over months
    # select the data in the month and eliminate the datetimeindex
    df_month = daily_energy[daily_energy.index.month==i].reset_index(drop=True) 
    # rename the column with the number of the month
    df_month.columns = [str(i)]
    # Append results to the DataFrame
    energy_array = pd.concat([energy_array, df_month], axis=1)
# Transpose to have months in y-axis and days in x-axis
energy_array = energy_array.transpose()
# Rename the columns of the days 
energy_array.columns = np.arange(1, 32)

# Plot heat map of daily insolation
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', # month labels
          'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
plt.figure(figsize=(10, 5))
ax = sns.heatmap(energy_array, cmap='CMRmap', linewidths=0.2, # plot heatmap with Seaborn (sns) library 
                xticklabels=2, annot=False,
                cbar_kws={'label': 'Daily Energy [kWh/m$^2$]'})
ax.set_title('Heat Map of Daily Insolation') # add title
ax.set_yticklabels(months,rotation=0) # add the months as tick-labels for the y-axis
ax.set_xticklabels(ax.get_xticklabels(),rotation=0) # add the days as tick-labels for the x-axis
ax.set_xlabel('Day of the Month')
plt.show()

../_images/a2cab67dff75a087f193d1a54e351bf19b454377127adb12838c8ca6de2531bb.png

Section summary#

This section has shown how to build and work with a time-series in Python with multiple examples:

We have seen how to prepare a DataFrame with DatatimeIndex to be used as a time-series when the timestamps are given in multiple formats in the temporal data and local/UTC time.
Changes in the temporal resolution of the data can be applied by down and up-sampling the data and the differences between 2 available methods (asfreq and resample) have been shown with examples and different sampling frequencies.
The interpolation of missing data in time-series can be used to up-sample the resolution of the data and examples with some methods have been shown.
Finally, several ideas to visualize data have been presented.

Overall, the possibilities with time-series of solar resource are many. The most useful and suitable analysis and visualizations will be determined by the application and scope of the study.

References #

Andreas, A.; Stoffel, T.; (2006). University of Nevada (UNLV): Las Vegas, Nevada (Data); NREL Report No. DA-5500-56509. http://dx.doi.org/10.5439/1052548