11. Timestamps and Time Series
Time Series
In data analysis, the development of variables over time is often examined, which means datasets contain information about the time. The data can be at fixed time intervals, e.g., "every 15 seconds" or "once a month", or at irregular intervals.
Time information can be in different formats in the data:
- timestamps, e.g., 2019-04-05 11:23
- periods, e.g., the year 2019, January 2019
- a period can also be indicated by two timestamps start - end
- elapsed time, e.g., seconds since the start of an experiment
Pandas has many tools and algorithms for processing time series, for example, aggregating at desired time intervals is easy.
datetime and time data types in basic Python
Python's datetime module has data types for date and time information:
- date (year, month, and day)
- time (time, hours, minutes, seconds, and microseconds)
- datetime (date and time)
- timedelta (the time between two datetime values, days, seconds, and microseconds)
- tzinfo (time zone)
from datetime import datetime
now = datetime.now()
print(now)
sometime = datetime(2019,5,6,11,23,45)
print(sometime)
another_time = datetime(2019,8,31)
print(another_time)
2022-11-23 15:03:10.621445
2019-05-06 11:23:45
2019-08-31 00:00:00
from datetime import date
today_date = date.today()
print(today_date)
2022-11-23
from datetime import timedelta
difference = sometime-now
print(difference)
print(type(difference))
print(difference.days)
print(difference.seconds)
print(now)
print(now + timedelta(10))
print(now + timedelta(1,10))
-1298 days, 20:20:34.378555
<class 'datetime.timedelta'>
-1298
73234
2022-11-23 15:03:10.621445
2022-12-03 15:03:10.621445
2022-11-24 15:03:20.621445
You cannot add datetimes together, but you can add a timedelta to a datetime.
datetime <-> string
A datetime object and the later introduced pandas Timestamp object can be printed as a string using the str
and strftime
methods:
now = datetime.now()
print(now)
print(str(now))
print(now.strftime('%d.%m.%Y'))
print(now.strftime('%m.%d.%Y'))
print(now.strftime('%c'))
2022-11-23 15:03:10.664621
2022-11-23 15:03:10.664621
23.11.2022
11.23.2022
Wed Nov 23 15:03:10 2022
strftime
formatting codes:
- %Y Four-digit year
- %y Two-digit year
- %m Two-digit month [01, 12]
- %d Two-digit day [01, 31]
- %H Hour (24-hour clock) [00, 23]
- %I Hour (12-hour clock) [01, 12]
- %M Two-digit minute [00, 59]
- %S Second [00, 61] (seconds 60, 61 account for leap seconds)
- %w Weekday as integer [0 (Sunday), 6]
- %U Week number of the year [00, 53]; Sunday is considered the first day of the week, and days before the first Sunday of the year are “week 0”
- %W Week number of the year [00, 53]; Monday is considered the first day of the week, and days before the first Monday of the year are “week 0”
- %z UTC time zone offset as +HHMM or -HHMM; empty if time zone naive
- %F Shortcut for %Y-%m-%d (e.g., 2012-4-18)
- %D Shortcut for %m/%d/%y (e.g., 04/18/12)
- %a Abbreviated weekday name
- %A Full weekday name
- %b Abbreviated month name
- %B Full month name
- %c Full date and time (e.g., ‘Tue 01 May 2012 04:20:57 PM’)
- %p Locale equivalent of AM or PM
- %x Locale-appropriate formatted date (e.g., in the United States, May 1, 2012 yields ’05/01/2012’)
- %X Locale-appropriate time (e.g., ’04:24:12 PM’)
Using the same formatting codes, a string can be interpreted as a datetime object with the datetime.strptime
method:
string = '22.11.2022'
time = datetime.strptime(string, '%d.%m.%Y')
print(time)
2022-11-22 00:00:00
To avoid writing formatting codes, you can use the parser.parse
method from the dateutil library, which parses most common date/time formats:
from dateutil.parser import parse
then = parse('22.11.2022 14:15')
print(then)
at_that_time = parse('2.12.22')
print(at_that_time) # incorrect if 2 is the day. In England, the month often comes first
at_that_time = parse('2.12.22', dayfirst=True)
print(at_that_time)
2022-11-22 14:15:00
2022-02-12 00:00:00
2022-12-02 00:00:00
pandas.to_datetime
The to_datetime
method from pandas similarly parses recognizable strings into pandas Timestamp
objects.
import pandas as pd
tt = pd.to_datetime('1.4.34', dayfirst=True)
print(tt)
print(type(tt))
2034-04-01 00:00:00
<class 'pandas._libs.tslibs.timestamps.Timestamp'>
When the parameter is of list type, pd.to_datetime
returns a DateTimeIndex
object, from a Series it returns a Series, etc.
times = ['12:23', '13:23', '23:34']
print(pd.to_datetime(times))
DatetimeIndex(['2022-11-23 12:23:00', '2022-11-23 13:23:00', '2022-11-23 23:34:00'], dtype='datetime64[ns]', freq=None)
timesS = pd.Series(['12:23', '13:23', '23:34'])
print(pd.to_datetime(timesS))
0 2022-11-23 12:23:00
1 2022-11-23 13:23:00
2 2022-11-23 23:34:00
dtype: datetime64[ns]
You can also tell to_datetime
the format used:
pd.to_datetime('1.4.34', format='%H.%M.%S')
Timestamp('1900-01-01 01:04:34')
pd.to_timedelta
creates a time difference, which can be added to a timestamp, for example.
df = pd.read_csv('Datasets/time.txt')
print(df)
broadcast duration
0 2019-01-01 5
1 2019-01-02 7
2 2019-01-02 12
3 2019-01-03 8
4 2019-01-08 3
5 2019-01-11 4
df['broadcast'] = pd.to_datetime(df['broadcast']) # convert to timestamps
df['delivery'] = df['broadcast'] + pd.to_timedelta(df['duration'],'d') # add days ('d') according to duration
print(df)
broadcast duration delivery
0 2019-01-01 5 2019-01-06
1 2019-01-02 7 2019-01-09
2 2019-01-02 12 2019-01-14
3 2019-01-03 8 2019-01-11
4 2019-01-08 3 2019-01-11
5 2019-01-11 4 2019-01-15
Timestamps in the index
A commonly used representation for time series is one in which the index of the Series or DataFrame is timestamps (Timestamp).
from datetime import datetime
import numpy as np
dates = [datetime(2022, 1, 2), datetime(2022, 1, 5), datetime(2022, 1, 7),
datetime(2022, 4, 8), datetime(2023, 1, 10), datetime(2023, 1, 12)]
ts = pd.Series(np.random.randn(6), index=dates)
print(ts)
2022-01-02 -0.234078
2022-01-05 1.457706
2022-01-07 -0.193775
2022-04-08 -1.387376
2023-01-10 0.833841
2023-01-12 -0.517104
dtype: float64
Such can be indexed or sliced just like other Series or DataFrames.
Additionally, any string interpretable as a time can be given. Or, for example, just the year or year and month, etc.
In slicing, the given date/time does not need to appear in the time series.
print(ts[2])
```python
print('\n--------\n')
print(ts['1.2.2022'])
print('\n--------\n')
print(ts['2022'])
print('\n--------\n')
print(ts['2022/1'])
print('\n--------\n')
print(ts['2022/2':'2023'])
print('\n--------\n')
print(ts[:'20220110'])
-0.19377497567135307
--------
-0.23407829942228328
--------
2022-01-02 -0.234078
2022-01-05 1.457706
2022-01-07 -0.193775
2022-04-08 -1.387376
dtype: float64
--------
2022-01-02 -0.234078
2022-01-05 1.457706
2022-01-07 -0.193775
dtype: float64
--------
2022-04-08 -1.387376
2023-01-10 0.833841
2023-01-12 -0.517104
dtype: float64
--------
2022-01-02 -0.234078
2022-01-05 1.457706
2022-01-07 -0.193775
dtype: float64
parse_dates in read_csv
When the dataset is read with read_csv, the desired columns can be parsed as timestamps at this stage with the parse_dates
parameter, which can be given:
- boolean. If True -> try parsing the index.
- list of int or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column.
- list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.
- dict, e.g. {‘foo’ : [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’
File:
date,value,value2
1.1.2022,15,-5
2.1.2022,14,-3
3.1.2022,11,1
4.1.2022,11,2
5.1.2022,20,1
6.1.2022,16,-2
7.1.2022,11,-1
8.1.2022,13,0
df = pd.read_csv('Datasets/times.txt')
df