Working with Date Using Pandas

ferhatmetin
Data Science Earth
Published in
5 min readJan 12, 2021

--

When working with time series data, the date information may not be in the format we want, or we may want to produce new variables that can make our model better with the time information we have. Now let’s see how we can do this in Pandas.

In this article, we will use the electricity consumption data (January 2015 — June 2020) from Epiaş Şeffaflık Platformu. You can access the data here.

Let’s start by loading the necessary libraries and data.

import pandas as pd # 1.0.3
import numpy as np
data=pd.read_csv("/kaggle/input/electricity-consumption-20152020/GercekZamanliTuketim-01072015-30062020.csv",
encoding= 'unicode_escape')
df=data.copy() # to copy just in case
df.head()
df.info()

As you can see, there are some problems such as the name of the column indicating the amount of consumption, the number indicating the amount being in the format we use. The time information is also kept in a separate column. We can add this column to the date column for now, as we will obtain it ourselves later. Finally, we can change the date variable that appears as a categorical variable to date format and delete the “Saat” column. Now let’s do what we say.

df.tuketim=df.tuketim.str.replace(".","")
df.tuketim=df.tuketim.str.replace(",",".")
df.tuketim=df.tuketim.astype(float)
df["Tarih"]=df["Tarih"]+" "+df["Saat"]
df.Tarih=df.Tarih.apply(pd.to_datetime)
df=df.drop("Saat",axis=1)
df.head()

1-Extracting Week, Month and Year

Accessing to this information is very easy with the dt accessor used with the Pandas series. We can use dt.week to access what week of the year it is, dt.month to access what month is and dt.year to access what year it is.

df["week"]=df.Tarih.dt.week
df["month"]=df.Tarih.dt.month
df["year"]=df.Tarih.dt.year
df.sample(10)

2- Extracting Semester and Quarter

Similarly, we can get information about what quarter it is. With the help of the code below, we can access the information about the six-month period (semester).

df["quarter"]=df.Tarih.dt.quarter
df['semester'] = np.where(df['quarter'].isin([1,2]), 1, 2)
df.sample(7)

3-Extracting Day,Day of Week and Day Names

We can access this information as we did before. Days of the week are named starting at 0. For instance, it will be 0 for Monday and 6 for Sunday.

df["day"]=df.Tarih.dt.day
df['daysofweek'] = df['Tarih'].dt.dayofweek
df["day_name"]=df.Tarih.dt.day_name()
df.sample(5)

4-Extracting Weekend

If we want to get weekend information, we can use the” day_name “ column. Our value will be 1 if the day is saturday or sunday,if not 0.

df['is_weekend'] = np.where(df['day_name'].isin(['Sunday', 'Saturday']), 1,0)

As a second way, we can use “daysofweek” column.In this case, if the day of the week is 5 or 6, that is, if it is greater than 4, it is the weekend.

df["is_weekend"]=df["daysofweek"].apply(lambda x:1 if x>4 else 0)
df.sample(7)

5-date_range() Function

We can create time data at any time interval with date_range(). For example, let’s create hourly data for the date January 1, 2020.We can use the H symbol to express the frequency as hour. Because we set a 24-hour time, we can enter 24 for the periods parameter.

data = pd.Series(pd.date_range('2020-1-1', periods=24, freq='H'))
data = pd.DataFrame(dict(date=data))
data.head(7) #first seven rows

If we want to create daily data for January, we can write this as follows:

data = pd.Series(pd.date_range('2020-1-1', periods=31, freq='D'))
data = pd.DataFrame(dict(date=data))
data.head(7)

Now,let’s create a secondly data.

data = pd.Series(pd.date_range('2020-1-1 8:01:00', periods=10, freq='S')) 
data = pd.DataFrame(dict(date=data))
data

6-Extracting Time,Minute,Second and Date

data['hour'] = data['date'].dt.hour
data['min'] = data['date'].dt.minute
data['sec'] = data['date'].dt.second
data["date_"]=data["date"].dt.date
data['time'] = data['date'].dt.time
data

7- resample() Function

With this function, we can change the frequency of the time series data and resample it. In order to use this function, we need to assign the date data to the index.

df.head()
df.index=df.Tarih
df.drop(columns=["Tarih"],axis=1,inplace=True)
df.head()

Now let’s look at average annual consumption. For this, we need to set the rule parameter as “A”.

df.resample(rule="A").mean()
#df.groupby(df_new.index.year).mean()

To see the average amount of consumption at the beginning of each month, we can write the following code:

df.resample(rule="MS").mean().head(10) #first 10 rows
Figure-Pandas-Frequency-Alias https://pandas.pydata.org/pandas-docs/version/0.15/timeseries.html

8 — shift() Function

Shift index by desired number of periods with an optional time frequency. Let’s make periods parameter 1 without entering another parameter and see what happens.

df.shift(1).head()

As you can see, the first value was down to the bottom line, and the first line received the NaN value.

df.shift(periods=2,freq="D").head()

In the above example, the indexes were changed with the indexes two days later.

9 — rolling() Function

Performs operations with the previous n data according to the specified time window. It can be used to calculate moving average. A window parameter is specified for this. For example, if this parameter is 7, the previous 7 values are averaged. the 7th day is also included in the average.

df.rolling(window=7).mean().head(10)#df.iloc[0:7].mean()

10 — expanding() Function

This function, unlike rolling (), processes all data before the specified period.

df.expanding(min_periods=7).mean().head(10)

REFERENCES

Linkedn : https://www.linkedin.com/in/ferhatmetin/

--

--