I tend to forget how to write certain blocks of code when I haven't written them in a while. Here's a common machine learning preprocessing task that falls into that category.

Imagine you have some event logs that capture an entity ID (user, store, ad, etc), timestamp, an event name, and maybe some other details. The data looks something like this:

userid  timestamp   event
789 2019-07-18 01:06:00 login
123 2019-07-19 08:30:00 login
789 2019-07-20 02:39:00 login
789 2019-07-20 08:15:00 login
456 2019-07-20 10:05:00 login
123 2019-07-20 14:40:00 login
123 2019-07-20 18:05:00 login
456 2019-07-21 21:11:00 login
789 2019-07-22 10:05:00 login
123 2019-07-23 09:18:00 login
789 2019-07-23 17:35:00 login
123 2019-07-25 16:49:00 login
789 2019-07-26 12:13:00 login
123 2019-07-27 19:56:00 login

For the sake of simplicity, let's say we want to build a model predicting whether or not a user will login in tomorrow. Our target is y = bool(logins).

Three features we think will be informative are the user's previous logins, whether they logged in yesterday, and the number of days since their last login. We'll call these features lifetime_logins, logins_yesterday, and days_since_last_login.

Using pandas, we aggregate by user and date to get each user's daily count of logins.

df = pd.read_clipboard(parse_dates=['timestamp'])
user_logins = (df.set_index('timestamp')
               .groupby(['userid', pd.Grouper(freq='D')])
               .size()
               .rename('logins'))
# userid  timestamp
# 123     2019-07-19    1
#         2019-07-20    2
#         2019-07-23    1
#         2019-07-25    1
#         2019-07-27    1
# 456     2019-07-20    1
#         2019-07-21    1
# 789     2019-07-18    1
#         2019-07-20    2
#         2019-07-22    1
#         2019-07-23    1
#         2019-07-26    1
# Name: logins, dtype: int64

But we're missing critical information. This is when the brain fart happens.

Recall the structure of our logs. Notice they omit records for when the user had no activity. In order to create our features, we need to fill in time gaps for each user and then roll that information forward.

This goal of this post is to help me remember how to do this in the future.

Filling Time Gaps

First, we need to put each user on a continuous time scale.

# create a continuous DatetimeIndex at a daily level
dates = pd.date_range(df.timestamp.min().date(),
                      df.timestamp.max().date(),
                      freq='1D')

# get unique set of user ids
users = df['userid'].unique()

# create a MultiIndex that is the product (cross-join) of
# users and DatetimeIndexes
idx = pd.MultiIndex.from_product([users, dates], names=['userid', 'timestamp'])

# and reindex our `user_logins` counts by it
user_logins = user_logins.reindex(idx)

# userid  timestamp
# 789     2019-07-18    1.0
#         2019-07-19    NaN
#         2019-07-20    2.0
#         2019-07-21    NaN
#         2019-07-22    1.0
#         2019-07-23    1.0
#         2019-07-24    NaN
#         2019-07-25    NaN
#         2019-07-26    1.0
#         2019-07-27    NaN

This gives us a continuous daily time series for each user. You can see what this looks like for user 789 above.

An important thing to note is that idx will need to be on the same time scale as the current DatetimeIndex in user_logins. Because we aggregated at a daily level using pd.Grouper(freq='D'), the MultiIndex we are using to reindex should also be at a daily level.

Creating Features

Now we're free to create our features. We can zero-fill days each user did not log in. We also need to convert our user_logins to a DataFrame, which allows us to create the new feature columns (e.g. logins_yesterday).

user_logins = user_logins.fillna(0).to_frame()
user_logins['logins_yesterday'] = user_logins.groupby(level='userid')['logins'].shift(1)
#                    logins  logins_yesterday
# userid timestamp
# 789    2019-07-18     1.0               NaN
#        2019-07-19     0.0               1.0
#        2019-07-20     2.0               0.0
#        2019-07-21     0.0               2.0
#        2019-07-22     1.0               0.0
# 123    2019-07-18     0.0               NaN
#        2019-07-19     1.0               0.0
#        2019-07-20     2.0               1.0
#        2019-07-21     0.0               2.0
#        2019-07-22     0.0               0.0
# 456    2019-07-18     0.0               NaN
#        2019-07-19     0.0               0.0
#        2019-07-20     1.0               0.0
#        2019-07-21     1.0               1.0
#        2019-07-22     0.0               1.0

The lifetime_logins and login_streak features need to be context dependant to avoid data leakage when training our model. Our features need to represent what would have been the correct values at the time. We can do this by rolling information forward with shift.

user_logins['lifetime_logins'] = (user_logins
                                  .groupby(level='userid')
                                  .logins.cumsum()
                                  .groupby(level='userid').shift(1))
user_logins['days_since_last_login'] = (user_logins
                                        .groupby(level='userid')
                                        .cumsum()
                                        .groupby(['userid', 'logins'])
                                        .cumcount()
                                        .groupby(level='userid').shift(1)
                                        .rename('days_since_last_login'))

#                    logins  logins_yesterday  lifetime_logins  days_since_last_login
# userid timestamp
# 789    2019-07-18     1.0               NaN              NaN                    NaN
#        2019-07-19     0.0               1.0              1.0                    0.0
#        2019-07-20     2.0               0.0              1.0                    1.0
#        2019-07-21     0.0               2.0              3.0                    0.0
#        2019-07-22     1.0               0.0              3.0                    1.0
# 123    2019-07-18     0.0               NaN              NaN                    NaN
#        2019-07-19     1.0               0.0              0.0                    0.0
#        2019-07-20     2.0               1.0              1.0                    0.0
#        2019-07-21     0.0               2.0              3.0                    0.0
#        2019-07-22     0.0               0.0              3.0                    1.0
# 456    2019-07-18     0.0               NaN              NaN                    NaN
#        2019-07-19     0.0               0.0              0.0                    0.0
#        2019-07-20     1.0               0.0              0.0                    1.0
#        2019-07-21     1.0               1.0              1.0                    0.0
#        2019-07-22     0.0               1.0              2.0                    0.0

This can also be extended to create rolling features: something like logins_last_n_days where n = [7, 14, 21].

for n in [7, 14, 21]: 
    col = 'logins_last_{}_days'.format(n)
    user_logins[col] = (user_logins
                        .groupby(level='userid')
                        .logins
                        .apply(lambda d: d.rolling(n).sum().shift(1)))

Hopefully you've found this post helpful. I know my future self will.