Efficient way to apply multiple filters to pandas DataFrame or Series

I have a scenario where a user wants to apply several filters to a Pandas DataFrame or Series object. Essentially, I want to efficiently chain a bunch of filtering (comparison operations) together that are specified at run-time by the user. The filters should be additive (aka each one applied should narrow results). I’m currently using … Read more

pandas loc vs. iloc vs. at vs. iat?

Recently began branching out from my safe place (R) into Python and and am a bit confused by the cell localization/selection in Pandas. I’ve read the documentation but I’m struggling to understand the practical implications of the various localization/selection options. Is there a reason why I should ever use .loc or .iloc over at, and … Read more

Find column whose name contains a specific string

I have a dataframe with column names, and I want to find the one that contains a certain string, but does not exactly match it. I’m searching for ‘spike’ in column names like ‘spike-2’, ‘hey spike’, ‘spiked-in’ (the ‘spike’ part is always continuous). I want the column name to be returned as a string or … Read more

pandas: multiple conditions while indexing data frame – unexpected behavior

I am filtering rows in a dataframe by values in two columns. For some reason the OR operator behaves like I would expect AND operator to behave and vice versa. My test code: import pandas as pd df = pd.DataFrame({‘a’: range(5), ‘b’: range(5) }) # let’s insert some -1 values df[‘a’][1] = -1 df[‘b’][1] = … Read more

How can I one hot encode in Python?

I have a machine learning classification problem with 80% categorical variables. Must I use one hot encoding if I want to use some classifier for the classification? Can i pass the data to a classifier without the encoding? I am trying to do the following for feature selection: I read the train file: num_rows_to_read = … Read more

Multiple aggregations of the same column using pandas GroupBy.agg()

Is there a pandas built-in way to apply two different aggregating functions f1, f2 to the same column df[“returns”], without having to call agg() multiple times? Example dataframe: import pandas as pd import datetime as dt import numpy as np pd.np.random.seed(0) df = pd.DataFrame({ “date” : [dt.date(2012, x, 1) for x in range(1, 11)], “returns” … Read more