How do I select multiple columns ‘a’ and ‘b’ from a Pandas DataFrame and save them into a new DataFrame?
I have a DataFrame with columns ‘a’, ‘b’, and ‘c’, and I want to create a new DataFrame containing only columns ‘a’ and ‘b’.
My unsuccessful attempts:
df1 = df['a':'b']
df1 = df.ix[:, 'a':'b']
What is the correct way to select multiple columns in Pandas and create a new DataFrame with only those columns?
To select multiple columns ‘a’ and ‘b’ from a Pandas DataFrame and create a new DataFrame, you should use double square brackets with a list of column names: df1 = df[['a', 'b']]. This method returns a new DataFrame containing only the specified columns, which is different from single bracket notation that would return a Series for a single column selection.
Contents
- Correct Method for Selecting Multiple Columns
- Alternative Approaches
- Why Your Attempts Failed
- Common Pitfalls and Best Practices
- Practical Examples
- Advanced Column Selection
Correct Method for Selecting Multiple Columns
The primary and most straightforward method to select multiple columns from a Pandas DataFrame is to use double square brackets with a list of column names:
df1 = df[['a', 'b']]
This creates a new DataFrame df1 containing only columns ‘a’ and ‘b’ from the original DataFrame df. The double brackets are essential because they create a list of column names, which pandas expects when selecting multiple columns.
Key Insight: When you use double brackets
df[['a', 'b']], you’re passing a list['a', 'b']to the DataFrame’s indexing operator. This tells pandas to return a DataFrame with those specific columns rather than a Series.
Let’s break down why this works:
- Inner brackets: Create a list of column names you want to select
- Outer brackets: Are the DataFrame indexing operator that processes the list
- Result: A new DataFrame containing only the specified columns
Alternative Approaches
While the double bracket method is the most common, there are several other ways to select multiple columns in pandas:
Using loc for Label-Based Selection
The loc indexer provides label-based selection and offers more flexibility:
df1 = df.loc[:, ['a', 'b']]
This explicitly selects all rows (:) and only columns ‘a’ and ‘b’. According to the pandas documentation, loc[] is used for label-based selection and is particularly useful when you need more control over the selection process.
Using iloc for Position-Based Selection
If you know the positions of your columns rather than their names:
df1 = df.iloc[:, [0, 1]] # Select first two columns by position
Creating a New DataFrame Explicitly
You can also create a new DataFrame by passing the original DataFrame and specifying the columns to include:
df1 = pd.DataFrame(df, columns=['a', 'b'])
This method is explicit and makes your intent clear - to create a new DataFrame with specific columns.
Using filter() Method
For more advanced column selection patterns:
df1 = df.filter(['a', 'b'])
The filter() method is useful when you want to select columns based on patterns or regular expressions.
Why Your Attempts Failed
Let’s analyze why your unsuccessful attempts didn’t work:
Attempt 1: df['a':'b']
This approach uses string slicing, which doesn’t work for column selection in pandas. String slicing is designed for label-based selection in pandas, but only works on the index, not on column names.
- Why it fails: Pandas interprets
'a':'b'as trying to slice rows from index ‘a’ to index ‘b’, not columns - What it actually does: If your DataFrame has rows with index labels ‘a’ and ‘b’, it would select those rows, not columns
- Correct approach for row slicing:
df.loc['a':'b', :]to select rows from ‘a’ to ‘b’ and all columns
Attempt 2: df.ix[:, 'a':'b']
The ix indexer has been deprecated since pandas version 0.20.0 and was completely removed in pandas 1.0.0. Even when it was available, this approach had issues.
- Why it failed:
ixwas intended to be a hybrid ofloc(label-based) andiloc(position-based) selection, but it was inconsistent and error-prone - Current alternatives: Use
locfor label-based selection orilocfor position-based selection - Correct approach:
df.loc[:, ['a', 'b']]ordf.iloc[:, [0, 1]]
Common Pitfalls and Best Practices
Single vs Double Brackets
A common source of confusion is the difference between single and double brackets:
# Single bracket - returns a Series (1D array)
series_a = df['a'] # Returns pandas Series
# Double brackets - returns a DataFrame (2D array)
df_ab = df[['a', 'b']] # Returns pandas DataFrame
As explained in the Stack Overflow discussion, “you must use double brackets if you select two or more columns. With one column name, single pair of brackets returns a Series.”
Column Name Existence
Always verify that the column names you’re trying to select actually exist:
# Check if columns exist before selection
if 'a' in df.columns and 'b' in df.columns:
df1 = df[['a', 'b']]
else:
print("One or both columns don't exist in the DataFrame")
Performance Considerations
For large DataFrames, the double bracket method is generally efficient. However, if you need to select columns repeatedly, consider:
# Store column names in a variable for reuse
cols_to_select = ['a', 'b']
df1 = df[cols_to_select]
Practical Examples
Let’s work through a complete example:
import pandas as pd
# Create a sample DataFrame
data = {
'a': [1, 2, 3, 4, 5],
'b': [10, 20, 30, 40, 50],
'c': [100, 200, 300, 400, 500]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
print()
# Method 1: Double brackets (most common)
df1 = df[['a', 'b']]
print("New DataFrame with columns 'a' and 'b':")
print(df1)
print()
# Method 2: Using loc
df2 = df.loc[:, ['a', 'b']]
print("Using loc to select columns 'a' and 'b':")
print(df2)
print()
# Method 3: Creating new DataFrame explicitly
df3 = pd.DataFrame(df, columns=['a', 'b'])
print("Creating new DataFrame explicitly:")
print(df3)
Output:
Original DataFrame:
a b c
0 1 10 100
1 2 20 200
2 3 30 300
3 4 40 400
4 5 50 500
New DataFrame with columns 'a' and 'b':
a b
0 1 10
1 2 20
2 3 30
3 4 40
4 5 50
Using loc to select columns 'a' and 'b':
a b
0 1 10
1 2 20
2 3 30
3 4 40
4 5 50
Creating new DataFrame explicitly:
a b
0 1 10
1 2 20
2 3 30
3 4 40
4 5 50
Real-World Example with Employee Data
Let’s use a more realistic example based on the GeeksforGeeks demonstration:
# Define employee data
data = {
'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age': [27, 24, 22, 32],
'Address': ['Delhi', 'Kanpur', 'Allahabad', 'Kannauj'],
'Qualification': ['Msc', 'MA', 'MCA', 'Phd']
}
# Create DataFrame
df = pd.DataFrame(data)
# Select only Name and Qualification columns
employee_info = df[['Name', 'Qualification']]
print("Employee Information:")
print(employee_info)
Output:
Employee Information:
Name Qualification
0 Jai Msc
1 Princi MA
2 Gaurav MCA
3 Anuj Phd
Advanced Column Selection
Selecting Columns Based on Conditions
You can also select columns based on certain conditions:
# Select columns that contain specific text
numeric_cols = df.loc[:, df.columns.str.contains('Age|Qualification')]
print("Columns containing 'Age' or 'Qualification':")
print(numeric_cols)
# Select columns based on data type
numeric_df = df.select_dtypes(include=['number'])
print("\nNumeric columns only:")
print(numeric_df)
Selecting Columns with Regular Expressions
Using the filter() method with regex:
# Select columns that start with 'A'
cols_starting_with_a = df.filter(regex='^A')
print("Columns starting with 'A':")
print(cols_starting_with_a)
Selecting Non-Adjacent Columns
If you need to select columns that aren’t next to each other:
# Select columns 'a' and 'c' (skipping 'b')
ac_df = df[['a', 'c']]
print("Columns 'a' and 'c' only:")
print(ac_df)
Conclusion
Key Takeaways
- Use double brackets
df[['a', 'b']]to select multiple columns and create a new DataFrame - Single brackets
df['a']return a Series, while double brackets return a DataFrame - Avoid deprecated methods like
ix- uselocfor label-based selection andilocfor position-based selection - String slicing like
df['a':'b']doesn’t work for column selection - it’s designed for row indexing - Multiple alternatives exist including
loc,iloc,filter(), and explicit DataFrame creation
Best Practices
- Always verify column names exist before selection
- Use double brackets to ensure you get a DataFrame, not a Series
- Prefer
locover deprecated methods for label-based selection - Store column lists in variables when reusing them for better performance
Recommended Approach
For your specific case of selecting columns ‘a’ and ‘b’ from a DataFrame, the recommended approach is:
df1 = df[['a', 'b']]
This is the most idiomatic, readable, and efficient way to accomplish your goal in pandas. It creates a new DataFrame containing only the columns you specified, leaving the original DataFrame unchanged.
Sources
- Pandas Documentation - Subset Selection
- GeeksforGeeks - Select Multiple Columns in Pandas
- Stack Overflow - Double Brackets Purpose
- Towards Data Science - Selecting Multiple Columns
- Statology - Pandas Select Multiple Columns
- PythonHow - Select Multiple Columns
- Spark by Examples - Pandas Select Multiple Columns