# Mastering Pandas: 3 Key Skills for Data Science Success

Written on

## Understanding Key Skills in Pandas

This article is part of my Next-Level Series. If you haven't already, be sure to check out the previous entries: 3 Underappreciated Skills to Make You a Next-Level Data Scientist and 3 Underappreciated Skills to Make You a Next-Level Python Programmer.

Pandas can be quite challenging to grasp, particularly for those who are new to programming. Despite its complexity, mastering this library is essential for anyone aspiring to excel in data science. Educational institutions are beginning to adapt their curricula to prioritize learning Pandas.

For instance, when I first delved into data science four years ago at UC Berkeley, the course utilized a specialized module named datascience, which, although built on Pandas, had significantly different functionality. In contrast, UC San Diego has recently introduced a module aptly named babypandas, aimed at easing students into full-fledged Pandas usage.

Both individuals and educational bodies are recognizing the necessity of a solid understanding of Pandas in today’s data science landscape, and they are adjusting their learning objectives accordingly. If you're reading this, you're probably among those eager to enhance your Pandas skills. Let’s explore three underappreciated abilities that will help you elevate your proficiency in Pandas.

### One-Hot Encoding for Categorical Variables

To appreciate this skill, it's essential to review the types of variables in statistics. Variables can be broadly categorized into two types: quantitative and qualitative (categorical). These can be further divided into:

**Discrete (Quantitative)**: Numerical values that can be counted exactly, like a city's population.**Continuous (Quantitative)**: Measured numerical values that can be infinitely precise, such as height.**Ordinal (Categorical)**: Qualitative variables with a defined order, like "spice level" (mild, medium, hot).**Nominal (Categorical)**: Qualitative variables without a specific order, such as color.

It’s important to note that while ordinal and nominal variables can be represented as numbers, arithmetic operations on these numbers do not yield meaningful results. For example, zip codes are nominal variables and cannot be summed or averaged.

The challenge arises because computational models typically prefer quantitative data, while much of our available data is qualitative. To build predictive models, we need to convert categorical data into a format that machines can process.

Consider a dataset with the columns "Age," "State," and "Income." To use "State" in a model predicting income, we must transform it. Simply assigning numbers to states is misleading for nominal variables, as it implies a false hierarchy.

The solution lies in one-hot encoding, which transforms a single categorical column into multiple binary columns—one for each unique category. For instance, instead of a single "State" column, we create columns like "is_state_California," "is_state_Oregon," etc. A row that originally had "California" would now have a value of 1 in the "is_state_California" column and 0 elsewhere.

Implementing this in Pandas is straightforward, thanks to the get_dummies function, which performs one-hot encoding effortlessly:

pd.get_dummies(my_df, columns=["State"], prefix='is_state')

Now your dataset is primed for machine learning.

The first video titled "Top 5 Pandas Tricks You Don't Know About" delves into essential techniques that can enhance your use of the Pandas library.

### Merging DataFrames Together

As any experienced data scientist knows, raw data is rarely clean and structured. A significant portion of a data scientist's role involves aggregating data from various sources and addressing missing values. One of the most crucial—yet intricate—methods for data combination is merging.

Merging effectively combines data from two DataFrames that share at least one common column. However, the various types of merges (or joins) can be confusing.

Let’s consider an example. Suppose we have a DataFrame called left_df with names and another DataFrame called right_df containing information about college degrees.

When merging, we need to specify:

- The left DataFrame
- The right DataFrame
- The column(s) to merge on
- The type of join: inner, left, right, or outer

An inner join will only include rows with matching values from both DataFrames. For example:

left_df.merge(right_df, on='Name', how='inner')

This will yield results only for individuals present in both DataFrames.

A left join retains all rows from the left DataFrame, filling in null values for any non-matching entries from the right DataFrame:

left_df.merge(right_df, on='Name', how='left')

Conversely, a right join includes all rows from the right DataFrame, filling nulls for any non-matching entries from the left:

left_df.merge(right_df, on='Name', how='right')

An outer join combines rows from both DataFrames, resulting in nulls where no matches exist:

left_df.merge(right_df, on='Name', how='outer')

Now you’re equipped to merge data effectively.

The second video titled "Top 11 Pandas Tricks Every Data Science Lover Should Know" covers various techniques that can streamline your data manipulation process.

### Concatenating DataFrames Together

After navigating the complexities of merging, you’ll find that concatenation is much simpler. This process involves stacking two or more DataFrames together, rather than joining them based on a common column.

The most basic example includes two DataFrames with identical columns. For instance:

pd.concat([top_df, bottom_df])

Unlike merging, where we call the method directly on a DataFrame object, we use pd.concat and pass the DataFrames as a list.

Concatenation can also accommodate DataFrames with differing structures. However, this may lead to unwanted null values, which we will need to address later. Additionally, concatenation can be performed horizontally by specifying axis=1:

pd.concat([top_df, more_df], axis=1)

This operation places two DataFrames side by side, without requiring a common column.

### Bonus Skill: Fill and Replace

When aggregating data, it’s common to encounter null values or inaccuracies. For example, let’s consider a DataFrame named faulty_df with missing entries.

To manage null values, we might use the average value—like 17—to fill these gaps:

faulty_df = faulty_df.fillna(17)

We can then convert any floats back to integers:

faulty_df['Age'] = faulty_df['Age'].astype('int')

And there you have it! A straightforward method to rectify errors in your DataFrames following merging and concatenation.

## Recap and Final Thoughts

As a data scientist, it's vital to be adept at collecting, organizing, cleaning, and restructuring data to meet your analysis needs. By honing the skills outlined above, you’ll be better prepared to tackle your data challenges.

Here’s a quick cheat sheet for reference:

- Categorical data needs transformation for ML models; learn one-hot encoding.
- Data is often scattered; improve your merging skills.
- Similar datasets should be combined; use concatenation effectively.

Wishing you the best of luck on your data manipulation journey!

Want to excel at Python? Access my exclusive guides for free. Interested in reading unlimited stories on Medium? Sign up with my referral link below!

I’m Murtaza Ali, a PhD student at the University of Washington specializing in human-computer interaction. I enjoy writing about education, programming, and life in general.