How To Clean The Data Using Jupyter

While most articles focus on deep learning and modeling, as a practicing data scientist you lot're probably going to spend much more time finding, accessing, and cleaning upwards data than y'all will running models against information technology.

Blog Templates 800x600 (37)

In this post, y'all'll get a quick, hands-on introduction to using the Python "Pandas" library. Whether you're taking your first steps to becoming a professional data scientist — or just want to save some repetitive work next fourth dimension y'all demand to clean upwards a spreadsheet— Pandas is an incredibly powerful tool for easily importing, cleaning, and exporting information.

Pandas envy

It may non be as popular amongst the data science crowd, but equally an ex-software programmer, Ruby volition always hold a special place in my heart. I dear the inclusive community, passion for test driving your code, the logical consistency of everything being an object, and the power and comprehensiveness of Track for building web applications quickly and efficiently.

Simply ever since I started teaching data science likewise as software engineering, I found Crimson defective in one key area. It simply doesn't have a fully fledged information analysis gem that can compare to Python's Pandas library. Usually when I code in Ruby, I appreciate the elegance and economy of expression that the language provides. But subsequently using Pandas for data cleaning, I tin can honestly say that importing, iterating over, cleaning and then saving data in Red is starting to experience a lilliputian verbose.

Assumptions

I'm going to assume that you have a professional person data science environment fix up on your computer. If you don't have Python, Jupyter Notebook, and Pandas installed on your machine, here's one way to go prepare for data science.
I'thou also going to assume that you're comfortable opening up a terminal window and cloning a GitHub repo.

Getting started

Here's a lab I created for an enterprise project. Kickoff off by opening a terminal window somewhere inside your user directory and cloning the repository:

> git clone https://github.com/acquire-co-curriculum/ent-ds-del-2-ii-cleaning-visitor-data

Now start up Jupyter Notebook. If yous're using the Anaconda distribution, run the Anaconda Navigator application and click on the Jupyter Notebook tile to commencement information technology up.

Pandas1

From there, navigate to the directory where y'all cloned the GitHub repo and you should run across an "alphabetize.ipynb" file in the directory:

Pandas2

Click on the index.ipynb to open the notebook:

Pandas3

OK, if you haven't seen a Jupyter Notebook before, it's rendered in a browser, composed of cells, and a really like shooting fish in a barrel manner to intersperse code, comments, charts, and tables. To run code, type it into a "code" cell and striking return. Let'southward start past writing and running the boilerplate lawmaking to import the Pandas library (as per convention, assigning information technology to the variable "pd"), remembering to hit "shift-enter" to run the prison cell.

Pandas4

Importing and exploring data

Let's see how easy it is to import the "new_data.csv" information into Pandas:

Pandas5

Adjacent up let's take a quick look at the data, starting with .head() to view the kickoff few rows, and so using the .info() method to get some general data relating to the entire data set:

Pandas6

OK, so it looks similar nosotros have a set of company names, usa they were incorporated in, their number of employees, and what kind of legal entity they utilize. Let'south imagine we wanted to know how many of these companies were of each entity type. Permit's start by using the value_counts() method to learn more than about the values within the EntityType field:

Pandas7

OK, so it looks like we've got C-corps, S-corps and LLCs, but you tin can see, due to inconsistent capitalization, information technology thinks at that place are vi different entity types instead of three. Cool. Nosotros've imported a information ready and learned something about it. At present let's make clean information technology up.

Cleaning up data

There are lots of ways of making the capitalization consistent for the EntityType – everything from going through manually cleaning up the information to downcasing the entire file to lower case – one grapheme at a time. Let's meet how nosotros could use Pandas to make the capitalization more than consistent.

Firstly, allow's merely see one fashion to iterate over a data frame past writing and running the post-obit code:

> for i in df.index: > impress(df.at[i, 'EntityType'])

And remember, indentation is meaningful in Python, so make sure to indent the line with the print argument to ensure that it'southward function of the for-loop.

Pandas8

OK, that seems to exist working. Adjacent step, let's capitalize all of the records using the following code:

> for i in df.alphabetize: > df.at[i, 'EntityType'] = df.at[i, 'EntityType'].upper() > df['EntityType'].value_counts()

That'southward not bad – nosotros've solved the problem in merely a few lines of code. But I wonder if we could have done something even slicker past taking advantage of some of the other methods built into Pandas rather than just treating it as a dumb information container to iterate over.

Selectors

One approach would be to apply Pandas selectors to apply transformations to a subset of the records without having to iterate. Let's reload the data into a new data frame and requite information technology a shot:

> df2 = pd.read_csv(new_data.csv') > df2.loc[df2["EntityType"] == "llc", "EntityType"] = "LLC" > df2.loc[df2["EntityType"] == "c corp", "EntityType"] = "C corp" > df2.loc[df2["EntityType"] == "s corp", "EntityType"] = "Southward corp" > df2['EntityType'].value_counts()

Pandas9

Exporting information

Finally, once you lot're happy with the changes y'all've made, it'south only a 1-liner to relieve the information to a new file, with just ii extra lines to read the exported data back in and to confirm it's saved all of our changes:

Pandas10

Summary

Nosotros have only merely scratched the surface of what Pandas can do. It really is a Swiss Regular army knife for exploratory data analysis. The range of methods available tin can feel overwhelming, but try to get into the habit of using Pandas as your go-to tool for cleaning up spreadsheet information, and so over time yous can try out boosted methods to expand what you lot tin practise with Pandas.