Let’s face it. Often data cleanup can be a headache, and not because of complexity. More often than not, I cringe at the thought of doing data cleanup because it becomes so monotonous. To say I’ve become a “wiz” at Python would be an understatement. Python is exceptionally lightweight, and you don’t need a complete development server to spin out a quick Python script for most of your typical data cleansing tasks.
I’ve compiled a list of my top 5 best Python scripts to automate data cleanup.
Converting PDF to CSV
Less of a “data cleanup” and more of a “data prep”; this script had saved me multiple times when I needed to grab data from a whitepaper, ebook, or other large PDF documents. I typically start by using Preview on Mac (because it has a simple way to delete pages in a PDF) and delete all of the pages that do not contain the data I want. This makes cleaning out any extra content that gets converted that much easier.
import tabula
# Grab the file
pdf_filename = input ("Enter the full path and filename: ")
# Extract the contents of the PDF
frame = tabula.read_pdf(pdf_filename, spreadsheet = True, encoding = 'utf-8')
# Create a CSV file from the contents
frame.to_csv('pdf_conversion.csv')
A relatively straightforward way to pull out data quickly so that you can further manipulate it in CSV format before pulling it into a database or tool such as Tableau or Count.
Merge CSV Files
One of the frustrating things about tools like a CRM is that exporting data can be a bit of a pain. Many systems will give you the export to CSV option, but there is no way to first consolidate the data before exporting it. This can result in 5+ files exported to a folder containing the same data type but different results. This script fixes this problem by taking those files, which must all have a matching format and exist in the same folder, and merging them into one file.
from time import strftime
import pandas as pd
import glob
# Define the path of the folder containing the CSV files
path = input('Please enter the full folder path: ')
# Make sure there is a trailing slash
if path[:-1] != "/":
path = path + "/"
# Grab the CSV files as a list
csv_files = glob.glob(path + '*.csv')
# Open each CSV file and merge into a single file
merged_file = pd.concat( [ pd.read_csv(c) for c in csv_files ] )
# Create the new file
merged_file.to_csv(path + 'merged_{}.csv'.format(strftime("%m-%d-%yT%H:%M:%S")), index=False)
print('Merge complete.')
The final output will give you a single CSV file containing all the data from the list of CSVs you exported from the source system.
Delete Duplicate Rows From CSV Files
If you need to delete duplicate rows of data from a CSV file, this can help you quickly perform your cleanup. Taking the previous example of merging duplicate files, it is possible that your CRM reporting includes the same Account across multiple reports. When you combined the data, you would have introduced duplicates into the dataset that can skew your results in a visualization tool or ML project.
import pandas as pd
# Get the filename
filename = input('Please enter the full folder path and filename: ')
# Define the CSV column name you want to check for duplicates on
duplicate_header = input('Please enter the header name to check for duplicates: ')
# Get the contents of the file
file_contents = pd.read_csv(filename)
# Remove the duplicate rows
deduplicated_data = file_contents.drop_duplicates(subset=[duplicate_header], keep="last", inplace=True)
# Create the new file
deduplicated_data.to_csv('deduplicated_data.csv')
Split a CSV Column
Another frustrating outcome of exporting a report is that it will sometimes contain a single column of data that I need to exist as two columns. A great example is a field called full name, but to which I need first and last names in their columns. I would say that I use this script more often than any of the others.
import pandas as pd
# Get the filename and define the columns
filename = input('Please enter the full folder path and filename: ')
col_to_split = input('Please enter the column name you want to split: ')
col_name_one = input('Name of first new column: ')
col_name_two = input('Name of second new column: ')
# Add the CSV data to a dataframe
df = pd.read_csv(filename)
# Split the column
df[[col_name_one,col_name_two]] = df[col_to_split].str.split(",", expand=True)
# Create the new file
df.to_csv('split_data.csv')
Merging Different Datasets
Let’s assume you have a list of accounts and their associated orders and want to see order history along with the associated account details. One good way to do this is by merging, or joining, the data into a single CSV file.
import pandas as pd
# Get the filenames and define user inputs
left_filename = input('Please enter the full LEFT folder path and filename: ')
right_filename = input('Please enter the full RIGHT folder path and filename: ')
join_type = input('Specify the join type (outer, inner, left, right): ')
join_column_name = input('What is the column name to join on (i.e. Account_ID): ')
# Read the files into dataframes
df_left = pd.read_csv(left_filename)
df_right = pd.read_csv(right_filename)
# Join the dataframes
joined_data = pd.merge(left = df_left, right = df_right, how = join_type, on = join_column_name)
# Create the new file
joined_data.to_csv('joined_data.csv')
Final Thoughts
These scripts have been invaluable in helping me to clean up CSV files that might get dropped into an S3 bucket to be picked up by a data pipeline or to load data into an ML model for processing. Almost all of the examples use Pandas, a Python library that is great for manipulating data and alludes to the power of this library. Pandas is my go-to library for manipulating data because it provides many powerful options.
Need help automating data cleansing and importing it into your data warehouse or ML model? Our team of data experts can help! Contact us to learn more about our services and offerings.