Google Sheets Data Cleaning
This is the second part of my five-part series. The first part, released on December 15, 2018, covered the collection of data with Google Forms. The form did more than simply collect information. It verified and validated the information provided by presenters, participants, and vendors.
In this issue, we will look at the collected information. We took steps to validate the information on the form. The validation didn’t include spell checking or formatting. Some of the formatting issues we encounter include capitalization, unwanted spaces, and misspellings. This information needs to be reviewed before it appears in the presentation schedule. This corrected information is sent to the presenters for verification.
In these lessons, we will import the form data into another sheet. I don’t like to work with the collected information directly. The imported information will be manipulated so we have consistency. This includes spell checking and letter case verification. The data will form the basis for the schedule. The schedule will import the information from our verification sheet. Every step of the import process keeps a consistent link to the original information.
In the beginning correct the original information. I quickly learned this was a mistake.
The information shown in these lessons is not actual information from a scheduled event. We need to look at email addresses and campus information. To keep things anonymous I have created a Google Sheet that resembles collected information. I will provide a link to the data in this lesson.
Google Sheets Query and Functions
Here are some of the skills you will learn in these lessons. The first step is to pull the information we want to clean. This information will be used to construct the schedule. The information is pulled into another sheet using a function called QUERY. A query requests data from a database table. The sheet in our Google Sheet is a table with data.
The query will request data from select columns in the table. Each column in the table is referred to as a field. The terms query and field come from the world of databases. Databases include one or more tables. We refer to all the information in each row as a record. Records are part of databases.
Cleaning data is a common part of collecting data. Scientists and researchers do this all the time. I use several functions to clean the collected data. Functions are special instructions used to carry out a series of steps.
I use the SUBSTITUTE function to replace unwanted characters. These characters often include periods, dashes, or extra letters. The substitute function looks for content I choose and replaces it with content I also choose. For example, the substitute function will learn for a character like a comma. It will replace the comma with another character of my choosing, like a period.
The TRIM function removes unwanted characters. It does this much like the SUBSTITUTE function. I use the function to trim extra spaces. It doesn’t remove all spaces. It is smart enough to know that two spaces in a row are not common. The extra space is removed. I like the function because it removes spaces from the beginning and end of words. Extra spaces cause formatting issues on published content.
Nested functions are one way to use more than one function at a time. I use the trim function within the substitute function. This removes unwanted spaces and removes unwanted characters.
I am a fan of something called Regular Expressions. Regular expressions use patterns to search for information. Searches are combined with REPLACE or EXTRACT. Replace with Regex, Regular Expressions works much like the substitute function. I prefer Regex whenever possible because it affords better precision.
The substitute function is literal. Here is an example. We want to add a space after a period if the space character is missing before the next sentence. Substitute will search for a period and then the letter “A” for example. What if the period is followed by another letter? This is where Regex really shines. A regular expression can be formatted to search for a period and any letter or number.
The PROPER function is used to reformat text with the proper case. The first letter in a word is changed to uppercase. The rest of the letters are changed to lowercase. That is all we will be using it for in our lesson. This simple function saves me lots of time.
Cell references are a staple of working with spreadsheets. The functions we use to take advantage of cell references. Cell references fall into two categories. Cell references are either relative or absolute. Relative cell references change with reference to the function. Absolute cell references remain fixed to one or more cells. We will use both forms in our lessons.