Regex basics
What is regex? Regex or Regular Expressions is a sequence of characters that form a search pattern. It can be used to search a string to see if it has a certain pattern. As a Data Scientist bootcamp student I wanted to think about how this is going to help me and if I should bother diving deeper into it. A lot of people that talked about using Regex used it for how designing websites with JavaScript, Node, TypeScript or PHP. Then I was thinking about how I am also looking for specific strings in a DataFrame so I decided to look into Regex.
My bootcamp class was tossed into Regex exercises with no lessons so I was very confused and didn’t understand it the first couple of times with all of the backslashes and special characters. When I see. “+” I think of addition but that’s not how it works with Regex. There are plenty of websites and cheat sheets to check out but I am a hands on learner and needed to do some examples to fully understand.
Let’s start off with letters and words first. If you are looking for an exact word type it with out brackets and it’ll match exactly to what you type. Now if you want to find a couple of different words that are similar you can use brackets and letters outside of brackets. For example if you want the words ‘man’ ‘dan’ and ‘fan’ but not ‘can’ the Regex would be [fdm]an. “an” would match all of the words because it is out of the bracket and is in every word, but because ‘c’ is not in the bracket it will not find ‘can’.
You can put in a range of letters that you would want to find by putting a dash with the letters in the bracket. Working with the same works we can do [a-h]an we’ll get all the words except for man because ‘m’ is not in the range of [a-h].
Curly brackets are used to find repetitions. For example if you wanted to find a website with ‘www’ you can do w{3} and it will find everywhere there is three w’s.
Everything that I went over with letters can be done with numbers as well. Now for what I thought was the most confusing part which was the special characters and what they meant. To find any character no matter letter, number or symbol you use ‘.’ but if you want to actually find a ‘.’ you need to add a backslash \. and that is true for any of these characters with special meanings.
Instead of using the curly brackets you can use ‘*’ or ‘+’ to find any repeat characters. The difference between ‘*’ and ‘+’ is that ‘*’ is for zero or more repetitions so it will pick up words with out the character you specify. If you do ‘+’ instead it makes sure that there is at least one of the characters and repetitions.
As you can see above, because b* is with c+ it will grab the first three lines even though there is no ‘b’. The question mark ‘?’ makes characters optional. If we replace the ‘*’ with the ‘?’ it will grab where there are ‘c’ but only grab one ‘b’ because ‘?’ doesn’t look for repetition.
Last thing that confused me was when to use the parentheses. They are used as a capturing group. For example if you want just the name of a file without the .pdf or what ever type of file it is. I think this one will be the best use for things like in if you want to only grab the date and the year and put them in separate columns.
The above grabs the month and year but it will also just give you the year. This is because you have the parentheses inside another parentheses.
There is more to go over in Regex but this is a good place to get started and to play around with to get comfortable.