May 14, 2024

Using Regular Expressions Wisely (Part 2)

© Roman Fedin / 123RF.com

You have already learned about the basic concepts of regular expressions in the previous chapter. This will take you beyond the classic search and replace. This time, we will introduce you to the expanded possibilities.

One feature of Regular Expressions that initially intrigued me was group indexing. For example, you receive a CSV file in which each row corresponds to a record, with semicolons separating each attribute (Listing 1).

Aspect;Alain;Frankreich;Physik;2022
Clauser;John;Vereinigte Staaten;Physik;2022
Zeilinger;Anton;Österreich;Physik;2022
Agostini;Pierre;Frankreich;Physik;2023
Krausz;Ferenc;Ungarn;Physik;2023
L'Huillier;Anne;Frankreich;Physik;2023

I prepare individual scientists from the list for further processing with the help of groups. The expression in the first line of Listing 2 serves as a regex for this. Figure 1 shows a set of results when using the Python function “findall()” to display all found locations. The regex is not very easy to read, so I fix it a bit (Listing 2, line 2).

Figure 1: »findall()« I have named the displayed groups.

Figure 1: I use “findall()” to display named groups.

^(.*?);(.*?);(.*?);(.*?);(.*?)$
^(.*?);(.*?);(.*?);(Physik|Chemie|Medizin|Literatur|Wirtschaftswissenschaften|Frieden);(\d{4})$
^(?<Name>.*?);(?<Vorname>.*?);(?<Land>.*?);(Physik|Chemie|Medizin|Literatur|Wirtschaftswissenschaften|Frieden);(\d{4})$
^(\w+(?# Name));(\w+(?# Vorname));(\w+(?# Land));(\w+(?# Fach));(\d+(?# Jahr))$

Now it is clear what to expect in groups 4 and 5. Options like this…

[…]Linux Magazine Online publishes all print articles published in Linux Magazine since 2001. This means you have access to a high-quality archive, including articles from current issues online. Over 4,000 articles are mostly freely accessible; Only a small fee is required for articles (as PDFs) from the ten most recent Linux magazines.

See also  A tsunami is said to have decimated the population of Great Britain