Using sed to make specific text lowercase in place

Data science, I'm sorry to say, often involves cleaning up input data into a usable and uniform format. Command line tools like grep, awk and sed provide an arcane power to manipulate text in files of arbitrary size. Mastering these tools can separate data science novices from data scientists with flaming robes (to continue on the arcane theme).

For the purposes of this tutorial we have a directory of files that have some lines of the form: Tags: Tag1, Tag2, ... (zero or more Tag labels). Our goal is to convert the tag labels to lowercase (Tag1 -> tag1), but leave the rest of the file unchanged. You can get the git repo with example files (here), or with git clone https://github.com/frankcleary/tag-examples.git.

The answer

$ git grep -lz "^Tags:" | xargs -0 sed -i -r "s/(^Tags:)(.+)/\1\L\2/g"

Explanation

The sed command allows for substitution of strings in text. We'll use a sed command to do the text manipulation (lowercasing of tag names) on these files. The basic syntax for a sed substitution command is this:

$ # s/ : do substitution
$ # old/new : replace any occurrences of "old" with "new"
$ # /g : replace all found matches on the line, instead of only the first
$ # filename : the name of the file to search and replace text in
$ sed "s/old/new/g" filename

This command will not modify the file, it outputs the result to stdout (prints it to screen). Our goal to construct a sed command to lowercase everything after "Tags:", modifying the file in place and not changing any files that aren't under version control. We'll go about constructing this command in steps.

Developing the command, Step 1: Match the line.

This sed command finds the lines containing "Tags:" and any other characters, and replaces the entire line with the string "changed".

$ # the original file:
$ cat tag-example1.txt
Title: Tag example 1
Tags: Tag1, Tag2

Content
$ # -r : use regular expressions
$ # ^Tags:.+ : Search for "Tags" at the beginning of a line (^)
$ #   followed by one or more other characters (.+).
$ sed -r "s/^Tags:.+/changed/g" tag-example1.txt
Title: Tag example 1
changed

Content

Developing the command, Step 2: lowercase the line.

We don't want to replace the line with new text, we want to replace it with the old text in lowercase (expect for the initial "Tag:" part). In a sed command \0 means "what was matched" and \L means "make lowercase." Combining these we can lowercase the entire line.

$ sed -r "s/^Tags:.+/\L\0/g" tag-example1.txt 
Title: Tag example 1
tags: tag1, tag2

Content

Developing the command, Step 3: lowercase part of the line.

The problem with the above command is that it lowercases the entire line, including the initial "Tags:" part. To solve this problem we can enclose parts of our string to replace in parenthesis and access the first enclosed part as \1, the second as \2 and so on. To lowercase just the part after "Tags:":

$ sed -r "s/(^Tags:)(.+)/\1\L\2/g" tag-example1.txt 
Title: Tag example 1
Tags: tag1, tag2

Content

Developing the command, Step 4: Finding the files to change

Now its time to replace the text of the actual files with the -i flag (-i '' on Mac OSX). This operation could be dangerous if the files are not under version control, so we'll use git to find and change only files in the git repo.

$ # outputs the file name and the matching line
$ git grep "^Tags:"
tag-example1.txt:Tags: Tag1, Tag2
tag-example2.txt:Tags: Tag1
tag-example3.txt:Tags: Tag1, Tag2, Tag3
$ # outputs just the file names
$ git grep -l "^Tags:"
tag-example1.txt
tag-example2.txt
tag-example3.txt
$ # outputs the file names separated by a null character
$ git grep -lz "^Tags:"
tag-example1.txt^@tag-example2.txt^@tag-example3.txt^@

Developing the command, Step 5: The complete command

We can use the xargs tool to tell sed to act on the list of files we found in step 4.

$ # outputs the files to be changed
$ git grep -lz "^Tags:" | xargs -0 echo
tag-example1.txt tag-example2.txt tag-example3.txt
$ # The final answer:
$ git grep -lz "^Tags:" | xargs -0 sed -i -r "s/(^Tags:)(.+)/\1\L\2/g"

Developing the command step 6: Inspect the results with `git diff`

We can confirm that we got the correct outcome with git diff

$ git diff
diff --git a/tag-example1.txt b/tag-example1.txt
index 589bbdf..7d57a7d 100644
--- a/tag-example1.txt
+++ b/tag-example1.txt
@@ -1,4 +1,4 @@
 Title: Tag example 1
-Tags: Tag1, Tag2
+Tags: tag1, tag2

 Content
diff --git a/tag-example2.txt b/tag-example2.txt
index addcd3b..d271212 100644
--- a/tag-example2.txt
+++ b/tag-example2.txt
@@ -1,4 +1,4 @@
 Title: Tag example 2
-Tags: Tag1
+Tags: tag1

 Content
diff --git a/tag-example3.txt b/tag-example3.txt
index c8b10e1..42e0a75 100644
--- a/tag-example3.txt
+++ b/tag-example3.txt
@@ -1,4 +1,4 @@
 Title: Tag example 3
-Tags: Tag1, Tag2, Tag3
+Tags: tag1, tag2, tag3

Data Science Bytes

Using sed to make specific text lowercase in place

The answer

Explanation

Developing the command, Step 1: Match the line.

Developing the command, Step 2: lowercase the line.

Developing the command, Step 3: lowercase part of the line.

Developing the command, Step 4: Finding the files to change

Developing the command, Step 5: The complete command

Developing the command step 6: Inspect the results with `git diff`

Similar Posts

Comments

The answer

Explanation

Developing the command, Step 1: Match the line.

Developing the command, Step 2: lowercase the line.

Developing the command, Step 3: lowercase part of the line.

Developing the command, Step 4: Finding the files to change

Developing the command, Step 5: The complete command

Developing the command step 6: Inspect the results with git diff

Similar Posts

Comments

Links

Social

Developing the command step 6: Inspect the results with `git diff`