Data science, I'm sorry to say, often involves cleaning up input data into a usable and uniform format. Command line tools like grep
, awk
and sed
provide an arcane power to manipulate text in files of arbitrary size. Mastering these tools can separate data science novices from data scientists with flaming robes (to continue on the arcane theme).
For the purposes of this tutorial we have a directory of files that have some lines of the form: Tags: Tag1, Tag2, ...
(zero or more Tag labels). Our goal is to convert the tag labels to lowercase (Tag1 -> tag1), but leave the rest of the file unchanged. You can get the git repo with example files (here), or with git clone https://github.com/frankcleary/tag-examples.git
.
The answer
$ git grep -lz "^Tags:" | xargs -0 sed -i -r "s/(^Tags:)(.+)/\1\L\2/g"
Explanation
The sed
command allows for substitution of strings in text. We'll use a sed
command to do the text manipulation (lowercasing of tag names) on these files. The basic syntax for a sed
substitution command is this:
$ # s/ : do substitution
$ # old/new : replace any occurrences of "old" with "new"
$ # /g : replace all found matches on the line, instead of only the first
$ # filename : the name of the file to search and replace text in
$ sed "s/old/new/g" filename
This command will not modify the file, it outputs the result to stdout (prints it to screen). Our goal to construct a sed
command to lowercase everything after "Tags:", modifying the file in place and not changing any files that aren't under version control. We'll go about constructing this command in steps.
Developing the command, Step 1: Match the line.
This sed
command finds the lines containing "Tags:" and any other characters, and replaces the entire line with the string "changed".
$ # the original file:
$ cat tag-example1.txt
Title: Tag example 1
Tags: Tag1, Tag2
Content
$ # -r : use regular expressions
$ # ^Tags:.+ : Search for "Tags" at the beginning of a line (^)
$ # followed by one or more other characters (.+).
$ sed -r "s/^Tags:.+/changed/g" tag-example1.txt
Title: Tag example 1
changed
Content
Developing the command, Step 2: lowercase the line.
We don't want to replace the line with new text, we want to replace it with the old text in lowercase (expect for the initial "Tag:" part). In a sed
command \0
means "what was matched" and \L
means "make lowercase." Combining these we can lowercase the entire line.
$ sed -r "s/^Tags:.+/\L\0/g" tag-example1.txt
Title: Tag example 1
tags: tag1, tag2
Content
Developing the command, Step 3: lowercase part of the line.
The problem with the above command is that it lowercases the entire line, including the initial "Tags:" part. To solve this problem we can enclose parts of our string to replace in parenthesis and access the first enclosed part as \1
, the second as \2
and so on. To lowercase just the part after "Tags:":
$ sed -r "s/(^Tags:)(.+)/\1\L\2/g" tag-example1.txt
Title: Tag example 1
Tags: tag1, tag2
Content
Developing the command, Step 4: Finding the files to change
Now its time to replace the text of the actual files with the -i
flag (-i ''
on Mac OSX). This operation could be dangerous if the files are not under version control, so we'll use git to find and change only files in the git repo.
$ # outputs the file name and the matching line
$ git grep "^Tags:"
tag-example1.txt:Tags: Tag1, Tag2
tag-example2.txt:Tags: Tag1
tag-example3.txt:Tags: Tag1, Tag2, Tag3
$ # outputs just the file names
$ git grep -l "^Tags:"
tag-example1.txt
tag-example2.txt
tag-example3.txt
$ # outputs the file names separated by a null character
$ git grep -lz "^Tags:"
tag-example1.txt^@tag-example2.txt^@tag-example3.txt^@
Developing the command, Step 5: The complete command
We can use the xargs
tool to tell sed
to act on the list of files we found in step 4.
$ # outputs the files to be changed
$ git grep -lz "^Tags:" | xargs -0 echo
tag-example1.txt tag-example2.txt tag-example3.txt
$ # The final answer:
$ git grep -lz "^Tags:" | xargs -0 sed -i -r "s/(^Tags:)(.+)/\1\L\2/g"
Developing the command step 6: Inspect the results with git diff
We can confirm that we got the correct outcome with git diff
$ git diff
diff --git a/tag-example1.txt b/tag-example1.txt
index 589bbdf..7d57a7d 100644
--- a/tag-example1.txt
+++ b/tag-example1.txt
@@ -1,4 +1,4 @@
Title: Tag example 1
-Tags: Tag1, Tag2
+Tags: tag1, tag2
Content
diff --git a/tag-example2.txt b/tag-example2.txt
index addcd3b..d271212 100644
--- a/tag-example2.txt
+++ b/tag-example2.txt
@@ -1,4 +1,4 @@
Title: Tag example 2
-Tags: Tag1
+Tags: tag1
Content
diff --git a/tag-example3.txt b/tag-example3.txt
index c8b10e1..42e0a75 100644
--- a/tag-example3.txt
+++ b/tag-example3.txt
@@ -1,4 +1,4 @@
Title: Tag example 3
-Tags: Tag1, Tag2, Tag3
+Tags: tag1, tag2, tag3
Similar Posts
- First Look at AWS Machine Learning, Score: 0.904
- Analyzing large xml files in python, Score: 0.858
- Saving time and space by working with gzip and bzip2 compressed files in python, Score: 0.836
- SF Python meetup talk, Score: 0.834
- Installing python for data science, Score: 0.807
Comments