Text Processing Tools
Overview
Tools for Extracting Text
- File Contents:
cat
andless
- File Excerpts:
head
andtail
- Extract by Keyword:
grep
- Extract by Column or Field:
cut
1. Viewing File Contents
cat command
cat
- Dump one or more file to STDOUTcat
command is most useful for viewing the short files- Multiple files are concatenated together
OPTIONS:
-A
Show all characters, including control characters and non-printing characters-s
Squeeze (multiple adjacent blank lines into a single black line)-b
Number each (non-blank) line of output
NOTE!:
If you dump the content of a binary file with cat to a terminal, you will make it unusable.
You can use reset command to clean up your garbled terminal and go on with it.
When you type reset
, it won’t be correctly echo-ed.
less command
less
View file or STDIN one page at a time.less
command is more useful for viewing the larger files.
Navigating Text with less
Space
Moves ahead one full screen-
b
Moves back one full screen Enter
Moves ahead one line-
k
Moves back one line g
Moves to the top of the file-
G
Moves to the bottom of the file /text
Searches for textn
Repeats the last search-
N
Repeats the last search, but in the opposite direction v
Opens the file in (vi by default)q
quits
2. Viewing File Excerpts
head command
head
: Display the first 10 lines of a file-n
: Specify the number of lines to display
Examples:
tail command
tail
: Display the last 10 lines of a file-n
: Specify the number of lines to display-f
: Follow subsequent additions to the file- Continue to display the file in REAL TIME
- Very useful for monitoring log files!
- System Administrators use this feature to keep an eye on the system log.
Examples:
3. Extracting Text by Keyword - grep command
grep
: Prints lines of files or STDIN where the pattern is matched- The patterns contain regular expression metacharacters and so it is considered good practice to always quote your regular expressions.
Options:
-i
: Search case insensitively-n
: Print line numbers of matches-
-v
: Print lines that does not contain the pattern -Ax
: Include x lines after each match-
-Bx
: Include x lines before each match -r
: Recursively search a directory-c
: Counts the number of lines where the pattern matched-
-l
: Only return the name of the file that have at least one line containing the pattern --color=auto
: Highlight the match in color
Examples:
4. Extract by Column or Field - cut command
cut
: Display specific columns of file or STDIN
Options:
-d
: Specify the column delimiter (Default is TAB)-f
: Specify the column to print-c
: Cut by characters
Examples:
Tools for Analyzing Text
- Text Stats:
wc
- Sorting Text:
sort
- Comparing Files:
diff
andpatch
- Spell Check:
aspell
wc command
wc
command counts the Number of Lines, Words, Bytes and/or Characters in a File or STDIN.- On traditional UNIX system every character in a text file took up exactly 1 byte.
- However, with the advent of internationalization and larger character sets like Unicode some characters can take up to 4 bytes.
Options:
-l
: Only for line count-w
: Only for word count-c
: Only for byte and/or chatacter count-m
: Get an accurate charcter count
Examples:
sort command
sort
- Sorts Text to STDOUT - Original File Unchanged
Syntax:
Options:
-r
: Perform a Reverse (Descending) sort-
-n
: Perform a Numerical sort -f
: Ignore (Folds) case of character in string-
-u
: Unique (Remove duplicate lines in output) -t
: Specify the column delimiter-k
: Specify the column to print
Examples:
NOTE!: The argument to the -k
option can be two numbers separated by a dot.
In this case,
The number before the dot is the field number
The number after the dot is the character within that field with which to begin sort
Eliminating Duplicate Lines
sort-u
: Removes duplicate lines from inputuniq
: Removes duplicate adjacent lines from input To print only unique line occurrences in a file (Remove all duplicate lines), input to uniq must be first sorted.
Options:
-
-c
: Produce a frequency listing - count no of occurrences. Each line will be prepended with a number indicating how many times it appears in the input -d
: Print one copy of the lines that are repeated in the input.-
-u
: Output only the lines that are truely unique - only occurring once in the input. -fn
: Avoid comparing the first n fields in each line.-sn
: Avoid comparing the first n characters in each line.
Examples:
diff command
diff
: Compare two files for difference.- Use
gvimdiff
for graphical diff - Provided by vim-X11 package.
Examples:
- Suppose a service on station1 is malfunctioning but the same service works on station2.
- Thanks to
diff
and the use of simple, text-based configuration files, - We can easily compare the working and non-working configurations.
Duplicating File Changes
diff-u
: Unified Diff (An alternate way of displaying the same information), Best for patch utility.patch
: Duplicate changes in other files (use with care!)
Options:
-b
: Automatically backup changed files.
Examples:
- To use
patch
, simply store the output of adiff -u
in a file; - And run the following command, which would make file.conf-station1 looks like file.conf-station2
Beware!:
- Do you actually want all of the changes above to be made?
- It would be advisable to first edit
file.conf.patch
- And remove the two lines describing the Hostname variable, since those should remain different between systems.
- If anything terrible happens,
patch -b
automatically creates a backup of each file it changes. - backups are given the
.orig
extension.
aspell command
aspell
: A aspell is an interactive spell checker.- It offers suggestions for corrections via a simple menu-driven interface.
Tools for Manipulating Text
tr command
tr
: Translate (Alter) Characters.- Only reads data from STDIN.
- Converts characters in one set to corresponding characters in another set.
sed command
sed
: Stream Editor- Performs search/replace operations on a stream of data
-
As with grep, it is considered good practice to always quote sed’s search/replace string
- By Default: sed make maximum one change per line
-
If you want make multiple changes per line then append
g
(Globle) at the end of search/replace pattern. - sed searches are case-sensitive
-
If you want to search case-insensitively then append
i
(case insensitive) to the pattern. - sed operates on all the lines of the file.
-
It is possible to provide sed with address limiting.
- Normally does not alter the source file
- use
-i.bak
to backup and alter the source file
Examples:
regex command
regex
: Regular Expressions- For more details see
man 7 regex
man grep
/-----------------------------------------------------------------------\ | Metachracter | Meaning | |------------------------------------------------------------------------ | ^ | Line Begin | | $ | Line Ends | | [xyz] | A character that is x, y or z | | [^xyz] | A character that is not x, y or z | \-----------------------------------------------------------------------/
Examples:
Newsletter
Get updated when I create new content.
Unsubscribe whenever. Never any spam.