Text Editing
Digital Development TeamThe University of Auckland Library
Tools, tips, tricks
LIANZA ITSIG webinar series
Summary
• General (large) text files– We manage and manipulate text data daily– It’s tedious and time consuming– Find & Replace is too limited and dangerous– We know there must be a better way...
• Tabular data files (eg. Spreadsheets)– We work with these all the time, usually in Excel– What tools can help us clean messy data?
Topics
• Regular Expressions
• Text Editors
• Operating on lines, not entire files
• Google Refine
Regular Expressions
/^\s+[a-zA-Z0-9](?:\W+)/
Regular Expressions
• A way to describe a set of strings and capture parts of them
• Originated in old UNIX/POSIX tools
• Now used all over the place
• Test your regexes out on the web:– http://gskinner.com/RegExr/
Text Editors & Useful Languages
sed, grep, awk
Text Editors
• Word processors aren’t text editors
• Shop around, compare features
• My favourite: Vim (UNIX, Windows, Mac)
– Wikipedia comparison of editor features– Wikipedia list of regex software
Useful Languages / Interpeters
• Perl– An old favourite, great for string manipulation
• Python– The cool kids tell me it’s better than Perl
• GREL– We’ll get to this later...
Line-by-line processing
while(<STDIN>) {....
}
Line-by-line processing
• Large files are large!– If they’re big on disk, they’ll be big in memory
• Lines are (usually!) small– Read a line– Do something with it– Output the modified line
Google Refine
• Cleans messy tabular data– Easy facetting and filtering of columns/values– Easy transformation of values
• Google Refine Expression Language (GREL)– Extensive use of regular expressions and other standard string
manipulation techniques
• Other features– Perform web service calls directly, reconcile row IDs
Conclusion
• Our problems are solvable!– Regular expressions– Decent text editors for general/unformatted text– Google Refine for tabular data
• Contact me– Please feel free to contact me with questions, corrections or
ideas– [email protected]– Twitter: @kimshepherd– Google+: [email protected]