APPLICATION OF REGULAR EXPRESSION
Ankit G – 014
Gagan – 034
Nikhil R.K- 060
Parashuram - 065
• A regular expression (regex) describes a pattern to match multiple input strings.
• Regular expressions descend from a fundamental concept in Computer Science
called finite automata theory
• Regular expressions are endemic to Unix
• Some utilities/programs that use them:– vi, ed, sed, and emacs
– awk, tcl, perl and Python
– grep, egrep, fgrep
– compilers
• The simplest regular expression is a string of literal characters to match.
• The string matches the regular expression if it contains the substring.
What is a Regular Expression?INTRODUCTION
Application in Linux
The “egrep” Tool
Copyright © 2007 by Adam Webber
Text File Search
• Unix tool: egrep
• Searches a text file for lines that contain a substring matching a specified pattern
• Echoes all such lines to standard output
In linux operating System:
Regular expressions are used by several different Unix commands, including ed, sed, awk, grep, and, to a more limited extent, vi.
Sed also understands something called addresses. Addresses are either particular locations in a file or a range where a particular editing command should be applied. When Sed encounters no addresses, it performs its operations on every line in the file.
Sed stands for stream editor is a stream oriented editor which was created exclusively for executing scripts. Thus all the input you feed into it passes through and goes to STDOUT and it does not change the input file.
Oracles implementation is the extension of the POSIX
(Portable Operating system for UNIX)
Editing Commands
COMMANDS ACTION
Insert
i, a
I, A
o, O
Insert text before, after cursor
Insert text before beginning, after end of line
Open new line for text below, above cursor
Editing Commands
COMMANDS ACTION
Change
r
cw
c
Replace character
Change word
Change current line
cmotion
C
R
s
Change text between the cursor and the target
of motion
Change to end of line
Type over (overwrite) characters
Substitute: delete character and insert new text
S Substitute: delete current line and insert new text
Application in Search Engine
One use of regular expressions that used to be very common was in web search engines.
Archie, one of the first search engines, used regular expressions exclusively to search through a database of filenames on servers.
Regular expressions were chosen for these early search engines because of both their power and easy implementation.
In the case of a search engine, the strings input to the regular expression would be either whole web pages or a pre-computed index of a web page that holds only the most important information from that web page.
A query such as regular expression could be translated into the following regular expression. (Σ∗regularΣ∗expressionΣ∗ )∗∪(Σ∗expressionΣ∗regularΣ∗ )∗ Σ, then, of course, would be the set of all characters in the character encoding used with this search engine.
Regular expressions are not used anymore in the large web search engines because with the growth of the web it became impossibly slow to use regular expressions. They are however still used in many smaller search engines such as a find/replace tool in a text editor or tools such as grep.
In web application String matching is used
Regular Expressions in Lexical Analysis
To perform lexical analysis, two components are required: a scanner and a tokenizer.
The purpose of tokenization is to categorize the lexemes found in a string to sort them by meaning.
The process can be considered a sub-task of parsinginput.
For example, the C programming language could contain tokens such as numbers, string constants, characters, identifiers (variable names), keywords, or operators.
We can simply define a set of regular expressions, each matching the valid set of lexemes that belong to this token type. This is the process of scanning.
This process can be quite complex and may require more than one pass to complete.
Another option is to use a process known as backtracking
For example, to determine if a lexeme is a valid identifier in C, we could use the following regular expression: [a-zA-Z ][a-zA-Z 0-9]∗ This regular expression says that identifiers must begin with a Roman letter or an underscore and may be followed by any number of letters, underscores, or numbers
CONCLUSION
Both regular expressions and finite-state automata represent regular languages.
The basic regular expression operations are: concatenation, union/disjunction, and Keene closure.
The regular expression language is a powerful pattern-matching tool.
Any regular expression can be automatically compiled into an NFA, to a DFA, and to a unique minimum-state DFA.
An FSA can use any set of symbols for its alphabet, including letters and words.
THANK YOU