Anti-VirusProduct Development
Cliff PentonHead of Software Development
Sophos Plc
Slides © 1999 Sophos Plchttp://www.sophos.com/
Who are Sophos?
Founded in 1980 as an electronic design partnership
Moved into data security in 1985
In 1989, among the first to respond to computer viruses
Anti-Virus is the main focus of the business
World leading enterprise-wide anti-virus software
Cover more platforms than any other anti-virus vendor
What do we make?
Conventional product development
WORD 95 OFFICE 97OFFICE
2000SR2SR1
Anti-virus product development...
NOV DEC JAN FEB MAR APR
PRODUCTFEATURES
OS DEVELOPMENTS
VIRUSRESEARCH
Anti-virus product development...
Presents simultaneous development challenges:
Complexity
Transparency
Quality
Regularity
Anti-virus product development...
User Interface
Virus Detection Engine
Virus Descriptions
Coping with the complexity
Anti-virus product development...
There are many issues, but I will focus on two today...
Multiple operating systems:
DOS/Windows 3.x, Windows 95/98, Windows NT,
OS/2, NetWare, Macintosh, OpenVMS, Unix...
Dealing with multiple languages:
English, French, German, Spanish, Japanese...
Multiple operating systems
The key issues in cross-platform development are:
Endianism
Packing and alignment
Multitasking
Memory management
File I/O
Endianism
Different hardware platforms store numbers in memory
in a different order
Big endian (e.g. SPARC)
Little endian (e.g. Intel)
When exchanging information must be aware of endian
related problems
Endianism
01 02 03 04
Big endian: 0x01020304
Little endian: 0x04030201
Packing and alignment
Some platforms strictly enforce data alignment when
reading and writing memory
Careless memory references may lead to disaster
(SIGBUS, or GPF)
Usually happens when reading structures from a file
with packing set to single byte
Better to read/write struct elements by assignment
Packing and alignment
typedef struct { long a; char b; short c;} x;
How big is this structure?
Packing and alignment
typedef struct { long a; char b; short c;} x;
How big is this structure?
8 with defaultpacking
7 with 1 bytepacking
Compiling with Visual C++ 6.0
Multitasking
Different operating systems use different scheduling
schemes
Cooperative/competitive multitasking
Preemptive multitasking
Tight loops and other compute bound operations need
careful tweaking to maintain performance on
competitive multitasking systems
Memory management
Not all operating systems have virtual memory, so we
cannot rely on malloc() and free()
Some require explicit virtual memory management,
such as DOS and NetWare
Need to use an intermediate layer to conditionally
choose between implicit and explicit virtual memory
management
Memory management
Explicit virtual memory management involves:
Allocating a handle to a memory block
Locking the handle to get a pointer to physical memory
Using the memory as usual
Unlocking the handle, releasing physical memory
Deallocating the handle when finished
File I/O
File I/O primitives differ between operating systems
File security considerations need to be taken into
account
Standard library calls may not provide the required
functionality
Multiple languages
Our Windows products ship in five languages:
English, French, German, Spanish, and Japanese
Introduces issues of character encoding:
UNICODE vs. SBCS vs. MBCS
Adds the overhead of translation to the development
process, which can be significant
Internationalisation
Character sets, alphabets and character encoding
Code pages
Dates and times
Generic coding techniques
Adding resources for multiple languages
English language
26 characters plus others < 256
7 bits == ASCII or 8 bits == ANSI
1 character == 1 byte
SBCS or Single Byte Character Set
Very familiar to anyone who has used strxxx()
functions
European languages
Accented characters are part of many languages
à, ô French
õ, ¡¿ Spanish
ö, ß German
Characters 0-127 are the same (ASCII)
Characters 128-255 are called extended characters
Still SBCS, but requires code pages...
Code pages
The extended characters of each language are
supported via code pages.
The code pages in DOS and Windows are different!
DOS - English (British) code page 850 (Latin 1)
DOS - English (US) code page 437 (Latin US)
Windows Latin 1 (ANSI) code page 1252
Example code page problem
DOS CP 850 Windows CP 1252
Far East languages
Now the fun begins…
Chinese has more than 10,000 characters
Japanese has several character types:
Hiragana phonetic characters
Katakana phonetic characters, used to spell
words taken from foreign languages
Kanji characters of Chinese origin
Double byte character encoding
Say hello to DBCS, or Double Byte Character Set, where:
0x00 -> 0x7F is ASCII as usual
0x80 ->0xFF is a combination of Kana (single-byte),
and Kanji lead-bytes
Used on Win95, WinNT, Mac, NetWare, OS/2
Double byte character encoding
Programming for DBCS
1 character != 1 byte
If the character is double-byte, both bytes of the
character must be dealt with together
0x00 is always NUL, so it is safe to scan a string for '\0'
Trail byte values can be confused with other characters
(e.g. \) if not handled properly
Never scan with pointer arithmetic (i.e. ptr++)
UNICODE
Instead of using 1 byte per character, Unicode uses 2
bytes per character
65536 possible characters in one character set
No need for code pages
UNICODE
Word breaking
Sentences in Japanese do not have spaces between
words.
Sentences can be broken at any Japanese character.
Break sentences on spaces and lead bytes.
Dates and times
Date and time representations are not universal
UK 22/05/98
USA 05/22/98
Japan10Y 05M 22D
Either use an OS call (e.g. GetDateFormat() on
Windows), or
Embed a date format string in a language-dependant
resource
Generic coding techniques
Use the libraries available, e.g. for Win32
_tcsinc() maps to
strinc() for SBCS
_mcsinc() for MBCS
_wcsinc() for UNICODE
Use TCHAR not char
Always enclose literal text with the _T() macro
Formatting messages
Never concatenate two strings to form a sentence
Take care when using printf(), as language
variations may dictate reordering of insertion objects
Win32 can use FormatMessage()
NetWare can use NWprintf() etc.
External resources
Avoid hard-coding text into application source code
Win32, Mac and OS/2 use resource files
NetWare uses message databases
DOS, VMS, etc. have to store strings in separate
modules, which are linked individually or loaded at run
time
Delivering multiple languages
Multiple language resources linked into executable --
good for small programs with limited text
Multiple executables -- e.g. SWEEP for DOS
Multiple resource-only DLLs -- extremely flexible
solution if OS supports DLLs
Multiple text-only message files for text-only operating
systems