Date post: | 19-Dec-2015 |
Category: |
Documents |
View: | 214 times |
Download: | 1 times |
Document Data MiningDesign Review
November 18, 2010
1
Team Members: Dallas Stinger, Wenlong Huang, Aaron PhillipsAdvisor: Gregory Donohoe, Ph.D.
The Problem
• State Board collects meeting minutes and other documents recording decisions made
• Board members want to retrieve text from old documents that relate to current issues– May not recall when issue was discussed– May not know exact keywords to search for
2
The Existing Solution
• Currently, all files exist on a large, unorganized shared network drive.
• Finding information recorded in documents requires knowing when it was recorded, and in which document.
3
Multiple File types• System limited to more major file types
– Word documents (.doc, .docx)– PDF files (.pdf)– Excel (.xls, .xlsx)– Text (.txt)
• Lacking – WordPerfect (.wpd)– PDF files that were scanned in– Open Office document types
5
Multi-User Access
Web Based• Pros:
– Information searchable anywhere
– Only one index required– Index on regular basis
without interrupt
• Cons:– File permissions
Individual User Application• Pros:
– Can be programmed to learn user behavior
– Apply more emphasis to files he/she used before
(Looks at search history to aid in new searches)
• Cons:– Software package installed
on each users machine
6
Search Collection of Documents Efficiently
• Real Time Searching– Pros:
• Easy• No initial overhead
– Cons:• Time consuming(> 100,000 words)• Unable to find non-
exact search results
• Reverse Indexing– Pros:
• Fast and efficient• Able to find useful
information without exact search text known
– Cons:• Large initial overhead(pre-analyze all documents)• Keep index file up to date• Storage space necessary
Results displayed in less than a second
7
Find Useful Information Without Exact String Specification (A: Stemming)
• Create our own– Pros:
• Pay attention to details that may be lacking in existing algorithms
(aglet vs. readable)• More efficient• Define special cases
– Cons:• Requires a lot of time
• Use existing algorithm– Pros:
• Readily available• Spend more time on
other important details
– Cons:• Special cases incorrect• Some root words are
truncated
9
Porter Stemming Algorithm
• Large set of steps based on English Natural Language to determine root of word
• Extensively used in programs
• Outdated: Results not always correct
10
Find Useful Information Without Exact String Specification (B: Thesaurus)
• Own Model– Pros:
• Fine tune thesaurus to have only relevant terms (terms that exist inside our index file)
– Cons:• Very time consuming
and complex
• Using pre-built Thesaurus– Pros:
• Quick and easy to use• Very extensive
– Cons:• Has irrelevant search
term results• Unnecessary terms for
State Board
11
Searching
• User types in a search criteria– Determine whether they want Narrow Search results
or Broad Search Results• May retrieve too many results in Broad Search
• Search algorithm converts each typed word into a list of possible stems and synonyms
• Tries all possible permutations of words, trying to find the closest match to the search
• Calculate standard deviation of the distance between all of the words
12
Searching (cont.)
• Each file is ranked based on the number of matches it contains– Exact matches rank highest– Reordering of exact match is ranked next– Stems, synonyms, partial matches, and large
spacing between searched words rank lowest
• All rank values found inside a file are summed• Highest ranked files considered most relevant
13
DocumentTest:
/// Returns the document location
public void getFileLocationTest() { convertPDF converpdf = new convertPDF("D:\\Class\\test.pdf"); string actual; actual = converpdf.getFileLocation(); string expected; expected = "D:\\Class\\test.pdf"; Assert.AreEqual(actual, expected); }
Unit Testing
23
/// creates word count in alphabetical order for all words located inside PDF
public void createDictionaryTest() { convertPDF converpdf = new convertPDF("D:\\Class\\test.pdf"); string toDictionary = "this is test code code code"; converpdf.createDictionary(toDictionary); int actual; converpdf.WordCounts.TryGetValue(“code", out actual); Assert.AreEqual(3, actual); }
Unit Testing
24
End of Semester Status
• Goals:– Working, tested prototype– Documentation for future teams
• Plenty of areas open for extension or improvement
25
Future Possibilities: File Types
• Currently supported file types– Microsoft Word– Microsoft Excel– PDF
• No optical character recognition
• Our system will allow for easy extension
26
Future Possibilities: Indexing
• We have a relatively simple indexing scheme• More complex indexing would lead to
decreased search time• Our indexing scheme is very general
– Could be specific to the State Board– Could lead to more relevant results
28
Future Possibilities: Searching
• Search time increases quickly as search terms are added
• Thesaurus is broad– Large number of synonyms can slow search– Could be trimmed to fit domain
• Porter stemming algorithm could be replaced
29
Future Possibilities: Correlation
• Related documents should be correlated– By date?– Using a tagging system?
30
Future Possibilities: Decision Database
• A client need that is not addressed by our software
• Many board decisions have been passed, with varying lifetimes
• A database could track all board decisions and lifespan
• Possible connection to our search engine?
31
Future Possibilities: Web-Based Interface
• Software will be installed on each user’s computer
• GUI could be web based, with access restricted to State Board employees
• Users could search from home or while on the road, not just in the office
• Indexing would be simplified
32