You are here

Fast Approximate String Searching for Wikipedia, P2P and Biological Sequences

Ela Hunt

Approximate string searching on natural language text uses database indexing only to a limited extent and does not work for short words. In biological string searching, indexing avenues are being actively investigated. In both contexts, suffix trees and n-grams are the main index types used.

I will present a new development in indexing for natural language, with application to web documents and tested in both client-server and P2P scenarios, and possibly extensible to biological string searching. I will first discuss the concept of the deletion neighbourhood and outline some complexity issues related to this idea. Then, I will move to the application areas where the concept proved to be beneficial. I will summarise the results of various performance tests with Wikipedia, Moby Dick, and natural language dictionaries, and then move on to a P2P DHT-based scenario which was the subject of another test. Finally, I will discuss possible extensions of this work, and the forthcoming tests with biological sequences.

Date and time: 
Thursday, 1 May, 2008 - 16:00
60 minutes