Information Extraction From a single Document Using Genetic Algorithm
Computer Science &Engg /Anna University
Last modified: April 27, 2006
In this paper we focus on Information Extraction (IE) which is one of the most prominent techniques currently used in Text Mining. Text Mining techniques are dedicated to the automated information extraction from unstructured textual data.
Information Extraction (IE) systems analyze unrestricted text in order to extract information about pre-specified types of events, entities or relationships. Information Extraction is not Information Retrieval. Information Retrieval gets sets of relevant documents and we analyze the documents. Information Extraction gets facts out of documents and we analyze the facts.
In this paper we address the problem of extracting the necessary Meta data from documents. The document may be business card, business letter, resumes etc. The extracted informations are organized in to a meaningful group according to their content. So that they can be accessed for further processing more easily.
This paper mainly focuses on resumes. The document is converted to a list of words with attributes. The word list is converted to a sequence of tokens representing a linearization of text blocks found in a document , tokens represent text line types and text block separators. The token sequence is then parsed using a parser. The most probable parser is used to assign metadata types to text blocks. Detected metadata fields can then be used to route the document or store in correctly in a document database.
The system should follow the 4 steps.
1.Dictionary creation: Dictionary contains the required key words. The keywords are the fields that are present in the resume. It also contain the Separators.
2.Line labeling using Tokenization and regular expression matching : This step reduce the document in to labeled lines. Tokenization separate one line from another using separators. Separators are blank space, newline, comma, colon, fullstop etc. Regular expression matching matches the keywords with text lines.
3.Text Recognizing using Syntax analysis and word sense disambiguation: This step distinguishes words that have same meaning or different meaning.
4.Parsing using In-House fips parser to automated filtering of nonsignificant words and get the appropriate information.
We are going to use a Fips Parser. The advantage of using Fips parser is all alternatives are considered parallel and it will not use a grammar of context free rule but it proceeds as a licensing parser. Combining Information Extraction Technology and genetic algorithms can produce a new integrated model for text mining.
Genetic algorithm(GA) is used to identify combination of terms that optimize an objective function, which is the cornerstone of the process. Genetic algorithm performs global searches by exploring multiple solutions in parallel. GAs can also handle with noisy and missing data. Benefits of Genetic algorithm used for this project is it may provide high level knowledge discovery.
The current system extract information using format guides and it is restricted to fixed number of formats. But this paper doesn’t use any format guide and the scope of the project is to read any type of resumes and store the content of the document in to the database. Unlike previous approaches, our model doesn’t rely on external resources or conceptual descriptions. Instead, it performs the discovery using only information from the original corpus of text documents.
We are implementing this system using MFC (Microsoft Foundation Classes) for front end and DB2 for back end.