What is Quantxt Theia?
Quantxt Theia is a managed service for text and data extraction.
Theia can extract data from documents with a mix of
forms, tables, and plain text.
Key Concepts
Theia splits input documents into Text Units, employs Vocabularies via Extractors for processing each text unit, and returns a list of Fields.
The most common Text Unit is a page. For text documents that are not organized in pages, lines, or sentences or processing as a whole are the options.
A Vocabulary is a list of phrases used for searching for fields within the text units.
An Extractor employs a Vocabulary and a regular expression to find and extract Fields.
A Field has a name and may have zero or one or multiple values. A typical form document has fields with one or no value, while a field extracted from a table column or a table row has multiple values.
Installation
Import Quantxt Theia API client library into your program:
<dependency>
<groupId>com.quantxt.sdk</groupId>
<artifactId>qtcurate</artifactId>
<version>2.6.1</version>
</dependency>
pip install qtcurate
Give Me The Code
The following code extracts Industrials, Financials and Utilities from Sector Allocation (%) table in this document:
import com.quantxt.sdk.client.QT;
import com.quantxt.sdk.extraction.Model;
import com.quantxt.sdk.document.Document;
import com.quantxt.sdk.model.Extractor;
import com.quantxt.sdk.result.Field;
import com.quantxt.sdk.result.Result;
import com.quantxt.sdk.vocabulary.Vocabulary;
import com.quantxt.sdk.vocabulary.VocabularyEntry;
import java.io.File;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Pattern;
public class Test {
public static void main(String[] args) {
QT.init("Your_api_key");
File file = new File("path_to_sample.pdf");
// 1- Upload the sample document for processing
List documents = new ArrayList<>();
documents.add(Document.creator().source(file).create());
// 2- Create vocabulary
List entries = new ArrayList<>();
entries.add(new VocabularyEntry("Industrials"));
entries.add(new VocabularyEntry("Financials"));
entries.add(new VocabularyEntry("Utilities"));
Vocabulary vocabulary = Vocabulary.creator()
.name("Allocation (%)")
.entries(entries)
.create();
// 3- Create Extractor - Regex must have 1 capturing group
Extractor extractor = new Extractor()
.setVocabulary(vocabulary)
.setValidator(Pattern.compile("^ +(\\d[\\d\\.\\,]+\\d)"));
// 4- Create model and run
Model model = Model.creator("My parser job")
.addExtractor(extractor)
.withDocuments(documents)
.create();
// 5- Wait to finish
Model.fetcher(model.getId()).blockUntilFinish();
// 6- Export results
for (Result result : Result.reader(model.getId()).read()){
for (Field field : result.getFields()) {
System.out.println(field.getStr() + " -> " + field.getFieldValues()[0].getStr());
}
}
// 7- Clean up
Model.deleter(model.getId()).delete();
Vocabulary.deleter(vocabulary.getId()).delete();
}
}
import sys
from qtcurate.extractor import Extractor, Mode
from qtcurate.vocabulary import Vocabulary
from qtcurate.model import Model
from qtcurate.qt import Qt
from qtcurate.document import Document
from qtcurate.result import Result
from qtcurate.result import Field, FieldValue
API_KEY = "Your_api_key"
DOCUMENT = "path_to_sample.pdf"
Qt.init(API_KEY)
# 1- Upload the sample document for processing
list_of_documents = []
document = Document()
doc = document.create(DOCUMENT)
list_of_documents.append(doc)
# 2- Create vocabulary
vocabulary = Vocabulary()
vocabulary.add_entry("Industrials")
vocabulary.add_entry("Financials")
vocabulary.add_entry("Utilities")
vocabulary.name("Allocations (%)").create()
# 3- Creator Extractor - Regex must have 1 capturing group
extractor = Extractor()
extractor.set_vocabulary(vocabulary.get_id())
extractor.set_validator("^ +(\\d[\\d\\.\\,]+\\d)")
# 4- Run
model = Model()
model.set_description("My parser job")
model.add_extractor(extractor)
model.with_documents(list_of_documents)
model.create()
# 5- Wait to finish
model.wait_for_completion()
# 6- Print results
result = Result(model.get_id())
for field in result.read():
print(f"{field.get_str()} {field.get_values()[0].get_str()}")
# 7- Clean up
vocabulary.delete(vocabulary.get_id())
model.delete(model.get_id())
Building Extraction Models
Quantxt Theia API offers operations for building extraction models via Vocabularies and Extractors and retrieving output via Result operations. Models can be created, fetched, listed and deleted.
Creating a new model
Extractor my_sample_extractor = ....;
List documents = .....;
Model model = Model.creator("My parser job")
.addExtractor(my_sample_extractor)
.withDocuments(documents)
.create();
extractor = Extractor()
extractor.set_ ...
document = Document()
documents.append[document]...
model = Model()
model.name(“My parser job”)
.add_extractor(extractor)
.with_documents(documents)
.create()
Fetching an existing model
Model model = Model.fetcher(model_id).fetch();
model = Model()
model.fetch(model_id)
Listing all existing model
List modelList = Model.reader().read();
model = Model()
list_models = model.read()
Deleting an existing model
boolean deleted = Model.deleter(model_id).delete();
model = Model()
model.delete(model_id);
Once a model is created, it can be used to process documents in production
import com.quantxt.sdk.client.QT;
import com.quantxt.sdk.document.Document;
import com.quantxt.sdk.extraction.job.Job;
import com.quantxt.sdk.extraction.model.Model;
import com.quantxt.sdk.result.Result;
import java.io.File;
import java.util.ArrayList;
public class Test {
public static void main(String [] args){
String API_KEY = "...";
File DOCUMENT = new File("...");
String MODEL_ID = "...";
QT.init(API_KEY);
Document document = Document.creator().source(DOCUMENT).create();
ArrayList documents = new ArrayList<>();
documents.add(document);
Model model = Model.fetcher(MODEL_ID).fetch();
Job job = Job.creator("my sample job")
.withModel(model)
.withDocuments(documents)
.create();
Job.fetcher(job.getId()).blockUntilFinish();
ArrayList results = Result.reader(job.getId()).read();
}
}
from qtcurate.model import Model
from qtcurate.job import Job
from qtcurate.qt import Qt
from qtcurate.document import Document
from qtcurate.result import Result
API_KEY = ...
DOCUMENT = ...
MODEL_ID = ...
Qt.init(API_KEY)
documents = []
document = Document()
document = document.create(DOCUMENT)
documents.append(document)
job = Job()
model = Model()
model = model.fetch(MODEL_ID)
job.set_description("my sample job").with_model(model.id).with_documents(documents).create()
job.wait_for_completion()
result = Result(job.get_id())
Vocabulary Operations
Vocabularies can be created, fetched, updated and deleted.
Creating a new Vocabulary
List vocabularyEntries = new ArrayList<>();
vocabularyEntries.add(new VocabularyEntry("Apple Inc."));
vocabularyEntries.add(new VocabularyEntry("Alphabet Inc."));
Vocabulary vocabulary = Vocabulary.creator()
.name("Companies")
.entries(vocabularyEntries)
.create();
vocabulary = Vocabulary()
vocabulary.addEntry(“Apple Inc.”)
vocabulary.addEntry(“Alphabet Inc.”)
vocabulary.name(“Companies”).create()
Fetching an existing Vocabulary
Vocabulary vocabulary = Vocabulary.fetcher(vocabulary_id).fetch();
vocabulary = Vocabulary()
vocabulary.fetch(vocabulary_id.id)
Updating an existing Vocabulary
Vocabulary vocabulary_updated = Vocabulary.updater(vocabulary_id)
.name("Companies changed")
.addEntry(new VocabularyEntry("Tesla"))
.update();
vocabulary = Vocabulary()
vocabulary_updated = vocabulary.name(“Companies changed”)
.addEntry(“Tesla”)
.update(vocabulary_id)
Deleting an existing Vocabulary
boolean deleted = Vocabulary.deleter(vocabulary_id).delete();
vocabulary.delete(vocabulary_id)
Extractor Operations
An extractor is coupled with a vocabulary and passed to a model operation for text extraction.
Extractor extractors = new Extractor();
// Set vocabulary
extractors.setVocabulary(vocabulary);
// Ignore `of` when searching for vocabulary entries
List stopwords = new ArrayList<>();
stopwords.add("of");
extractors.setStopwordList(stopwords);
// Convert extracted text into doubles - Only works for Excel export
extractors.setDataType(DOUBLE);
// Unordered search for vocabulary entries, 'Price of Product' will match on 'Product Price'
extractors.setMode(UNORDERED);
// Looking for monies right after the vocab matches: 'Product Price: $1,234'
// Validator regex must have one capturing group
extractors.setValidator(Pattern.compile("^\: +(\\$[\\d,]+)");
extractor = Extractor()
# Set vocabulary
extractor.set_vocabulary(vocabulary)
# Ignore 'of' when searching for vocabulary entries
stopwords = []
stopwords.append(“of”)
extractor.set_stop_word_list(stopwords)
# Convert extracted text into doubles - Only works for Excel export
extractor.set_data_type(DataType.DOUBLE)
# Unordered search for vocabulary entries, 'Price of Product' will match on 'Product Price'
extractor.set_mode(Mode.UNORDERED)
# Looking for monies right after the vocab matches: 'Product Price: $1,234'
# Validator regex must have one capturing group
extractor.set_validator("^\: +(\\$[\\d,]+)")
Result Operations
Results can be fetched once a model is run:
List results = Result.reader(model.getId()).read();
// Access extracted fields and vocabulary ids used to extract them
for (Result r : results){
for (Field f : r.getFields()){
System.out.println(f.getVocabId() + " " + f.getStr());
}
}
result = Result(model.get_id())
# Access extracted fields and vocabulary ids used to extract them
for i in result.read():
field = Field(i)
if field.get_values() != "":
field_value = FieldValues(field.get_values())
print(f"{field.get_id()} {field_value.get_str()[0]}")
Data Types
This section covers the data types used by Quantxt Theia for analyzing documents.
Document
Input files are converted into Document objects with the following properties before analysis:
id Unique ID for a Document object
fileName Original filename
contentType Content type of the document detected automatically by the engine
date The timestamp the document was created
VocabularyEntry
A VocabularyEntry has phrase and an optional category for searching input documents for finding fields. Category is used to tag and group the matched phrases:
str Search phrase to find fields
category Category or a normalized name given
Vocabulary
A Vocabulary holds a list of VocabularyEntries used for searching the documents. Vocabulary has the following properties:
id id of the vocabulary assigned by the engine upon creation of a vocabulary
name Name of the vocabulary set by the user
entries List of VocabularyEntry items
Extractor
An extractor employs a vocabulary to scan the text for fields. Extractor is also in charge of finding and validating the field values. Extractor scans the text for all entries in the vocabulary using modern full-text search techniques. User can set search methods, stop words, and synonyms via the extractors. Extractor has the following properties:
vocabulary Name of the vocabulary set by the user
type The type of the extracted field value. Default is STRING. Possible values are LONG, DOUBLE and DATETIME and if set, engine will do the best effort to convert the extracted values into the set type
mode Search mode used for scanning input content to find vocabulary phrases (aka, VocabularyEntry.strs). Default mode is SIMPLE:
- SIMPLE Case insensetive and ignore puctuation in finding matches
- UNORDERED SIMPLE plus allow ignore order of words in multi word pharses
- STEM SIMPLE plus allow matching on minor
variatios
on words;
Building will match on build, built and builds. - UNORDERED_STEM UNORDERED and STEM
- FUZZY_UNORDERED_STEM UNORDERED_STEM plus allow
fuzzy matching on words;
Building will match on builidng.
validator A regular expression used to find and validate found FieldValues
patternBetweenMultipleValues Allowed gap between
values in a multi-value field such as a table row or a table column.
By default Theia finds up to one value per field. Setting this to something like ^\s+$ makes Theia to
find multiple values for a table row or column where the boundary between cells is whitespace or blank
lines.
stopwordList List of words to ignore in searching for phrases in vocabulary
synonymList List of synonyms used in searching for phrases in vocabulary
Model
Models are in charge of extracting data from documents. A model has the following properties:
id Unique id for the model set by the engine once the task is submitted
description Description of the model
extractors List of extractors to be used in data analysis
documents List of documents to be processed
numWorkers Maximum number of threads used during the extraction. Default is 8.
Result
Once a model is run on documents, it returns an array of Result. Theia creates one Result object for every Text Unit that had any extracted data. Result has the following properties:
id id of the model that produced this result
documentName Same as Document.name
unitNumber Text unit number
creationTime Timestamp of creation of the results
fields Fields that were extracted for the associated text unit
Field
Fields are outcome of Extractors. If an extractor finds one field with one or more valid values for the field, it will create one one Field object. Field has the following properties:
str The match found by the extractor
vocabName Name of the vocabulary that had the match
vocabId Id of the vocabulary that had the match
category Category of the match from the vocabulary
type Type of the match. Can be STRING, LONG, DOUBLE or DATETIME
fieldValues The field values found and validated by the Extractor's validator