What is Quantxt Theia?

Quantxt Theia is a managed service for text and data extraction.
Theia can extract data from documents with a mix of forms, tables, and plain text.

Key Concepts


Theia splits input documents into Text Units, employs Vocabularies via Extractors for processing each text unit, and returns a list of Fields.

The most common Text Unit is a page. For text documents that are not organized in pages, lines, or sentences or processing as a whole are the options.

A Vocabulary is a list of phrases used for searching for fields within the text units.

An Extractor employs a Vocabulary and a regular expression to find and extract Fields.

A Field has a name and may have zero or one or multiple values. A typical form document has fields with one or no value, while a field extracted from a table column or a table row has multiple values.

Installation


Import Quantxt Theia API client library into your program:


  <dependency>
    <groupId>com.quantxt.sdk</groupId>
    <artifactId>qtcurate</artifactId>
    <version>2.6.1</version>
  </dependency>

                

  pip install qtcurate

                

Give Me The Code


The following code extracts Industrials, Financials and Utilities from Sector Allocation (%) table in this document:


  import com.quantxt.sdk.client.QT;
  import com.quantxt.sdk.extraction.Model;
  import com.quantxt.sdk.document.Document;
  import com.quantxt.sdk.model.Extractor;
  import com.quantxt.sdk.result.Field;
  import com.quantxt.sdk.result.Result;
  import com.quantxt.sdk.vocabulary.Vocabulary;
  import com.quantxt.sdk.vocabulary.VocabularyEntry;
  import java.io.File;
  import java.util.ArrayList;
  import java.util.List;
  import java.util.regex.Pattern;

  public class Test {

    public static void main(String[] args) {
      QT.init("Your_api_key");
      File file = new File("path_to_sample.pdf");

      // 1- Upload the sample document for processing
      List documents = new ArrayList<>();
      documents.add(Document.creator().source(file).create());

      // 2- Create vocabulary
      List entries = new ArrayList<>();
      entries.add(new VocabularyEntry("Industrials"));
      entries.add(new VocabularyEntry("Financials"));
      entries.add(new VocabularyEntry("Utilities"));

      Vocabulary vocabulary = Vocabulary.creator()
              .name("Allocation (%)")
              .entries(entries)
              .create();

      // 3- Create Extractor - Regex must have 1 capturing group
      Extractor extractor = new Extractor()
              .setVocabulary(vocabulary)
              .setValidator(Pattern.compile("^ +(\\d[\\d\\.\\,]+\\d)"));

      // 4- Create model and run
      Model model = Model.creator("My parser job")
              .addExtractor(extractor)
              .withDocuments(documents)
              .create();

      // 5- Wait to finish
      Model.fetcher(model.getId()).blockUntilFinish();

      // 6- Export results
      for (Result result : Result.reader(model.getId()).read()){
          for (Field field : result.getFields()) {
              System.out.println(field.getStr() + " -> " + field.getFieldValues()[0].getStr());
          }
      }

      // 7- Clean up
      Model.deleter(model.getId()).delete();
      Vocabulary.deleter(vocabulary.getId()).delete();
    }
  }


              

  import sys

  from qtcurate.extractor import Extractor, Mode
  from qtcurate.vocabulary import Vocabulary
  from qtcurate.model import Model
  from qtcurate.qt import Qt
  from qtcurate.document import Document
  from qtcurate.result import Result
  from qtcurate.result import Field, FieldValue


  API_KEY = "Your_api_key"
  DOCUMENT = "path_to_sample.pdf"

  Qt.init(API_KEY)

  # 1- Upload the sample document for processing
  list_of_documents = []
  document = Document()
  doc = document.create(DOCUMENT)
  list_of_documents.append(doc)

  # 2- Create vocabulary
  vocabulary = Vocabulary()
  vocabulary.add_entry("Industrials")
  vocabulary.add_entry("Financials")
  vocabulary.add_entry("Utilities")
  vocabulary.name("Allocations (%)").create()

  # 3- Creator Extractor - Regex must have 1 capturing group
  extractor = Extractor()
  extractor.set_vocabulary(vocabulary.get_id())
  extractor.set_validator("^ +(\\d[\\d\\.\\,]+\\d)")

  # 4- Run
  model = Model()
  model.set_description("My parser job")
  model.add_extractor(extractor)
  model.with_documents(list_of_documents)
  model.create()

  # 5- Wait to finish
  model.wait_for_completion()

  # 6- Print results
  result = Result(model.get_id())
  for field in result.read():
    print(f"{field.get_str()} {field.get_values()[0].get_str()}")

  # 7- Clean up
  vocabulary.delete(vocabulary.get_id())
  model.delete(model.get_id())


              

Building Extraction Models


Quantxt Theia API offers operations for building extraction models via Vocabularies and Extractors and retrieving output via Result operations. Models can be created, fetched, listed and deleted.

Creating a new model


  Extractor my_sample_extractor = ....;
  List documents = .....;
  Model model = Model.creator("My parser job")
          .addExtractor(my_sample_extractor)
          .withDocuments(documents)
          .create();


                

  extractor = Extractor()
  extractor.set_ ...
  document = Document()
  documents.append[document]...
  model = Model()
  model.name(“My parser job”)
         .add_extractor(extractor)
         .with_documents(documents)
         .create()


                

Fetching an existing model


  Model model = Model.fetcher(model_id).fetch();


                

  model = Model()
  model.fetch(model_id)


                

Listing all existing model


  List modelList = Model.reader().read();

                

  model = Model()
  list_models = model.read()


                

Deleting an existing model


  boolean deleted = Model.deleter(model_id).delete();


                

  model = Model()
  model.delete(model_id);


                

Once a model is created, it can be used to process documents in production


  import com.quantxt.sdk.client.QT;
  import com.quantxt.sdk.document.Document;
  import com.quantxt.sdk.extraction.job.Job;
  import com.quantxt.sdk.extraction.model.Model;
  import com.quantxt.sdk.result.Result;

  import java.io.File;
  import java.util.ArrayList;

  public class Test {
      public static void main(String [] args){
          String API_KEY  = "...";
          File DOCUMENT = new File("...");
          String MODEL_ID = "...";

          QT.init(API_KEY);

          Document document = Document.creator().source(DOCUMENT).create();

          ArrayList documents = new ArrayList<>();
          documents.add(document);

          Model model = Model.fetcher(MODEL_ID).fetch();
          Job job = Job.creator("my sample job")
                  .withModel(model)
                  .withDocuments(documents)
                  .create();
          Job.fetcher(job.getId()).blockUntilFinish();

          ArrayList results = Result.reader(job.getId()).read();
      }
  }


                

  from qtcurate.model import Model
  from qtcurate.job import Job
  from qtcurate.qt import Qt
  from qtcurate.document import Document
  from qtcurate.result import Result

  API_KEY  = ...
  DOCUMENT = ...
  MODEL_ID = ...

  Qt.init(API_KEY)

  documents = []
  document = Document()
  document = document.create(DOCUMENT)
  documents.append(document)

  job = Job()
  model = Model()
  model = model.fetch(MODEL_ID)
  job.set_description("my sample job").with_model(model.id).with_documents(documents).create()
  job.wait_for_completion()

  result = Result(job.get_id())


                

Vocabulary Operations

Vocabularies can be created, fetched, updated and deleted.

Creating a new Vocabulary


  List vocabularyEntries = new ArrayList<>();
  vocabularyEntries.add(new VocabularyEntry("Apple Inc."));
  vocabularyEntries.add(new VocabularyEntry("Alphabet Inc."));

  Vocabulary vocabulary = Vocabulary.creator()
            .name("Companies")
            .entries(vocabularyEntries)
            .create();


                

  vocabulary = Vocabulary()
  vocabulary.addEntry(“Apple Inc.”)
  vocabulary.addEntry(“Alphabet Inc.”)
  vocabulary.name(“Companies”).create()


                

Fetching an existing Vocabulary


  Vocabulary vocabulary = Vocabulary.fetcher(vocabulary_id).fetch();

                

  vocabulary = Vocabulary()
  vocabulary.fetch(vocabulary_id.id)


                

Updating an existing Vocabulary


  Vocabulary vocabulary_updated = Vocabulary.updater(vocabulary_id)
        .name("Companies changed")
        .addEntry(new VocabularyEntry("Tesla"))
        .update();


                

  vocabulary = Vocabulary()
  vocabulary_updated = vocabulary.name(“Companies changed”)
        .addEntry(“Tesla”)
        .update(vocabulary_id)


                

Deleting an existing Vocabulary


  boolean deleted = Vocabulary.deleter(vocabulary_id).delete();

                

  vocabulary.delete(vocabulary_id)


                

Extractor Operations

An extractor is coupled with a vocabulary and passed to a model operation for text extraction.


  Extractor extractors = new Extractor();
  
  // Set vocabulary
  extractors.setVocabulary(vocabulary);

  // Ignore `of` when searching for vocabulary entries
  List stopwords = new ArrayList<>();
  stopwords.add("of");
  extractors.setStopwordList(stopwords);

  // Convert extracted text into doubles - Only works for Excel export
  extractors.setDataType(DOUBLE);

  // Unordered search for vocabulary entries, 'Price of Product' will match on 'Product Price'
  extractors.setMode(UNORDERED);

  // Looking for monies right after the vocab matches: 'Product Price: $1,234'
  // Validator regex must have one capturing group
  extractors.setValidator(Pattern.compile("^\: +(\\$[\\d,]+)");


                

  extractor = Extractor()

  # Set vocabulary
  extractor.set_vocabulary(vocabulary)

  # Ignore 'of' when searching for vocabulary entries
  stopwords = []
  stopwords.append(“of”)
  extractor.set_stop_word_list(stopwords)

  # Convert extracted text into doubles - Only works for Excel export
  extractor.set_data_type(DataType.DOUBLE)

  # Unordered search for vocabulary entries, 'Price of Product' will match on 'Product Price'
  extractor.set_mode(Mode.UNORDERED)

  # Looking for monies right after the vocab matches: 'Product Price: $1,234'
  # Validator regex must have one capturing group
  extractor.set_validator("^\: +(\\$[\\d,]+)")


                

Result Operations

Results can be fetched once a model is run:


  List results = Result.reader(model.getId()).read();
  // Access extracted fields and vocabulary ids used to extract them
  for (Result r : results){
    for (Field f : r.getFields()){
        System.out.println(f.getVocabId() + " " + f.getStr());
    }
  }

                

  result = Result(model.get_id())
  # Access extracted fields and vocabulary ids used to extract them
  for i in result.read():
    field = Field(i)
    if field.get_values() != "":
        field_value = FieldValues(field.get_values())
        print(f"{field.get_id()} {field_value.get_str()[0]}")
        

                

Data Types


This section covers the data types used by Quantxt Theia for analyzing documents.

Document

Input files are converted into Document objects with the following properties before analysis:

id Unique ID for a Document object

fileName Original filename

contentType Content type of the document detected automatically by the engine

date The timestamp the document was created

VocabularyEntry

A VocabularyEntry has phrase and an optional category for searching input documents for finding fields. Category is used to tag and group the matched phrases:

str Search phrase to find fields

category Category or a normalized name given

Vocabulary

A Vocabulary holds a list of VocabularyEntries used for searching the documents. Vocabulary has the following properties:

id id of the vocabulary assigned by the engine upon creation of a vocabulary

name Name of the vocabulary set by the user

entries List of VocabularyEntry items

Extractor

An extractor employs a vocabulary to scan the text for fields. Extractor is also in charge of finding and validating the field values. Extractor scans the text for all entries in the vocabulary using modern full-text search techniques. User can set search methods, stop words, and synonyms via the extractors. Extractor has the following properties:

vocabulary Name of the vocabulary set by the user

type The type of the extracted field value. Default is STRING. Possible values are LONG, DOUBLE and DATETIME and if set, engine will do the best effort to convert the extracted values into the set type

mode Search mode used for scanning input content to find vocabulary phrases (aka, VocabularyEntry.strs). Default mode is SIMPLE:

  • SIMPLE Case insensetive and ignore puctuation in finding matches
  • UNORDERED SIMPLE plus allow ignore order of words in multi word pharses
  • STEM SIMPLE plus allow matching on minor variatios on words;
    Building will match on build, built and builds.
  • UNORDERED_STEM UNORDERED and STEM

  • FUZZY_UNORDERED_STEM UNORDERED_STEM plus allow fuzzy matching on words;
    Building will match on builidng.

validator A regular expression used to find and validate found FieldValues

patternBetweenMultipleValues Allowed gap between values in a multi-value field such as a table row or a table column.

By default Theia finds up to one value per field. Setting this to something like ^\s+$ makes Theia to find multiple values for a table row or column where the boundary between cells is whitespace or blank lines.

stopwordList List of words to ignore in searching for phrases in vocabulary

synonymList List of synonyms used in searching for phrases in vocabulary

Model

Models are in charge of extracting data from documents. A model has the following properties:

id Unique id for the model set by the engine once the task is submitted

description Description of the model

extractors List of extractors to be used in data analysis

documents List of documents to be processed

numWorkers Maximum number of threads used during the extraction. Default is 8.

Result

Once a model is run on documents, it returns an array of Result. Theia creates one Result object for every Text Unit that had any extracted data. Result has the following properties:

id id of the model that produced this result

documentName Same as Document.name

unitNumber Text unit number

creationTime Timestamp of creation of the results

fields Fields that were extracted for the associated text unit

Field

Fields are outcome of Extractors. If an extractor finds one field with one or more valid values for the field, it will create one one Field object. Field has the following properties:

str The match found by the extractor

vocabName Name of the vocabulary that had the match

vocabId Id of the vocabulary that had the match

category Category of the match from the vocabulary

type Type of the match. Can be STRING, LONG, DOUBLE or DATETIME

fieldValues The field values found and validated by the Extractor's validator