Smart Vision v7.2-Performance Test Report

(Impact of PDF Search step)

 

Contents

  1. Test Environment – Infrastructure setup
  2. Performance Testing - Software Used
  3. FinOps Components & Autoscaling configurations
    1. Autoscaling Configuration
    2. Smart Extract Container Configuration
    3. DU Project Configuration
    4. Document Selected for Processing
  4. Performance Test -Summary
  5. Detailed Test Results
    1. Clones Workflow Execution Time
    2. Execution time of each steps in DU
    3. Queue time of each step in DU
    4. No of messages in each Queue of RabbitMQ used by DU
  6. Performance Test – Infrastructure Resource Usage
    1. Document Understanding Stack
    2. Without PDF Search step
    3. Clones Stack
    4. Database & Message Broker
  7. Performance Test Execution results
  8. Performance Test Observations

Test Environment – Infrastructure setup

Performance test was conducted for SmartOps Kubernetes based Infrastructure hosted in Azure Cloud.

Test environment was deployed with Smart Extract pipeline and its related containers only. Vespa pipeline and its related containers are excluded in this test executions.

The below table summarizes the hardware Infrastructure along with Node count used for the Smart Extract infrastructure.

 

Kubernetes Nodes

 

Hardware Infrastructure

Node Count

Min

Max

Application Node Pool

Components

DU (Smart-extract, smart-extract-predict, rest, pipeline, scheduler, invoice-split, page-split , invoice-image-ocr), Clones Engine, pwf-invoice-extraction-scheduler, pwf-invoice-extraction-listener, pwf-invoice-extraction-api, ie-ui

Azure D8sv3

CPU - 8 vCPU Core

RAM: 32 GB

 

4

 

6

Persistent Pool

Components

Mongo, MinIO, RabbitMQ, Elastic Search, Kibana

Azure D4sv3

CPU - 4 vCPU Core

RAM: 16 GB

 

3

 

6

MySQL

Azure Managed MYSQL

General Purpose, 2 Core(s), 8GB RAM, 100 GB (Auto Grow Enabled)

 

NA

 

NA

Picture 8

Performance Testing - Software Used

Following Tools used as part of performance testing

Tool

Version

Description

JMeter

5.1.1

Implementing Performance testing

Prometheus

 

Capture resource utilization on server side.

Grafana

 

Dashboard to view resource utilization.

Microsoft Excel

 

Analyzing test results &reports.

FinOps

7.2

SmartOps FinOps application.

FinOps Components & Autoscaling configurations

The below table provides details of docker containers of different components associated with FinOps from Document Understanding (DU) , Clones, PWF, Database & Message Broker.

This also provides detail of components identified for Autoscaling and the criteria defined for Autoscaling. As part of ensuring availability of non-scalable components to avoid failures of any components during document processing, 2 instances of each components are by default available in the Kubernetes cluster.

Stack Name

Container Name

Autoscaling Enabled

Scaling Criteria

Document Understanding

du-rest

Y

Based on CPU Usage

du-scheduler

N

 

du-pipeline

Y

Based on CPU Usage

du-invoice-split

N

 

du-tilt-correct

N

 

du-smart-extract

N

 

du-invoice-image-ocr

N

 

Clones

clones-sense-queue

N

 

clones-engine

Y

Based on CPU Usage

PWF

pwf-Invoice-extraction-listener-du

N

 

pwf-invoice-extraction-listener

N

 

pwf-invoice-extraction-api

N

 

pwf-invoice-extraction-scheduler

N

 

Following are the details of replicas configured for Database & Message Broker in Kubernetes cluster MySQL is deployed as a Managed service at Azure.

Stack Name

Container Name

No of instances

Database

mongo

3

minio

2

elasticsearch

3

mysql

Azure Managed Service

Message Broker

rabbitmq

3

Autoscaling Configuration

Following are the CPU threshold limits, Replicas (Minimum & Maximum) configured for each component identified for Autoscaling in Kubernetes cluster for FinOps.

Container name

CPU Threshold

min replicas

max replicas

Clones-engine

80%

2

4

du-core-nlp

80%

2

4

du-pipeline

80%

2

4

du-rest

80%

2

4

Smart Extract Container Configuration

Based on the optimal Invoice processing observed in previous tests, Smart Extract Engine and Invoice Image OCR POD replicas are set to 3. All the tests are conducted with this configuration.

Smart Extract Engine

Invoice Image OCR

3

3

DU Project Configuration

DU Project setting

Value

Engine Pipeline

Smart Extract

Language Classifier

Disabled

Feedback Learning (HITL)

Disabled

Preprocessors Enabled

Barcode Page Split, Skip Dup Validation

Document Selected for Processing

File selected for processing has the below parameters.

Test is executed using Invoice samples of 51 per batch, each Invoice has 2 pages the first page being the Invoice scan page and second page is the invoice page.

Different vendor-based sample invoices are used consisting of Invoices of different complexity categorized as Low, Medium, High.

Test is executed 3 times to confirm a consistent result.

 

Parameter

Value

Main File Type

Zip

Zip file contents (Combined, Individual)

1 PDF with combined Invoices

No of Invoices combined in the zip file

51 (102 pages)

Pages per Invoice document.

2 (1 scan and 1 invoice page)

 

Invoice Batch.

Batch consist of 51 Invoice samples each with 3 different variations of Invoices of Low, Medium, High complexity

Number of fields extracted per invoice

22

Insight Fields

14

Line Items

8

Infrastructure Variation - Document processing Engine

Smart Extract Exclusive

Performance Test -Summary

Comparison report before and after PDF search code changes. An addition of 3 minutes is noticed with PDF search code changes. This is the total time taken in DU to process 51 invoices used in the test.

Before PDF search

After PDF search

Impact

19.57

22.53

An addition of 3 minutes (for 51 invoices) after PDF search code changes.

Without PDF Search step:

Test 

Execution

DU Time Taken (Minutes)

First file processed

Last file processed

Test 1

3.48

19.46

Test 2

3.38

20.12

Test 3

3.46

19.13

With PDF Search step:

Test 

Execution

DU Time Taken (Minutes)

First file processed

Last file processed

Test 1

3.00

22.27

Test 2

3.35

23.23

Test 3

3.08

22.10

 

The following graph represents the time-based execution of the different steps of DU as part of the processing of the 51 Invoice sample for one of the test executions.

The blue bar represents the duration of execution of the different DU steps. This also help us in identifying the parallel execution details of the DU steps.

Without PDF Search step

 

Picture 18

With PDF Search step

Picture 26

Detailed Test Results

Clones Workflow Execution Time

It involves the time taken to execute the following workflow by clones engine as part of document processing.

Invoice_PWF_PushToDUSched is a scheduled workflow which polls the FTP folder to fetch the Invoice batch zip file and provides the zip file to the corresponding Document Understanding Project for data extraction.

Invoice_PWF_CheckRoleCondition workflow validates the role-based permission needed as part of the processing of invoice documents.

The below table summarizes the average time taken for the executions of different clones workflow.

Without PDF Search step

 

Clones Workflow (Seconds)

Invoice Batch

Executions

CheckRoleCondition

Invoice PWF Installation

PushToDUSched

Sample -1

Test1

7.13

2.11

9.59

Test2

7.26

2.14

12.31

Test3

7.19

2.23

10.35

 

With PDF Search step

 

Clones Workflow (Seconds)

Invoice Batch

Executions

CheckRoleCondition

Invoice PWF Installation

PushToDUSched

Sample -1

Test1

14.43

1.65

8.55

Test2

12.37

1.68

40.54

Test3

10.77

1.57

11.27

 

Execution time of each steps in DU

Time taken for the execution of different steps in Document extraction in DU as part of Invoice extraction. There are several steps involved in the processing of an Invoice document by Documents Understanding application. PDF search which is introduced in 7.2 recently is one among such step.

The result consists of data from one of the 3 test executions.

Without PDF Search step

DU Steps

Execution Time (Seconds)

AVG

MIN

MAX

Occurrence per batch

Extended Time per batch

INITIALIZE

1.56

0.37

4.23

155

241.99

VALIDATE

0.87

0.16

2.64

155

134.54

ZIP_EXTRACT

1.24

1.24

1.24

1

1.24

INV_DOC_SPLIT

426.20

426.20

426.20

1

426.20

INV_PAGE_SPLIT

10.27

2.62

19.41

51

523.79

INV_IMAGE_OCR

19.91

4.01

59.86

102

2030.92

SMART_EXTRACT_PREPROCESSOR

4.80

1.88

9.68

102

489.76

SMART_EXTRACT

5.43

3.79

8.96

102

554.34

SMART_EXTRACT_POSTPROCESSOR

2.96

0.60

8.38

102

302.16

SMART_EXTRACT_FOI

14.27

1.03

38.61

102

1455.29

TRIGGER_PARALLEL_VESPA_SE

5.86

1.80

18.19

102

597.85

INV_RESULT_AGGREGATE

0.44

0.20

1.78

51

22.57

FINALIZE

1.49

0.33

7.48

359

534.47

 

With PDF Search step

DU Steps

Execution Time (Seconds)

AVG

MIN

MAX

Occurrence per batch

Extended Time per batch

INITIALIZE

1.85

0.38

5.51

155

286.64

VALIDATE

1.01

0.20

3.62

155

156.47

ZIP_EXTRACT

1.21

1.21

1.21

1

1.21

INV_DOC_SPLIT

536.35

536.35

536.35

1

536.35

INV_PAGE_SPLIT

3.14

0.88

7.88

51

159.92

INV_IMAGE_OCR

23.63

3.72

97.73

102

2409.75

SMART_EXTRACT_PREPROCESSOR

5.59

2.05

13.32

102

570.07

SMART_EXTRACT

9.43

7.35

14.62

102

962.17

SMART_EXTRACT_POSTPROCESSOR

3.32

0.56

17.62

102

338.45

SMART_EXTRACT_FOI

16.29

1.19

46.76

102

1661.33

TRIGGER_PARALLEL_VESPA_SE

7.14

2.05

21.64

102

728.52

INV_RESULT_AGGREGATE

0.62

0.23

3.12

51

31.42

FINALIZE

2.00

0.38

10.71

359

719.68

PDF_SEARCH

9.45

1.30

44.58

102

964.09

Queue time of each step in DU

In each steps of Document extraction in DU there is a Queue phase associated. The below table summarizes the time spent in the Queue of each of the steps of Document extraction in Document Understanding application.

Without PDF Search step

DU Steps

Queue Time (Seconds)

AVG

MIN

MAX

Occurrence per batch

Extended Time per batch

INITIALIZE

2.80

0.57

8.00

155

434.32

VALIDATE

3.00

0.68

10.13

155

465.61

ZIP_EXTRACT

0.71

0.71

0.71

1

0.71

INV_DOC_SPLIT

0.69

0.69

0.69

1

0.69

INV_PAGE_SPLIT

2.54

0.65

5.91

51

129.36

INV_IMAGE_OCR

204.00

0.76

673.37

102

20808.29

SMART_EXTRACT_PREPROCESSOR

117.42

0.91

761.61

102

11977.33

SMART_EXTRACT

1.74

0.59

5.87

102

177.69

SMART_EXTRACT_POSTPROCESSOR

74.14

0.82

147.11

102

7562.56

SMART_EXTRACT_FOI

140.59

0.85

813.30

102

14340.43

TRIGGER_PARALLEL_VESPA_SE

2.19

0.68

9.13

102

223.05

INV_RESULT_AGGREGATE

1.52

0.61

10.75

51

77.45

FINALIZE

1.57

0.53

9.51

359

563.87

 

With PDF Search step

DU Steps

Queue Time (Seconds)

AVG

MIN

MAX

Occurrence per batch

Extended Time per batch

INITIALIZE

3.52

0.65

12.63

155

545.65

VALIDATE

3.71

0.82

14.36

155

574.29

ZIP_EXTRACT

0.71

0.71

0.71

1

0.71

INV_DOC_SPLIT

0.80

0.80

0.80

1

0.80

INV_PAGE_SPLIT

623.66

99.51

1171.29

51

31806.70

INV_IMAGE_OCR

200.58

0.98

725.64

102

20459.26

SMART_EXTRACT_PREPROCESSOR

144.56

0.86

863.67

102

14745.54

SMART_EXTRACT

3.67

0.68

12.24

102

374.48

SMART_EXTRACT_POSTPROCESSOR

83.37

1.08

164.34

102

8503.77

SMART_EXTRACT_FOI

137.72

0.86

872.63

102

14047.80

TRIGGER_PARALLEL_VESPA_SE

2.83

0.82

9.69

102

288.91

INV_RESULT_AGGREGATE

1.96

0.73

8.15

51

99.99

FINALIZE

1.94

0.66

10.22

359

697.04

PDF_SEARCH

6.06

0.68

51.26

102

617.96

 

No of messages in each Queue of RabbitMQ used by DU

This involves metrics of messages in different Queues used by DU as part of the various steps of Documents processing by DU

It helps in identifying in which Queue the De-Queuing process is slow and helps to identify the component depending on the Queue which has a slow message exchange rate.

The below graph represents the count of messages in different queue used by each DU steps during test execution.

Without PDF Search step

Picture 9

With PDF Search step

Picture 10

Performance Test – Infrastructure Resource Usage

Below are the metrics captured on maximum CPU cores & Memory used while executing the performance test using the using sample invoices consisting of 51 invoices with 17 each sample of Low, medium, High complexity-based invoices of different vendors.

As part of ensuring high availability 2 POD of each components are deployed in Kubernetes, the CPU core and Memory shown below is the maximum CPU cores & memory usage of both the POD of each core component of Document Understanding, Clones, Database, Message Broker using the Smart Extract documents processing engine.

Document Understanding Stack

Without PDF Search step

Picture 2

With PDF Search step

Picture 11

 

Clones Stack

Without PDF Search step

Picture 1

With PDF Search step

 

Picture 19

Database & Message Broker

Without PDF Search step

Picture 21

With PDF Search step

Picture 20

Performance Test Execution results

Raw data of each test executions from Document Understanding are available on the following SharePoint location - Test Run Data

Performance Test Observations