Training a custom classification model with Word or Excel Documents

Are you looking to train a custom classification model for Word, Excel, or PowerPoint documents? Currently, the Document Intelligence Studio doesn’t support this feature. However, in this blog, I’ll demonstrate how you can achieve it using the REST API.

Requirements

  • Azure ‘Document Intelligence’ or ‘Azure AI services multi-service account’ resource
  • Azure ‘Storage account’
  • Minimal of 10 Documents (5 per category, minimum of 2 categories)
  • Postman

Word and Excel

At the moment of writing it is only possible to classify the following file types using the Document Intelligence Studio: .jfif .pjpeg .jpeg .pjp, .jpg .pdf .png .tif .tiff

In API versions 2024-02-29-preview, 2023-10-31-preview, and later you can train a custom classification model on Microsoft Office documents (.docx, .xlxs and .pptx), but only using the REST API.

Upload files

In the Azure portal navigate to your Storage account.

Create a new private container and upload your training documents. Upload the files to separate folders per category by using the advanced options.

Authorization and variables

Navigate to your Document Intelligence or multi-service resource in the Azure portal. Copy the values of both KEY1 and Endpoint for later use.

Open Postman and create a new blank collection. On the Authorization tab and set the following values:

  • Authorization type: API Key
  • Key: Ocp-Apim-Subscription-Key
  • Value: The Authorization key (KEY1) value from the Document Intelligence resource
  • Add to: Header
On the variables tab, add 2 new variables.
  • endpointUrl: Add the Document Intelligence endpoint value
  • version: Add the version of the API you would like to use. I am using version 2024-02-29-preview for this example.

Analyzing files

Before we can start training a custom classification model, we need to analyze all the training documents using the prebuild-layout model. This wil generate the .orc.json files needed to train the model. This cannot be done using the Document Intelligence portal. 

It is done in 2 steps.

  1. Analyze the document
  2. Get the analyzed results

Step 1

Because we uploaded the files to a private container the files are not publicly available. In order for the Document Intelligence resource to access to the file we need to generate a SAS url. Navigate to your files in your Storage account. Select the first document and open the Generate SAS panel using the menu under the three dots (…). Leave all the settings on default and select Generate SAS token and URL

Copy the Blob SAS URL for later use.

Create a new POST request in Postman using the following url.

				
					{{endpointUrl}}/documentintelligence/documentModels/prebuilt-layout:analyze?api-version={{version}}
				
			

In the body of the request post the following Json as raw data and modify it with your data.

				
					{
 "urlSource": "<Blob SAS URL of your document>"
}

				
			

When you send this request you will get a status 202 Accepted message. In the header of the message, you will find a GUID named ‘apim-request-id’. Copy this GUID for later in step 2.

Step 2

Create a new GET request in Postman using the following url. Make sure to change the <apim request id> tag with the GUID you copied in the last step. In the response of this second request you will get the analyzed content of the file. You can save this content using the … Save response to file option in Postman.

				
					{{endpointUrl}}/documentintelligence/documentModels/prebuilt-layout/analyzeResults/<apim request id>?api-version={{version}}
				
			

Save the files with the same name as the original, but append the following extension to the original file name: .ocr.json. For example: If you analyzed a file named packinglist.docx, name the file packinglist.docx.ocr.json.

You need to analyze all 10 files using the steps above.
When you are done upload al the .ocr.json files in the same folder as the original files in the storage account.

Create the Custom model

Create a new POST request in Postman using the following url.

				
					{{endpointUrl}}/documentintelligence/documentClassifiers:build?api-version={{version}}
				
			

In the body of the request post the following Json as raw data and modify it to your needs.

				
					{
  "classifierId": "<Name of the model>",
  "description": "<Description of the model>",",
  "docTypes": {
    "<catagoryName>": {
      "azureBlobSource": {
        "containerUrl": "<SAS url of the blob container>",
        "prefix": "<foldername>/"
      }
    },
    "<catagoryName>": {
      "azureBlobSource": {
        "containerUrl": "<SAS url of the blob container>",
        "prefix": "<foldername>/"
      }
    }
  }
}

				
			

The containerURL value contains a SAS token for the private container we created at the start of this blog. When you generate the SAS token for the container make sure that you add the list permissions.

Model status

Create a new GET request in Postman using the following url.

				
					{{endpointUrl}}/documentintelligence/documentClassifiers?api-version={{version}}
				
			

If your model creation was successful, it will be listed in the body. If it failed, you can check the error message using the Document Intelligence Studio portal. Navigate to the custom classification models, open a project and open the models tab. You can click on the name of the failed model to see the error.

Testing

Testing the model is done in 2 steps just like when we analyzed the documents.

Step 1

Create a new POST request in Postman using the following url. Make sure to change the <classifierId> tag to your model id.

				
					{{endpointUrl}}/documentintelligence/documentClassifiers/<classifierId>:analyze?api-version={{version}}
				
			

In the body of the request post the following Json as raw data and modify it with your data.

				
					{
 "urlSource": "<Blob SAS URL of your document>"
}

				
			

Get the ‘apim-request-id’ from the headers of the response and use this in step 2

Step 2

Create a new GET request in Postman using the following url. Make sure to change the <classifierId> tag to your model id and change the <apim request id> tag with the GUID you copied in the last step.

				
					{{endpointUrl}}/documentintelligence/documentClassifiers/<classifierId>/analyzeResults/<apim request id>?api-version={{version}}
				
			

When you look at the Json result you will see the category the model has predicted in the docType element.

Facebook
X
LinkedIn
WhatsApp
Email