Are you looking to train a custom classification model for Word, Excel, or PowerPoint documents? Currently, the Document Intelligence Studio doesn’t support this feature. However, in this blog, I’ll demonstrate how you can achieve it using the REST API.
Blogs
This blog is part in a series of blogs about the capabilities of the Azure Document Intelligence resource.
Requirements
- Azure ‘Document Intelligence’ or ‘Azure AI services multi-service account’ resource
- Azure ‘Storage account’
- Minimal of 10 Documents (5 per category, minimum of 2 categories)
- Postman
Word and Excel
At the moment of writing it is only possible to classify the following file types using the Document Intelligence Studio: .jfif .pjpeg .jpeg .pjp, .jpg .pdf .png .tif .tiff
In API versions 2024-02-29-preview, 2023-10-31-preview, and later you can train a custom classification model on Microsoft Office documents (.docx, .xlxs and .pptx), but only using the REST API.
Upload files
In the Azure portal navigate to your Storage account.
Create a new private container and upload your training documents. Upload the files to separate folders per category by using the advanced options.
Authorization and variables
Navigate to your Document Intelligence or multi-service resource in the Azure portal. Copy the values of both KEY1 and Endpoint for later use.
Open Postman and create a new blank collection. On the Authorization tab and set the following values:
- Authorization type: API Key
- Key: Ocp-Apim-Subscription-Key
- Value: The Authorization key (KEY1) value from the Document Intelligence resource
- Add to: Header
- endpointUrl: Add the Document Intelligence endpoint value
- version: Add the version of the API you would like to use. I am using version 2024-02-29-preview for this example.
Analyzing files
Before we can start training a custom classification model, we need to analyze all the training documents using the prebuild-layout model. This wil generate the .orc.json files needed to train the model. This cannot be done using the Document Intelligence portal.
It is done in 2 steps.
- Analyze the document
- Get the analyzed results
Step 1
Because we uploaded the files to a private container the files are not publicly available. In order for the Document Intelligence resource to access to the file we need to generate a SAS url. Navigate to your files in your Storage account. Select the first document and open the Generate SAS panel using the menu under the three dots (…). Leave all the settings on default and select Generate SAS token and URL
Copy the Blob SAS URL for later use.
Create a new POST request in Postman using the following url.
{{endpointUrl}}/documentintelligence/documentModels/prebuilt-layout:analyze?api-version={{version}}
In the body of the request post the following Json as raw data and modify it with your data.
{
"urlSource": ""
}
When you send this request you will get a status 202 Accepted message. In the header of the message, you will find a GUID named ‘apim-request-id’. Copy this GUID for later in step 2.
Step 2
Create a new GET request in Postman using the following url. Make sure to change the <apim request id> tag with the GUID you copied in the last step. In the response of this second request you will get the analyzed content of the file. You can save this content using the … Save response to file option in Postman.
{{endpointUrl}}/documentintelligence/documentModels/prebuilt-layout/analyzeResults/?api-version={{version}}
Save the files with the same name as the original, but append the following extension to the original file name: .ocr.json. For example: If you analyzed a file named packinglist.docx, name the file packinglist.docx.ocr.json.
You need to analyze all 10 files using the steps above.
When you are done upload al the .ocr.json files in the same folder as the original files in the storage account.
Create the Custom model
Create a new POST request in Postman using the following url.
{{endpointUrl}}/documentintelligence/documentClassifiers:build?api-version={{version}}
In the body of the request post the following Json as raw data and modify it to your needs.
{
"classifierId": "",
"description": "",",
"docTypes": {
"": {
"azureBlobSource": {
"containerUrl": "",
"prefix": "/"
}
},
"": {
"azureBlobSource": {
"containerUrl": "",
"prefix": "/"
}
}
}
}
The containerURL value contains a SAS token for the private container we created at the start of this blog. When you generate the SAS token for the container make sure that you add the list permissions.
Model status
Create a new GET request in Postman using the following url.
{{endpointUrl}}/documentintelligence/documentClassifiers?api-version={{version}}
If your model creation was successful, it will be listed in the body. If it failed, you can check the error message using the Document Intelligence Studio portal. Navigate to the custom classification models, open a project and open the models tab. You can click on the name of the failed model to see the error.
Testing
Testing the model is done in 2 steps just like when we analyzed the documents.
Step 1
Create a new POST request in Postman using the following url. Make sure to change the <classifierId> tag to your model id.
{{endpointUrl}}/documentintelligence/documentClassifiers/:analyze?api-version={{version}}
In the body of the request post the following Json as raw data and modify it with your data.
{
"urlSource": ""
}
Get the ‘apim-request-id’ from the headers of the response and use this in step 2
Step 2
Create a new GET request in Postman using the following url. Make sure to change the <classifierId> tag to your model id and change the <apim request id> tag with the GUID you copied in the last step.
{{endpointUrl}}/documentintelligence/documentClassifiers//analyzeResults/?api-version={{version}}
When you look at the Json result you will see the category the model has predicted in the docType element.