Skip to content

ParseResponse serialization duplicates keys #64

@DemirTonchev

Description

@DemirTonchev

ParseResponse serialization produces both class and class_ fields when using parse method and dumping to file:

response.model_dump_json(indent=2)

produces json like this:

{
  "chunks": [...], 
  "class_": "full",  # <<<<<<<
  "identifier": "full",
  "markdown": "<string>",
  "pages": [0],
  "class": "full"  # <<<<<<<
}

further more downloading blob from s3 by using the job output_url and then trying to load the model like this:

import json
from io import BytesIO
import httpx

async def download_blob(presigned_url: str):
    async with httpx.AsyncClient() as client:
        async with client.stream("GET", presigned_url) as response:
            response.raise_for_status()
            buffer = BytesIO()
            async for chunk in response.aiter_bytes():
                buffer.write(chunk)
            return buffer
buf = await download_blob(response.output_url)
parsed = ParseResponse.model_validate_json(buf.getvalue())

Fails with:

ValidationError: 1 validation error for ParseResponse splits.0.class
Field required [type=missing, input_value={'class_': 'full', 'ident...a94-9681-e3c8860228dd']}, input_type=dict]

using ParseResponse.model_validate_json(buf.getvalue(), by_name=True) succeeds this not found in the documentation.

Expected behavior:

model_dump_json() should produce only 'class' or 'class_' not both
JSON from S3 should deserialize correctly

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions