Encoding issue when trying to run command `rasa data split nlu`

Description

I am creating a bot for Polish language with RASA open source. I have encountered a bug when tring to run CLI command rasa data split nlu.

The data generated into splits fails to encode custom UTF-8 characters properly. The data is saved into yaml as unicode sequences \u0142 instead of ł and is later read improperly by the importer. This error occurs only if this character is in entity dict.

While All specific Polish characters like ą, ę, ł, ć, ź, ... work well when thay are not included in the entity value dict (they are properly represented) inside the json-encoded dict they contain escape unicode sequences.

I managed to fix this issue by properly encoding the entity dict while saving it with TrainingDataWriter.

A simmilar issue is also referenced in GitHub Issue #7541 RasaHQ/rasa and marked as stale for earlier version of rasa

is blocked by

Activity

Details

Assignee

Reporter

Labels

Rasa Open Source Version

Rasa SDK Version

Python version

Operating System

Difficulty

Created July 10, 2024 at 12:14 PM
Updated July 10, 2024 at 2:23 PM