Hasky: Dataset Upload Tool
Hasky is an information retrieval and question answering system builder that uses self supervised training techniques to significantly reduce the need for text annotation.
Give clear instructions on formatting within the dataset
When the user is required to upload a dataset necessary to train a model, account for cases where the user does not actually know what a dataset is or entails.
Try to provide clear instructions on what the dataset should include, and how it should be structured.
For CSV
If it is in the form of an excel spreadsheet, tell the user what the model is expecting – 2 columns or 3? Which column should contain the data, and which the labels? Does it matter?
If possible, show examples of an ideal dataset either in pictures or a downloadable sample.
Set clear expectations on accepted file formats for uploads
Before the user tries to upload a file, ensure that there is adequate information given on what file formats are accepted- CSVs, jSONs, and even google sheet links.
Show preview of uploaded datasets
Once the user has uploaded a dataset, make sure they are able to preview the dataset before moving on to the next step. Often times these users are dealing with multiple datasets and enabling them to double check their uploaded fils reduces any careless mistakes or accidental uploads of the wrong file.
For the smoothest user experience training NLP models, build a spreadsheet/table feature for your user to create or edit datasets on the platform itself. This means having some sort of an in-built spreadsheet capability.
Without this tool, users have to constantly download files to make changes – no matter how minor – and re-upload them back again. The constant downloading and uploading are extra steps that the user can do without should they be given the ability to manipulate the dataset on the platform directly.