Text Tasks
Train models for text classification, regression, and token classification.
Text Classification
Quick Start
aitraining text-classification \
--model bert-base-uncased \
--data-path ./reviews.csv \
--text-column text \
--target-column label \
--project-name sentiment-model
Parameters
| Parameter | Description | Default |
|---|
--model | Base model | bert-base-uncased |
--data-path | Path to data (CSV, JSON, HF dataset) | None (required) |
--project-name | Output directory | project-name |
--text-column | Column with text | text |
--target-column | Column with labels | target |
--epochs | Training epochs | 3 |
--batch-size | Batch size | 8 |
--lr | Learning rate | 5e-5 |
--max-seq-length | Maximum sequence length | 128 |
--warmup-ratio | Warmup proportion | 0.1 |
--weight-decay | Weight decay | 0.0 |
--early-stopping-patience | Early stopping patience | 5 |
--early-stopping-threshold | Early stopping threshold | 0.01 |
Example: Sentiment Analysis
aitraining text-classification \
--model distilbert-base-uncased \
--data-path ./sentiment.csv \
--text-column review \
--target-column sentiment \
--project-name sentiment \
--epochs 5 \
--batch-size 16
Text Regression
For predicting continuous values from text.
Quick Start
aitraining text-regression \
--model bert-base-uncased \
--data-path ./scores.csv \
--text-column text \
--target-column score \
--project-name score-predictor
Example: Rating Prediction
aitraining text-regression \
--model microsoft/deberta-v3-base \
--data-path ./reviews.csv \
--text-column review_text \
--target-column rating \
--project-name rating-predictor \
--epochs 10
Token Classification (NER)
For named entity recognition and similar tasks.
Quick Start
aitraining token-classification \
--model bert-base-cased \
--data-path ./ner_data.json \
--tokens-column tokens \
--tags-column ner_tags \
--project-name ner-model
Your data should have tokenized text and corresponding tags:
{
"tokens": ["John", "lives", "in", "New", "York"],
"ner_tags": ["B-PER", "O", "O", "B-LOC", "I-LOC"]
}
Parameters
| Parameter | Description | Default |
|---|
--tokens-column | Column with token lists | tokens |
--tags-column | Column with tag lists | tags |
--max-seq-length | Maximum sequence length | 128 |
Example: Custom NER
aitraining token-classification \
--model bert-base-cased \
--data-path ./custom_entities.json \
--tokens-column words \
--tags-column labels \
--project-name custom-ner \
--epochs 5 \
--batch-size 16
Sequence-to-Sequence
For translation, summarization, and similar tasks.
Quick Start
aitraining seq2seq \
--model t5-small \
--data-path ./translations.csv \
--text-column source \
--target-column target \
--project-name translator
Parameters
| Parameter | Description | Default |
|---|
--model | Base model | google/flan-t5-base |
--text-column | Source text column | text |
--target-column | Target text column | target |
--max-seq-length | Max source sequence length | 128 |
--max-target-length | Max target sequence length | 128 |
--batch-size | Batch size | 2 |
--epochs | Training epochs | 3 |
--lr | Learning rate | 5e-5 |
Example: Summarization
aitraining seq2seq \
--model facebook/bart-base \
--data-path ./articles.csv \
--text-column article \
--target-column summary \
--project-name summarizer \
--epochs 3 \
--max-seq-length 1024 \
--max-target-length 128
For question answering from context.
Quick Start
aitraining extractive-qa \
--model bert-base-uncased \
--data-path ./squad_format.json \
--project-name qa-model
Parameters
| Parameter | Description | Default |
|---|
--text-column | Context column | context |
--question-column | Question column | question |
--answer-column | Answers column | answers |
--max-seq-length | Max sequence length | 128 |
--max-doc-stride | Document stride for chunking | 128 |
SQuAD-style format:
{
"context": "Paris is the capital of France.",
"question": "What is the capital of France?",
"answers": {
"text": ["Paris"],
"answer_start": [0]
}
}
For training sentence embeddings.
Quick Start
aitraining sentence-transformers \
--model sentence-transformers/all-MiniLM-L6-v2 \
--data-path ./pairs.csv \
--project-name embeddings
Parameters
| Parameter | Description | Default |
|---|
--trainer | Training mode | pair_score |
--sentence1-column | First sentence column | sentence1 |
--sentence2-column | Second sentence column | sentence2 |
--target-column | Score/label column | None |
--max-seq-length | Max sequence length | 128 |
--batch-size | Batch size | 8 |
--epochs | Training epochs | 3 |
--lr | Learning rate | 3e-5 |
Sentence pairs with similarity scores:
sentence1,sentence2,score
"The cat sits.",The feline rests.",0.9
"I love pizza","The sky is blue",0.1
Common Options
All text tasks share these options:
| Option | Description | Default |
|---|
--push-to-hub | Upload to Hugging Face Hub | False |
--username | HF username (required if pushing) | None |
--token | HF token (required if pushing) | None |
--log | Logging: wandb, tensorboard, none | wandb |
When using --push-to-hub, the repository is created as private by default at {username}/{project-name}.
Next Steps