Skip to main content

Text Tasks

Train models for text classification, regression, and token classification.

Text Classification

Quick Start

aitraining text-classification \
  --model bert-base-uncased \
  --data-path ./reviews.csv \
  --text-column text \
  --target-column label \
  --project-name sentiment-model

Parameters

ParameterDescriptionDefault
--modelBase modelbert-base-uncased
--data-pathPath to data (CSV, JSON, HF dataset)None (required)
--project-nameOutput directoryproject-name
--text-columnColumn with texttext
--target-columnColumn with labelstarget
--epochsTraining epochs3
--batch-sizeBatch size8
--lrLearning rate5e-5
--max-seq-lengthMaximum sequence length128
--warmup-ratioWarmup proportion0.1
--weight-decayWeight decay0.0
--early-stopping-patienceEarly stopping patience5
--early-stopping-thresholdEarly stopping threshold0.01

Example: Sentiment Analysis

aitraining text-classification \
  --model distilbert-base-uncased \
  --data-path ./sentiment.csv \
  --text-column review \
  --target-column sentiment \
  --project-name sentiment \
  --epochs 5 \
  --batch-size 16

Text Regression

For predicting continuous values from text.

Quick Start

aitraining text-regression \
  --model bert-base-uncased \
  --data-path ./scores.csv \
  --text-column text \
  --target-column score \
  --project-name score-predictor

Example: Rating Prediction

aitraining text-regression \
  --model microsoft/deberta-v3-base \
  --data-path ./reviews.csv \
  --text-column review_text \
  --target-column rating \
  --project-name rating-predictor \
  --epochs 10

Token Classification (NER)

For named entity recognition and similar tasks.

Quick Start

aitraining token-classification \
  --model bert-base-cased \
  --data-path ./ner_data.json \
  --tokens-column tokens \
  --tags-column ner_tags \
  --project-name ner-model

Data Format

Your data should have tokenized text and corresponding tags:
{
  "tokens": ["John", "lives", "in", "New", "York"],
  "ner_tags": ["B-PER", "O", "O", "B-LOC", "I-LOC"]
}

Parameters

ParameterDescriptionDefault
--tokens-columnColumn with token liststokens
--tags-columnColumn with tag liststags
--max-seq-lengthMaximum sequence length128

Example: Custom NER

aitraining token-classification \
  --model bert-base-cased \
  --data-path ./custom_entities.json \
  --tokens-column words \
  --tags-column labels \
  --project-name custom-ner \
  --epochs 5 \
  --batch-size 16

Sequence-to-Sequence

For translation, summarization, and similar tasks.

Quick Start

aitraining seq2seq \
  --model t5-small \
  --data-path ./translations.csv \
  --text-column source \
  --target-column target \
  --project-name translator

Parameters

ParameterDescriptionDefault
--modelBase modelgoogle/flan-t5-base
--text-columnSource text columntext
--target-columnTarget text columntarget
--max-seq-lengthMax source sequence length128
--max-target-lengthMax target sequence length128
--batch-sizeBatch size2
--epochsTraining epochs3
--lrLearning rate5e-5

Example: Summarization

aitraining seq2seq \
  --model facebook/bart-base \
  --data-path ./articles.csv \
  --text-column article \
  --target-column summary \
  --project-name summarizer \
  --epochs 3 \
  --max-seq-length 1024 \
  --max-target-length 128

Extractive QA

For question answering from context.

Quick Start

aitraining extractive-qa \
  --model bert-base-uncased \
  --data-path ./squad_format.json \
  --project-name qa-model

Parameters

ParameterDescriptionDefault
--text-columnContext columncontext
--question-columnQuestion columnquestion
--answer-columnAnswers columnanswers
--max-seq-lengthMax sequence length128
--max-doc-strideDocument stride for chunking128

Data Format

SQuAD-style format:
{
  "context": "Paris is the capital of France.",
  "question": "What is the capital of France?",
  "answers": {
    "text": ["Paris"],
    "answer_start": [0]
  }
}

Sentence Transformers

For training sentence embeddings.

Quick Start

aitraining sentence-transformers \
  --model sentence-transformers/all-MiniLM-L6-v2 \
  --data-path ./pairs.csv \
  --project-name embeddings

Parameters

ParameterDescriptionDefault
--trainerTraining modepair_score
--sentence1-columnFirst sentence columnsentence1
--sentence2-columnSecond sentence columnsentence2
--target-columnScore/label columnNone
--max-seq-lengthMax sequence length128
--batch-sizeBatch size8
--epochsTraining epochs3
--lrLearning rate3e-5

Data Format

Sentence pairs with similarity scores:
sentence1,sentence2,score
"The cat sits.",The feline rests.",0.9
"I love pizza","The sky is blue",0.1

Common Options

All text tasks share these options:
OptionDescriptionDefault
--push-to-hubUpload to Hugging Face HubFalse
--usernameHF username (required if pushing)None
--tokenHF token (required if pushing)None
--logLogging: wandb, tensorboard, nonewandb
When using --push-to-hub, the repository is created as private by default at {username}/{project-name}.

Next Steps