跳转到主要内容

文本任务

训练用于文本分类、回归和标记分类的模型。

文本分类

快速开始

aitraining text-classification \
  --model bert-base-uncased \
  --data-path ./reviews.csv \
  --text-column text \
  --target-column label \
  --project-name sentiment-model

Parameters

ParameterDescriptionDefault
--modelBase modelbert-base-uncased
--data-pathPath to data (CSV, JSON, HF dataset)None (required)
--project-nameOutput directoryproject-name
--text-columnColumn with texttext
--target-columnColumn with labelstarget
--epochsTraining epochs3
--batch-sizeBatch size8
--lrLearning rate5e-5
--max-seq-lengthMaximum sequence length128
--warmup-ratioWarmup proportion0.1
--weight-decayWeight decay0.0
--early-stopping-patienceEarly stopping patience5
--early-stopping-thresholdEarly stopping threshold0.01

示例:情感分析

aitraining text-classification \
  --model distilbert-base-uncased \
  --data-path ./sentiment.csv \
  --text-column review \
  --target-column sentiment \
  --project-name sentiment \
  --epochs 5 \
  --batch-size 16

文本回归

用于从文本预测连续值。

快速开始

aitraining text-regression \
  --model bert-base-uncased \
  --data-path ./scores.csv \
  --text-column text \
  --target-column score \
  --project-name score-predictor

示例:评分预测

aitraining text-regression \
  --model microsoft/deberta-v3-base \
  --data-path ./reviews.csv \
  --text-column review_text \
  --target-column rating \
  --project-name rating-predictor \
  --epochs 10

标记分类(NER)

用于命名实体识别和类似任务。

快速开始

aitraining token-classification \
  --model bert-base-cased \
  --data-path ./ner_data.json \
  --tokens-column tokens \
  --tags-column ner_tags \
  --project-name ner-model

数据格式

您的数据应包含标记化的文本和相应的标签:
{
  "tokens": ["John", "lives", "in", "New", "York"],
  "ner_tags": ["B-PER", "O", "O", "B-LOC", "I-LOC"]
}

Parameters

ParameterDescriptionDefault
--tokens-columnColumn with token liststokens
--tags-columnColumn with tag liststags
--max-seq-lengthMaximum sequence length128

示例:自定义 NER

aitraining token-classification \
  --model bert-base-cased \
  --data-path ./custom_entities.json \
  --tokens-column words \
  --tags-column labels \
  --project-name custom-ner \
  --epochs 5 \
  --batch-size 16

序列到序列

用于翻译、摘要和类似任务。

快速开始

aitraining seq2seq \
  --model t5-small \
  --data-path ./translations.csv \
  --text-column source \
  --target-column target \
  --project-name translator

Parameters

ParameterDescriptionDefault
--modelBase modelgoogle/flan-t5-base
--text-columnSource text columntext
--target-columnTarget text columntarget
--max-seq-lengthMax source sequence length128
--max-target-lengthMax target sequence length128
--batch-sizeBatch size2
--epochsTraining epochs3
--lrLearning rate5e-5

示例:摘要

aitraining seq2seq \
  --model facebook/bart-base \
  --data-path ./articles.csv \
  --text-column article \
  --target-column summary \
  --project-name summarizer \
  --epochs 3 \
  --max-seq-length 1024 \
  --max-target-length 128

抽取式 QA

用于从上下文回答问题。

快速开始

aitraining extractive-qa \
  --model bert-base-uncased \
  --data-path ./squad_format.json \
  --project-name qa-model

Parameters

ParameterDescriptionDefault
--text-columnContext columncontext
--question-columnQuestion columnquestion
--answer-columnAnswers columnanswers
--max-seq-lengthMax sequence length128
--max-doc-strideDocument stride for chunking128

数据格式

SQuAD 风格格式:
{
  "context": "Paris is the capital of France.",
  "question": "What is the capital of France?",
  "answers": {
    "text": ["Paris"],
    "answer_start": [0]
  }
}

句子转换器

用于训练句子嵌入。

快速开始

aitraining sentence-transformers \
  --model sentence-transformers/all-MiniLM-L6-v2 \
  --data-path ./pairs.csv \
  --project-name embeddings

Parameters

ParameterDescriptionDefault
--trainerTraining modepair_score
--sentence1-columnFirst sentence columnsentence1
--sentence2-columnSecond sentence columnsentence2
--target-columnScore/label columnNone
--max-seq-lengthMax sequence length128
--batch-sizeBatch size8
--epochsTraining epochs3
--lrLearning rate3e-5

数据格式

带有相似度分数的句子对:
sentence1,sentence2,score
"The cat sits.",The feline rests.",0.9
"I love pizza","The sky is blue",0.1

通用选项

所有文本任务共享这些选项:
OptionDescriptionDefault
--push-to-hubUpload to Hugging Face HubFalse
--usernameHF username (required if pushing)None
--tokenHF token (required if pushing)None
--logLogging: wandb, tensorboard, nonewandb
使用 --push-to-hub 时,存储库默认创建为私有,位于 {username}/{project-name}

下一步