文本任务
训练用于文本分类、回归和标记分类的模型。
文本分类
快速开始
aitraining text-classification \
--model bert-base-uncased \
--data-path ./reviews.csv \
--text-column text \
--target-column label \
--project-name sentiment-model
Parameters
| Parameter | Description | Default |
|---|
--model | Base model | bert-base-uncased |
--data-path | Path to data (CSV, JSON, HF dataset) | None (required) |
--project-name | Output directory | project-name |
--text-column | Column with text | text |
--target-column | Column with labels | target |
--epochs | Training epochs | 3 |
--batch-size | Batch size | 8 |
--lr | Learning rate | 5e-5 |
--max-seq-length | Maximum sequence length | 128 |
--warmup-ratio | Warmup proportion | 0.1 |
--weight-decay | Weight decay | 0.0 |
--early-stopping-patience | Early stopping patience | 5 |
--early-stopping-threshold | Early stopping threshold | 0.01 |
示例:情感分析
aitraining text-classification \
--model distilbert-base-uncased \
--data-path ./sentiment.csv \
--text-column review \
--target-column sentiment \
--project-name sentiment \
--epochs 5 \
--batch-size 16
文本回归
用于从文本预测连续值。
快速开始
aitraining text-regression \
--model bert-base-uncased \
--data-path ./scores.csv \
--text-column text \
--target-column score \
--project-name score-predictor
示例:评分预测
aitraining text-regression \
--model microsoft/deberta-v3-base \
--data-path ./reviews.csv \
--text-column review_text \
--target-column rating \
--project-name rating-predictor \
--epochs 10
标记分类(NER)
用于命名实体识别和类似任务。
快速开始
aitraining token-classification \
--model bert-base-cased \
--data-path ./ner_data.json \
--tokens-column tokens \
--tags-column ner_tags \
--project-name ner-model
数据格式
您的数据应包含标记化的文本和相应的标签:
{
"tokens": ["John", "lives", "in", "New", "York"],
"ner_tags": ["B-PER", "O", "O", "B-LOC", "I-LOC"]
}
Parameters
| Parameter | Description | Default |
|---|
--tokens-column | Column with token lists | tokens |
--tags-column | Column with tag lists | tags |
--max-seq-length | Maximum sequence length | 128 |
示例:自定义 NER
aitraining token-classification \
--model bert-base-cased \
--data-path ./custom_entities.json \
--tokens-column words \
--tags-column labels \
--project-name custom-ner \
--epochs 5 \
--batch-size 16
序列到序列
用于翻译、摘要和类似任务。
快速开始
aitraining seq2seq \
--model t5-small \
--data-path ./translations.csv \
--text-column source \
--target-column target \
--project-name translator
Parameters
| Parameter | Description | Default |
|---|
--model | Base model | google/flan-t5-base |
--text-column | Source text column | text |
--target-column | Target text column | target |
--max-seq-length | Max source sequence length | 128 |
--max-target-length | Max target sequence length | 128 |
--batch-size | Batch size | 2 |
--epochs | Training epochs | 3 |
--lr | Learning rate | 5e-5 |
示例:摘要
aitraining seq2seq \
--model facebook/bart-base \
--data-path ./articles.csv \
--text-column article \
--target-column summary \
--project-name summarizer \
--epochs 3 \
--max-seq-length 1024 \
--max-target-length 128
抽取式 QA
用于从上下文回答问题。
快速开始
aitraining extractive-qa \
--model bert-base-uncased \
--data-path ./squad_format.json \
--project-name qa-model
Parameters
| Parameter | Description | Default |
|---|
--text-column | Context column | context |
--question-column | Question column | question |
--answer-column | Answers column | answers |
--max-seq-length | Max sequence length | 128 |
--max-doc-stride | Document stride for chunking | 128 |
数据格式
SQuAD 风格格式:
{
"context": "Paris is the capital of France.",
"question": "What is the capital of France?",
"answers": {
"text": ["Paris"],
"answer_start": [0]
}
}
句子转换器
用于训练句子嵌入。
快速开始
aitraining sentence-transformers \
--model sentence-transformers/all-MiniLM-L6-v2 \
--data-path ./pairs.csv \
--project-name embeddings
Parameters
| Parameter | Description | Default |
|---|
--trainer | Training mode | pair_score |
--sentence1-column | First sentence column | sentence1 |
--sentence2-column | Second sentence column | sentence2 |
--target-column | Score/label column | None |
--max-seq-length | Max sequence length | 128 |
--batch-size | Batch size | 8 |
--epochs | Training epochs | 3 |
--lr | Learning rate | 3e-5 |
数据格式
带有相似度分数的句子对:
sentence1,sentence2,score
"The cat sits.",The feline rests.",0.9
"I love pizza","The sky is blue",0.1
通用选项
所有文本任务共享这些选项:
| Option | Description | Default |
|---|
--push-to-hub | Upload to Hugging Face Hub | False |
--username | HF username (required if pushing) | None |
--token | HF token (required if pushing) | None |
--log | Logging: wandb, tensorboard, none | wandb |
使用 --push-to-hub 时,存储库默认创建为私有,位于 {username}/{project-name}。
下一步