utils.quantization
utils.quantization
Utilities for quantization including QAT and PTQ using torchao.
Functions
| Name | Description |
|---|---|
| convert_qat_model | This function converts a QAT model which has fake quantized layers back to the original model. |
| get_quantization_config | This function is used to build a post-training quantization config. |
| prepare_model_for_qat | This function is used to prepare a model for QAT by swapping the model’s linear |
| quantize_model | This function is used to quantize a model. |
| save_quantized_model | Save a quantized model, handling MXTensor serialization. |
convert_qat_model
utils.quantization.convert_qat_model(model, quantize_embedding=False)This function converts a QAT model which has fake quantized layers back to the original model.
get_quantization_config
utils.quantization.get_quantization_config(
weight_dtype,
activation_dtype=None,
group_size=None,
)This function is used to build a post-training quantization config.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| weight_dtype | TorchAOQuantDType | The dtype to use for weight quantization. | required |
| activation_dtype | TorchAOQuantDType | None | The dtype to use for activation quantization. | None |
| group_size | int | None | The group size to use for weight quantization. | None |
Returns
| Name | Type | Description |
|---|---|---|
| AOBaseConfig | The post-training quantization config. |
Raises
| Name | Type | Description |
|---|---|---|
| ValueError | If the activation dtype is not specified and the weight dtype is not int8 or int4, or if the group size is not specified for int8 or int4 weight only quantization. |
prepare_model_for_qat
utils.quantization.prepare_model_for_qat(
model,
weight_dtype,
group_size=None,
activation_dtype=None,
quantize_embedding=False,
)This function is used to prepare a model for QAT by swapping the model’s linear layers with fake quantized linear layers, and optionally the embedding weights with fake quantized embedding weights.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| model | The model to quantize. | required | |
| weight_dtype | TorchAOQuantDType | The dtype to use for weight quantization. | required |
| group_size | int | None | The group size to use for weight quantization. | None |
| activation_dtype | TorchAOQuantDType | None | The dtype to use for activation quantization. | None |
| quantize_embedding | bool | Whether to quantize the model’s embedding weights. | False |
Raises
| Name | Type | Description |
|---|---|---|
| ValueError | If the activation/weight dtype combination is invalid. |
quantize_model
utils.quantization.quantize_model(
model,
weight_dtype,
group_size=None,
activation_dtype=None,
quantize_embedding=None,
)This function is used to quantize a model.
Parameters
| Name | Type | Description | Default |
|---|---|---|---|
| model | The model to quantize. | required | |
| weight_dtype | TorchAOQuantDType | The dtype to use for weight quantization. | required |
| group_size | int | None | The group size to use for weight quantization. | None |
| activation_dtype | TorchAOQuantDType | None | The dtype to use for activation quantization. | None |
| quantize_embedding | bool | None | Whether to quantize the model’s embedding weights. | None |
save_quantized_model
utils.quantization.save_quantized_model(model, save_dir, **kwargs)Save a quantized model, handling MXTensor serialization.
MXTensor does not have a valid storage pointer, which causes
save_pretrained to crash (both in remove_tied_weights_from_state_dict
via id_tensor_storage, and in safetensors serialization).
Transformers >=5.5 removed the safe_serialization parameter entirely.
For MX-quantized models we save the config/generation_config via
save_pretrained machinery and the weights via torch.save.