utils.quantization

utils.quantization

Utilities for quantization including QAT and PTQ using torchao.

Functions

Name	Description
convert_qat_model	This function converts a QAT model which has fake quantized layers back to the original model.
get_quantization_config	This function is used to build a post-training quantization config.
prepare_model_for_qat	This function is used to prepare a model for QAT by swapping the model’s linear
quantize_model	This function is used to quantize a model.
save_quantized_model	Save a quantized model, handling MXTensor serialization.

convert_qat_model

utils.quantization.convert_qat_model(model, quantize_embedding=False)

This function converts a QAT model which has fake quantized layers back to the original model.

get_quantization_config

utils.quantization.get_quantization_config(
    weight_dtype,
    activation_dtype=None,
    group_size=None,
)

This function is used to build a post-training quantization config.

Parameters

Name	Type	Description	Default
weight_dtype	TorchAOQuantDType	The dtype to use for weight quantization.	required
activation_dtype	TorchAOQuantDType \| None	The dtype to use for activation quantization.	`None`
group_size	int \| None	The group size to use for weight quantization.	`None`

Returns

Name	Type	Description
	AOBaseConfig	The post-training quantization config.

Raises

Name	Type	Description
	ValueError	If the activation dtype is not specified and the weight dtype is not int8 or int4, or if the group size is not specified for int8 or int4 weight only quantization.

prepare_model_for_qat

utils.quantization.prepare_model_for_qat(
    model,
    weight_dtype,
    group_size=None,
    activation_dtype=None,
    quantize_embedding=False,
)

This function is used to prepare a model for QAT by swapping the model’s linear layers with fake quantized linear layers, and optionally the embedding weights with fake quantized embedding weights.

Parameters

Name	Type	Description	Default
model		The model to quantize.	required
weight_dtype	TorchAOQuantDType	The dtype to use for weight quantization.	required
group_size	int \| None	The group size to use for weight quantization.	`None`
activation_dtype	TorchAOQuantDType \| None	The dtype to use for activation quantization.	`None`
quantize_embedding	bool	Whether to quantize the model’s embedding weights.	`False`

Raises

Name	Type	Description
	ValueError	If the activation/weight dtype combination is invalid.

quantize_model

utils.quantization.quantize_model(
    model,
    weight_dtype,
    group_size=None,
    activation_dtype=None,
    quantize_embedding=None,
)

This function is used to quantize a model.

Parameters

Name	Type	Description	Default
model		The model to quantize.	required
weight_dtype	TorchAOQuantDType	The dtype to use for weight quantization.	required
group_size	int \| None	The group size to use for weight quantization.	`None`
activation_dtype	TorchAOQuantDType \| None	The dtype to use for activation quantization.	`None`
quantize_embedding	bool \| None	Whether to quantize the model’s embedding weights.	`None`

save_quantized_model

utils.quantization.save_quantized_model(model, save_dir, **kwargs)

Save a quantized model, handling MXTensor serialization.

MXTensor does not have a valid storage pointer, which causes save_pretrained to crash (both in remove_tied_weights_from_state_dict via id_tensor_storage, and in safetensors serialization). Transformers >=5.5 removed the safe_serialization parameter entirely.

For MX-quantized models we save the config/generation_config via save_pretrained machinery and the weights via torch.save.