Performance Analysis of Two-stage Uniform Quantization of Laplacian Data

XVII International Conference on Systems, Automatic Control and Measurements, SAUM 2024 (pp. 47-50)

АУТОР(И) / AUTHOR(S): Sofija Perić , Aleksandra Jovanović , Zoran Perić , Jelena Nikolić 

Download Full Pdf  

DOI:  10.46793/SAUM24.047P

САЖЕТАК / ABSTRACT:

Residual quantization decomposes the quantization process into smaller ones, achieving a very small quantization error with significantly reduced computational costs. In this paper, we explore the application of the residual concept in two-stage uniform quantization (TSUQ). TSUQ employs two uniform quantizers, one in the first stage to quantize the value of the input data using R1 bits and the other in the second stage to quantize the residual error of the previous stage using R2 bits. We analyze the TSUQ of Laplacian data, covering the following aspects: i) we determine the signal to quantization noise ratio (SQNR) for different (R1, R2) pairs, ii) we estimate the gain in SQNR achieved by quantization of residual error, and iii) we examine the robustness of SQNR to changes in the variance of the data being quantized. The conducted analysis can be useful in evaluating the performance of a particular TSUQ. Moreover, it can be useful in designing a TSUQ by enabling the determination of R1 and R2 that provide the highest SQNR.

КЉУЧНЕ РЕЧИ / KEYWORDS:

residual quantization, uniform quantization, variance-mismatch,  performance robustness, signal to quantization noise ratio

ПРОЈЕКАТ/ ACKNOWLEDGEMENT:

The research presented in this paper was funded by the Ministry of Science, Technological Development and Innovation of the Republic of Serbia [grant number 451-03-66/2024-03/200102].

ЛИТЕРАТУРА / REFERENCES

  1. Kuzmin, V. B. Mart, Y. Ren, M. Nagel, J. Peters, and T. Blankevoort, “FP8 quantization: The power of the exponent”, in Proc. 36th Conf. on Neural Information Processing Systems, NeurIPS 2022, New Orleans, Louisiana, USA, December 4-9, 2022.
  2. Xu, L. Xie, X. Gu, X. Chen, H. Chang, H. Zhang, et al, “QA-LoRA: Quantization-aware low-rank adaptation of large language models”, in Proc. 12th Int. Conf. on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024.
  3. Banner, Y. Nahshan, and D. Soudry, “Post-training 4-bit quantization of convolutional networks for rapid-deployment”, in Proc. 33rd Conf. on Neural Information Processing Systems, NeurIPS 2019, Vancouver, BC, Canada, December 8-14, 2019, pp. 7948-7956.
  4. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Binarized neural networks”, in Proc. 30th Conf. on Neural Information Processing Systems, NeurIPS 2016, Barcelona, Spain, December 5-10, 2016, pp. 4114-4122.
  5. Peric, B. Denic, M. Savic, and V. Despotovic, “Design and analysis of binary scalar quantizer of Laplacian source with applications”, Information, vol. 11, no. 501, pp. 1-18, October 2020. doi:10.3390/info11110501
  6. Yin, J. Dong, Y. Wang, D. S. Christopher, and V. Kuleshov, “ModuLoRA: Finetuning 2-Bit LLMs on consumer GPUs by integrating with modular quantizers”, Trans. on Machine Learning Research, vol. 1, pp. 1-17, February 2024.