CLOS network: A parameter-efficient alternative to linear layers in neural networks

Main Article Content

Orgil Jargalsaikhan
Magvan-Erdene Gantulga
Nomin-Erdene Zorigtbaatar
Battsetseg Erdenebat

Keywords

Linear layer, CLOS network, Transformers, parameter reduction

Abstract

 In this paper, we propose utilizing the CLOS network architecture to replace traditional linear layers in deep learning models including transformers. The CLOS network, commonly used in networking systems, is adapted to neural networks to reduce parameter sizes while maintaining model performance. Our experiments show that the CLOS network achieves the same accuracy and loss as the conventional linear layer, but with fewer parameters. However, this efficiency comes at the cost of increased processing time which is 1.5x to 3x slower. Despite this trade-off, the CLOS network can be an effective alternative for parameter reduction in various architectures, including large models like transformers. 

Abstract 63 | Full PDF Downloads 7

References

[1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
[2] C. Clos, “A study of non-blocking switching networks,” The Bell System Technical Journal, vol. 32, no. 2, pp. 406–424, 1953.
[3] Y. Gong, L. Liu, M. Yang, and L. Bourdev, “Compressing deep convolutional networks using vector quantization,” arXiv preprint arXiv:1412.6115, 2014.
[4] C. Leng, Z. Dou, H. Li, S. Zhu, and R. Jin, “Extremely low bit neural network: Squeeze the last bit out with admm,” arXiv preprint arXiv:1707.09870, 2017.
[5] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” 2016. [Online]. Available: https://arxiv.org/abs/1510.00149
[6] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko, “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” pp. 2704–2713, 06 2018.
[7] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Quantized neural networks: Training neural networks with low precision weights and activations,” arXiv preprint arXiv:1609.07061, 2016.
[8] M. Kim and P. Smaragdis, “Bitwise neural networks,” arXiv preprint arXiv:1601.06071, 2016.
[9] F. Li and B. Liu, “Ternary weight networks,” arXiv preprint arXiv:1605.04711, 2016.
[10] J. Ott, Z. Lin, Y. Zhang, S.-C. Liu, and Y. Bengio, “Recurrent neural networks with limited numerical precision,” arXiv preprint arXiv:1608.06902, 2017.
[11] Z. Zhou, W. Zhou, H. Li, and R. Hong, “Online filter clustering and pruning for efficient convnets,” in 2018 25th IEEE International Conference on Image Processing (ICIP), 2018, pp. 11–15.
[12] C. Louizos, M. Welling, and D. P. Kingma, “Learning sparse neural networks through l0 regularization,” in International Conference on Learning Representations, 2018. [Online]. Available: https://openreview.net/forum?id=H1Y8hhg0b
[13] S. Swaminathan, D. Garg, R. Kannan, and F. Andres, “Sparse low rank factorization for deep neural network compression,” Neurocomputing, vol. 398, pp. 185–196, 2020. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0925231220302253
[14] A. Novikov, D. Podoprikhin, A. Osokin, and D. Vetrov, “Tensorizing neural networks,” in Proceedings of the 29th International Conference on Neural Information Processing Systems - Volume 1, ser. NIPS’15. Cambridge, MA, USA: MIT Press, 2015, p. 442–450.
[15] R. Saha, N. Sagan, V. Srivastava, A. J. Goldsmith, and M. Pilanci, “Compressing large language models using low rank and low precision decomposition,” 2024.
[16] V. Sindhwani, T. Sainath, and S. Kumar, “Structured transforms for small-footprint deep learning,” in Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, Eds., vol. 28. Curran Associates, Inc., 2015. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2015/file/851300ee84c2b80ed40f51ed26d866fc-Paper.pdf
[17] Y. Cheng, F. X. Yu, R. S. Feris, S. Kumar, A. Choudhary, and S.-F. Chang, “An exploration of parameter redundancy in deep networks with circulant projections,” in Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ser. ICCV ’15. USA: IEEE Computer Society, 2015, p. 2857–2865 [Online]. Available: https://doi.org/10.1109/ICCV.2015.327
[18] E. J. Hu, Y. Shen, P. Wallis, Z. AllenZhu, Y. Li, S. Wang, L. Chen, Y. Li, and S. Wang, “Lora: Lowrank adaptation of large language models,” arXiv preprint, vol. arXiv:2106.09685, 2021.
[19] T. Chen, B. Xu, C. Zhang, and C. Guestrin, “Training deep nets with sublinear memory cost,” 2016. [Online]. Available: https://arxiv.org/abs/1604.06174
[20] G. Huang, S. Liu, L. van der Maaten, and K. Weinberger, “Condensenet: An efficient densenet using learned group convolutions,” pp.2752–2761, 06 2018.
[21] M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” ArXiv, vol. abs/1905.11946, 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:167217261
[22] H. Cai, L. Zhu, and S. Han, “Proxylessnas: Direct neural architecture search on target task and hardware,” ArXiv, vol. abs/1812.00332, 2018. [Online]. Available: https://api.semanticscholar.org/CorpusID:54438210
[23] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” 2014. [Online]. Available: https://arxiv.org/abs/1409.4842
[24] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <0.5mb model size,” 2016. [Online]. Available: https://arxiv.org/abs/1602.07360
[25] M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and Łukasz Kaiser, “Universal transformers,” 2019. [Online]. Available: https://arxiv.org/abs/1807.03819
[26] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean, “Outrageously large neural networks: The sparsely-gated mixture-of-experts layer,” arXiv preprint arXiv:1701.06538, 2017.
[27] D. Lepikhin, H. Lee, Y. Xu, D. Chen, O. Firat, Y. Huang, M. Krikun, N. Shazeer, and Z. Chen, “Gshard: Scaling giant models with conditional computation and automatic sharding,” arXiv preprint arXiv:2006.16668, 2020.
[28] Y. Zhou, X. Zhai, W.-L. Chiang, Y. Li, J. Gu et al., “Mixture-of-experts with expert choice routing,” arXiv preprint arXiv:2202.09368, 2022.
[29] Y. Liu, J. Gu, N. Goyal, and et al., “Multilingual denoising pre-training for neural machine translation,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 726–742, 2020. [Online]. Available: https://aclanthology.org/2020.tacl-1.47/
[30] A. Yang, A. Li, B. Yang, and et al., “Qwen3 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2505.09388
[31] N. Team, M. R. Costa-jussà, J. Cross, O. Çelebi, and et al., “No language left behind: Scaling human-centered machine translation,” 2022. [Online]. Available: https://arxiv.org/abs/2207.04672