Research Article | | Peer-Reviewed

Demonstrating the Usage of Content Reliability Metrics on Comparing the Contentual Performance of Two LLMs

Received: 28 October 2025     Accepted: 19 November 2025     Published: 11 December 2025
Views:       Downloads:
Abstract

Using generative AI in education requires content reliability of the generated answers when the same question or its slightly changed version is asked again. Therefore, assessing the content reliability of the generated answers is an important issue. This requires appropriate contentual reliability metrics. In this paper we propose two new contentual reliability metrics for evaluating LLMs: content consistency and contentual robustness. They enable us to assess content reliability related performance, which cannot be evaluated by regular metrics like accuracy and faithfulness, but necessary due to the intrinsic random nature of generative AI. We demonstrate the usage of these content reliability metrics in assessing the contentual performance of two LLMs: one without and the other with reasoning model. The experiments run under identical software and hardware settings, and the source of information is restricted to a single PDF that serves as ground truth. The experiments are performed on two locally executed models, which ensures reproducibility of the experiments. The experiments use two question types: "easy questions" and "complicated questions" requiring multi-step reasoning. The results show that the MS phi-4-reasoning-plus model produces answers to complex questions not only with higher accuracy, but also with improved content consistency and contentual robustness. The mean accuracy changes from 0.36 to 0.84 when MS phi-4-reasoning-plus model is used instead of the MS phi-4 model, which corresponds to 133% improvement. Interestingly the experimental results do not show any difference in standard deviation of accuracy among the two models. The mean content consistency and mean contentual robustness changes from 0.55 to 0.92 and from 0.74 to 0.93 when the MS phi-4-reasoning-plus model is used instead of the MS phi-4 model, corresponding to 67% and 26 % improvements, respectively. The standard deviation of both content reliability metrics drops significantly when the MS phi-4 model is replaced by the reasoning model. Evaluating the effect of incorporating a reasoning model into an LLM to the contentual performance gives an insight into the operational reliability of the LLMs on contentual level. These results justified the hypothesis that not only the accuracy but also the operational reliability of the LLMs on contentual level has been significantly improved due to incorporating the reasoning model. The experiment design, the results and their evaluation demonstrated successfully the usage of the newly proposed content reliability metrics for assessing and comparing the contentual performance of LLMs.

Published in Science Innovation (Volume 13, Issue 6)
DOI 10.11648/j.si.20251306.15
Page(s) 162-169
Creative Commons

This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.

Copyright

Copyright © The Author(s), 2025. Published by Science Publishing Group

Keywords

Performance Evaluation, LLM, Contentual Reliability Metrics, Contentual Performance

References
[1] Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N; Kaiser, Łukasz; Polosukhin Illia. "Attention is All you Need". 2017. In I. Guyon and U. Von Luxburg and S. Bengio and H. Wallach and R. Fergus and S. Vishwanathan and R. Garnett (ed.). 31st Conference on Neural Information Processing Systems (NIPS). Advances in Neural Information Processing Systems. Vol. 30. Curran Associates.
[2] Roumeliotis, Konstantinos I.; Tselikas, Nikolaos D. "ChatGPT and Open-AI Models: A Preliminary Review". 2023. Future Internet. 15 (6): 192.
[3] Partha Pratim Ray. ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. 2023. Internet of Things and Cyber-Physical Systems, Volume 3. Pages 121-154.
[4] Taojun Hu, Xiao-Hua Zhou. Unveiling LLM Evaluation Focused on Metrics: Challenges and Solutions. 2024. ArXiv.abs/2404.09135.
[5] Yinuo Dong, Zili Zhang, Yang Zhi, Xiaoyun Li, Tengyu Guo, Lanlan He, Shuran Zhao, Xueli Yang, Jieting Tang, Wei Zhong, Qinghui Niu, Mingyang Ma, Zuxiong Huang, Yimin Mao. Evaluating Large Language Models' Performance in Answering Common Questions on Drug-Induced Liver Injury. 2025. JHEP Reports, 101579, ISSN 2589-5559,
[6] A. Ríos-Hoyo, N. L. Shan, A. Li, A. T. Pearson, L. Pusztai, and F. M. Howard. “Evaluation of large language models as a diagnostic aid for complex medical cases,”. 2024. Front Med (Lausanne), vol. 11, p. 1380148.
[7] L. Huang et al., “A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,” ACM Trans Inf Syst, 2023.
[8] Malin, Ben and Kalganova, Tatiana and Boulgouris, Nikolaos. A review of faithfulness metrics for hallucination assessment in Large Language Models.2025. IEEE Journal of Selected Topics in Signal Processing, p. 1–13, HYPERLINK "
[9] Shailja Gupta and Rajesh Ranjan and Surya Narayan Singh.A Comprehensive Survey of Retrieval-Augmented Generation (RAG): Evolution, Current Landscape and Future Directions. 2024. arXiv 2410.12837. HYPERLINK "
[10] Zongxi Li, Zijian Wang, Weiming Wang, Kevin Hung, Haoran Xie, Fu Lee Wang. Retrieval-augmented generation for educational application: A systematic survey.2025.
[11] Abdin, M., Aneja, J., Behl, H. S., Bubeck, S., Eldan, R., Gunasekar, S., Harrison, M., Hewett, R. J., Javaheripi, M., Kauffmann, P., Lee, J. R., Lee, Y. T., Li, Y., Liu, W., Mendes, C. C., Nguyen, A., Price, E., Rosa, G. D., Saarikivi, O., Salim, A., Shah, S., Wang, X., Ward, R., Wu, Y., Yu, D., Zhang, C., & Zhang, Y. Phi-4 Technical Report. 2024. ArXiv, abs/2412.08905
[12] Marah Abdin and Jyoti Aneja and Harkirat Behl and Sébastien Bubeck and Ronen Eldan and Suriya Gunasekar and Michael Harrison and Russell J. Hewett and Mojan Javaheripi and Piero Kauffmann and James R. Lee and Yin Tat Lee and Yuanzhi Li and Weishung Liu and Caio C. T. Mendes and Anh Nguyen and Eric Price and Gustavo de Rosa and Olli Saarikivi and Adil Salim and Shital Shah and Xin Wang and Rachel Ward and Yue Wu and Dingli Yu and Cyril Zhang and Yi Zhang.Phi-4 Technical Report.2024. arXiv 2412.08905.
[13] Yao Fu, Litu Ou, Mingyu Chen, Yuhao Wan, Hao Peng, Tushar Khot.Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance. 2023. arXiv2305.17306. HYPERLINK "
[14] Siwei Wu and Zhongyuan Peng and Xinrun Du and Tuney Zheng and Minghao Liu and Jialong Wu and Jiachen Ma and Yizhi Li and Jian Yang and Wangchunshu Zhou and Qunshu Lin and Junbo Zhao and Zhaoxiang Zhang and Wenhao Huang and Ge Zhang and Chenghua Lin and J. H. Liu.A Comparative Study on Reasoning Patterns of OpenAI's o1 Model.2024. arXiv 2410.13639.
[15] Mühl, A., Schöllbauer, J., Straus, E., & Korunka, C. Threatening relatedness while boosting social interactions: the inconsistent effect of daily task ambiguity on daily relatedness satisfaction among remote workers. 2024. The International Journal of Human Resource Management, 36(1), 56-79. HYPERLINK "
Cite This Article
  • APA Style

    Miklos, J., Saffer, Z. (2025). Demonstrating the Usage of Content Reliability Metrics on Comparing the Contentual Performance of Two LLMs. Science Innovation, 13(6), 162-169. https://doi.org/10.11648/j.si.20251306.15

    Copy | Download

    ACS Style

    Miklos, J.; Saffer, Z. Demonstrating the Usage of Content Reliability Metrics on Comparing the Contentual Performance of Two LLMs. Sci. Innov. 2025, 13(6), 162-169. doi: 10.11648/j.si.20251306.15

    Copy | Download

    AMA Style

    Miklos J, Saffer Z. Demonstrating the Usage of Content Reliability Metrics on Comparing the Contentual Performance of Two LLMs. Sci Innov. 2025;13(6):162-169. doi: 10.11648/j.si.20251306.15

    Copy | Download

  • @article{10.11648/j.si.20251306.15,
      author = {Johannes Miklos and Zsolt Saffer},
      title = {Demonstrating the Usage of Content Reliability Metrics on Comparing the Contentual Performance of Two LLMs},
      journal = {Science Innovation},
      volume = {13},
      number = {6},
      pages = {162-169},
      doi = {10.11648/j.si.20251306.15},
      url = {https://doi.org/10.11648/j.si.20251306.15},
      eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.si.20251306.15},
      abstract = {Using generative AI in education requires content reliability of the generated answers when the same question or its slightly changed version is asked again. Therefore, assessing the content reliability of the generated answers is an important issue. This requires appropriate contentual reliability metrics. In this paper we propose two new contentual reliability metrics for evaluating LLMs: content consistency and contentual robustness. They enable us to assess content reliability related performance, which cannot be evaluated by regular metrics like accuracy and faithfulness, but necessary due to the intrinsic random nature of generative AI. We demonstrate the usage of these content reliability metrics in assessing the contentual performance of two LLMs: one without and the other with reasoning model. The experiments run under identical software and hardware settings, and the source of information is restricted to a single PDF that serves as ground truth. The experiments are performed on two locally executed models, which ensures reproducibility of the experiments. The experiments use two question types: "easy questions" and "complicated questions" requiring multi-step reasoning. The results show that the MS phi-4-reasoning-plus model produces answers to complex questions not only with higher accuracy, but also with improved content consistency and contentual robustness. The mean accuracy changes from 0.36 to 0.84 when MS phi-4-reasoning-plus model is used instead of the MS phi-4 model, which corresponds to 133% improvement. Interestingly the experimental results do not show any difference in standard deviation of accuracy among the two models. The mean content consistency and mean contentual robustness changes from 0.55 to 0.92 and from 0.74 to 0.93 when the MS phi-4-reasoning-plus model is used instead of the MS phi-4 model, corresponding to 67% and 26 % improvements, respectively. The standard deviation of both content reliability metrics drops significantly when the MS phi-4 model is replaced by the reasoning model. Evaluating the effect of incorporating a reasoning model into an LLM to the contentual performance gives an insight into the operational reliability of the LLMs on contentual level. These results justified the hypothesis that not only the accuracy but also the operational reliability of the LLMs on contentual level has been significantly improved due to incorporating the reasoning model. The experiment design, the results and their evaluation demonstrated successfully the usage of the newly proposed content reliability metrics for assessing and comparing the contentual performance of LLMs.},
     year = {2025}
    }
    

    Copy | Download

  • TY  - JOUR
    T1  - Demonstrating the Usage of Content Reliability Metrics on Comparing the Contentual Performance of Two LLMs
    AU  - Johannes Miklos
    AU  - Zsolt Saffer
    Y1  - 2025/12/11
    PY  - 2025
    N1  - https://doi.org/10.11648/j.si.20251306.15
    DO  - 10.11648/j.si.20251306.15
    T2  - Science Innovation
    JF  - Science Innovation
    JO  - Science Innovation
    SP  - 162
    EP  - 169
    PB  - Science Publishing Group
    SN  - 2328-787X
    UR  - https://doi.org/10.11648/j.si.20251306.15
    AB  - Using generative AI in education requires content reliability of the generated answers when the same question or its slightly changed version is asked again. Therefore, assessing the content reliability of the generated answers is an important issue. This requires appropriate contentual reliability metrics. In this paper we propose two new contentual reliability metrics for evaluating LLMs: content consistency and contentual robustness. They enable us to assess content reliability related performance, which cannot be evaluated by regular metrics like accuracy and faithfulness, but necessary due to the intrinsic random nature of generative AI. We demonstrate the usage of these content reliability metrics in assessing the contentual performance of two LLMs: one without and the other with reasoning model. The experiments run under identical software and hardware settings, and the source of information is restricted to a single PDF that serves as ground truth. The experiments are performed on two locally executed models, which ensures reproducibility of the experiments. The experiments use two question types: "easy questions" and "complicated questions" requiring multi-step reasoning. The results show that the MS phi-4-reasoning-plus model produces answers to complex questions not only with higher accuracy, but also with improved content consistency and contentual robustness. The mean accuracy changes from 0.36 to 0.84 when MS phi-4-reasoning-plus model is used instead of the MS phi-4 model, which corresponds to 133% improvement. Interestingly the experimental results do not show any difference in standard deviation of accuracy among the two models. The mean content consistency and mean contentual robustness changes from 0.55 to 0.92 and from 0.74 to 0.93 when the MS phi-4-reasoning-plus model is used instead of the MS phi-4 model, corresponding to 67% and 26 % improvements, respectively. The standard deviation of both content reliability metrics drops significantly when the MS phi-4 model is replaced by the reasoning model. Evaluating the effect of incorporating a reasoning model into an LLM to the contentual performance gives an insight into the operational reliability of the LLMs on contentual level. These results justified the hypothesis that not only the accuracy but also the operational reliability of the LLMs on contentual level has been significantly improved due to incorporating the reasoning model. The experiment design, the results and their evaluation demonstrated successfully the usage of the newly proposed content reliability metrics for assessing and comparing the contentual performance of LLMs.
    VL  - 13
    IS  - 6
    ER  - 

    Copy | Download

Author Information
  • Department of Information Systems Engineering and Management, The Distance-Learning University of Applied Sciences (FERNFH), Wiener Neustadt, Austria

  • Department of Information Systems Engineering and Management, The Distance-Learning University of Applied Sciences (FERNFH), Wiener Neustadt, Austria;Institute of Statistics and Mathematical Methods in Economics, Vienna University of Technology, Vienna, Austria

  • Sections