This thesis examines the reliability of large language models (LLMs) in software development tools. It reveals significant biases in datasets used to train these models, leading to inflated performance metrics. The research also assesses the quality of code generated by ChatGPT, identifying various issues and exploring its self-repairing abilities. Additionally, the study uncovers security flaws in Visual Studio Code extensions that use LLMs, highlighting potential risks of credential-related data leakage. Overall, the thesis provides crucial insights into the challenges of using LLMs in software development and offers recommendations for improving their reliability to enhance developer productivity and software quality.