An Accurate and Trustworthy Chatbot for Data Interactions

Project Info

Team Name

ASTRA

Team Members

Mehraab , Haroon , Aseem , Fateen , Saugat , Niure , Mahathir

Project Description

Why are we doing this?

Addressing the challenges faced by government agencies in extracting insights from vast datasets is crucial for enhancing operational efficiency, accountability, and public trust. As these agencies manage extensive and diverse datasets, the ability to derive accurate and actionable insights becomes paramount. The current reliance on advanced AI and large language models (LLMs) presents significant hurdles, particularly concerning accuracy and trustworthiness. In a governmental context, where even a 90% accuracy rate is insufficient, the risks associated with misinformation can lead to misguided policies, misallocation of resources, and erosion of public confidence.

The importance of developing solutions that prioritize accuracy over advanced reasoning capabilities cannot be overstated. Government decisions often have far-reaching implications, affecting citizens' lives and the allocation of taxpayer funds. Therefore, tools that facilitate conversational data interrogation across multiple datasets can empower government employees to make informed decisions based on reliable data. Implementing trust scoring and vetting mechanisms will further enhance the credibility of AI responses, ensuring that users can confidently rely on the insights generated.

Moreover, the need for grounded, scope-limited responses is essential to prevent hallucinations and irrelevant information that can cloud judgment. By creating a transferable framework that works across various departments—such as HR, finance, and operations—agencies can foster a culture of data-driven decision-making. Additionally, providing suggested question scaffolding can guide users toward productive queries, maximizing the utility of the data at their disposal.

Ultimately, the development of these solutions aligns with ethical AI practices, emphasizing privacy, bias prevention, and transparency in algorithmic decision-making. By addressing these challenges, government agencies can enhance their operational effectiveness, improve service delivery, and build a foundation of trust with the public they serve.

System Architecture Overview

The architecture is designed to integrate structured data processing with advanced language model reasoning, ensuring accurate and context-aware responses to user queries. The workflow begins with three raw datasets, which first undergo rigorous data cleaning and preprocessing to address missing values, standardize formats, and ensure consistency across all fields. Once prepared, the datasets are ingested into a PostgreSQL database through an automated ETL (Extract, Transform, Load) pipeline.

This step establishes a reliable and structured knowledge base that serves as the foundation for subsequent query execution. When a user submits a natural language query, the request is initially processed by a Large Language Model (LLM) functioning as the query translator. This model interprets the intent of the user’s input and generates a valid SQL query that can extract precise results from the PostgreSQL database. In parallel, the same query is transformed into a high-dimensional vector embedding.

This embedding is sent to Pinecone, where a vector similarity search is performed to retrieve semantically relevant references and contextual information. This process leverages Retrieval-Augmented Generation (RAG) to enhance the factual database results with richer descriptive knowledge. At the same time, the query is routed to a secondary LLM that acts as the synthesis and reasoning layer.

This model consolidates and cross-validates three distinct sources of input: The factual output returned by the SQL query from PostgreSQL, the contextual reference retrieved from Pinecone’s vector index, and the original natural language query submitted by the user. By comparing and integrating these sources, the secondary LLM generates a final response that is both factually grounded in structured data and contextually enriched with semantic knowledge. This layered orchestration ensures the system delivers responses that are accurate, comprehensive, and aligned with the user’s intent.

What makes us stand out?

Our system stands out due to its innovative hybrid search strategy, which effectively combines traditional SQL database querying with advanced technologies like large language models (LLMs) and retrieval-augmented generation (RAG). This unique integration allows us to convert natural language queries into precise SQL commands, enabling users to interact with the system intuitively. Once a query is generated, it is supported by RAG, which enhances the factual accuracy of the results by providing contextual references.

Additionally, we conduct semantic searches within the schema of our PostgreSQL database, ensuring that the information retrieved is not only relevant but also enriched with meaningful context. This hybrid approach leverages the strengths of both traditional database querying and modern AI capabilities, allowing for a seamless flow of information. By utilizing LLMs to interpret user intent and RAG to augment the data retrieval process, our system delivers comprehensive insights that are grounded in reliable data.

Ultimately, this powerful combination empowers users to make informed decisions based on accurate and contextually relevant information, fostering a more efficient and effective data interrogation experience.

#llm #postgre #kaggle #data #rag #sql #python #data cleaning #database #pinecone

Evidence of Work

Video

Homepage

Team DataSets

Freedom of information statistics

Description of Use Similar, to the another datasets we created a python pipeline that will do the validation and cleaning of this data. The only difference is that this data isn't in time series it is structured with a multi-sheet format. With both quantitative (counts, costs, processing times) and qualitative (agency comments) components, allowing analysis of FOI activity at both whole-of-government and agency-specific levels. So that meant that we make a custom python pipeline compared to the others.

Data Set

Employee Leave tracking data

Description of Use We worked with one file containing employee leave data. For this file, it was essential to carefully handle missing values. For example, if a date column had gaps, we couldn’t simply delete those rows; instead, we had to identify and apply suitable methods to fill them in. In addition, we ensured that all numerical and string columns were stored in the correct data types to avoid inconsistencies. Finally, we formatted the data so it was fully compatible with SQL queries and LLM processing, allowing for a smoother and more efficient pipeline.

Data Set

Human Resources Data set.

Description of Use We had to analyze 4 files that contained various forms of employee data. For each file we had to make sure that the missing values were properly filled out this means that if we had a missing values in a date column we couldn't simply remove those rows we had to find an appropriate way to fill them out. Moreover, we also had to ensure numerical columns and string columns were in appropriate data type. Lastly we had to make the data was formatted for the SQL query/LLM processing to ensure a smoother pipeline.

Data Set