CrocFishAI
Project Info

Team Name
Hot Dry Noodle
Team Members
kevin , Allen , Cheng , Shaobin
Project Description
Introduction
To help NT government agencies deploy a reliable conversational and analytical AI, we are excited to introduce our AI app prototype - CrocFishAI. It is designed to improve the accuracy of AI and resolve the "hallucination" problem. Please let us introduce our AI app step by step.
1. Multi-Model Verification and Trust Scoring Mechanisms
To verify and improve the accuracy of large model responses, we designed three steps for our AI:
multi-model verification → trust scoring → human review (user review/expert review).(1) Multi-Model Review Mechanism
Our system simultaneously uses 5 local large models (Gemini, Grok, Deepseek, Claude, LLaMA) to answer the same question, and a sixth “referee model” is deployed to check whether the results are consistent:
- If consistent: the answer is considered more reliable and will be output directly.
- If inconsistent: the system triggers a risk alert and escalates to expert human review.
The essence of the multi-model mechanism is not simply to increase accuracy numerically, but to significantly reduce the probability of an incorrect answer being output directly. For example, if the accuracy of a single model is 90%, the probability that all 5 models produce the same wrong answer is extremely low, so the overall result approaches nearly 100% reliability.
(2) Trust Scoring
Users can provide immediate feedback on each answer. In this way, we can improve users' experience through AI itself.
- Satisfied: the trust score of the answer increases, and for similar future questions, this answer is prioritized.
- Not satisfied: the system requires the user to explain the issue, providing a basis for further optimization.
(3) Human Review
When too many user selects “Not satisfied,” or when the multi-model answers diverge, our system triggers expert human review. Our experts use RAG (Retrieval-Augmented Generation) to correct the answer and write the result into the knowledge base/correction library for future usage. It is helpful to not retrain the large model, but rather improve answer quality continuously via knowledge updates. Let's have our test member Alice give us some examples to explain how it works.
Case 1: Multi-Model Review Mechanism (Trust Scoring)
Process: User question → multi-model consistent answers → user satisfaction feedback
User question: What is the Department of Justice’s budget for 2024-25?
AI Answer: The Department of Justice’s budget for 2024-25 is 117,519 thousand AUD.
Source: 2024-25-pbs-program-expense-line-items.csv, Attorney-General’s Department, 2024-25 column, Row 4.
User feedback: Alice marked the answer as satisfactory → no human review triggered.
Case 2: Multi-Model Inconsistency → Expert Review (RAG in practice)
Process: User question → multi-model inconsistent answers → expert review triggered → expert correction → correct answer automatically next time
User question: What is the Department of Justice’s budget for 2026-27?
AI Answer: The Department of Justice’s budget for 2026-27 is approximately 88,591 thousand AUD.
Source: 2024-25-pbs-program-expense-line-items.csv, Row 4, Column 2026-27.
Risk alert: 1 model produced a different result → escalated to expert review.
Expert conclusion: The correct answer is 88,591 thousand AUD.
Case 3: User Real-Time Feedback Mechanism
Process: User question → multi-model consistent answers → user feedback on satisfaction → if “Not satisfied” → expert intervention
User question: What is the budget execution date for the Office of Parliamentary Counsel?
AI Answer: The budget execution date is 01/07/2024.
Source: 2024-25-pbs-program-expense-line-items.csv, Row 5, Execution Date column.
User feedback: Alice selected “Not satisfied.” Reason: In China the format is Year-Month-Day, while AI gave Day-Month-Year.
2. Grounded, Scope-Limited Responses
Our main goal is to avoid large model hallucinations and ensure all answers are based on verified data, without fabricating responses to out-of-scope or irrelevant questions.
Training and Implementation
- Structured Data Processing: Integrate Excel/CSV directly for precise queries.
- Unstructured Data Transformation: Convert PDFs/Word into structured tables for storage.
- Answer Boundary Rules: Follow “If I know, I answer; if I don’t, I say so.”
Application Example
Dialogue 1 (Answer available): DOJ budget 2024-25 → 117,519 thousand AUD.
Dialogue 2 (Uncertain): DOJ budget 2029-30 → “I don’t know, data only up to 2027-28.”
Dialogue 3 (Not present): Australian Space Agency budget 2024-25 → “No answer available.”
3. Suggested Question Scaffolding to Guide Users
Goal: Our AI will guide users to ask clearer, more effective questions. It is designed to help people who don't have enough digital awareness.
Example: User asks: “What is this year’s government budget?”
AI guides: Which fiscal year do you want to know? Do you want the data for all departments or just one? Do you want to exclude some departments?
4. Audit Trails Showing How Conclusions Were Reached
Our chatbot will provide audit trails so every answer can be traced back to data sources. It will help to check any mistakes made by AI.
Example: Row 2, 2025-26 budget → 7,101 thousand AUD.
Source: 2024-25-pbs-program-expense-line-items.csv, Row 2, Column 2025-26.
5. Reducing Manual Effort: High-Frequency Question Automation
Automatically, our app will identify and analyze frequently asked questions. It reduces manual effort and supports users lacking digital awareness.
Example (Top 7-Day Questions):
- DOJ budget 2024-25? (3 times)
- DOJ budget 2025-26? (2 times)
- Attorney-General’s Department budget 2024-25? (2 times)
- Which department has the highest growth rate? (2 times)
- Which projects will decrease over the next 4 years? (2 times)
Why We Stand Out
We solve the two hardest problems in government AI adoption—accuracy and usability—through five reinforcing mechanisms:
- No hallucinations. Every answer is grounded in structured data. If the data isn’t there, we say “I don’t know.” That guarantees honesty and reliability.
- Multi-model verification. Five local models cross-check each response, with a sixth referee model enforcing consistency—drastically reducing error risk and pushing accuracy close to 100%.
- User-first experience. Real-time feedback lets users rate answers; “Not satisfied” automatically escalates to human experts. Trust and satisfaction are built into the loop.
- Question scaffolding. Guided prompts and templates help users—especially those with low digital awareness— ask precise, answerable questions from the start.
- Efficiency through automation. The system surfaces the past week’s Top 10 questions, cutting repetitive work for experts and helping new users find answers faster.