Skip to content

Verification & Enrichment

Valter provides two complementary features for ensuring accuracy of legal content. Verification catches hallucinated legal references before they reach users. Enrichment adds structured legal analysis (IRAC) and knowledge graph context to decisions.

Endpoint: POST /v1/verify

LLMs frequently hallucinate legal citations — inventing sumula numbers, misspelling minister names, or fabricating process numbers. The LegalVerifier (core/verifier.py) validates references found in text against known datasets and computes a hallucination risk score.

The verifier checks four categories of legal references, each toggleable via request parameters:

Validates sumula numbers against local STJ and STF reference data using SumulaValidator. The validation confirms:

  • The sumula number exists
  • The correct court is attributed (STJ vs STF)
  • Current status (vigente or not)
  • The associated legal area

Validates minister names against a known list using MinistroValidator. Returns:

  • valid: whether the name matches a known minister
  • confidence: exact, partial, or none
  • is_aposentado: whether the minister is retired
  • suggestion: corrected name when a partial match is found

Validates the CNJ process number format using a regex pattern:

NNNNNNN-NN.NNNN.N.NN.NNNN
# From core/verifier.py
PROCESSO_REGEX = re.compile(r"\b(\d{7}-\d{2}\.\d{4}\.\d\.\d{2}\.\d{4})\b")

This validates format only — it does not confirm the process exists in external systems.

Extracts and classifies legislation mentions using regex patterns. Recognizes both explicit references (e.g., “Lei 8.078/1990”) and common aliases:

AliasResolves To
CDCLei 8.078/1990
CCLei 10.406/2002
CPCLei 13.105/2015
CLTDecreto-Lei 5.452/1943
CPDecreto-Lei 2.848/1940
CTNLei 5.172/1966
CFConstituicao Federal 1988
ECALei 8.069/1990

Article references (e.g., “Art. 186”) are also extracted and linked to their parent law.

The verifier computes an overall HallucinationMetrics object:

# From core/verifier.py
@dataclass
class HallucinationMetrics:
risk_level: str # "low", "medium", "high"
risk_score: float # 0-100
total_citations: int
valid_count: int
invalid_count: int
unverified_count: int
details: dict

The risk score aggregates validation results: a text with many invalid references produces a higher score, signaling that the content may contain hallucinated citations.

Verification relies on golden datasets stored in the data/reference/ directory. The project also has 810,225 STJ metadata records for broader process validation.

Endpoint: POST /v1/enrich

The DocumentEnricher (core/enricher.py) performs two operations on a legal document: heuristic IRAC classification and knowledge graph context loading.

IRAC is a standard framework for analyzing legal decisions:

ComponentWhat It IdentifiesExample Patterns
IssueThe legal question being decided”questao”, “controversia”, “discute-se”, “cinge-se”
RuleThe legal norm or principle applied”artigo”, “lei”, “sumula”, “nos termos de”
ApplicationHow the rule was applied to the facts”no caso”, “in casu”, “verifica-se”, “configurado”
ConclusionThe court’s decision”portanto”, “ante o exposto”, “da provimento”, “nega”

The classification is heuristic and regex-based — it does not depend on an LLM. Each IRAC section is identified by scanning the document text (ementa, tese, razoes_decidir) against compiled regex patterns:

# From core/enricher.py
IRAC_PATTERNS = {
IRACSection.ISSUE: [
r"\b(?:questao|problema|controversia|debate|discussao|tese|cerne)\b",
r"\b(?:discute-se|indaga-se|pergunta-se|cinge-se)\b",
],
IRACSection.RULE: [
r"\b(?:art\.?|artigo|lei|sumula|codigo|dispositivo|norma)\b",
r"\b(?:preve|estabelece|dispoe|prescreve|determina)\b",
],
# ...
}

After IRAC classification, the enricher loads graph context by running 5 parallel queries against Neo4j:

Entity TypeMethodWhat It Returns
Criteriosget_criterios()Legal criteria connected to the decision
Dispositivosget_dispositivos()Legal statutes cited
Precedentesget_precedentes()Precedent decisions cited
Legislacaoget_legislacao()Legislation edges with relationship metadata
Related Decisionsget_related_decisions()Decisions connected via shared criteria

The enrichment result includes a kg_available flag indicating whether the Neo4j graph contained data for this decision.

# From core/enricher.py
@dataclass
class EnrichmentResult:
document_id: str
irac: IRACAnalysis | None = None
features: DocumentFeatures | None = None
criterios: list[Criterio] = field(default_factory=list)
dispositivos: list[DispositivoLegal] = field(default_factory=list)
precedentes: list[Precedente] = field(default_factory=list)
legislacao: list[DecisaoLegislacaoEdge] = field(default_factory=list)
related_decisions: list[RelatedDecision] = field(default_factory=list)
kg_available: bool = False

If a FeaturesStore is configured, the enricher also loads the 21 AI-extracted features for the document.

Endpoint: POST /v1/factual/extract

The FactualExtractor (core/factual_extractor.py) uses a Groq LLM to extract two independent structured representations from legal text:

A set of 10-15 factual bullets plus a condensed narrative (2-3 sentences), optimized for semantic search:

  • Each bullet cites the source excerpt from the original text when identifiable
  • Uncertain or contested facts are marked with uncertainty: true
  • The digest_text is designed to be dense and comparable across cases

The core legal argument extracted from the document:

  • Central thesis description (1-3 sentences)
  • Legal basis: statutes cited (e.g., “CDC art. 14”, “CC/2002 art. 186”)
  • Precedents cited in the text (e.g., “REsp 1.234.567/SP”, “Sumula 297/STJ”)

These dense representations produce more discriminative vectors than embedding entire decisions, which suffer from topic averaging across long documents. The input is capped at 4,000 characters to align with LLM context limits.

Integrated into the verifier pipeline, temporal validity checks whether referenced legal norms are still in effect. This catches references to revoked or superseded legislation, which is a common source of inaccuracy in AI-generated legal content.