Methodology Limitations, RocSite Discovery

Why this page exists. A trustworthy audit service has to be specific about what its audit can and cannot tell you. This page is that specificity. Read it before you submit, and read it again when you receive a rating.

What an automated rating CAN tell you

An automated RocSite Discovery rating tells you, deterministically and reproducibly, whether a submission satisfies the methodological criteria encoded in our published protocol at the version stamped on the rating.

Concretely, an automated rating can answer:

Does the submission meet our pre-registration discipline? The 8 OSF rules, filed before analysis, hypothesis stated explicitly, cohort definition, outcome variables defined, pre-specified analysis plan, falsification criteria, immutability, deviations flagged, pass or fail per rule.
Do the falsification gates pass? When a gate is implementable for a submission's domain (e.g., care-process leakage detection on a clinical AI claim), the engine reports pass / fail / not applicable for that gate.
Is the cited prior literature consistent with the new claim? The engine cross-references claimed effect sizes against meta-analyses and primary literature where the engine has access to a comparable corpus.
Does the submission survive the same protocol applied to RocSite's own work? We hold our own published findings to the same gates we hold yours to. A rating reports your submission against that consistent bar.

What an automated rating CANNOT tell you

Equally important. The engine produces methodological signal, not scientific truth.

Whether the underlying science is correct. A submission can pass every gate and still be wrong about the world. Strong methodology is necessary, not sufficient.
Whether the conclusions generalize. The engine evaluates internal validity against the protocol; external validity, does this hold in a different population, different hospital, different decade, is a separate question we do not answer.
Whether the data was collected ethically. We do not assess IRB approval, consent procedures, data-use agreements, or patient-protection regimes. That is the role of IRBs, regulators, and your institution.
Whether the work is novel. The engine does not search the global literature for prior art. A submission can pass our protocol and still duplicate published work.
Whether peer reviewers will agree. Peer reviewers bring domain judgment, taste, and disciplinary context that our protocol does not encode.
Whether the work is fraudulent. We do not detect data fabrication, plagiarism, or image manipulation. Our gates are methodological, not forensic.

Known limitations of the current engine

Be specific. This is what we can and cannot do today, named.

Domain coverage. The engine's gate library is most mature for clinical AI on ICU outcome data (MIMIC-IV, eICU-CRD). Gates for financial models, legal AI, and applied research are in active development and are presently audit-by-engagement only, they do not run automatically on submissions in those domains.
Care-process leakage gate is implemented for clinical AI claims; the analogous gate for non-clinical domains is conceptual, not yet automated.
Adversarial debate. The Advocate / Adversary / Arbiter RocStars layer is implemented for findings the engine itself produces. Submissions to the public registry currently get rule-by-rule feedback without the full debate trace; the debate layer is on the roadmap for paid tiers.
Subgroup fairness. Where a submission's cohort definition lacks the demographic stratification needed to evaluate subgroup effects, the engine flags this as "scope: not assessable" rather than passing or failing the gate. The engine evaluates stratification on age, sex, and recorded race/ethnicity where the cohort includes those fields; subgroup gates for social deprivation index (SDI), payer, and primary language are documented in the protocol but not yet implemented. We do not invent stratification.
Pre-registration timestamp. We accept the OSF or comparable timestamp at face value. We do not independently audit the registry's chain of custody. Disputes about an OSF registry entry's timestamp or chain of custody are handled by OSF directly. We can refer you to the appropriate OSF support channel.
PDF parsing. OCR'd or image-only PDFs may produce degraded ratings because the engine cannot extract structured methodology from rendered images at the same fidelity as text PDFs. We flag this when detected.
Language. The engine reads English. Non-English submissions are translated programmatically before review; translated submissions are eligible for the Exploratory tier only and require independent human re-review before any promotion to Confirmed.
Validator input scope = model input scope. When the engine evaluates an external clinical AI (e.g., a sepsis classifier under AI Governor), the validator necessarily runs on the same record-level inputs the AI received. If those inputs are insufficient to establish true ground truth — for instance, missing the documented infection source needed to confirm Sepsis-3 — the validator operates on a SIRS-like proxy rather than the full Sepsis-3 criteria. This is disclosed in every per-case reasoning note. Two consequences worth flagging in audit and procurement contexts: (1) verdicts labeled Cannot be verified are not the same as Model wrong and should not be counted in a false-positive rate; (2) a clean validator pass means "consistent with the available record," not "biologically confirmed." Stricter ground-truth assessment requires the chart, the cultures, and the clinician — not the validator alone.

AI Governor splits Cannot be verified into two sub-categories that appear in every audit row as verdict_provenance:
- scope_gap_kept — the record is genuinely insufficient (SIRS-proxy gap, missing infection source, insufficient inputs). The validator deliberately declined to render a definitive verdict and the audit row records why.
- text_override — the model's structured verdict was inconclusive but its prose reasoning contained an explicit verdict literal ("DISAGREES", "AGREES", "PARTIAL"), and the system used the prose verdict. This is the override path; it does not apply when scope-gap markers are present.
- primary — the model's structured verdict was definitive (not inconclusive); used as-is.
Each verdict additionally carries a methodology_audit field naming which clinical criteria sets the reasoning referenced (Sepsis-2 / Sepsis-3 / CMS SEP-1 / SOFA / qSOFA / CHA₂DS₂-VASc / NIHSS / Fleischner 2017 / LI-RADS / BI-RADS / Lung-RADS / TI-RADS / Ottawa Ankle / ASIA) and surfacing clinical-sounding claims (e.g. "patient appears stable", "discharge home") that aren't grounded in any of them. A grounding signal of ungrounded means the verdict was reached without referencing a named criteria set and should be treated as exploratory regardless of the verdict itself.

Why automated review is valuable despite these limitations

Naming the limits makes the value sharper, not softer.

Consistency. The same input produces the same rating, every time. There is no "the reviewer was tired today."
Reproducibility. A rating record names its protocol version. Anyone with the protocol can, in principle, re-derive the rating from the submission. Ratings that cannot be reproduced are bugs we want to know about.
No reviewer bias. The engine has no prior relationship with the submitter, no career stake in the submitter's institution, no conflict from prior peer review of the same author's work.
No institutional conflict. A rating can be unfavorable to a paper from a major center without political consequence, because no human author of the rating exists.
Accessibility. A pre-registration check is free. The protocol is public. The methodology is published. Anyone can audit the auditor.
Speed and scale. Automated review runs in days, not months. We can process more submissions than any panel of human reviewers, applying the same standard to each.

How to use a rating well

Read the rule-by-rule breakdown, not just the headline number. The breakdown is where the actionable information lives.
Treat "Exploratory" as a flag, not a verdict. An Exploratory rating means at least one gate is unmet or untestable, often because the underlying study design is single-dataset or pre-replication. This is normal at early stages of a research program.
If a gate flagged a problem, fix the problem. Then resubmit. The engine is faster than peer review; the iteration loop is the point.
Read this page again before you cite a rating in a high-stakes context. Citing methodology audit as if it were peer review or regulatory clearance is a category error we cannot prevent on your behalf.

How to flag a defect in this protocol

If you believe the engine evaluated your submission against a faulty gate, the right place to file the bug is the right of reply process described in our Terms of Service. Methodology critiques of RocSite Discovery itself are welcomed and published. A live audit service that cannot accept criticism of its own audit is not a real audit service.