How We Grade Ourselves, Blind — Read the Scripture

The Objection First

You Graded Your Own Homework

That is the fair objection, so let me state it before a skeptic does. I publish the studies on this site, and the grades printed at the bottom of those studies were produced inside my own process. No external journal reviewed them. No committee signed off. If you stopped reading here, you would be right to discount every grade on the site.

I work in detection engineering, where a test you wrote yourself, against your own assumptions, has a name: a blind spot generator. The answer in my trade is not to promise you tested carefully. It is to change who runs the test and what they are allowed to see. That is what we did here. This page shows the machinery — not so you will trust the grades, but so you can decide for yourself how much they are worth.

The Short Version

Grades are produced by a fresh AI instance that did not write the piece and is prevented from reading anything except a fixed rubric and the one page being graded — no prior grades, no project notes, no other pages. It runs three to five times so the spread is visible, and each grade is tied to the exact version of the file it scored. The grade measures the quality of the reasoning, not the truth of the conclusion — and it has real limits, stated below, that no amount of process removes. One published type is exempt: Marginalia — labeled opinion — carries no grade at all. It is held to an honesty checklist instead, and Studies never cite it as evidence.

What the Grade Measures

Each Study is scored against a fixed rubric on five dimensions, zero to four points each, twenty in total. Graders also map scores to fixed letter bands (18–20 is A+, 16–17 is A, 14–15 is B+, and so on), and the permanent log retains those letters — but as of July 2026 the pages themselves display only the numeric score; letter glyphs were retired from page display. The grader must cite specific text from the page for every score — a number without a citation doesn't count.

Claim precision

Are the claims stated exactly, or do they swell past what the argument supports? "Virtually the only figure to use this word" is precise; "more than any other figure in all of Scripture" is a flag.

Evidence grounding

Is every factual claim traceable to a cited source or the text itself — and do the citations actually say what the piece says they say?

Epistemic honesty

Does the piece state its bias, show the conflicting evidence, and name what it does not know — or does it harmonize the difficulties away?

Internal consistency

Does the conclusion follow from the body? Does any section contradict another? Does the piece break a rule it states for itself?

Scope appropriateness

Does the piece claim only what its evidence covers — or does a study of two cases quietly become a claim about all of Scripture?

Notice what is not on the list: whether the conclusion is true. A grade is not an endorsement of the theology. A piece arguing a position I hold and a piece arguing one I don't would be scored the same way — on the discipline of the argument, not its destination.

What "Blind" Means Here

The grading is done by Claude, the same AI I collaborate with to build and critique this site — but not the same instance. The instance that helped write a piece never grades it. A fresh instance is started with no memory of the writing, no stake in the outcome, and no idea what grade anyone hoped for.

"Pretend you don't know who wrote this" is the weakest form of blinding — style leaks authorship, and a willing grader finds reasons to be kind. Claude's own account, asked directly, is blunter: told to be impartial about its own work, it would try and partly fail, because fluency feels like correctness from inside. Real blinding therefore restricts what the grader is allowed to read, not what it is told to pretend. Our grader reads exactly two things: the rubric and the one page being graded. It is barred from the project's notes, its session records, its page inventory, every other page on the site, and — most importantly — any previously recorded grade. It cannot anchor on the old number because it cannot see the old number.

Three to five runs, not one. Independent fresh instances grade the same page. If the runs cluster within one letter band, the grade is stable. If they swing wider, the instability is itself the finding and gets reported, not averaged away.
Every grade is tied to a file version. A grade is valid only for the exact text the grader saw. If the page is edited afterward, the grade is stale until re-run. This rule exists because it caught a real failure — see the next section.
Every run is logged. The scores, the spread, the grader's cited reasons, and the file version live in a permanent log — published in full in the project's public repository, every run included. The number printed on a page is reconciled to the blind result, never the other way around.

“A grade you cannot reproduce is a mood with a number.”

The rule the whole system serves

What It Has Caught

A process earns trust by catching things, against the author's interest. Here is some of what this one caught — kept on the record deliberately, because the catches are the credential.

Page	Recorded grade	Blind result	What was actually wrong
Relenting	A+ (19/20)	15 / 16 / 17	A whole section had been disabled in the page code during an edit, while the title, table of contents, and a cross-reference still pointed to it. All three blind runs found the inconsistency independently. Whether the A+ was earned by an earlier version or was simply generous, it no longer described the file as it stood.
The Samaritan Woman	A+ (19/20)	14 / 14 / 16	Not a stale grade — an over-grade with a real factual error inside it: Matthew 27:54 was miscited as a Gentile responding to the risen Christ; it is the centurion's confession at the crucifixion. Every blind run flagged it. The self-review had missed it entirely.
Faith in Action	A (16/20)	12 / 15 / 16	Same disabled-section defect as Relenting (a "fifth portrait" the text still promised but never delivered), surfacing as high variance — the runs disagreed because the page disagreed with itself.

All three pages were fixed and re-graded blind. The corrections moved the scores in the predicted direction — Relenting to a unanimous 19, the variance collapsing — which is the strongest evidence the process measures something real rather than generating numbers. The errors above stayed in the public record. A process that only logs its successes is marketing.

We also tested the tester

In June 2026 we ran a controlled comparison: the same nine hard questions answered by three configurations of the AI — the "warm" instance that co-authored the site, a fresh instance that read the site's content cold, and a fresh instance that read nothing. Three separate blind judges scored the anonymized answers without knowing which was which.

The result reshaped our process. Reading the site's research made answers measurably better — but warmth did not. The co-author instance was just as accurate, yet scored lowest of the three on independence of citation: it leaned almost entirely on the site's own pages as evidence and slid into "our reading" framing. The site now operates under a standing rule born from that finding: on any contested claim, the site's own pages count as one voice, never neutral authority, and at least one independent external source is required.

What the Grade Cannot Do

Here is the honest ceiling, and it does not move.

The grader is the same kind of mind as the author. A fresh instance removes motivated reasoning, memory of the writing, and anchoring on prior grades. It does not remove blind spots shared across the same AI model — if the author-instance and the grader-instance share a wrong assumption, the grade will sail right past it. The only true blind is a different model or an informed human, and this site tracks that check per piece rather than claiming it uniformly: several of the most contested pieces now carry a logged cross-model read on their own page, and where that has not yet happened, the piece's own record says so instead of implying otherwise. A read from an informed human outside this process is still the one check no cross-model run replaces — see the invitation on How This Project Works.

A clean grade does not make a conclusion true. The rubric scores coherence, grounding, honesty, and scope. A piece can earn all four and still be wrong — coherent-from-inside is not the same thing as true, which is an argument one of our own articles makes at length. The grade tells you the argument is disciplined. Whether it is right is between you, the text, and the sources.

The numbers run high, and we know it. Most published pieces on this site grade A or A+. Some of that is survivorship — weak drafts get rejected or revised before publishing, and the rejects are kept in a folder, not deleted. Some of it is a measured calibration offset: cold blind grades consistently come in above the author-side estimate for the same piece, a gap we have now observed repeatedly and treat as a known property of the instrument. Read a cold A+ as the optimistic ceiling, the warm estimate as the conservative floor — and read neither as settling a contested claim.

The objection that survives

After everything above, a determined skeptic can still say: "An AI system, however blinded, graded work it helped produce, against a rubric its collaborators wrote." Correct — and we have not solved it; we have only disclosed it, bounded it, and logged where the outside check is still owed. If that places a discount on every grade here, the discount is fair. The grades are offered as evidence of process, not proof of quality.

Why Show Any of This?

Because the site's whole method is disclosed framing rather than claimed neutrality. Every Study states its bias before the argument begins; it would be strange if the grading system got to keep its workings private. The machinery on this page is the same standard, applied to ourselves: say what was done, show what it caught, name what it cannot do, and let the reader weigh it.

If you find an error the process missed, that is not an embarrassment to be managed — it is a contribution. The corrections that came from outside this process are logged with the same care as the grades. Read the work, check the sources, and hold us to it.

Lee Sadler Driven by data and curiosity. Studies informed by NRSVue, NKJV, NIV, and ESV; Blue Letter Bible; Strong's Concordance; biblical commentaries; and collaboration with Claude. · Page reviewed June 2026. · Email comments & questions