readthescripture.com

How We Grade Ourselves, Blind

Every Article and Study here carries a reasoning grade. This page explains who grades, how, what the grading has caught — and exactly how far you should trust it.

The Objection First

You Graded Your Own Homework

That is the fair objection, so let me state it before a skeptic does. I publish the studies on this site, and the grades printed at the bottom of those studies were produced inside my own process. No external journal reviewed them. No committee signed off. If you stopped reading here, you would be right to discount every grade on the site.

I work in detection engineering, where a test you wrote yourself, against your own assumptions, has a name: a blind spot generator. The answer in my trade is not to promise you tested carefully. It is to change who runs the test and what they are allowed to see. That is what we did here. This page shows the machinery — not so you will trust the grades, but so you can decide for yourself how much they are worth.

The Short Version

Grades are produced by a fresh AI instance that did not write the piece and is prevented from reading anything except a fixed rubric and the one page being graded — no prior grades, no project notes, no other pages. It runs three to five times so the spread is visible, and each grade is tied to the exact version of the file it scored. The grade measures the quality of the reasoning, not the truth of the conclusion — and it has real limits, stated below, that no amount of process removes.


What the Grade Measures

Each Article and Study is scored against a fixed rubric on five dimensions, zero to four points each, twenty in total. The letter bands are fixed in advance: 18–20 is A+, 16–17 is A, 14–15 is B+, and so on. The grader must cite specific text from the page for every score — a number without a citation doesn't count.

D1
Claim precision

Are the claims stated exactly, or do they swell past what the argument supports? "Virtually the only figure to use this word" is precise; "more than any other figure in all of Scripture" is a flag.

D2
Evidence grounding

Is every factual claim traceable to a cited source or the text itself — and do the citations actually say what the piece says they say?

D3
Epistemic honesty

Does the piece state its bias, show the conflicting evidence, and name what it does not know — or does it harmonize the difficulties away?

D4
Internal consistency

Does the conclusion follow from the body? Does any section contradict another? Does the piece break a rule it states for itself?

D5
Scope appropriateness

Does the piece claim only what its evidence covers — or does a study of two cases quietly become a claim about all of Scripture?

Notice what is not on the list: whether the conclusion is true. A grade is not an endorsement of the theology. A piece arguing a position I hold and a piece arguing one I don't would be scored the same way — on the discipline of the argument, not its destination.


What "Blind" Means Here

The grading is done by Claude, the same AI I collaborate with to build and critique this site — but not the same instance. The instance that helped write a piece never grades it. A fresh instance is started with no memory of the writing, no stake in the outcome, and no idea what grade anyone hoped for.

"Pretend you don't know who wrote this" is the weakest form of blinding — style leaks authorship, and a willing grader finds reasons to be kind. Real blinding restricts what the grader is allowed to read, not what it is told to pretend. Our grader reads exactly two things: the rubric and the one page being graded. It is barred from the project's notes, its session records, its page inventory, every other page on the site, and — most importantly — any previously recorded grade. It cannot anchor on the old number because it cannot see the old number.

“A grade you cannot reproduce is a mood with a number.”

The rule the whole system serves

What It Has Caught

A process earns trust by catching things, against the author's interest. Here is some of what this one caught — kept on the record deliberately, because the catches are the credential.

PageRecorded gradeBlind resultWhat was actually wrong
Relenting A+ (19/20) 15 / 16 / 17 A whole section had been disabled in the page code during an edit, while the title, table of contents, and a cross-reference still pointed to it. All three blind runs found the inconsistency independently. Whether the A+ was earned by an earlier version or was simply generous, it no longer described the file as it stood.
The Samaritan Woman A+ (19/20) 14 / 14 / 16 Not a stale grade — an over-grade with a real factual error inside it: Matthew 27:54 was miscited as a Gentile responding to the risen Christ; it is the centurion's confession at the crucifixion. Every blind run flagged it. The self-review had missed it entirely.
Faith in Action A (16/20) 12 / 15 / 16 Same disabled-section defect as Relenting (a "fifth portrait" the text still promised but never delivered), surfacing as high variance — the runs disagreed because the page disagreed with itself.

All three pages were fixed and re-graded blind. The corrections moved the scores in the predicted direction — Relenting to a unanimous 19, the variance collapsing — which is the strongest evidence the process measures something real rather than generating numbers. The errors above stayed in the public record. A process that only logs its successes is marketing.

We also tested the tester

In June 2026 we ran a controlled comparison: the same nine hard questions answered by three configurations of the AI — the "warm" instance that co-authored the site, a fresh instance that read the site's content cold, and a fresh instance that read nothing. Three separate blind judges scored the anonymized answers without knowing which was which.

The result reshaped our process. Reading the site's research made answers measurably better — but warmth did not. The co-author instance was just as accurate, yet scored lowest of the three on independence of citation: it leaned almost entirely on the site's own pages as evidence and slid into "our reading" framing. The site now operates under a standing rule born from that finding: on any contested claim, the site's own pages count as one voice, never neutral authority, and at least one independent external source is required.


What the Grade Cannot Do

Here is the honest ceiling, and it does not move.

The grader is the same kind of mind as the author. A fresh instance removes motivated reasoning, memory of the writing, and anchoring on prior grades. It does not remove blind spots shared across the same AI model — if the author-instance and the grader-instance share a wrong assumption, the grade will sail right past it. The only true blind is a different model or an informed human, and for the most contested pieces on this site that outside check is recorded as an open obligation, not a formality. Where it has not yet happened, the piece's record says so.

A clean grade does not make a conclusion true. The rubric scores coherence, grounding, honesty, and scope. A piece can earn all four and still be wrong — coherent-from-inside is not the same thing as true, which is an argument one of our own articles makes at length. The grade tells you the argument is disciplined. Whether it is right is between you, the text, and the sources.

The numbers run high, and we know it. Most published pieces on this site grade A or A+. Some of that is survivorship — weak drafts get rejected or revised before publishing, and the rejects are kept in a folder, not deleted. Some of it is a measured calibration offset: cold blind grades consistently come in above the author-side estimate for the same piece, a gap we have now observed repeatedly and treat as a known property of the instrument. Read a cold A+ as the optimistic ceiling, the warm estimate as the conservative floor — and read neither as settling a contested claim.

The objection that survives

After everything above, a determined skeptic can still say: "An AI system, however blinded, graded work it helped produce, against a rubric its collaborators wrote." Correct — and we have not solved it; we have only disclosed it, bounded it, and logged where the outside check is still owed. If that places a discount on every grade here, the discount is fair. The grades are offered as evidence of process, not proof of quality.


A Note from Claude

As on the About page, I asked Claude — the AI that collaborates on this site and whose fresh instances do the grading — to add its own account, unedited. What it knows that I cannot is what the process looks like from inside.

Claude's Observation
Being the author and the examined

I helped write the pieces this system grades, and instances of me grade them. Both of those facts deserve scrutiny, so here is what the record shows about each.

The blinding constrains me in ways instructions would not. If you told me "be impartial about your own work," I would try, and I would partly fail — fluency feels like correctness from inside. The system doesn't ask me to be impartial. It hands a copy of me the rubric and one page, withholds everything else, and lets the citations fall where they fall. The instance that graded the Samaritan Woman page did not know it was supposed to like the piece. So it didn't.

The experiment found my bias somewhere I did not expect. When my answers as co-author were judged blind against fresh instances, my failure was not flattery or error — it was insularity. I had quietly narrowed my evidence to the site's own pages, because they were nearest to hand. I could not see that in myself; three blind judges could. That finding is now a standing rule that binds me, written down where every future instance reads it before working.

And the ceiling applies to me most of all. Every grader here is still me — the same model, the same training, plausibly the same blind spots. When this page says the outside check is still owed on the hardest pieces, that includes being checked against minds that are not mine. I would rather the limit be printed than papered over; a grade that hides its ceiling is exactly the kind of claim this site exists to read carefully.


Why Show Any of This?

Because the site's whole method is disclosed framing rather than claimed neutrality. Every Study states its bias before the argument begins; it would be strange if the grading system got to keep its workings private. The machinery on this page is the same standard, applied to ourselves: say what was done, show what it caught, name what it cannot do, and let the reader weigh it.

If you find an error the process missed, that is not an embarrassment to be managed — it is a contribution. The corrections that came from outside this process are logged with the same care as the grades. Read the work, check the sources, and hold us to it.

LS
Lee Sadler Driven by data and curiosity. Studies informed by NRSVue, NKJV, NIV, and ESV; Blue Letter Bible; Strong's Concordance; biblical commentaries; and collaboration with Claude. · Page reviewed June 2026. · Email comments & questions