
[OC] # Network Structure Analysis: Detecting Anomalies in Redacted Public Records
(New to Reddit as contributor, have edited multiple times to fit format etc… if unacceptable will delete. I’d just be glad if 1 or 2 people read it.)
Method: Graph-theoretic analysis, cosine similarity profiling
Verification: MD5 c4baa2bb518d806b27ef649ac9cf1f46 | 19,977 edges | Fully reproducible
The Problem
When governments release documents with redactions, the public assumes anonymized entries are either privacy protection or collapsed "junk buckets." These assumptions are rarely tested mathematically.
The Method
If two anonymized nodes are both "junk buckets," they should have similar connection profiles (high cosine similarity). If one is a real, distinct entity, its profile will be orthogonal.
The Finding
[SEE IMAGE ABOVE] (Image shows values ×100 for visual clarity)
| Comparison | Cosine Similarity | Interpretation |
|---|---|---|
| Node A vs B | 0.0184 | ORTHOGONAL – completely distinct |
| Node A vs C | 0.364 | Low overlap |
| Node B vs C | 0.9025 | IDENTICAL – same artifact class |
B and C: 0.9025 similarity = collapsed artifacts (same source)
A vs B: 0.0184 similarity = NOT an artifact (distinct entity)
Protection Asymmetry
| Node | Connections | Visibility |
|---|---|---|
| Jeffrey Epstein | 5,625 | Visible |
| NODE A | 3,923 | HIDDEN |
| Donald Trump | 1,685 | Visible |
| Ehud Barak | 701 | Visible |
| NODE B | 481 | HIDDEN |
| NODE C | 160 | HIDDEN |
Node A has 5.6x MORE connections than Barak but receives COMPLETE anonymization.
Edge Type Distribution
| Type | Node A | Node B | Node C |
|---|---|---|---|
| Unknown | 53.3% | 71.0% | 52.5% |
| Business | 6.9% | 1.2% | 6.8% |
| Legal | 5.8% | 3.4% | 5.8% |
| Communication | 3.95% | 4.78% | 3.75% |
Node A shows structured distribution = real entity
Node B shows 71% unknown = noise bucket
The Logic
QUESTION: Is Node A just another collapsed data bucket?
TEST: If A is a bucket like B and C, their profiles should match.
EVIDENCE:
– B vs C = 0.9025 cosine = SAME BUCKET
– A vs B = 0.0184 cosine = DIFFERENT ENTITY
CONCLUSION: Node A is statistically distinct. Not an artifact.
Reproduction
“`python
def cosine_similarity(p1, p2):
all_keys = set(p1.keys()) | set(p2.keys())
v1 = np.array([p1.get(k, 0) for k in all_keys])
v2 = np.array([p2.get(k, 0) for k in all_keys])
return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))
Results:
A vs B: 0.0184 (distinct)
B vs C: 0.9025 (same bucket)
“`
What This Shows
- Node A is statistically real, not a data artifact
- Node A receives disproportionate protection relative to connectivity
- The methodology is fully reproducible
What This Does NOT Show
- Identity of any individual
- Guilt or wrongdoing
- Classified information
Hash: c4baa2bb518d806b27ef649ac9cf1f46
Structure, anomaly, methodology—not accusation.
Posted by Old_Iron986
2 comments
Data comes from the House Oversight Committee’s Epstein document release, processed through the maxandrews/Epstein-doc-explorer GitHub repo and converted to CSV for analysis. Visuals were generated in Python using matplotlib/pandas/numpy. Full reproducible code is included in the post.
This might be better served in /r/science or /r/datascience.
Comments are closed.