Hacker Newsnew | past | comments | ask | show | jobs | submit | zone411's submissionslogin
1.Show HN: LLM Sycophancy Benchmark: Opposite-Narrator Contradictions (github.com/lechmazur)
3 points by zone411 8 days ago | past | discuss
2.Show HN: LLM Round‑Trip Translation Benchmark (github.com/lechmazur)
6 points by zone411 6 months ago | past
3.Show HN: LLM Creative Story‑Writing Benchmark V3 (github.com/lechmazur)
8 points by zone411 6 months ago | past
4.Show HN: Mapping LLM Style and Range in Flash Fiction (github.com/lechmazur)
7 points by zone411 6 months ago | past
5.Pact: Head-to-head negotiation benchmark for LLMs (github.com/lechmazur)
6 points by zone411 6 months ago | past
6.Show HN: Bazaar – a new LLM benchmark for economic reasoning under uncertainty (github.com/lechmazur)
8 points by zone411 7 months ago | past | 1 comment
7.AI Comes Up with Physics Experiments. But They Work (quantamagazine.org)
4 points by zone411 7 months ago | past
8.Emergent Price-Fixing by LLM Auction Agents (github.com/lechmazur)
7 points by zone411 8 months ago | past
9.Public Goods Game Benchmark: Contribute and Punish, a Multi-Agent Benchmark (github.com/lechmazur)
7 points by zone411 12 months ago | past
10.Elimination Game: Multi-Agent LLM Social Reasoning, Strategy, and Deception (github.com/lechmazur)
5 points by zone411 on Feb 25, 2025 | past
11.SWE-Lancer: a benchmark of freelance software engineering tasks from Upwork (arxiv.org)
111 points by zone411 on Feb 18, 2025 | past | 74 comments
12.LLM Hallucination Benchmark: R1, o1, o3-mini, Gemini 2.0 Flash Think Exp 01-21 (github.com/lechmazur)
17 points by zone411 on Feb 10, 2025 | past | 3 comments
13.Multi-Agent Step Race Benchmark: LLM Collaboration and Deception Under Pressure (github.com/lechmazur)
7 points by zone411 on Jan 22, 2025 | past | 1 comment
14.Show HN: LLM Thematic Generalization Benchmark (github.com/lechmazur)
6 points by zone411 on Jan 14, 2025 | past
15.Show HN: LLM Creative Story-Writing Benchmark (github.com/lechmazur)
5 points by zone411 on Jan 6, 2025 | past
16.Show HN: LLM Divergent Thinking Creativity Benchmark (github.com/lechmazur)
8 points by zone411 on Dec 30, 2024 | past
17.Show HN: LLM Deceptiveness and Gullibility Benchmark (github.com/lechmazur)
7 points by zone411 on Oct 22, 2024 | past | 1 comment
18.LLM Confabulation (Hallucination) Leaderboard (github.com/lechmazur)
6 points by zone411 on Oct 10, 2024 | past
19.O1-preview and o1-mini results on NYT Connections (twitter.com/lechmazur)
2 points by zone411 on Sept 13, 2024 | past | 1 comment
20.Grok is an AI modeled after the Hitchhiker’s Guide to the Galaxy (twitter.com/xai)
213 points by zone411 on Nov 5, 2023 | past | 226 comments
21.Can you beat a stochastic parrot? ParrotChess.com (parrotchess.com)
3 points by zone411 on Sept 22, 2023 | past | 4 comments
22.Generative AI while browsing in Chrome (labs.google.com)
3 points by zone411 on Aug 15, 2023 | past
23.Statement on AI Risk (safe.ai)
341 points by zone411 on May 30, 2023 | past | 921 comments
24.Google tells staff it plans to limit publishing AI research (businessinsider.com)
63 points by zone411 on May 5, 2023 | past | 28 comments
25.4th Gen Intel Xeon Scalable Sapphire Rapids Leaps Forward (servethehome.com)
2 points by zone411 on Jan 10, 2023 | past | 1 comment
26.Fast and Furious Movie Titles by 'Claude' from Anthropic AI (twitter.com/jayelmnop)
2 points by zone411 on Jan 9, 2023 | past
27.SatelliteXplorer (esri.com)
2 points by zone411 on Dec 30, 2022 | past
28.SBF Arrested by Bahamian Authorities (twitter.com/tier10k)
1308 points by zone411 on Dec 12, 2022 | past | 812 comments
29.Large Language Models Can Self-Improve (openreview.net)
3 points by zone411 on Oct 2, 2022 | past | 1 comment
30.America Reached One Million Covid Deaths (nytimes.com)
5 points by zone411 on May 14, 2022 | past

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: