appvoid

30 4 152

https://ko-fi.com/appvoid

AI & ML interests

Working on small sota models

Recent Activity

repliedto their post about 18 hours ago

A huge amount of large synthetic datasets on huggingface looks surprisingly like templates, that might be one of the main reasons open models might not be as good as other models, we need more people to create smaller, human-curated datasets instead of lazily sending millions of requests to large models for us to fulfill.

liked a model about 21 hours ago

tiiuae/Falcon-H1-Tiny-90M-Instruct

new activity 1 day ago

nineninesix/diamond-1.0:Really cool

View all activity

Organizations

replied to their post about 18 hours ago

I think we agree more than it seems. I am not saying manual review should replace diversity metrics.

Skeleton counts and entropy are useful, but they only catch what the parser measures. A batch can have many distinct skeletons and still repeat the same reasoning, tone, difficulty, or task patterns.

My point is to catch that early, while the batch is still small enough to change the prompts or generation strategy. After 100k rows, deduplication cannot recover the missing diversity.

The safest rail is both: batch-level diversity metrics and periodic human review.

liked a model about 21 hours ago

tiiuae/Falcon-H1-Tiny-90M-Instruct

Text Generation • 91.1M • Updated Jan 15 • 1.99k • 45

New activity in nineninesix/diamond-1.0 1 day ago

Really cool

🤗 1

#1 opened 1 day ago by

appvoid

liked a model 1 day ago

nineninesix/diamond-1.0

Audio-to-Audio • Updated 2 days ago • 43

replied to their post 1 day ago

I want to clarify that obviously still we need to generate with the best models we have but evaluate the batches manually be it 100 or 1000. Even if it's a quick read over examples. Never EVER let an AI generate without rails. I was too confident on frontier models outputs just to get dissapointed later.

posted an update 1 day ago

Post

4 replies

updated a collection 2 days ago

cool datasets

Collection

some interesting datasets to use for language modeling • 10 items • Updated 2 days ago • 1

replied to their post 6 days ago

I think is the second thing. A small model that enters a strange context may produce a relatively flat distribution. It does not have a strong idea of what should come next but still it has to pick something up. Anti-repetition behavior from data consumes capacity, the less capacity, the more likely your model will suffer from this bias.

That's a simplified view I have on self-reinforcing degeneration mechanism which you can read about here btw: https://arxiv.org/abs/2109.08705

posted an update 6 days ago

Post

145

small reasoning models are overrated, these little ones just doom loop a lot by default. good data will always be the moat when training or finetuning small models and latest sota models like fable 5 and gpt 5.6 are increasingly making this a lot easier to do.

2 replies

liked a dataset 7 days ago

argilla-warehouse/proofread-assistant

Viewer • Updated Oct 16, 2024 • 501k • 579 • 1

New activity in LiquidAI/LFM2.5-350M 11 days ago

Are you planning on the 700m one?

👀 1

#4 opened 4 months ago by

AI & ML interests

Recent Activity

Organizations

appvoid's activity

Really cool

Are you planning on the 700m one?