Blogroll
Raspberry Pi's latest computer is an answer to the RAM pricing crisis
Raspberry Pi has already hiked prices on its single board computers in December and this February in response to AI-related RAM price hikes, but now it's trying something new: a configuration built with the memory crisis in mind. The company has introduced a Raspberry Pi 4 model with 3GB of RAM for just under $84. No, it's not an April Fool's joke.
Spotify tanked my productivity, but this music app helped me regain focus
Music can be an incredibly powerful focus tool for many people, and it can be a good catalyst to improve productivity. With the advent of work and study playlists, apps like Spotify, YouTube, and Apple Music are flooded with focus music that aims to strip away distractions.
Move over, Game of Thrones—HBO has finally found your replacement
HBO achieved enormous success with Game of Thrones, a fantasy series based on the books of George R.R. Martin. Between its high viewership and significant cultural impact, Game of Thrones became a once-in-a-generation show. While GoT's spin-offs have been successful, they have failed to reach the unfathomable heights of their predecessor. HBO is clearly looking for its next Game of Thrones. Could that next landmark show be Harry Potter?
2027 VW Atlas gains ChatGPT and horsepower, but rivals win on maintenance costs
Volkswagen pulled the silk off the 2027 Atlas in the Big Apple during the 2026 New York International Auto Show, and it’s clear the brand has upscaled its three-row SUV, especially the interior. Among the highlights, an AI-powered voice assistant with ChatGPT that responds to driver prompts, whether it's adjusting the climate control or finding a nearby coffee shop or restaurant.
Apple boots vibe coding app Anything from App Store
Apple brought the ban hammer down on an AI-powered iOS app.
The Information reported that Apple pulled an app called "Anything" from the App Store. For the unfamiliar, Anything is/was an app based around using "vibe coding," or the act of using natural language AI prompts to generate apps, often by people with no formal coding experience.
Apple has been either pulling vibe coding apps or blocking them from releasing updates since March, according to The Information, with other apps like Vibecode and Replit becoming victims.
SEE ALSO: Apple celebrates 50th birthday with homepage animationIn case you're wondering why Apple might take a hard line against vibe coding apps, it's not just based on vibes. The company told MacRumors that while there isn't a precise rule against vibe coding, these apps do violate App Store Guideline 2.5.2, which states:
Apps should be self-contained in their bundles, and may not read or write data outside the designated container area, nor may they download, install, or execute code which introduces or changes features or functionality of the app, including other apps. Educational apps designed to teach, develop, or allow students to test executable code may, in limited circumstances, download code provided that such code is not used for other purposes. Such apps must make the source code provided by the app completely viewable and editable by the user.
So, while there isn't exactly a rule against vibe coding apps, that guideline, as currently written, would make it pretty hard for any of them to exist on the App Store.
App developers have also reported delays in app store approvals this year, with some blaming vibe coding apps for creating a bottleneck. On Apple's end, fewer vibe-coded apps means fewer submissions to review.
But that also means you might need to learn how to code for real if you want to make an iOS app, so not everyone is a winner here.
SEE ALSO: You vibe-coded an app, now what?Instagram reportedly deletes Bellesa sex toy shop account for using the word clitoris
The sex toy shop Bellesa Boutique said today that Instagram "permanently deleted" its account for using the word "clitoris."
Bellesa Boutique offers sex toys for any gender, from vibrators to cuffs. (Bellesa also has a sister site, hosting pornography marketed towards women.)
SEE ALSO: Shockingly low number of adults can identify the clitoris, sex toy shop finds"Bellesa Boutique (@bellesaco) was just banned from Instagram," the shop posted from a new account, @bellesacensored, on March 31. The original Bellesa Instagram account had 700,000 followers and hosted over a decade's worth of content, the caption stated.
In a statement posted to X, the company provided this explanation: "Our violation? Using the word 'clitoris.'"
This Tweet is currently unavailable. It might be loading or has been removed.In email screenshots Bellesa shared with Mashable, Meta stated that the account was disabled for "violating Meta's Community Standards due to sexually explicit language in organic content." (Organic means that Bellesa shared the content on its account rather than in an advertisement.)
The email goes on:
Examples of content that is not allowed include sexually explicit language that uses explicit or graphic detail about:
Genitals
States of sexual arousal (e.g., wetness or erection)
Sexual encounters
Credit: Screenshot: BellesaThis is language lifted from Meta's Community Standards concerning Adult Sexual Solicitation and Sexually Explicit Language, where it states that the above language isn't allowed, but does not include "content shared in a humorous, satirical context or as sexual cursing."
Cofounder and CEO of Bellesa, Michelle Shnaidman, told Mashable that there wasn't any opportunity to appeal or review specific content before deletion, nor was the company given any warning. The Bellesa team was locked out of the account entirely on Saturday morning, and they were given a notice through the app.
The email also states that the company reviewed the case and determined the account violated its guidelines and can't be re-enabled.
"For over a decade, hundreds of thousands of people came to the @bellesaco community to learn about and celebrate their own bodies — a safe, shame-free space to discuss sexual wellness and pleasure. Instagram deleted it for 'sexually explicit language,' meaning discussing women's bodies in a health context is treated as inherently unacceptable," Shnaidman told Mashable over email.
The account deletion happened days after Meta was found guilty (along with YouTube) of negligent platform design that resulted in the harm of a young person's mental health. (Meta has said it will appeal the verdict.)
Credit: Screenshot: Mashable"Four days after losing a $375M lawsuit in court, Meta needed to look tough," the @bellesacensored post continues. "Instead of fixing what got them sued, they banned a women's sexual health community."
Shnaidman stated similarly that Bellesa wasn't the problem Meta was sued over. "But we're easier to ban than the content that actually got them into court," she said.
This isn't the first time Meta deleted content from a sex toy shop. In 2023, Meta reportedly rejected ads from another sex toy shop, Unbound, until it marketed to men. That same year, Meta seemed to reject period care ads for being "adult" or "political." (Meanwhile, explicit AI girlfriends were OK to advertise on Meta in 2024.) And for years, sex workers as well as LGBTQ content creators have told Mashable that they, too, have been banned or shadowbanned from Instagram.
In a 2025 study on the suppression of sexual and reproductive health on major platforms like Meta, the Center for Intimacy Justice found that of the groups studied, 63 percent had organic content removed from Meta platforms, and 84 percent of businesses and 76 percent of nonprofits had ads rejected by Meta.
The nonprofit Repro Uncensored, which monitors and tracks censorship, documented a wave of increased censorship in Nov. and Dec. 2025, its executive director, Martha Dimitratou, told Mashable.
Even in the last few days, they've seen a new wave of accounts taken down by Meta, including LGBTQ accounts and even accounts for nightclubs. Dimitratou couldn't pinpoint exactly why this is happening right now, but it could be a mix of AI content moderation, people reporting these accounts, or a big political or legal event — like the Meta trial.
Bellesa's Facebook account remains up, along with some Reels, though it has around 40,000 followers compared to Instagram's 700,000.
Mashable has reached out to Meta for comment.
"The ability to discuss sexual health online is how an entire generation of women learned what endometriosis is, what a cervical exam involves, that their experiences are normal," Shnaidman said. "Take that away and you're not protecting anyone — you're pushing these convos back into the dark."
Samsung DeX changed how I buy phones: USB ports and processors matter way more than you think
I use a Samsung Galaxy phone plugged into a monitor with DeX mode as my primary computer. This means when I shop for a new phone, I'm also shopping for a PC. The details I pay attention to are different from most. If you're interested in this same setup, here's what to look out for.
Your Google smart speakers can now understand more commands
Smart homes are supposed to make life easier, but you might know that’s not always the case if you’ve yelled at a speaker that misunderstood a simple command. Following last month’s major update to Gemini for Home, the company is rolling out Google Home v4.2, focusing on fixing these small but frustrating experiences, along with some improvements to smart home controls.
10 tools every homelabber should try at least once
Are you looking for fun (or unique) pieces of software to expand your homelab with? I’ve been on the hunt for new software lately, and found 10 tools that everyone should try at least once. In no particular order, here are tools that have (or will) change how I run my homelab.
How to watch the Artemis II launch, the first trip to the Moon in 53 years
NASA is poised to return to the Moon over 53 years after Apollo 17, and this time you don't need a TV to follow along. The Artemis II mission is scheduled to launch from the Kennedy Space Center on April 1st as soon as 6:24PM ET—here's how to watch as astronauts make history.
AirTags are the best Home Assistant accessory you've overlooked—here's 5 ways I'm using them
You've likely been sitting on automation potential right in your pocket or attached to your keys. While Apple AirTags started out as a way to find lost luggage or misplaced wallets, they've quietly evolved into the coolest and most versatile Home Assistant accessory out there right now. They are devices that give you precise locations, but they can easily be used in a smart home. Location tracking doesn't have to be as simple as we make it. Using just a cheap tag, you add more features to your home without much effort.
You can now change your Gmail username. Here’s how to do it.
Google first unveiled Gmail to the public on April 1, 2004. Now, 22 years later, Google is finally letting some Gmail users change their account's username while retaining everything else in their account.
The ability for Gmail users to change their username was first teased by Google late last year. And now, as of Tuesday, every Gmail user in the U.S. can officially change their username — that's the part that comes before the "@gmail.com" – to whatever they want, as long as the new username is available.
Aside from the username, everything else with the account remains the same. All emails and files associated with the old username will continue to exist in the account for the new username.
What happens to your old Gmail address? Google says it will retain that username for the user so that emails sent to that old address continue to arrive to the new username's inbox.
So, are you ready to change that Gmail username you created while you were still in high school? Here's how to do it.
How to change your Gmail usernameIf you're a Gmail user in the U.S., the option to change your Gmail username while retaining the same account is now open to you.
To change your username, simply go to your Settings while signed into your Google account. Next, go to Personal info, followed by Email, and then Google Account email.
Eligible accounts will then see a button labeled "Change Google Account email" on this page. Tap that button and then pick a new username.
Please note, Gmail users can only change their username once every 12 months. So, once you pick a new username, you're stuck with it for at least a year. But that might sound like a pretty short timeframe if you were one of the unfortunate users stuck with your previous Gmail username for 22 years.
This Subaru SUV hits 60 mph in under 5 seconds—and seats seven
Subaru just pulled the wraps off its newest SUV, and it’s a pretty big deal for the brand. The all-electric 2027 Getaway is its first three-row model—and also its most powerful yet.
The best free Lego deals this week: How to claim a cute Easter Bunny and Star Wars set for free
It's not easy to come by something for nothing in 2026, let alone something good. But Lego offers free deals fairly regularly, with purchase, of course. This week's free deals include something perfect for Easter and a fun build for Star Wars fans. Both of these free offerings expire on April 5, so hop to it.
Best holiday offering Opens in a new window Credit: Lego Lego Cute Easter Bunny $0 at Lego$4.99 Save $4.99 with $40+ purchase Get Deal Why we like it
Lego named this build "Cute Easter Bunny" and we agree on its cuteness. It's a mini-build designed for those ages 6 and up. It includes 66 pieces to build the bunny, carrot, and three colorful Easter eggs. Plus, the heart-shaped nose could not be cuter. The Lego Cute Easter Bunny build comes free with online purchase of $40 or more at Lego.
SEE ALSO: Spend $50 at Amazon on Easter candy, toys, and games to get $10 off Best Star Wars offering Opens in a new window Credit: Lego Lego Star Wars Kamino Training Facility $0 at Lego$29.99 Save $29.99 with $160+ purchase Get Deal Why we like it
Even the best need training. The Lego Star War Kamino Training Facility is free with purchase of $160 or more online at Lego through April 5. It's a 190-piece build that includes three Clone Cadets from Star Wars: Attack of the Clones. There's also the KE-8 Enforcer floating patrol vehicle with it's cockpit that opens. The Star Wars build is about 10 inches tall and 3.5 inches wide.
5 PC-building facts that sound like complete nonsense
Building PCs is hardly anything new, but it's definitely an enthusiast thing, and that can create a lot of myths and misconceptions. You've probably heard myths, such as that liquid coolers are dangerous and can flood your entire PC, or that SSDs are less reliable than HDDs.
8 new shows and movies streaming on HBO Max in April
I don’t know about you guys, but I’ll be spending a lot of time on HBO Max in April after seeing all the new shows and movies streaming this month.
Kia's compact EV3 electric SUV comes to the US with 320 miles of range
Kia has introduced the US version of its EV3 crossover, and it's poised to deliver strong range and charging capabilities for a small electric SUV.
20+ lingering Amazon Spring Sale deals Im sending to the group chat
I've been writing for Mashable for four years, and in that time, I've tested hundreds of products — including tons of vibrators. While my niche is sexual health and wellness, my actual everyday life consists of wrangling two unruly beagles, working from literally wherever, and trying to get my hair done in under 10 minutes.
SEE ALSO: Amazon's Big Spring Sale is over, but these 110+ deals are still live: Last chance to save on Apple, Bose, LegoSo, when my editors asked me to round up the best Amazon Big Spring Sale deals that are miraculously still live today, I realized my version of "essential tech" looks a little different than the gadgets Mashable's Tech Editor, Timothy Beck Werth, already covered.
From my favorite steam mop to the blowout kit I use every single day, here are the lingering deals I'm telling my friends to buy before they disappear for good.
Shop a steam mop that'll keep your floors spotless sans chemicals Bissell PowerFresh Lift-Off Pet Steam Mop $148.93 at Amazon$159.99 Save $11.06 Get Deal at Amazon
As you know, I have two beagles. And even though I love them to bits, they can be messy, unruly, and demanding — especially if I'm not giving them enough attention (which usually leads to an on-purpose accident to get it). That's why I love this thing.
The Bissell PowerFresh Lift-Off Pet Steam Mop is a literal lifesaver when they've gotten up in the middle of the night and left me a surprise glued to the floor by morning. Of course, I take them out, and they have puppy pads in the house, but sometimes real accidents happen.
SEE ALSO: The PetSafe ScoopFree is a budget Litter-Robot alternative if your cat hates enclosed litter boxesThis steam mop is unlike other wet mops I've tried (and hated because they stink even after I clean them); it's super easy to use, cleanup is quick and easy (I just throw the washable mop pad in the washer), and it doesn't use harsh chemicals or leave a nasty, wet dog smell in my apartment like most wet mops do. It relies on steam and heat to get the job done, and it works! It also features a removable Lift-Off pod and comes with 13 attachments, so you can clean other surfaces like baseboards and the shower, too.
Get the Bissell PowerFresh Lift-Off Pet Steam Mop for $148.93, down from $159.99 (save $11.06) at Amazon.
More floor care dealsShark Steam & Scrub Mop — $119.99 $159.99 (save $40)
Roborock Q7 M5 — $149.99 $239.99 (save $90)
Shark Pet IX141 — $149 $299.99 (save $150.99)
Roborock Q7 M5+ — $249.99 $329.99 (save $80)
Roborock Q10 S5+ — $279.99 $549.99 (save $270)
Mova P10 Pro Ultra — $399 $499 (save $100)
Roborock Qrevo Series — $399.99 $649.99 (save $250)
Eufy E28 with portable carpet cleaner — $649.99 $999.99 (save $350)
$159.99 Save $20 Get Deal at Amazon
Bissell is one of my favorite brands for pet parents because it's affordable and effective. Whenever I need to spot-clean something, I use the Bissell Little Green Pet Deluxe Portable Carpet Cleaner. It's loud, but it works, and this deluxe version comes with a three-inch Tough Stain Tool, a Stain Trapper Tool, and trial-size formulas to get you started.
Get the Bissell Little Green Pet Deluxe Portable Carpet Cleaner for $139.99, down from $159.99 (save $20) at Amazon.
SEE ALSO: What is the best robot vacuum for pet hair? After testing, my cats and I chose 4 top picks for 2026 so far. Give your WFW (work from wherever) setup an upgrade with a new-ish MacBook Pro MacBook Pro (M5, 24GB RAM, 1TB SSD) $1,799 at Amazon$1,899 Save $100 Get Deal at Amazon
Whenever I'm not on my iMac, I work on my 14-inch MacBook Pro (M4, 16GB RAM, 512GB SSD). The M4 is a 2024 laptop, but it still works just fine even after putting it through hell day after day (I'm sure my log time is in the hundreds of thousands). It's not on sale right now, but the 2025 model is. You can still get $100 off the 14.2-inch MacBook Pro (M5, 24GB RAM, 1TB SSD) at Amazon. (It's not a shocking deal, but $100 is $100, right?)
Get the MacBook Pro (M5, 24GB RAM, 1TB SSD) for $1,799, down from $1,899 (save $100) at Amazon.
More Apple dealsApple AirPods 4 — $114.95 $129 (save $14.05)
Apple AirPods 4 (with ANC) — $155.99 $179 (save $23.01)
Apple iPad, 11-inch (A16, WiFi, 128GB) — $299 $349 (save $50)
Apple AirPods Max (2nd Gen) — $529 $549 (save $20)
Apple iPad Air, 11-inch (M4, WiFi, 256GB) — $649.99 $699 (save $50)
Apple MacBook Air, 15-inch (M4, 16GB RAM, 256GB SSD) — $999.97 $1,199 (save $199.03)
Apple MacBook Air, 13-inch (M5, 16GB RAM, 512GB SSD) — $1,049 $1,099 (save $50)
Apple MacBook Air, 13-inch (M5, 24GB RAM, 1TB SSD) — $1,449.99 $1,499 (save $49.01)
Apple MacBook Air, 15-inch (M5, 24GB RAM, 1TB SSD) — $1,649 $1,699 (save $50)
$155 Save $44.56 Get Deal at Amazon
If there's one personal item I have to take with me anytime I'm staying overnight, it'd be the Drybar "You Had Me at Blowout" Kit. I got this gift set for Christmas last year, and I literally use it every single day.
This little device is super handy for anyone who hates doing their hair or just doesn't want to spend hours in the bathroom getting ready (my sister takes forever to put her makeup on). With this tool, you can style your hair in minutes and be on your way to wherever way before your friends realize that you're not actually on the train yet.
SEE ALSO: I tested the best Dyson Airwrap dupes under $300: The Shark FlexStyle isn't your only optionIt comes with the Double Shot blow dryer brush, plus Detox Dry Shampoo, Prep Rally Detangler, Triple Sec texturizer, and Final Call. It claims to give you extra volume that lasts for eight hours, and honestly, it delivers.
Get the Drybar "You Had Me at Blowout" Kit for $116.25, down from $155 (save $38.75) at Amazon.
More beauty tech dealsCHI Lava Ceramic 1-inch Flat Iron Hair Straightener — $59.98 $107.99 (save $48.01)
BaByliss Pro Porcelain Ceramic Carrera Hair Dryer — $80.49 $114.99 (save $34.50)
Dreame Pocket Hair Dryer — $89.99 $129.99 (save $40)
Shark FlexStyle Air Styling & Drying System — $219.99 $379.99 (save $160)
Dyson Supersonic Nural — $449.99 $549.99 (save $100)
Dyson Airwrap i.d. — $541.52 $649.99 (save $108.47)
$52 Save $13.02 Get Deal
OK, so hydration is super important, and so is staying moisturized because glowing skin is happy skin. I cannot recommend this moisturizer enough. I put this on my face every time I get out of the shower. (I also love, love, love philosophy's face mask.) The formula includes hyaluronic acid and pineapple extract, which provide up to 72 hours of hydration. It's super lightweight, gives you a gorgeous natural glow, and works as a great primer if you wear foundation (I don't, and it still makes my skin dewy).
But here's the real tea: This stuff is never (I repeat, never) on sale, so you should definitely grab it while it is!
Get philosophy's hope in a jar glow water cream face moisturizer for $38.98, down from $52 (save $13.02) at Amazon.
SEE ALSO: Is this $45 red light gua sha a promising Solawave dupe? Keep your stress levels low with my all-time favorite wand vibrator LELO Smart Wand 2 (Medium) $138.58 at Amazon$169 Save $30.42 Get Deal at Amazon
I couldn't write a deals roundup without including a vibrator — sexual wellness is an essential part of your overall routine. I have the large version of the LELO Smart Wand 2, and it was my very first favorite wand vibrator. (Until I met the VIM by Fun Factory, which was, unfortunately, discontinued.) Right now, the medium version is on sale, and it's the perfect size if you want something a bit more manageable to hold (or pack in a weekend bag).
It has a super-chic design, is 100% waterproof for the bath or shower, and offers 10 different vibration patterns. The charge also lasts practically forever, so you don't have to worry about it dying right when you need to release some tension.
Get the LELO Smart Wand 2 (Medium) for $138.58, down from $169 (save $30.42) at Amazon.
More sexual wellness dealsTracy's Dog OG Dual-vibe — $29.99 $39.99 (save $10)
plusOne dual rabbit vibrator — $33.97 $39.99 (save $6.02)
Tracy's Dog Bumpa 3-in-1 anal vibrator — $44.99 $49.99 (save $5)
Tracy's Dog Passion Kit — $45.99 $49.99 (save $4)
ADeLe: Predicting and explaining AI performance across tasks
- AI benchmarks report performance on specific tasks but provide limited insight into underlying capabilities; ADeLe evaluates models by scoring both tasks and models across 18 core abilities, enabling direct comparison between task demands and model capabilities.
- Using these ability scores, the method predicts performance on new tasks with ~88% accuracy, including for models such as GPT-4o and Llama-3.1.
- It builds ability profiles and identifies where models are likely to succeed or fail, highlighting strengths and limitations across tasks.
- By linking outcomes to task demands, ADeLe explains differences in performance, showing how it changes as task complexity increases.
AI benchmarks report how large language models (LLMs) perform on specific tasks but provide little insight into their underlying capabilities that drive their performance. They do not explain failures or reliably predict outcomes on new tasks. To address this, Microsoft researchers in collaboration with Princeton University and Universitat Politècnica de València introduce ADeLe (opens in new tab) (AI Evaluation with Demand Levels), a method that characterizes both models and tasks using a broad set of capabilities, such as reasoning and domain knowledge, so performance on new tasks can be predicted and linked to specific strengths and weaknesses in a model.
In a paper published in Nature, “General Scales Unlock AI Evaluation with Explanatory and Predictive Power (opens in new tab),” the team describes how ADeLe moves beyond aggregate benchmark scores. Rather than treating evaluation as a collection of isolated tests, it represents both benchmarks and LLMs using the same set of capability scores. These scores can then be used to estimate how a model will perform on tasks it has not encountered before. The research was supported by Microsoft’s Accelerating Foundation Models Research (AFMR) grant program.
ADeLe-based evaluationADeLe scores tasks across 18 core abilities, such as attention, reasoning, domain knowledge, and assigns each task a value from 0 to 5 based on how much it requires each ability. For example, a basic arithmetic problem might score low on quantitative reasoning, but an Olympiad-level proof would score much higher.
Evaluating a model across many such tasks produces an ability profile—a structured view of where the model performs and where it breaks down. Comparing this profile to the demands of a new task makes it possible to identify the specific gaps that lead to failure. The process is illustrated in Figure 1.
Figure 1. Top: (1) Model performance on the ADeLe benchmark and (2) the resulting ability profiles, showing each model’s strengths and limitations across core abilities. Bottom: (1) Application of 18 scoring criteria to each task and (2) the resulting task profiles, showing the abilities each task requires. Evaluating ADeLeUsing ADeLe, the team evaluated a range of AI benchmarks and model behaviors to understand what current evaluations capture and what they miss. The results show that many widely used benchmarks provide an incomplete and sometimes misleading picture of model capabilities and that a more structured approach can clarify those gaps and help predict how models will behave in new settings.
ADeLe shows that many benchmarks do not isolate the abilities they are intended to measure or only cover a limited range of difficulty levels. For example, a test designed to evaluate logical reasoning may also depend heavily on specialized knowledge or metacognition. Others focus on a narrow range of difficulty, omitting both simpler and more complex cases. By scoring tasks based on the abilities they require, ADeLe makes these mismatches visible and provides a way to diagnose existing benchmarks and design better ones.
Applying this framework to 15 LLMs, the team constructed ability profiles using 0–5 scores for each of 18 abilities. For each ability, the team measured how performance changes with task difficulty and used the difficulty level at which the model has a 50% chance of success as its ability score. Figure 2 illustrates these results as radial plots that show where the model performs well and where it breaks down.
Figure 2. Ability profiles for 15 LLMs across 18 abilities. Left: OpenAI models. Middle: Llama models. Right: DeepSeek-R1 distilled models.This analysis shows that models differ in their strengths and weaknesses across abilities. Newer models generally outperform older ones, but not consistently across all abilities. Performance on knowledge-heavy tasks depends strongly on model size and training, while reasoning-oriented models show clear gains on tasks requiring logic, learning, abstraction, and social inference. These patterns typically require multiple, separate analyses across different benchmarks and can still produce conflicting conclusions when task demands are not carefully controlled. ADeLe surfaces them within a single framework.
ADeLe also enables prediction. By comparing a model’s ability profile to the demands of a task, it can forecast whether the model will succeed, even on tasks that are unfamiliar. In experiments, this approach achieved approximately 88% accuracy for models like GPT-4o and LLaMA-3.1-405B, outperforming traditional methods. This makes it possible to both explain and anticipate potential failures before deployment, improving the reliability and predictability of AI model assessment.
Whether AI systems can truly reason is a central debate in the field. Some studies report strong reasoning performance, while others show they break down at scale. These results reflect differences in task difficulty. ADeLe shows that benchmarks labeled as measuring “reasoning” vary in what they require, from basic problem-solving to tasks that combine the need for advanced logic, abstraction, and domain knowledge. The same model can score above 90% on lower-demand tests and below 15% on more demanding ones, reflecting differences in task requirements rather than a change in capability.
Reasoning-oriented models like OpenAI’s o1 and GPT-5 show measurable gains over standard models—not only in logic and mathematics but also with interpreting user intent. However, performance declines as task demands increase. AI systems can reason, but only up to a point, and ADeLe identifies where that point is for each model.
Azure AI Foundry LabsGet a glimpse of potential future directions for AI, with these experimental technologies from Microsoft Research.
Azure AI Foundry Opens in a new tab Looking aheadADeLe is designed to evolve alongside advances in AI and can be extended to multimodal and embodied AI systems. It also has the potential to serve as a standardized framework for AI research, policymaking, and security auditing.
More broadly, it advances a more systematic approach to AI evaluation—one that explains system behavior and predicts performance. This work builds on earlier efforts, including Microsoft research on applying psychometrics to AI evaluation and recent work on Societal AI, emphasizing the importance of AI evaluation.
As general-purpose AI systems continue to outpace existing evaluation methods, approaches like ADeLe offer a path toward more rigorous and transparent assessment in real-world use. The research team is working to expand this effort through a broader community. Additional experiments, benchmark annotations, and resources are available on GitHub (opens in new tab).
Opens in a new tabThe post ADeLe: Predicting and explaining AI performance across tasks appeared first on Microsoft Research.


