Testing a Design System: Visual, Unit, and Accessibility Tests
A developer's guide to testing design systems with visual regression, unit tests, and accessibility checks — covering tools, patterns, and real code examples.
Why Testing a Design System Is Different from Testing an App
Honestly, most teams ship a component library and then wonder why it silently breaks three months later. The answer is almost always the same: no automated tests, or tests written at the wrong layer.
Testing an application and testing a design system are genuinely different problems. In an app, you test behavior — does the user flow work? In a design system, you're testing *contracts*. Does this button still render with an 8px gap on all sides? Does the color token resolve correctly in dark mode? Does the aria-label propagate when passed as a prop?
You'll need three distinct test layers to cover it properly: unit tests for logic and prop contracts, visual regression tests for pixel-level stability, and accessibility tests for WCAG compliance. Drop any one of these and you've got a gap you'll regret.
Unit Tests: Testing Component Contracts and Prop Logic
Unit tests on UI components shouldn't try to assert on CSS. That's a losing battle. Instead, focus on what you can actually describe as a contract: given these props, the rendered output has this structure and these attributes.
Here's a concrete example using Vitest and Testing Library for a Button component. The test isn't checking colors or spacing — it's verifying that the component passes down the right ARIA attributes and renders the correct element type based on the as prop.
import { render, screen } from '@testing-library/react';
import userEvent from '@testing-library/user-event';
import { describe, it, expect, vi } from 'vitest';
import { Button } from '../components/Button';
describe('Button', () => {
it('renders as an anchor when `as="a"` is passed', () => {
render(<Button as="a" href="/docs">Read docs</Button>);
expect(screen.getByRole('link', { name: 'Read docs' })).toBeInTheDocument();
});
it('calls onClick and does not submit when type="button"', async () => {
const handler = vi.fn();
render(<Button type="button" onClick={handler}>Click me</Button>);
await userEvent.click(screen.getByRole('button'));
expect(handler).toHaveBeenCalledOnce();
});
it('is disabled and aria-disabled when disabled prop is set', () => {
render(<Button disabled>Save</Button>);
const btn = screen.getByRole('button', { name: 'Save' });
expect(btn).toBeDisabled();
expect(btn).toHaveAttribute('aria-disabled', 'true');
});
});Keep unit tests fast and isolated. No network calls, no real DOM measurements. If a test needs a 200ms setTimeout to pass, you're probably testing the wrong thing at this layer.
Visual Regression Testing: Catching Pixel Drift Before Production
Visual regression is where most design systems actually fail quietly. You update Tailwind v4.0.2 to v4.1.0, a utility changes its computed value by 1px, and suddenly your Card component's inner padding is 11px instead of 12px. Nobody notices until a designer opens Figma next to the live site.
Playwright with toHaveScreenshot() is a solid choice for local and CI visual snapshots. Chromatic (built on Storybook) is the more turnkey option if you're already using Storybook for component documentation. The tradeoff is cost vs. setup time — Chromatic charges by snapshot, Playwright snapshots are free but need more infra to store baselines.
The key discipline is keeping your visual test stories deterministic. No animations, no random data, no Date.now() calls. If your component has a transition, disable it in the test environment with prefers-reduced-motion: reduce or a CSS override like *, *::before, *::after { transition: none !important; animation: none !important; }.
Run visual tests on every PR, not just main branch merges. The earlier you catch a drift, the cheaper it is to fix. A 2px regression found in PR review takes 30 seconds to address. Found three weeks later in production, it takes a release cycle.
Accessibility Testing: Automated and Manual
Automated accessibility tools catch roughly 30% of WCAG issues. That's not nothing — but it means you can't skip manual checks. Think of axe-core (via @axe-core/react or vitest-axe) as your first filter, not your full coverage.
Here's the pattern we use to add axe checks to existing Vitest + Testing Library tests without any extra setup per test file:
import { render } from '@testing-library/react';
import { axe, toHaveNoViolations } from 'vitest-axe';
import { expect } from 'vitest';
import { Modal } from '../components/Modal';
expect.extend(toHaveNoViolations);
it('Modal has no axe violations when open', async () => {
const { container } = render(
<Modal isOpen title="Confirm action">
<p>Are you sure you want to delete this item?</p>
</Modal>
);
const results = await axe(container);
expect(results).toHaveNoViolations();
});Beyond automated checks, you'll want a manual keyboard navigation pass for every interactive component. Tab order, focus rings (never outline: none without a replacement), Escape key behavior on modals and dropdowns — these are things axe won't catch. If you're building to WCAG 2.2 AA standards, check out the full WCAG accessibility guide for the complete criterion list.
One thing that trips up teams: contrast ratios in glassmorphic or semi-transparent components. A token like rgba(255,255,255,0.15) on a white background is essentially invisible to contrast checkers because the actual ratio depends on what's rendered behind it. Test these components against both light and dark backgrounds, not just the default.
Testing Color Tokens and Spacing Systems
Color tokens and spacing scales are the foundation of a design system, and they're surprisingly easy to test if you treat them as data. Your token file is source of truth — write a test that imports it and asserts invariants.
What kind of invariants? That every semantic token (color-primary, color-error, etc.) resolves to a value that meets a 4.5:1 contrast ratio against your surface tokens. That spacing steps follow a consistent scale — if your base unit is 4px, then space-3 should be 12px, not 11px. That no two distinct tokens share the same hex value (which usually indicates a naming collision in your color system design).
These token validation tests run in milliseconds and catch the class of bugs that normally require a designer to notice. Write them once, run them on every commit.
Integration Tests: Composition and Theming
Individual component tests don't catch composition bugs. A Button inside a Card inside a Modal can have z-index, overflow, or stacking context issues that only appear at the composed level. Write at least a few integration tests that render common compositions.
Theming is the other thing to integration-test explicitly. If your system supports a theme toggle between light and dark mode, test that switching the theme actually updates CSS custom properties on the right element. Don't assume it works just because the toggle component renders — verify the cascade.
import { render, screen } from '@testing-library/react';
import userEvent from '@testing-library/user-event';
import { ThemeProvider } from '../ThemeProvider';
import { ThemeToggle } from '../components/ThemeToggle';
it('applies dark class to document root when toggled', async () => {
render(
<ThemeProvider defaultTheme="light">
<ThemeToggle />
</ThemeProvider>
);
// starts light
expect(document.documentElement.classList.contains('dark')).toBe(false);
await userEvent.click(screen.getByRole('switch', { name: /toggle theme/i }));
expect(document.documentElement.classList.contains('dark')).toBe(true);
});Does your icon system render the right SVG when you pass name="chevron-right"? Write a test. Does your icon system swap correctly between filled and outlined variants? Write a test. These take five minutes each and save hours of debugging.
Setting Up a Testing Pipeline for Your Component Library
Here's how a sensible CI pipeline looks for a design system monorepo. Unit and axe tests run on every push — they're fast (under 30s typically). Visual regression tests run on every PR against a stored baseline, with human review required to accept changes. A11y manual checklists are attached to any PR that touches interactive components.
For the visual test baseline, store snapshots in git using Git LFS or in a dedicated artifact store. Don't regenerate baselines automatically on CI failures — that defeats the point. A failed visual test means something *changed*, and you need a human to decide whether that change was intentional.
One more thing: run your tests against the built output, not just the source. If you're shipping a compiled package with Rollup or tsup, the built .js and .d.ts files are what consumers actually import. A test that only runs against src/ won't catch a build step that silently drops a prop type or mangles a class name.
Common Mistakes and How to Avoid Them
Testing implementation details is the most common mistake. If your test imports internal functions, checks for specific class names like bg-blue-500, or reaches into component state with instance(), it'll break every time you refactor — even when behavior is identical. Test what users and consumers of your system actually see and interact with.
The second big one: writing tests after the fact in a rush before a release. You'll skip edge cases, miss states (loading, error, empty), and end up with tests that only cover the happy path. If you can't do full TDD, at least write tests immediately after you finish a component while the edge cases are still fresh.
And what about snapshot testing with toMatchSnapshot()? Use it sparingly. Inline snapshots for small outputs are fine. But a 200-line JSX snapshot is just noise — developers approve snapshot diffs without reading them. Visual regression tools do the same job better for UI output.
FAQ
Depends on your setup. Chromatic is faster to adopt if you already use Storybook — it handles diffing and review UI out of the box. Playwright's toHaveScreenshot() is free and works without Storybook, but you'll manage baseline storage yourself. For open-source libraries, Playwright + GitHub Actions artifact storage is usually the practical choice.
jsdom (used by Vitest and Jest) has limited CSS support and doesn't compute custom properties the way a real browser does. For token resolution tests, use Playwright or a headless browser. For structural tests — does the component apply the right CSS variable names as inline styles or data attributes — jsdom is fine.
Not reliably. axe-core checks contrast based on computed color values, but it can't always resolve what's visually behind a semi-transparent element like rgba(255,255,255,0.15). You need to verify contrast manually or with a browser devtools contrast checker for glassmorphic and layered components.
AA is the standard target and the one most legal requirements reference. AAA is worth pursuing for text contrast (7:1 ratio instead of 4.5:1) if your system is used in government, healthcare, or accessibility-sensitive contexts. Build to AA first, then add AAA where it's achievable without sacrificing design quality.
Block external font requests in your Playwright config and serve fonts locally in tests. Disable all transitions and animations with a global CSS rule in your test setup: *, *::before, *::after { transition: none !important; animation-duration: 0s !important; }. Also set a fixed viewport size — 1280x800 is a common baseline.
No. Visual tests tell you the output changed, but not why. Unit tests verify prop contracts, ARIA attribute correctness, and behavior logic that isn't visible in a screenshot. You need both layers — they catch different categories of bugs.