In this Article
Is web scraping legal? The short answer: scraping publicly available, non-personal data is broadly legal in most jurisdictions — but the details decide everything. What you scrape (product prices vs. personal profiles), how you scrape it (public pages vs. behind a login), where you operate (the US’s CFAA logic vs. Europe’s GDPR), and what you do with the data (price monitoring vs. AI training) each move you between “defensible business practice” and “regulatory action.” This hub walks through the landmark cases, the 2025-2026 AI-era lawsuits, every major regulator’s position, and a practical framework — drawing on the legal sections we’ve fact-checked across 25+ country and platform guides.
One thing up front: this is general information, not legal advice. Before scaling a commercial scraping pipeline, consult counsel in your jurisdiction and your targets’.
Key Facts
- Scraping public data is generally not “hacking” in the US. The Ninth Circuit’s 2022 ruling in hiQ Labs v. LinkedIn held that scraping publicly accessible pages does not violate the Computer Fraud and Abuse Act (CFAA) — there’s no “unauthorized access” where no credentials are required. But hiQ still lost on breach of contract: it had agreed to LinkedIn’s terms, and the case ended in a consent judgment.
- “Public” does not mean “free to take” for personal data. Europe’s regulators have made this explicit: the Netherlands’ AP fined Clearview AI €30.5 million, Italy’s Garante fined it €20 million, the UK’s ICO £7.5 million, and France’s CNIL fined data broker KASPR €240,000 — all for scraping personal data that was technically public.
- Logged-out vs. logged-in is the line that keeps winning. In Meta v. Bright Data (Jan 2024), a US court declined to block the scraping of public, logged-out Facebook and Instagram pages — while making clear that scraping behind a login, in breach of terms you’ve accepted, is a different matter.
- The AI era opened a new front: contract and licensing claims. Reddit sued Anthropic (2025; remanded to California state court in March 2026) and Perplexity plus three scraping vendors including Oxylabs (Oct 2025, S.D.N.Y.). Bartz v. Anthropic (June 2025) found training on lawfully obtained books “exceedingly transformative” fair use — while carving out pirated sources. The legality of collection and the legality of training are now separate questions.
- Proxies themselves are legal. A proxy is standard network infrastructure — the same technology behind corporate gateways and CDNs. Legality depends on what you do through it, not on the tool.
- The defensible lane, everywhere: public, read-only collection of non-personal data (prices, listings, rankings, availability), at respectful request rates, honoring robots.txt, without bypassing logins or technical barriers, with no personal data in the pipeline.
The Short Answer, by Question
| Question | Short answer |
|---|---|
| Is scraping public product/price data legal? | Generally yes — the most defensible category everywhere |
| Is scraping personal data (names, profiles, contacts) legal? | High risk — GDPR regulators fine for it even when data is public |
| Is scraping behind a login legal? | Risky — you’ve accepted terms; contract claims survive even where CFAA fails |
| Is violating robots.txt illegal? | Not a law by itself, but courts and regulators treat it as evidence of bad faith |
| Is using proxies legal? | Yes — the activity, not the tool, determines legality |
| Is scraping for AI training legal? | Unsettled — fair-use wins for lawfully obtained data, but contract/licensing suits are multiplying |
United States: the CFAA, hiQ, and the Contract Layer
The foundational US question was whether scraping public pages is “unauthorized access” under the Computer Fraud and Abuse Act — the anti-hacking statute. The Ninth Circuit answered in hiQ Labs v. LinkedIn (2022, applying the Supreme Court’s Van Buren logic): no. Where a page is open to anyone without credentials, there is no authorization gate to breach, so public-data scraping is not a CFAA violation.
But hiQ’s full story is the real lesson. After winning the CFAA battle, hiQ lost the war: in November 2022 the court found it had breached LinkedIn’s User Agreement (which it had accepted by creating accounts), and the case ended in a consent judgment — hiQ paid and agreed to a permanent injunction. The CFAA protects you from hacking claims on public pages; it does not protect you from contracts you’ve agreed to.
Meta v. Bright Data (January 2024) sharpened the line further: the court declined to find Bright Data in breach for scraping public, logged-out Facebook and Instagram pages, reasoning that the platforms’ terms govern account holders acting as such — not logged-out visitors collecting public content. The practical rule that emerged: logged-out public scraping is defensible; logged-in scraping against accepted terms is not. Older trespass-to-chattels cases (eBay v. Bidder’s Edge, 2000) still matter at extreme volumes — if your crawl burdens a site’s servers, you re-open that door. And state-level privacy laws (California’s CCPA/CPRA, Virginia’s VCDPA and others) add personal-data obligations on top.
The AI Era: 2025-2026 Lawsuits That Changed the Map
AI training created a second wave of scraping litigation — less about “was access authorized” and more about contracts, licensing, and copyright:
- Bartz v. Anthropic (June 2025, Judge Alsup): training an LLM on lawfully obtained books is “exceedingly transformative” fair use — but the ruling explicitly carved out pirated source copies. How you obtained the data matters as much as what you did with it.
- Kadrey v. Meta (June 2025, Judge Chhabria): plaintiffs lost on the record they brought, but the opinion warned this was not broad permission for AI training — a narrow win, not a doctrine.
- Reddit v. Anthropic (filed June 2025): notable because Reddit sued over terms-of-service and unjust-enrichment theories, not copyright. In late March 2026 a federal judge remanded the case to California state court, holding Reddit’s claims are not preempted by the Copyright Act — confirming that contract-based scraping claims have independent life.
- Reddit v. Perplexity, SerpApi, Oxylabs & AWMProxy (Oct 2025, S.D.N.Y.): the first case to name scraping infrastructure vendors as co-defendants alongside the AI company. As of spring 2026 it sits at the motion-to-dismiss stage. For proxy buyers, the takeaway is about supply chains: who collects your data, and how, is now part of your legal exposure.
- NYT v. OpenAI (consolidated, S.D.N.Y.): in January 2026 the court ordered OpenAI to produce a 20-million-conversation log sample in discovery; the case is testing whether model “regurgitation” of articles undermines fair use. No final ruling yet — but it’s the case most likely to define AI-training boundaries.
- The licensing economy is the flip side: Reddit licenses its data to Google (~$60M/yr) and OpenAI; Cloudflare and Stack Overflow launched pay-per-crawl; the RSL protocol (2025) gives publishers a standard way to price machine access. “Free to scrape” and “licensed to train” are diverging tracks.
Europe: GDPR and the Strictest Regulators
In the EU (and the UK), the question is rarely “was access authorized” — it’s “did you process personal data without a lawful basis?” Under the GDPR, scraping any data that identifies a person (names, photos, profiles, contact details) is “processing,” and “it was public” is not a lawful basis by itself. The enforcement record is unambiguous:
- Netherlands (AP) — the hardest line in Europe. Its May 2024 guidance calls scraping personal data “almost always a violation of the GDPR,” and it fined Clearview AI €30.5 million for scraping facial images, with directors warned of personal liability.
- Italy (Garante) — fined Clearview €20 million and a website owner €60,000 for scraping data to build an online telephone directory; its stated principle: “if it’s public, I can take it” is false.
- France (CNIL) — fined contact-data broker KASPR €240,000 (decision Dec 2024) for scraping LinkedIn contact details of users who had limited visibility. Its June 2025 guidance allows scraping publicly accessible personal data under legitimate interest — but only with safeguards (exclusion lists, minimization, transparency).
- UK (ICO) — fined Clearview £7.5 million; the UK GDPR plus the Data (Use and Access) Act 2025 keep the framework aligned with the EU’s.
- EU-wide: the EDPB’s coordinated 2026 enforcement on transparency raises scrutiny further, and the Database Directive adds a separate sui generis right against extracting substantial parts of protected databases — a claim that exists in Europe but not the US.
The crucial nuance: none of this targets non-personal data. Prices, product listings, availability, rankings, and reviews-without-author-data are outside the GDPR’s scope. European price-intelligence and SEO pipelines run legally every day — the line is personal data.
Rest of the World, at a Glance
| Jurisdiction | Framework / regulator | Scraping stance |
|---|---|---|
| Australia | Privacy Act 1988 + 2024 reform (OAIC) | OAIC found Clearview breached the Act by scraping faces; statutory privacy tort added 2024. Public product data: defensible. |
| Japan | APPI (PPC) | Direction is favorable: a 2026 APPI amendment bill (before the Diet) would add a consent exemption for statistical/AI processing of public data. |
| Brazil | LGPD (ANPD) | Active enforcement (Clearview among targets); 2026-27 priorities put scraping at the AI-and-rights intersection. |
| India | DPDP Act 2023 (phasing in to 2027) | Personal-data rules tightening; ANI v. OpenAI tests AI training on news content. |
| China | PIPL + DSL + CSL | Strict on personal data and cross-border transfer; commercial scraping needs careful localization review. |
| Türkiye | KVKK + 2024 amendments | GDPR-style regime; fines revalued annually for inflation. |
| Netherlands / Italy / France / UK | See Europe section | The four most active enforcers on record. |
Every country guide in our blog library — from Germany to Japan — carries a fact-checked local legal section if you need depth on a specific market.
What’s Defensible vs. What’s Risky
Generally defensible
- Public, read-only collection of non-personal data: prices, product specs, stock, rankings, search results, flight fares, real-estate listings
- Logged-out access only — nothing behind authentication
- Human-paced request rates that don’t burden the target’s infrastructure
- Honoring robots.txt and published crawl policies
- Competitive intelligence, market research, SEO monitoring, ad verification, academic research on aggregates
Risky to indefensible
- Scraping personal data — names, photos, emails, phone numbers, profiles — especially at scale, especially in Europe
- Scraping behind a login, against terms you accepted (the hiQ contract lesson)
- Bypassing technical barriers: CAPTCHAs presented as access controls, IP blocks aimed at you specifically, paywalls
- Republishing scraped content wholesale (copyright), or extracting substantial parts of protected databases (EU database right)
- Volume that degrades the target site (trespass-to-chattels exposure)
- Training AI on content you obtained from pirated or terms-breaching sources (Bartz‘s carve-out)
A Compliance Checklist for Scraping Teams
1. Scope the data. If a field can identify a person, treat it as regulated: either drop it, or establish a GDPR lawful basis with documented safeguards (CNIL’s 2025 legitimate-interest sheet is the best template).
2. Stay logged out. Public pages only. The moment you authenticate, you’ve accepted a contract — and contract claims are the ones that win.
3. Respect the site’s signals. robots.txt, rate limits, crawl-delay. Not always legally binding, but the first thing a court or regulator looks at.
4. Engineer for restraint. Human-paced cadence, caching to avoid re-fetching, no hammering. Volume is what turns a scraping dispute into a trespass claim.
5. Mind the supply chain. Reddit v. Perplexity named the scraping vendors, not just the AI buyer. Know how your data providers collect, and prefer infrastructure with ethically sourced, consent-based IP pools — audit-defensibility now extends to your proxy layer.
6. Separate collection from use. Lawful collection doesn’t license every use: republishing, database extraction, and AI training each carry their own analysis.
7. Get counsel before scale. A pipeline that’s fine at 1,000 requests/day may need review at 10 million.
Are Proxies Legal?
Yes. Proxies are standard networking infrastructure — the same intermediary technology behind corporate gateways, CDNs, and privacy tools. Businesses use residential proxies to see localized prices, verify ads actually display in target markets, track search rankings by city, and test websites from other countries — all routine, lawful operations. What the law evaluates is the activity: scraping personal data through a proxy is exactly as regulated as doing it directly, and collecting public prices through a proxy is exactly as defensible. The one provider-side question that matters is sourcing: a network built on consented, ethically sourced residential IPs keeps your collection layer clean — which, post-Reddit v. Perplexity, your lawyers will eventually ask about.
FAQ
Is web scraping legal?
Scraping publicly available, non-personal data is broadly legal in most jurisdictions — the Ninth Circuit’s hiQ v. LinkedIn (2022) confirmed that scraping public pages isn’t “unauthorized access” under the US CFAA. The risk lives in the details: personal data (GDPR fines up to €30.5M in the Clearview cases), scraping behind logins against accepted terms (how hiQ ultimately lost), bypassing technical barriers, and republishing copyrighted content. Public product, price, and SERP data collected respectfully is the defensible lane.
Is web scraping legal in the US?
Generally yes for public data: hiQ v. LinkedIn established that public-page scraping doesn’t violate the CFAA, and Meta v. Bright Data (2024) declined to block logged-out scraping of public Facebook/Instagram pages. But contract claims survive — hiQ lost on breach of LinkedIn’s terms it had accepted — and extreme volumes risk trespass claims. State privacy laws (CCPA/CPRA) regulate personal data on top.
Is web scraping legal under GDPR in Europe?
Non-personal data (prices, listings, rankings) is outside the GDPR entirely. Personal data is where Europe is strict: the Dutch AP calls scraping personal data “almost always a violation” and fined Clearview €30.5M; Italy’s Garante fined it €20M; France’s CNIL fined KASPR €240K — all for scraping public personal data. CNIL’s 2025 guidance allows legitimate-interest scraping of public personal data only with documented safeguards.
Can I legally scrape Amazon, Google, or other big platforms?
Collecting public product and search data logged-out, at respectful rates, is the same defensible category as elsewhere — it’s how the price-intelligence and SEO industries operate. The risks are platform-specific: don’t log in (terms), don’t take personal data (reviews with author identities), don’t overload endpoints, and know each platform’s enforcement posture. Our per-platform guides (Amazon, Google Maps, LinkedIn, Reddit) cover the specific cases and rules.
Is scraping data for AI training legal?
Unsettled and splitting into two questions. Collection: same rules as all scraping. Training: Bartz v. Anthropic (2025) found training on lawfully obtained books transformative fair use, but carved out pirated sources; NYT v. OpenAI is testing whether output “regurgitation” defeats fair use; Reddit’s contract-based suits against Anthropic and Perplexity (with its scraping vendors) survived early procedural rounds. Licensed data and clean supply chains are becoming the safe harbor.
Are proxies legal to use?
Yes — proxies are standard network infrastructure, legal essentially everywhere. The activity conducted through the proxy is what’s regulated: lawful scraping stays lawful through a proxy, unlawful collection stays unlawful. Choose providers with ethically sourced, consent-based residential pools; after Reddit v. Perplexity named scraping vendors as defendants, the provenance of your collection infrastructure is part of your compliance story.
Does violating robots.txt make scraping illegal?
robots.txt isn’t a statute, and ignoring it isn’t automatically illegal. But courts and regulators treat it as a signal of good or bad faith, several rulings cite it when weighing trespass and contract claims, and the CNIL’s safeguards expect you to honor opt-outs. Practically: honoring robots.txt costs little and materially strengthens your defensibility — it belongs in every compliant pipeline.
What happened in hiQ v. LinkedIn, in one paragraph?
hiQ scraped public LinkedIn profiles for HR analytics; LinkedIn sent a cease-and-desist; hiQ sued and won the headline issue — the Ninth Circuit held (2022) that scraping public pages isn’t a CFAA violation. Then LinkedIn won the war: the court found hiQ breached the User Agreement it had accepted, and the case ended in a consent judgment with hiQ paying and accepting a permanent injunction. The double lesson: public scraping isn’t hacking, but contracts you accept are enforceable.
