Tired of Web Scraping? Make The AI Do It

Tired of Web Scraping? Make The AI Do It

[James Turk] has a novel approach to the problem of scraping web content in a structured way without needing to write the kind of page-specific code web scrapers usually have to deal with. How? Just enlist the help of a natural language AI. Scrapeghost relies on OpenAI’s GPT API to parse a web page’s content, pull out and classify any salient bits, and format it in a useful way.

What makes Scrapeghost different is how data gets organized. For example, when instantiating scrapeghost one defines the data one wishes to extract. For example:

from scrapeghost import SchemaScraper
scrape_legislators = SchemaScraper(
schema=
"name": "string",
"url": "url",
"district": "string",
"party": "string",
"photo_url": "url",
"offices": ["name": "string", "address": "string", "phone": "string"],

)

The kicker is that this format is entirely up to you! The GPT models are very, very good at processing natural language, and scrapeghost uses GPT to process the scraped data and find (using the example above) whatever looks like a name, district, party, photo, and office address and format it exactly as requested.

It’s an experimental tool and you’ll need an API key from OpenAI to use it, but it has useful features and is certainly a novel approach. There’s a tutorial and even a command-line interface, so check it out.

Next Post

Blinks Are Useful In VR, But Triggering Blinks Is Tricky

Sun Apr 9 , 2023
In VR, a blink can be a window of opportunity to improve the user’s experience. We’ll explain how in a moment, but blinks are tough to capitalize on because they are unpredictable and don’t last very long. That’s why researchers spent time figuring out how to induce eye blinks on […]
Blinks Are Useful In VR, But Triggering Blinks Is Tricky

You May Like

About

muryou-erogazou.net provide by The top global media Technology, Gadget, Website, SEO, Internet Marketing,Digital marketing.