LLM Extract Does Not Do Whole Page?
@Caleb Trying To Extract Structured Data From A Website. But All The Data Is Not Being Scraped. Only The First Entries At The Top Of The Page Are Being Scraped. Any Suggestions?
9 Replies
Hey, could you share your request url/schema so we can replicate?
My guess is that it has to do with the page loading on scroll
Thank you very much for helping! Just a caveat, I am not a coder or developer. So, there's a likelihood I am missing something in the request. Here you go:
data = app.scrape_url(
"https://www.zillow.com/las-vegas-nv/sold/?searchQueryState=%7B%22pagination%22%3A%7B%7D%2C%22isMapVisible%22%3Atrue%2C%22mapBounds%22%3A%7B%22west%22%3A-116.15555895361328%2C%22east%22%3A-114.42177904638672%2C%22south%22%3A35.71392015428329%2C%22north%22%3A36.7166111978914%7D%2C%22regionSelection%22%3A%5B%7B%22regionId%22%3A18959%2C%22regionType%22%3A6%7D%5D%2C%22filterState%22%3A%7B%22sort%22%3A%7B%22value%22%3A%22globalrelevanceex%22%7D%2C%22ah%22%3A%7B%22value%22%3Atrue%7D%2C%22rs%22%3A%7B%22value%22%3Atrue%7D%2C%22fsba%22%3A%7B%22value%22%3Afalse%7D%2C%22fsbo%22%3A%7B%22value%22%3Afalse%7D%2C%22nc%22%3A%7B%22value%22%3Afalse%7D%2C%22cmsn%22%3A%7B%22value%22%3Afalse%7D%2C%22auc%22%3A%7B%22value%22%3Afalse%7D%2C%22fore%22%3A%7B%22value%22%3Afalse%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A11%2C%22usersSearchTerm%22%3A%22Las%20Vegas%20NV%22%7D",
{
"formats": ["extract"],
"extract": {
"prompt": "Extract the address, date, price, beds, baths, and sqft of all homes sold in Las Vegas, Nevada in the last 6 months.",
},
},
)
Zillow
Las Vegas NV Real Estate - Las Vegas NV Homes For Sale | Zillow
Zillow has 6413 homes for sale in Las Vegas NV. View listing photos, review sales history, and use our detailed real estate filters to find the perfect place.
Where is your extraction schema? thats the most important parameter to pass because it tells the model exactly what format it should return the data in
Caleb! Great to hear from you. In my ignorance, I put the schema in the prompt. The extraction produced the desire result as far as structuring the output correctly. However, It stopped after 10 extractions when there were over 500 more to do. Should I explicitly state the extraction schema?
Yes, explicitly state the extraction schema!
Caleb. Understood and thank you for your time. I declaared the schema to scrape data off a simpler website. Here is the updated code I used:
class ExtractSchema(BaseModel):
Address: str
Location: str
Price: int
Beds: int
Baths: float
SqFt: int
Px_SqFt: int
Time_On_Redfin: str
data = app.scrape_url(
"https://www.redfin.com/zipcode/89134/filter/sort=lo-days",
{
#extract the listings
"formats": ["extract"],
"extract": {
"schema": ExtractSchema.model_json_schema(),
"prompt": "Extract all the listings from the redfin website"
}
)
print(data["extract"])
It does a wonderful job of correctly extracting data into the defined schema. But, it only pulls up the first listing and there are over 30 on that page. Can you please teach me what changes I need to make to capture data from all the listings?