How to send post request (I'm doing reverse engineering)

I'm conducting reverse engineering and have discovered a link that retrieves all the data I need using the POST method. I've copied the request as cURL to analyze the parameters required for making the correct request. I've modified the parameters to make the request using the POST method. I've successfully tested this using httpx, but now I want to implement it using the Crawlee framework. How can I change the method used by the HTTP client to retrieve the data, and how can I pass the modified parameters I've prepared? Additionally, if anyone has experience, I'd appreciate any insights on handling POST requests within this framework. Thanks
9 Replies
Hall
Hall•8mo ago
View post on community site
This post has been pushed to the community knowledgebase. Any replies in this thread will be synced to the community site.
Apify Community
helpful-purple
helpful-purple•8mo ago
Hey @frankman Here's an example in the documentation - https://crawlee.dev/python/docs/examples/fill-and-submit-web-form
Fill and submit web form | Crawlee for Python · Fast, reliable craw...
Crawlee helps you build and maintain your Python crawlers. It's open source and modern, with type hints for Python to help you catch bugs early.
conscious-sapphire
conscious-sapphireOP•7mo ago
Hi, the above answser doesn't work for me. I have found this open issue and may be it is related because I'm trying to do a POST request and I'm not getting any data. https://github.com/apify/crawlee-python/issues/560 I'm doing this: Here how I adding the request:
async def main() -> None:
async with Actor:
crawler = BeautifulSoupCrawler()

url = "https://www.MY_URL.com?gridFilterType=0&homeAwayFilterType=0&sortBy=0&nearbyGridRadius=50&venueIdFilterType=0&eventViewType=0&opponentCategoryId=0&pageIndex=1&method=GetFilteredEvents&categoryId=4555genreId=undefined&eventCountryType=0&fromPrice=undefined&toPrice=undefined"

initial_req = Request.from_url(
method="POST",
url=str(url),
)

@crawler.router.default_handler
async def default_handler(context: BeautifulSoupCrawlingContext) -> None:
context.log.info(f"Processing {context.request.url}")
await context.push_data(context.request.model_dump_json())

# Run the crawler
await crawler.run([initial_req])
async def main() -> None:
async with Actor:
crawler = BeautifulSoupCrawler()

url = "https://www.MY_URL.com?gridFilterType=0&homeAwayFilterType=0&sortBy=0&nearbyGridRadius=50&venueIdFilterType=0&eventViewType=0&opponentCategoryId=0&pageIndex=1&method=GetFilteredEvents&categoryId=4555genreId=undefined&eventCountryType=0&fromPrice=undefined&toPrice=undefined"

initial_req = Request.from_url(
method="POST",
url=str(url),
)

@crawler.router.default_handler
async def default_handler(context: BeautifulSoupCrawlingContext) -> None:
context.log.info(f"Processing {context.request.url}")
await context.push_data(context.request.model_dump_json())

# Run the crawler
await crawler.run([initial_req])
Here the response when I want to save the json:
{
"url": "MY_URL",
"unique_key": "MY_URL",
"method": "POST",
"headers": {},
"query_params": {},
"payload": null,
"data": {},
"user_data": {
"__crawlee": {
"state": 3
}
},
"retry_count": 0,
"no_retry": false,
"loaded_url": "MY_URL",
"handled_at": null,
"id": "iEYRVLtHdfdR7s6",
"json_": null,
"order_no": null
}
{
"url": "MY_URL",
"unique_key": "MY_URL",
"method": "POST",
"headers": {},
"query_params": {},
"payload": null,
"data": {},
"user_data": {
"__crawlee": {
"state": 3
}
},
"retry_count": 0,
"no_retry": false,
"loaded_url": "MY_URL",
"handled_at": null,
"id": "iEYRVLtHdfdR7s6",
"json_": null,
"order_no": null
}
GitHub
Unable to execute POST request with JSON payload · Issue #560 · api...
Example async def main() -> None: crawler = HttpCrawler() # Define the default request handler, which will be called for every request. @crawler.router.default_handler async def request_handler(...
conscious-sapphire
conscious-sapphireOP•7mo ago
The issue also is related with this PR: https://github.com/apify/crawlee-python/pull/542 I'm adding this url to follow this issue. I'm interested in help because I'm using crawlee and apify a lot.
GitHub
fix!: merge payload and data fields of Request by vdusek · Pull Req...
Description We had data and payload fields on the Request model. payload was not being provided to the HTTP clients, only the data field. In this PR, I'm merging them together, keeping ...
MEE6
MEE6•7mo ago
@frankman just advanced to level 1! Thanks for your contributions! 🎉
helpful-purple
helpful-purple•7mo ago
Hey, @frankman Yes, I created issue 560 🙂 About your URL. I don't see any payload in it. That is, you pass all the parameters as link parameters, not in the body of the POST request. Are you sure you are creating it correctly? Are you doing the same thing using HTTPX? If you look at how the site sees it using httpbin.org/post you'll get this response format
{
"args": {
"categoryId": "4555genreId=undefined",
"eventCountryType": "0",
"eventViewType": "0",
"fromPrice": "undefined",
"gridFilterType": "0",
"homeAwayFilterType": "0",
"method": "GetFilteredEvents",
"nearbyGridRadius": "50",
"opponentCategoryId": "0",
"pageIndex": "1",
"sortBy": "0",
"toPrice": "undefined",
"venueIdFilterType": "0"
},
"data": "",
"files": {},
"form": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate, br",
"Content-Length": "0",
"Host": "httpbin.org",
"User-Agent": "python-httpx/0.27.2",
"X-Amzn-Trace-Id": "Root=1-67100e24-37616e605f9cf31e5538556b"
},
"json": null,
"origin": "91.240.96.149",
"url": "https://httpbin.org/post?gridFilterType=0&homeAwayFilterType=0&sortBy=0&nearbyGridRadius=50&venueIdFilterType=0&eventViewType=0&opponentCategoryId=0&pageIndex=1&method=GetFilteredEvents&categoryId=4555genreId%3Dundefined&eventCountryType=0&fromPrice=undefined&toPrice=undefined"
}
{
"args": {
"categoryId": "4555genreId=undefined",
"eventCountryType": "0",
"eventViewType": "0",
"fromPrice": "undefined",
"gridFilterType": "0",
"homeAwayFilterType": "0",
"method": "GetFilteredEvents",
"nearbyGridRadius": "50",
"opponentCategoryId": "0",
"pageIndex": "1",
"sortBy": "0",
"toPrice": "undefined",
"venueIdFilterType": "0"
},
"data": "",
"files": {},
"form": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate, br",
"Content-Length": "0",
"Host": "httpbin.org",
"User-Agent": "python-httpx/0.27.2",
"X-Amzn-Trace-Id": "Root=1-67100e24-37616e605f9cf31e5538556b"
},
"json": null,
"origin": "91.240.96.149",
"url": "https://httpbin.org/post?gridFilterType=0&homeAwayFilterType=0&sortBy=0&nearbyGridRadius=50&venueIdFilterType=0&eventViewType=0&opponentCategoryId=0&pageIndex=1&method=GetFilteredEvents&categoryId=4555genreId%3Dundefined&eventCountryType=0&fromPrice=undefined&toPrice=undefined"
}
This is completely correct for your example, all parameters are in args You'll also see an error in your URL 🙂 You forgot the & before the genreId parameter The correct URL should be
url = "https://www.MY_URL.com?gridFilterType=0&homeAwayFilterType=0&sortBy=0&nearbyGridRadius=50&venueIdFilterType=0&eventViewType=0&opponentCategoryId=0&pageIndex=1&method=GetFilteredEvents&categoryId=4555&genreId=undefined&eventCountryType=0&fromPrice=undefined&toPrice=undefined"
url = "https://www.MY_URL.com?gridFilterType=0&homeAwayFilterType=0&sortBy=0&nearbyGridRadius=50&venueIdFilterType=0&eventViewType=0&opponentCategoryId=0&pageIndex=1&method=GetFilteredEvents&categoryId=4555&genreId=undefined&eventCountryType=0&fromPrice=undefined&toPrice=undefined"
conscious-sapphire
conscious-sapphireOP•7mo ago
Sorry, I have deleted the domain name and some parameters. I have a mistake. You analyze in base of that. I will put the original link so you can check it again.
async def main() -> None:
async with Actor:
crawler = BeautifulSoupCrawler()

url = "https://www.viagogo.com/Concert-Tickets/Pop-Rock/Dance-Pop/Shakira-Tickets?gridFilterType=0&homeAwayFilterType=0&sortBy=0&nearbyGridRadius=50&venueIdFilterType=0&eventViewType=0&opponentCategoryId=0&pageIndex=1&method=GetFilteredEvents&categoryId=4555&radiusFrom=80467&radiusTo=null&from=1970-01-01T00%3A00%3A00.000Z&to=9999-12-30T23%3A00%3A00.000Z&lat=39.044&lon=-77.488&genreId=undefined&eventCountryType=0&fromPrice=undefined&toPrice=undefined"

initial_req = Request.from_url(
method="POST",
url=str(url),
)

@crawler.router.default_handler
async def default_handler(context: BeautifulSoupCrawlingContext) -> None:
context.log.info(f"Processing {context.request.url}")
await context.push_data(context.request.model_dump_json())

await crawler.run([initial_req])
async def main() -> None:
async with Actor:
crawler = BeautifulSoupCrawler()

url = "https://www.viagogo.com/Concert-Tickets/Pop-Rock/Dance-Pop/Shakira-Tickets?gridFilterType=0&homeAwayFilterType=0&sortBy=0&nearbyGridRadius=50&venueIdFilterType=0&eventViewType=0&opponentCategoryId=0&pageIndex=1&method=GetFilteredEvents&categoryId=4555&radiusFrom=80467&radiusTo=null&from=1970-01-01T00%3A00%3A00.000Z&to=9999-12-30T23%3A00%3A00.000Z&lat=39.044&lon=-77.488&genreId=undefined&eventCountryType=0&fromPrice=undefined&toPrice=undefined"

initial_req = Request.from_url(
method="POST",
url=str(url),
)

@crawler.router.default_handler
async def default_handler(context: BeautifulSoupCrawlingContext) -> None:
context.log.info(f"Processing {context.request.url}")
await context.push_data(context.request.model_dump_json())

await crawler.run([initial_req])
Continues 🧵 The output was:
{
"url": "https://www.viagogo.com/Concert-Tickets/Pop-Rock/Dance-Pop/Shakira-Tickets?gridFilterType=0&homeAwayFilterType=0&sortBy=0&nearbyGridRadius=50&venueIdFilterType=0&eventViewType=0&opponentCategoryId=0&pageIndex=1&method=GetFilteredEvents&categoryId=4555&radiusFrom=80467&radiusTo=null&from=1970-01-01T00%3A00%3A00.000Z&to=9999-12-30T23%3A00%3A00.000Z&lat=39.044&lon=-77.488&genreId=undefined&eventCountryType=0&fromPrice=undefined&toPrice=undefined",
"unique_key": "https://www.viagogo.com/concert-tickets/pop-rock/dance-pop/shakira-tickets?categoryid=4555&eventcountrytype=0&eventviewtype=0&from=1970-01-01t00%3a00%3a00.000z&fromprice=undefined&genreid=undefined&gridfiltertype=0&homeawayfiltertype=0&lat=39.044&lon=-77.488&method=getfilteredevents&nearbygridradius=50&opponentcategoryid=0&pageindex=1&radiusfrom=80467&radiusto=null&sortby=0&to=9999-12-30t23%3a00%3a00.000z&toprice=undefined&venueidfiltertype=0",
"method": "POST",
"headers": {},
"query_params": {},
"payload": null,
"data": {},
"user_data": {
"__crawlee": {
"state": 3
}
},
"retry_count": 0,
"no_retry": false,
"loaded_url": "https://www.viagogo.com/Concert-Tickets/Pop-Rock/Dance-Pop/Shakira-Tickets?gridFilterType=0&homeAwayFilterType=0&sortBy=0&nearbyGridRadius=50&venueIdFilterType=0&eventViewType=0&opponentCategoryId=0&pageIndex=1&method=GetFilteredEvents&categoryId=4555&radiusFrom=80467&radiusTo=null&from=1970-01-01T00%3A00%3A00.000Z&to=9999-12-30T23%3A00%3A00.000Z&lat=39.044&lon=-77.488&genreId=undefined&eventCountryType=0&fromPrice=undefined&toPrice=undefined",
"handled_at": null,
"id": "iEYRVLtHdfdR7s6",
"json_": null,
"order_no": null
}
{
"url": "https://www.viagogo.com/Concert-Tickets/Pop-Rock/Dance-Pop/Shakira-Tickets?gridFilterType=0&homeAwayFilterType=0&sortBy=0&nearbyGridRadius=50&venueIdFilterType=0&eventViewType=0&opponentCategoryId=0&pageIndex=1&method=GetFilteredEvents&categoryId=4555&radiusFrom=80467&radiusTo=null&from=1970-01-01T00%3A00%3A00.000Z&to=9999-12-30T23%3A00%3A00.000Z&lat=39.044&lon=-77.488&genreId=undefined&eventCountryType=0&fromPrice=undefined&toPrice=undefined",
"unique_key": "https://www.viagogo.com/concert-tickets/pop-rock/dance-pop/shakira-tickets?categoryid=4555&eventcountrytype=0&eventviewtype=0&from=1970-01-01t00%3a00%3a00.000z&fromprice=undefined&genreid=undefined&gridfiltertype=0&homeawayfiltertype=0&lat=39.044&lon=-77.488&method=getfilteredevents&nearbygridradius=50&opponentcategoryid=0&pageindex=1&radiusfrom=80467&radiusto=null&sortby=0&to=9999-12-30t23%3a00%3a00.000z&toprice=undefined&venueidfiltertype=0",
"method": "POST",
"headers": {},
"query_params": {},
"payload": null,
"data": {},
"user_data": {
"__crawlee": {
"state": 3
}
},
"retry_count": 0,
"no_retry": false,
"loaded_url": "https://www.viagogo.com/Concert-Tickets/Pop-Rock/Dance-Pop/Shakira-Tickets?gridFilterType=0&homeAwayFilterType=0&sortBy=0&nearbyGridRadius=50&venueIdFilterType=0&eventViewType=0&opponentCategoryId=0&pageIndex=1&method=GetFilteredEvents&categoryId=4555&radiusFrom=80467&radiusTo=null&from=1970-01-01T00%3A00%3A00.000Z&to=9999-12-30T23%3A00%3A00.000Z&lat=39.044&lon=-77.488&genreId=undefined&eventCountryType=0&fromPrice=undefined&toPrice=undefined",
"handled_at": null,
"id": "iEYRVLtHdfdR7s6",
"json_": null,
"order_no": null
}
If I do the same but only with httpx:
resp = httpx.post(url)
print(resp.json()

> output:

{'items': [{'eventId': 153433356,
'name': 'Shakira',
'url': 'https://www.viagogo.com/Concert-Tickets/Pop-Rock/Dance-Pop/Shakira-Tickets/E-153433356',
'dayOfWeek': 'Wed',
...
}
resp = httpx.post(url)
print(resp.json()

> output:

{'items': [{'eventId': 153433356,
'name': 'Shakira',
'url': 'https://www.viagogo.com/Concert-Tickets/Pop-Rock/Dance-Pop/Shakira-Tickets/E-153433356',
'dayOfWeek': 'Wed',
...
}
helpful-purple
helpful-purple•7mo ago
Hi. All the code works correctly. The problem is exactly what you are doing. 1. context.request.model_dump_json() - as you can see, it outputs the Request metadata, which does not include the server response As a result, you are comparing the request metadata from crawlee with the server response in httpx... 2. I don't really understand why you need BeautifulSoupCrawler when working with json. I think it would be more appropriate to use ParselCrawler or HttpCrawler with a convenient library for working with json. Here is a sample code that will do what you expect it to do
async def main() -> None:
async with Actor:
crawler = BeautifulSoupCrawler()

url = "https://www.viagogo.com/Concert-Tickets/Pop-Rock/Dance-Pop/Shakira-Tickets?gridFilterType=0&homeAwayFilterType=0&sortBy=0&nearbyGridRadius=50&venueIdFilterType=0&eventViewType=0&opponentCategoryId=0&pageIndex=1&method=GetFilteredEvents&categoryId=4555&radiusFrom=80467&radiusTo=null&from=1970-01-01T00%3A00%3A00.000Z&to=9999-12-30T23%3A00%3A00.000Z&lat=39.044&lon=-77.488&genreId=undefined&eventCountryType=0&fromPrice=undefined&toPrice=undefined"

initial_req = Request.from_url(
method="POST",
url=str(url),
)

@crawler.router.default_handler
async def default_handler(context: BeautifulSoupCrawlingContext) -> None:
context.log.info(f"Processing {context.request.url}")

await context.push_data(context.soup.find("p").text)

await crawler.run([initial_req])
async def main() -> None:
async with Actor:
crawler = BeautifulSoupCrawler()

url = "https://www.viagogo.com/Concert-Tickets/Pop-Rock/Dance-Pop/Shakira-Tickets?gridFilterType=0&homeAwayFilterType=0&sortBy=0&nearbyGridRadius=50&venueIdFilterType=0&eventViewType=0&opponentCategoryId=0&pageIndex=1&method=GetFilteredEvents&categoryId=4555&radiusFrom=80467&radiusTo=null&from=1970-01-01T00%3A00%3A00.000Z&to=9999-12-30T23%3A00%3A00.000Z&lat=39.044&lon=-77.488&genreId=undefined&eventCountryType=0&fromPrice=undefined&toPrice=undefined"

initial_req = Request.from_url(
method="POST",
url=str(url),
)

@crawler.router.default_handler
async def default_handler(context: BeautifulSoupCrawlingContext) -> None:
context.log.info(f"Processing {context.request.url}")

await context.push_data(context.soup.find("p").text)

await crawler.run([initial_req])
conscious-sapphire
conscious-sapphireOP•7mo ago
You're right Mantisus, now I'm using HttpCrawler() and I'm getting the data I want: This code does what I want:
from apify import Actor
from crawlee import Request
from crawlee.http_crawler import HttpCrawler, HttpCrawlingContext
import json


async def main() -> None:
async with Actor:
crawler = HttpCrawler()

url = "https://www.viagogo.com/Concert-Tickets/Pop-Rock/Dance-Pop/Shakira-Tickets?gridFilterType=0&homeAwayFilterType=0&sortBy=0&nearbyGridRadius=50&venueIdFilterType=0&eventViewType=0&opponentCategoryId=0&pageIndex=1&method=GetFilteredEvents&categoryId=4555&radiusFrom=80467&radiusTo=null&from=1970-01-01T00%3A00%3A00.000Z&to=9999-12-30T23%3A00%3A00.000Z&lat=39.044&lon=-77.488&genreId=undefined&eventCountryType=0&fromPrice=undefined&toPrice=undefined"

initial_req = Request.from_url(
method="POST",
url=str(url),
)

@crawler.router.default_handler
async def default_handler(context: HttpCrawlingContext) -> None:
context.log.info(f"Processing {context.request.url}")
json_response = context.http_response.read() # <------ This is the same than this: response.json() after doing response = httpx.post(url)
json_resp_parsed = json.loads(json_response)
await context.push_data(json_resp_parsed)

await crawler.run([initial_req])
from apify import Actor
from crawlee import Request
from crawlee.http_crawler import HttpCrawler, HttpCrawlingContext
import json


async def main() -> None:
async with Actor:
crawler = HttpCrawler()

url = "https://www.viagogo.com/Concert-Tickets/Pop-Rock/Dance-Pop/Shakira-Tickets?gridFilterType=0&homeAwayFilterType=0&sortBy=0&nearbyGridRadius=50&venueIdFilterType=0&eventViewType=0&opponentCategoryId=0&pageIndex=1&method=GetFilteredEvents&categoryId=4555&radiusFrom=80467&radiusTo=null&from=1970-01-01T00%3A00%3A00.000Z&to=9999-12-30T23%3A00%3A00.000Z&lat=39.044&lon=-77.488&genreId=undefined&eventCountryType=0&fromPrice=undefined&toPrice=undefined"

initial_req = Request.from_url(
method="POST",
url=str(url),
)

@crawler.router.default_handler
async def default_handler(context: HttpCrawlingContext) -> None:
context.log.info(f"Processing {context.request.url}")
json_response = context.http_response.read() # <------ This is the same than this: response.json() after doing response = httpx.post(url)
json_resp_parsed = json.loads(json_response)
await context.push_data(json_resp_parsed)

await crawler.run([initial_req])
Thanks Mantisus

Did you find this page helpful?