Crawlee & Apify•8mo ago

How to send post request (I'm doing reverse engineering)

I'm conducting reverse engineering and have discovered a link that retrieves all the data I need using the POST method. I've copied the request as cURL to analyze the parameters required for making the correct request. I've modified the parameters to make the request using the POST method. I've successfully tested this using httpx, but now I want to implement it using the Crawlee framework. How can I change the method used by the HTTP client to retrieve the data, and how can I pass the modified parameters I've prepared? Additionally, if anyone has experience, I'd appreciate any insights on handling POST requests within this framework. Thanks

9 Replies

Hall•8mo ago

View post on community site

This post has been pushed to the community knowledgebase. Any replies in this thread will be synced to the community site.

Apify Community

helpful-purple•8mo ago

Hey @frankman Here's an example in the documentation - https://crawlee.dev/python/docs/examples/fill-and-submit-web-form

Fill and submit web form | Crawlee for Python · Fast, reliable craw...

Crawlee helps you build and maintain your Python crawlers. It's open source and modern, with type hints for Python to help you catch bugs early.

conscious-sapphireOP•7mo ago

Hi, the above answser doesn't work for me. I have found this open issue and may be it is related because I'm trying to do a POST request and I'm not getting any data. https://github.com/apify/crawlee-python/issues/560 I'm doing this: Here how I adding the request:

async def main() -> None:
    async with Actor:
        crawler = BeautifulSoupCrawler()

        url = "https://www.MY_URL.com?gridFilterType=0&homeAwayFilterType=0&sortBy=0&nearbyGridRadius=50&venueIdFilterType=0&eventViewType=0&opponentCategoryId=0&pageIndex=1&method=GetFilteredEvents&categoryId=4555genreId=undefined&eventCountryType=0&fromPrice=undefined&toPrice=undefined"

        initial_req = Request.from_url(
            method="POST",
            url=str(url),
        )

        @crawler.router.default_handler
        async def default_handler(context: BeautifulSoupCrawlingContext) -> None:
            context.log.info(f"Processing {context.request.url}")
            await context.push_data(context.request.model_dump_json())

        # Run the crawler
        await crawler.run([initial_req])

async def main() -> None:
    async with Actor:
        crawler = BeautifulSoupCrawler()

        url = "https://www.MY_URL.com?gridFilterType=0&homeAwayFilterType=0&sortBy=0&nearbyGridRadius=50&venueIdFilterType=0&eventViewType=0&opponentCategoryId=0&pageIndex=1&method=GetFilteredEvents&categoryId=4555genreId=undefined&eventCountryType=0&fromPrice=undefined&toPrice=undefined"

        initial_req = Request.from_url(
            method="POST",
            url=str(url),
        )

        @crawler.router.default_handler
        async def default_handler(context: BeautifulSoupCrawlingContext) -> None:
            context.log.info(f"Processing {context.request.url}")
            await context.push_data(context.request.model_dump_json())

        # Run the crawler
        await crawler.run([initial_req])

Here the response when I want to save the json:

{
  "url": "MY_URL",
  "unique_key": "MY_URL",
  "method": "POST",
  "headers": {},
  "query_params": {},
  "payload": null,
  "data": {},
  "user_data": {
    "__crawlee": {
      "state": 3
    }
  },
  "retry_count": 0,
  "no_retry": false,
  "loaded_url": "MY_URL",
  "handled_at": null,
  "id": "iEYRVLtHdfdR7s6",
  "json_": null,
  "order_no": null
}

{
  "url": "MY_URL",
  "unique_key": "MY_URL",
  "method": "POST",
  "headers": {},
  "query_params": {},
  "payload": null,
  "data": {},
  "user_data": {
    "__crawlee": {
      "state": 3
    }
  },
  "retry_count": 0,
  "no_retry": false,
  "loaded_url": "MY_URL",
  "handled_at": null,
  "id": "iEYRVLtHdfdR7s6",
  "json_": null,
  "order_no": null
}

GitHub

Unable to execute POST request with JSON payload · Issue #560 · api...

Example async def main() -> None: crawler = HttpCrawler() # Define the default request handler, which will be called for every request. @crawler.router.default_handler async def request_handler(...

conscious-sapphireOP•7mo ago

The issue also is related with this PR: https://github.com/apify/crawlee-python/pull/542 I'm adding this url to follow this issue. I'm interested in help because I'm using crawlee and apify a lot.

GitHub

fix!: merge payload and data fields of Request by vdusek · Pull Req...

Description We had data and payload fields on the Request model. payload was not being provided to the HTTP clients, only the data field. In this PR, I'm merging them together, keeping ...

MEE6•7mo ago

@frankman just advanced to level 1! Thanks for your contributions! 🎉

helpful-purple•7mo ago

Hey, @frankman Yes, I created issue 560 🙂 About your URL. I don't see any payload in it. That is, you pass all the parameters as link parameters, not in the body of the POST request. Are you sure you are creating it correctly? Are you doing the same thing using HTTPX? If you look at how the site sees it using httpbin.org/post you'll get this response format

{
  "args": {
    "categoryId": "4555genreId=undefined", 
    "eventCountryType": "0", 
    "eventViewType": "0", 
    "fromPrice": "undefined", 
    "gridFilterType": "0", 
    "homeAwayFilterType": "0", 
    "method": "GetFilteredEvents", 
    "nearbyGridRadius": "50", 
    "opponentCategoryId": "0", 
    "pageIndex": "1", 
    "sortBy": "0", 
    "toPrice": "undefined", 
    "venueIdFilterType": "0"
  }, 
  "data": "", 
  "files": {}, 
  "form": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Content-Length": "0", 
    "Host": "httpbin.org", 
    "User-Agent": "python-httpx/0.27.2", 
    "X-Amzn-Trace-Id": "Root=1-67100e24-37616e605f9cf31e5538556b"
  }, 
  "json": null, 
  "origin": "91.240.96.149", 
  "url": "https://httpbin.org/post?gridFilterType=0&homeAwayFilterType=0&sortBy=0&nearbyGridRadius=50&venueIdFilterType=0&eventViewType=0&opponentCategoryId=0&pageIndex=1&method=GetFilteredEvents&categoryId=4555genreId%3Dundefined&eventCountryType=0&fromPrice=undefined&toPrice=undefined"
}

{
  "args": {
    "categoryId": "4555genreId=undefined", 
    "eventCountryType": "0", 
    "eventViewType": "0", 
    "fromPrice": "undefined", 
    "gridFilterType": "0", 
    "homeAwayFilterType": "0", 
    "method": "GetFilteredEvents", 
    "nearbyGridRadius": "50", 
    "opponentCategoryId": "0", 
    "pageIndex": "1", 
    "sortBy": "0", 
    "toPrice": "undefined", 
    "venueIdFilterType": "0"
  }, 
  "data": "", 
  "files": {}, 
  "form": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Content-Length": "0", 
    "Host": "httpbin.org", 
    "User-Agent": "python-httpx/0.27.2", 
    "X-Amzn-Trace-Id": "Root=1-67100e24-37616e605f9cf31e5538556b"
  }, 
  "json": null, 
  "origin": "91.240.96.149", 
  "url": "https://httpbin.org/post?gridFilterType=0&homeAwayFilterType=0&sortBy=0&nearbyGridRadius=50&venueIdFilterType=0&eventViewType=0&opponentCategoryId=0&pageIndex=1&method=GetFilteredEvents&categoryId=4555genreId%3Dundefined&eventCountryType=0&fromPrice=undefined&toPrice=undefined"
}

This is completely correct for your example, all parameters are in args You'll also see an error in your URL 🙂 You forgot the & before the genreId parameter The correct URL should be

url = "https://www.MY_URL.com?gridFilterType=0&homeAwayFilterType=0&sortBy=0&nearbyGridRadius=50&venueIdFilterType=0&eventViewType=0&opponentCategoryId=0&pageIndex=1&method=GetFilteredEvents&categoryId=4555&genreId=undefined&eventCountryType=0&fromPrice=undefined&toPrice=undefined"

url = "https://www.MY_URL.com?gridFilterType=0&homeAwayFilterType=0&sortBy=0&nearbyGridRadius=50&venueIdFilterType=0&eventViewType=0&opponentCategoryId=0&pageIndex=1&method=GetFilteredEvents&categoryId=4555&genreId=undefined&eventCountryType=0&fromPrice=undefined&toPrice=undefined"

conscious-sapphireOP•7mo ago

Sorry, I have deleted the domain name and some parameters. I have a mistake. You analyze in base of that. I will put the original link so you can check it again.

async def main() -> None:
    async with Actor:
        crawler = BeautifulSoupCrawler()

        url = "https://www.viagogo.com/Concert-Tickets/Pop-Rock/Dance-Pop/Shakira-Tickets?gridFilterType=0&homeAwayFilterType=0&sortBy=0&nearbyGridRadius=50&venueIdFilterType=0&eventViewType=0&opponentCategoryId=0&pageIndex=1&method=GetFilteredEvents&categoryId=4555&radiusFrom=80467&radiusTo=null&from=1970-01-01T00%3A00%3A00.000Z&to=9999-12-30T23%3A00%3A00.000Z&lat=39.044&lon=-77.488&genreId=undefined&eventCountryType=0&fromPrice=undefined&toPrice=undefined"

        initial_req = Request.from_url(
            method="POST",
            url=str(url),
        )

        @crawler.router.default_handler
        async def default_handler(context: BeautifulSoupCrawlingContext) -> None:
            context.log.info(f"Processing {context.request.url}")
            await context.push_data(context.request.model_dump_json())

        await crawler.run([initial_req])

async def main() -> None:
    async with Actor:
        crawler = BeautifulSoupCrawler()

        url = "https://www.viagogo.com/Concert-Tickets/Pop-Rock/Dance-Pop/Shakira-Tickets?gridFilterType=0&homeAwayFilterType=0&sortBy=0&nearbyGridRadius=50&venueIdFilterType=0&eventViewType=0&opponentCategoryId=0&pageIndex=1&method=GetFilteredEvents&categoryId=4555&radiusFrom=80467&radiusTo=null&from=1970-01-01T00%3A00%3A00.000Z&to=9999-12-30T23%3A00%3A00.000Z&lat=39.044&lon=-77.488&genreId=undefined&eventCountryType=0&fromPrice=undefined&toPrice=undefined"

        initial_req = Request.from_url(
            method="POST",
            url=str(url),
        )

        @crawler.router.default_handler
        async def default_handler(context: BeautifulSoupCrawlingContext) -> None:
            context.log.info(f"Processing {context.request.url}")
            await context.push_data(context.request.model_dump_json())

        await crawler.run([initial_req])

Continues 🧵 The output was:

{
  "url": "https://www.viagogo.com/Concert-Tickets/Pop-Rock/Dance-Pop/Shakira-Tickets?gridFilterType=0&homeAwayFilterType=0&sortBy=0&nearbyGridRadius=50&venueIdFilterType=0&eventViewType=0&opponentCategoryId=0&pageIndex=1&method=GetFilteredEvents&categoryId=4555&radiusFrom=80467&radiusTo=null&from=1970-01-01T00%3A00%3A00.000Z&to=9999-12-30T23%3A00%3A00.000Z&lat=39.044&lon=-77.488&genreId=undefined&eventCountryType=0&fromPrice=undefined&toPrice=undefined",
  "unique_key": "https://www.viagogo.com/concert-tickets/pop-rock/dance-pop/shakira-tickets?categoryid=4555&eventcountrytype=0&eventviewtype=0&from=1970-01-01t00%3a00%3a00.000z&fromprice=undefined&genreid=undefined&gridfiltertype=0&homeawayfiltertype=0&lat=39.044&lon=-77.488&method=getfilteredevents&nearbygridradius=50&opponentcategoryid=0&pageindex=1&radiusfrom=80467&radiusto=null&sortby=0&to=9999-12-30t23%3a00%3a00.000z&toprice=undefined&venueidfiltertype=0",
  "method": "POST",
  "headers": {},
  "query_params": {},
  "payload": null,
  "data": {},
  "user_data": {
    "__crawlee": {
      "state": 3
    }
  },
  "retry_count": 0,
  "no_retry": false,
  "loaded_url": "https://www.viagogo.com/Concert-Tickets/Pop-Rock/Dance-Pop/Shakira-Tickets?gridFilterType=0&homeAwayFilterType=0&sortBy=0&nearbyGridRadius=50&venueIdFilterType=0&eventViewType=0&opponentCategoryId=0&pageIndex=1&method=GetFilteredEvents&categoryId=4555&radiusFrom=80467&radiusTo=null&from=1970-01-01T00%3A00%3A00.000Z&to=9999-12-30T23%3A00%3A00.000Z&lat=39.044&lon=-77.488&genreId=undefined&eventCountryType=0&fromPrice=undefined&toPrice=undefined",
  "handled_at": null,
  "id": "iEYRVLtHdfdR7s6",
  "json_": null,
  "order_no": null
}

{
  "url": "https://www.viagogo.com/Concert-Tickets/Pop-Rock/Dance-Pop/Shakira-Tickets?gridFilterType=0&homeAwayFilterType=0&sortBy=0&nearbyGridRadius=50&venueIdFilterType=0&eventViewType=0&opponentCategoryId=0&pageIndex=1&method=GetFilteredEvents&categoryId=4555&radiusFrom=80467&radiusTo=null&from=1970-01-01T00%3A00%3A00.000Z&to=9999-12-30T23%3A00%3A00.000Z&lat=39.044&lon=-77.488&genreId=undefined&eventCountryType=0&fromPrice=undefined&toPrice=undefined",
  "unique_key": "https://www.viagogo.com/concert-tickets/pop-rock/dance-pop/shakira-tickets?categoryid=4555&eventcountrytype=0&eventviewtype=0&from=1970-01-01t00%3a00%3a00.000z&fromprice=undefined&genreid=undefined&gridfiltertype=0&homeawayfiltertype=0&lat=39.044&lon=-77.488&method=getfilteredevents&nearbygridradius=50&opponentcategoryid=0&pageindex=1&radiusfrom=80467&radiusto=null&sortby=0&to=9999-12-30t23%3a00%3a00.000z&toprice=undefined&venueidfiltertype=0",
  "method": "POST",
  "headers": {},
  "query_params": {},
  "payload": null,
  "data": {},
  "user_data": {
    "__crawlee": {
      "state": 3
    }
  },
  "retry_count": 0,
  "no_retry": false,
  "loaded_url": "https://www.viagogo.com/Concert-Tickets/Pop-Rock/Dance-Pop/Shakira-Tickets?gridFilterType=0&homeAwayFilterType=0&sortBy=0&nearbyGridRadius=50&venueIdFilterType=0&eventViewType=0&opponentCategoryId=0&pageIndex=1&method=GetFilteredEvents&categoryId=4555&radiusFrom=80467&radiusTo=null&from=1970-01-01T00%3A00%3A00.000Z&to=9999-12-30T23%3A00%3A00.000Z&lat=39.044&lon=-77.488&genreId=undefined&eventCountryType=0&fromPrice=undefined&toPrice=undefined",
  "handled_at": null,
  "id": "iEYRVLtHdfdR7s6",
  "json_": null,
  "order_no": null
}

If I do the same but only with httpx:

resp = httpx.post(url)
print(resp.json()

> output: 

{'items': [{'eventId': 153433356,
   'name': 'Shakira',
   'url': 'https://www.viagogo.com/Concert-Tickets/Pop-Rock/Dance-Pop/Shakira-Tickets/E-153433356',
   'dayOfWeek': 'Wed',
...
}

resp = httpx.post(url)
print(resp.json()

> output: 

{'items': [{'eventId': 153433356,
   'name': 'Shakira',
   'url': 'https://www.viagogo.com/Concert-Tickets/Pop-Rock/Dance-Pop/Shakira-Tickets/E-153433356',
   'dayOfWeek': 'Wed',
...
}

helpful-purple•7mo ago

Hi. All the code works correctly. The problem is exactly what you are doing. 1. context.request.model_dump_json() - as you can see, it outputs the Request metadata, which does not include the server response As a result, you are comparing the request metadata from crawlee with the server response in httpx... 2. I don't really understand why you need BeautifulSoupCrawler when working with json. I think it would be more appropriate to use ParselCrawler or HttpCrawler with a convenient library for working with json. Here is a sample code that will do what you expect it to do

async def main() -> None:
    async with Actor:
        crawler = BeautifulSoupCrawler()

        url = "https://www.viagogo.com/Concert-Tickets/Pop-Rock/Dance-Pop/Shakira-Tickets?gridFilterType=0&homeAwayFilterType=0&sortBy=0&nearbyGridRadius=50&venueIdFilterType=0&eventViewType=0&opponentCategoryId=0&pageIndex=1&method=GetFilteredEvents&categoryId=4555&radiusFrom=80467&radiusTo=null&from=1970-01-01T00%3A00%3A00.000Z&to=9999-12-30T23%3A00%3A00.000Z&lat=39.044&lon=-77.488&genreId=undefined&eventCountryType=0&fromPrice=undefined&toPrice=undefined"

        initial_req = Request.from_url(
            method="POST",
            url=str(url),
        )

        @crawler.router.default_handler
        async def default_handler(context: BeautifulSoupCrawlingContext) -> None:
            context.log.info(f"Processing {context.request.url}")

            await context.push_data(context.soup.find("p").text)

        await crawler.run([initial_req])

async def main() -> None:
    async with Actor:
        crawler = BeautifulSoupCrawler()

        url = "https://www.viagogo.com/Concert-Tickets/Pop-Rock/Dance-Pop/Shakira-Tickets?gridFilterType=0&homeAwayFilterType=0&sortBy=0&nearbyGridRadius=50&venueIdFilterType=0&eventViewType=0&opponentCategoryId=0&pageIndex=1&method=GetFilteredEvents&categoryId=4555&radiusFrom=80467&radiusTo=null&from=1970-01-01T00%3A00%3A00.000Z&to=9999-12-30T23%3A00%3A00.000Z&lat=39.044&lon=-77.488&genreId=undefined&eventCountryType=0&fromPrice=undefined&toPrice=undefined"

        initial_req = Request.from_url(
            method="POST",
            url=str(url),
        )

        @crawler.router.default_handler
        async def default_handler(context: BeautifulSoupCrawlingContext) -> None:
            context.log.info(f"Processing {context.request.url}")

            await context.push_data(context.soup.find("p").text)

        await crawler.run([initial_req])

conscious-sapphireOP•7mo ago

You're right Mantisus, now I'm using HttpCrawler() and I'm getting the data I want: This code does what I want:

from apify import Actor
from crawlee import Request
from crawlee.http_crawler import HttpCrawler, HttpCrawlingContext
import json


async def main() -> None:
    async with Actor:
        crawler = HttpCrawler()

        url = "https://www.viagogo.com/Concert-Tickets/Pop-Rock/Dance-Pop/Shakira-Tickets?gridFilterType=0&homeAwayFilterType=0&sortBy=0&nearbyGridRadius=50&venueIdFilterType=0&eventViewType=0&opponentCategoryId=0&pageIndex=1&method=GetFilteredEvents&categoryId=4555&radiusFrom=80467&radiusTo=null&from=1970-01-01T00%3A00%3A00.000Z&to=9999-12-30T23%3A00%3A00.000Z&lat=39.044&lon=-77.488&genreId=undefined&eventCountryType=0&fromPrice=undefined&toPrice=undefined"

        initial_req = Request.from_url(
            method="POST",
            url=str(url),
        )

        @crawler.router.default_handler
        async def default_handler(context: HttpCrawlingContext) -> None:
            context.log.info(f"Processing {context.request.url}")
            json_response = context.http_response.read()  # <------ This is the same than this: response.json() after doing response = httpx.post(url)
            json_resp_parsed = json.loads(json_response)
            await context.push_data(json_resp_parsed)

        await crawler.run([initial_req])

from apify import Actor
from crawlee import Request
from crawlee.http_crawler import HttpCrawler, HttpCrawlingContext
import json


async def main() -> None:
    async with Actor:
        crawler = HttpCrawler()

        url = "https://www.viagogo.com/Concert-Tickets/Pop-Rock/Dance-Pop/Shakira-Tickets?gridFilterType=0&homeAwayFilterType=0&sortBy=0&nearbyGridRadius=50&venueIdFilterType=0&eventViewType=0&opponentCategoryId=0&pageIndex=1&method=GetFilteredEvents&categoryId=4555&radiusFrom=80467&radiusTo=null&from=1970-01-01T00%3A00%3A00.000Z&to=9999-12-30T23%3A00%3A00.000Z&lat=39.044&lon=-77.488&genreId=undefined&eventCountryType=0&fromPrice=undefined&toPrice=undefined"

        initial_req = Request.from_url(
            method="POST",
            url=str(url),
        )

        @crawler.router.default_handler
        async def default_handler(context: HttpCrawlingContext) -> None:
            context.log.info(f"Processing {context.request.url}")
            json_response = context.http_response.read()  # <------ This is the same than this: response.json() after doing response = httpx.post(url)
            json_resp_parsed = json.loads(json_response)
            await context.push_data(json_resp_parsed)

        await crawler.run([initial_req])

Thanks Mantisus

Gaming

Programming

How to send post request (I'm doing reverse engineering)

Did you find this page helpful?