google-maps-scraper

command module

v1.0.0 Latest Latest Go to latest Published: Sep 8, 2023 License: MIT Imports: 18 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/gosom/google-maps-scraper

Links

README ¶

Google maps scraper

A command line google maps scraper build using

scrapemate web crawling framework.

You can use this repository either as is, or you can use it's code as a base and customize it to your needs

Features

Extracts many data points from google maps
Exports the data to CSV, JSON or PostgreSQL
Perfomance about 55 urls per minute (-depth 1 -c 8)
Extendable to write your own exporter
Dockerized for easy run in multiple platforms
Scalable in multiple machines

Extracted Data Points

link
title
category
address
open_hours
website
phone
plus_code
review_count
review_rating
reviews_per_rating
latitude
longitude
cid
status
descriptions
reviews_link
thumbnail
timezone
price_range
data_id
images
reservations
order_online
menu
owner
complete_address
about
user_reviews

Quickstart

Using docker:

touch results.csv && docker run -v $PWD/example-queries.txt:/example-queries -v $PWD/results.csv:/results.csv gosom/google-maps-scraper -depth 1 -input /example-queries -results /results.csv -exit-on-inactivity 3m

file results.csv will contain the parsed results.

On your host

(tested only on Ubuntu 22.04)

git clone https://github.com/gosom/google-maps-scraper.git
cd google-maps-scraper
go mod download
go build
./google-maps-scraper -input example-queries.txt -results restaurants-in-cyprus.csv -exit-on-inactivity 3m

Be a little bit patient. In the first run it downloads required libraries.

The results are written when they arrive in the results file you specified

Command line options

try ./google-maps-scraper -h to see the command line options available:

  -c int
        sets the concurrency. By default it is set to half of the number of CPUs (default NUM_CPU)
  -cache string
        sets the cache directory (no effect at the moment) (default "cache")
  -debug
        Use this to perform a headfull crawl (it will open a browser window) [only when using without docker]
  -depth int
        is how much you allow the scraper to scroll in the search results. Experiment with that value (default 10)
  -dsn string
        Use this if you want to use a database provider
  -exit-on-inactivity duration
        program exits after this duration of inactivity(example value '5m')
  -input string
        is the path to the file where the queries are stored (one query per line). By default it reads from stdin (default "stdin")
  -json
        Use this to produce a json file instead of csv (not avalaible when using db)
  -lang string
        is the languate code to use for google (the hl urlparam).Default is en . For example use de for German or el for Greek (default "en")
  -produce
        produce seed jobs only (only valid with dsn)
  -results string
        is the path to the file where the results will be written (default "stdout")

Using Database Provider (postgreSQL)

For running in your local machine:

docker-compose -f docker-compose.dev.yaml up -d

The above starts a PostgreSQL contains and creates the required tables

to access db:

psql -h localhost -U postgres -d postgres

Password is postgres

Then from your host run:

go run main.go -dsn "postgres://postgres:postgres@localhost:5432/postgres" -produce -input example-queries.txt --lang el

(configure your queries and the desired language)

This will populate the table gmaps_jobs .

you may run the scraper using:

go run main.go -c 2 -depth 1 -dsn "postgres://postgres:postgres@localhost:5432/postgres"

If you have a database server and several machines you can start multiple instances of the scraper as above.

Kubernetes

You may run the scraper in a kubernetes cluster. This helps to scale it easier.

Assuming you have a kubernetes cluster and a database that is accessible from the cluster:

First populate the database as shown above
Create a deployment file scraper.deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: google-maps-scraper
spec:
  selector:
    matchLabels:
      app: google-maps-scraper
  replicas: {NUM_OF_REPLICAS}
  template:
    metadata:
      labels:
        app: google-maps-scraper
    spec:
      containers:
      - name: google-maps-scraper
        image: gosom/google-maps-scraper:v0.9.3
        imagePullPolicy: IfNotPresent
        args: ["-c", "1", "-depth", "10", "-dsn", "postgres://{DBUSER}:{DBPASSWD@DBHOST}:{DBPORT}/{DBNAME}", "-lang", "{LANGUAGE_CODE}"]

Please replace the values or the command args accordingly

Note: Keep in mind that because the application starts a headless browser it requires CPU and memory. Use an appropriate kubernetes cluster

Perfomance

Expected speed with concurrency of 8 and depth 1 is 55 jobs/per minute. Each search is 1 job + the number or results it contains.

Based on the above: if we have 1000 keywords to search with each contains 16 results => 1000 * 10 = 16000 jobs.

We expect this to take about 10000/55 ~ 291 minutes ~ 5 hours

If you want to scrape many keywords then it's better to use the Database Provider in combination with Kubernetes for convenience and start multipe scrapers in more than 1 machines.