How To Scrap Website Data With the Google API

Posted by Garrett Mac on 04/04/4990
In Scrapping, Apis
Tags blog, apis, google

Scrapping website data can be done may different ways, in this blog post Im going to show you how to do this using the Google Api. Im going to show you how to scrap a specific websites data.

Step 1:

This is the link to the Googles Custom Search Tool you’ll be using. Select the “Add” button.

Step 2:

Enter the website you’d like to pull data from in the Sites to search box.

For example www.example.com

This should autofill the Name of the search engine box.

Don’t ask me why its called Name of the search engine this really should be named something like “Campaign” because you’re ripping off a website, some of which could be a “Search Engine” (meaning you can do this with google images too).

Then Select Create

Step 3:

For this step you need A google api key you can get at https://console.developers.google.com/. It should start with AIza***** Now you’ll find the third option in the list of otpion titled Modify your seach engine. You’ll select that one by pressing the control panel button.

Here you can get whats called a CX key under the “Search Engine ID” button

It looks like this 001508811111111110:od11111111uhtc

Has infomation like the search url embeded into it (the CX).

This is link it you wanna do an onsite search:

https://cse.google.com/cse?oe=utf8&ie=utf8&source=uds&q=<SOME_QUERY>&start=0&cx=<YOUR_CX>#gsc.tab=0&gsc.q=<SOME_QUERY>&gsc.page=1

FOR a raw responce that you can view in the browser it should look something like this:

https://www.googleapis.com/customsearch/v1element?key=AIza*********alt=json&callback=angular.callbacks._0&cx=0015**********:odc*******&prettyPrint=true&q=<QUERY>&searchType=image

This is an angular1 service I wrote a while back for my specific use case. You can tinker with it to make it your own.

My Controller


var _items=[{"Name":"Item Name"}]
        googleScrap.queryEachInArray(_items)
        .then(function(data) {console.log('data: ',data)})
        .catch(function(err) {console.log('err: ',err)})

        or

        googleScrap.query("some string")
        .then(function(data) {console.log('data: ',data)})
        .catch(function(err) {console.log('err: ',err)})

my Service

app.service('googleScrap', function($http, $q, $timeout, ROOT, _) {
    var self = this;
    this.queryEachInArray = function(myArray) {
        var items = []
        var d = $q.defer();


        function singleValueFunction(value) {
            self.query(value.Name).then(function(data) {
                                                        console.log("data: ",data);
                var images = _(data.results).map('richSnippet.cseImage.src').sort().value()
                value.images = _.without(images, '', undefined);
                items.push(value)
                //if (myArray.length === items.length) { d.resolve(items)}
            })
        }
        if(!Array.isArray(myArray))d.reject("please pass array");

        myArray.reduce(function(p, val) {
            return p.then(function() {
                return singleValueFunction(val);
            });
        }, $q.when(myArray.length === items.length)).then(function(finalResult) {
            d.resolve(items)
        }, function(err) {
            console.log("err: ", err);
        });
        return d.promise;
    }
    this.query = function(query) {
        return new Promise(function(resolve, reject) {
            var url = "https://www.googleapis.com/customsearch/v1element?key=AIza***"

                /*  --------------
                 manage here "https://cse.google.com/cse"
                --------------
                  "imgSize":"small",
                  "fileType":"jpg",
                    "cx": "0055555555555555:odca4444444",
                */
            var params = {
                "alt": "json",
                "searchType": "image",
                "cx": "0055555555555555:odca4444444",
                "prettyPrint": "true",
                'callback': 'JSON_CALLBACK',
                "q": query
            };
            $http({
                url: url,
                method: 'JSONP',
                params: params
            }).success(function(data) {
                resolve(data)
            })
        });
    }
});

Im sure parts of that promise could be written better but figured I’d share (also notice I’m using lodash as a dependancy).

Hope this helped get you started on your website scraping!

comments powered by Disqus