Scrapping website data can be done may different ways, in this blog post Im going to show you how to do this using the Google Api. Im going to show you how to scrap a specific websites data.
Step 1:
This is the link to the Googles Custom Search Tool you’ll be using. Select the “Add” button.
Step 2:
Enter the website you’d like to pull data from in the Sites to search box.
For example www.example.com
This should autofill the Name of the search engine box.
Don’t ask me why its called Name of the search engine this really should be named something like “Campaign” because you’re ripping off a website, some of which could be a “Search Engine” (meaning you can do this with google images too).
Then Select Create
Step 3:
For this step you need A google api key you can get at https://console.developers.google.com/. It should start with AIza*****
Now you’ll find the third option in the list of otpion titled Modify your seach engine. You’ll select that one by pressing the control panel button.
Here you can get whats called a CX key under the “Search Engine ID” button
It looks like this 001508811111111110:od11111111uhtc
Has infomation like the search url embeded into it (the CX).
This is link it you wanna do an onsite search:
https://cse.google.com/cse?oe=utf8&ie=utf8&source=uds&q=<SOME_QUERY>&start=0&cx=<YOUR_CX>#gsc.tab=0&gsc.q=<SOME_QUERY>&gsc.page=1
FOR a raw responce that you can view in the browser it should look something like this:
https://www.googleapis.com/customsearch/v1element?key=AIza*********alt=json&callback=angular.callbacks._0&cx=0015**********:odc*******&prettyPrint=true&q=<QUERY>&searchType=image
This is an angular1 service I wrote a while back for my specific use case. You can tinker with it to make it your own.
My Controller
var _items=[{"Name":"Item Name"}]
googleScrap.queryEachInArray(_items)
.then(function(data) {console.log('data: ',data)})
.catch(function(err) {console.log('err: ',err)})
or
googleScrap.query("some string")
.then(function(data) {console.log('data: ',data)})
.catch(function(err) {console.log('err: ',err)})
my Service
app.service('googleScrap', function($http, $q, $timeout, ROOT, _) {
var self = this;
this.queryEachInArray = function(myArray) {
var items = []
var d = $q.defer();
function singleValueFunction(value) {
self.query(value.Name).then(function(data) {
console.log("data: ",data);
var images = _(data.results).map('richSnippet.cseImage.src').sort().value()
value.images = _.without(images, '', undefined);
items.push(value)
//if (myArray.length === items.length) { d.resolve(items)}
})
}
if(!Array.isArray(myArray))d.reject("please pass array");
myArray.reduce(function(p, val) {
return p.then(function() {
return singleValueFunction(val);
});
}, $q.when(myArray.length === items.length)).then(function(finalResult) {
d.resolve(items)
}, function(err) {
console.log("err: ", err);
});
return d.promise;
}
this.query = function(query) {
return new Promise(function(resolve, reject) {
var url = "https://www.googleapis.com/customsearch/v1element?key=AIza***"
/* --------------
manage here "https://cse.google.com/cse"
--------------
"imgSize":"small",
"fileType":"jpg",
"cx": "0055555555555555:odca4444444",
*/
var params = {
"alt": "json",
"searchType": "image",
"cx": "0055555555555555:odca4444444",
"prettyPrint": "true",
'callback': 'JSON_CALLBACK',
"q": query
};
$http({
url: url,
method: 'JSONP',
params: params
}).success(function(data) {
resolve(data)
})
});
}
});
Im sure parts of that promise could be written better but figured I’d share (also notice I’m using lodash as a dependancy).
Hope this helped get you started on your website scraping!