Author Archives: Team RemotePanda

About Team RemotePanda

RemotePanda is a personalized platform for companies to hire remote talent and get the quality work delivered from our city Pune. All talents associated with us are close network connections. When we connect them with you, we make sure to manage their quality, their growth, the legalities and also the delivery of work. The idea is to make remote work successful for you.
Build-Operate-Transfer (BOT) Model

Build-Operate-Transfer (BOT) Model

BOT Model

 

BOT is the next big thing in offshoring

 

Companies looking to get involved in the international markets are embracing the build-operate-transfer (BOT), model. The BOT model is a form of an integrated partnership combining the design and construction responsibilities of design-build acquisitions with operations and maintenance. It lets you build an offshore team in the ‘build’ phase and put them for a preliminary test during the ‘operate’ phase. Once the team has adjusted to your company’s processes, tools, and methodologies, you can take the full control of the team in the ‘transfer’ phase. BOT model gives you the leverage of trying before buying. That means you don’t need to invest much in your offshore unit until you are sure of its worth.

 

Let’s understand all the three stages of the BOT model in detail:

BOT stages

 

 

BOT Stages

 

  • Build:  It is the initial phase, including activities like defining the qualification requirements to the workforce.  It includes everything ranging from distributing personnel recruitment procedure, approval of the reporting process, creation of infrastructure, and core team. This is also the phase where trust rapport between customer and contractor is formed, and the project is started. 

 

  • Operate: This phase is where project management happens. It includes expanding the team as per the customers’ requirements while developing products or projects. Additionally, there is the implementation of processes to reach an ideal level of business maturity and technical abilities of the team. Also, the ‘operate’ phase encompasses the allocation of lead programmers and team leaders in the unit plus applicable costs are changed on an annual basis.

 

  • Transfer: it is the final phase where outsourcing partner transfers project ownership to the client. Although it’s only possible when your client is ready to take control of the project or the contract has ended. This phase includes a transfer of assets and some handover process.

 

How RemotePanda helps you with build-operate-transfer software outsourcing?

 

We help you build a team of the top talent in India – right from sourcing candidates to interviews to due diligence & hiring. We then put those employees in an office of your choosing & manage their day to day operations. Later, once you feel the team is ready & at par with your sops, we transfer the control over to you. You now have access to global talent, an office in a different part of the world, easy access to a booming economy and most of all, you will have your employees

 

The BOT model enables you to set up an offshore team in the ‘build’ phase and then take them for a trial run during the ‘operate’ phase. Once you see that the team has well adapted to your company’s processes, methodologies, and tools, you can decide to take full control of the team in the ‘transfer’ phase. This model gives you the choice of trying before buying, meaning you don’t need to make any long term investment in your offshore unit until you are sure of its value.

 

How the BOT model at RemotePanda works?

 

How the BOT model at RemotePanda works

 

 

The build-operate-transfer model at RemotePanda follows a set of procedures.

 

  • Initially, we will connect with you through a phone call and understand your requirements and pain points. Once the prerequisites are precise, we’ll start building a team for you.

 

  • It includes the screening the candidates with due-diligence, interviewing them, and hiring the best-fit.

 

  • In this step, we will be responsible for the agile project management of the team. Besides, we will provide them the technical and business coaching along with taking care of the daily operations.

 

  • Finally, you can transfer the candidates on your payroll. We will manage all the legalities and continue providing you the necessary support.

 

In short, RemotePanda will set up a dedicated team for your project as per the number of resources required. The resources will be provided with the infrastructure, which includes physical space, machines, high-speed internet, Amazon cloud workspace, secure access, and training. Also, a virtual CTO can be accessed on an hourly to fulltime basis.

 

If requested, RemotePanda will continue to assist with administrative management, accounting, recruitment, or any other services related to your offshore team even after the transfer of ownership.

 

Reasons to choose the BOT model

 

Reasons to choose the BOT model

 

 

  • Cost efficiency

Running a business is all about saving money, spending it, and gaining profits. That’s why the BOT model often helps in cutting costs in the best way possible. Since the project team is owned, managed, and operated by an offshore group, it saves as much as 60% of the standard cost. The funds saved can be used to help enrich and develop the company’s employees and recruiting more skillful resources.

 

  • No risk in building a team

There are risks associated with every business since every country faces issues at some point or another. So the BOT model helps in reducing the risks of being in a different country by diversifying investment for the investors. Since the outsourcing partners are already aware of the conditions of the country it’s operating on, they are prepared for possible problems that could arise during the team set up.

 

  • Ability to scale rapidly

Using the build-operate-transfer model, organizations can scale their operations quickly through a wide array of services, which in turn completes the business model. 

 

  • Faster time to market

Having diverse resources in different locations or time zones help reducing time to market since the process cycle is almost 24/7. Since the development is uninterrupted, it doesn’t waste much time in a day.

 

  • Access to booming technologies

Having an experienced team allows companies to introduce the latest technologies. And having access to these technologies can be an added advantage that often engages and retains members in the organization.

 

Conclusion

A BOT model helps companies create the full value of their outsourcing partners. Simultaneously, the proper plan keeps business operations and knowledge in-house as if it’s the part of the same company. 

Similarly, RemotePanda assists you in such a way that you can focus on developing and enhancing your core business. We will supervise the offshore operations and development, and get things done before the transfer of ownership.

Do you want to build your team in India with RemotePanda
Politics in Freelance Teams

Politics in Freelance Teams

 

One of the most despised thing in a co-located office is the politics which doesn’t seem to cease. People turn to remote work for various reasons, one of them being fewer politics and more focus on the work, but does this really hold true? Is freelance team completely free of politics and bias as compared to the in-house team? freelancers are humans, after all, they are driven by their own passion, goal, ego and more of a mix of all the negatives and positives.

In this whole charade of politics, what the employees don’t understand is that they are disrupting the amazing culture which has been inculcated over the years through tremendous hard work and fun activities. Such a blow to the company culture not only affects the management but also affects the profitability of the company. Now we usually think that politics only happen when the employees are co-located, being physically distant might be the trick to avoid it, but let’s see how remote teams feed the fire that can give rise to this devastating phenomenon.

 

1. Cross-Cultural Differences:

 

When I was interviewing CXOs and developers for our #MakeRemoteWork survey report, I met a developer in the US. She had tremendous experience of working remotely and had been through this roller coaster before.

 

Cultural-Differences

 

She told me that one of the reasons why it happens is that a team has members from different culture and background, so people sharing similar values get along really well, while those who don’t, feel a bit ostracized. There is still a bias towards, race, religion, country and what not.

These are the things that can break an extremely strong team, what employees need to realize is that they are here to work in synergy and harmony not to judge a person on personal traits and discriminate.

 

2. Less socialization:

 

Slack is the virtual office for all the remote workers. You see less of their faces and more of their text messages. Remote workers are generally so involved in their work that they barely talk about their personal lives on the communication channels and talk more about work.

Less-socialization

 

In this process, nobody gets to know what the person on the other side is like. We tend to form a perception about that person based on his work and not for the person he is. This is usually one of the reasons that remote employees feel less engaged in activities other than work, which leads to a difference of opinion between the team. So what’s really important here is COMMUNICATION. On a positive note, remote team members are not great on an emotional level compared to their in-house counterparts, so their efforts to influence internal politics are pretty low.

 

3. Ego Clashes:

 

Shoutouts and all are cool, but it sucks when a kissass employee gets it over a kickass employee.

 

Ego-Clashes

 

It’s true that rewarding an employee boosts the morale of not just employee but sets a standard for other employees as well, but when you’re working with a remote team, you should take into consideration the entire team and not just a single member. In the end, everybody is codependent and desires to be appreciated equally. A team is like a Bad Boys movie, we ride together, we die together, bad boys for life


Now that we know what gives rise to politics in freelance teams let’s go through some tips to diffuse these politics.

 

1. Company culture:

 

It’s the responsibility of the management to enlighten their employees about the culture that has been established in the company.

 

 

Company-Culture

 

You shouldn’t let employees learn more about the company through water cooler talks and gossips, this won’t do anyone any good, it will only lead to further misunderstanding of how the company works. So it’s the management or the leader’s responsibility to show his employees the ropes.

 

2. Communication:

 

Organizations who embrace freelancers have advanced technology to make communication easier and seamless within the team,

 

 

Communication

 

it’s imperative for the team members to use these communication tools, not just for work, but to stay in constant touch and build a better team as well as company culture.

 

3. Focus on work:

 

It’s no surprise that we are at an organization to work in unison to achieve personal as well as the organization’s goal, so, one of the best things one can do to stay away from all this politics is to focus on the task at hand.

 

Focus on work

 

At the end of the day, everyone wants to feel satisfied and appreciated for the work they have done, and it will only happen when the focus is less on politics and more the work.

 

4. Build Trust:

Trust is one of the most important factors behind every team and organization’s success.

Build-Trust

 

The organization should plan certain team building exercises which would revolve around making long term or short term strategies and also creating a stronger bond during the entire process. This is a win-win situation for the team as well as the organization.

 

Conclusion

 

Politics within a freelance team is quite different than the one you’ll encounter in the office, there’s no backstabbing or the usual office drama, although certain cases persist like taking credit for somebody else’s work. It eventually comes down to how well a manager can encourage healthy communication between his team and how well a team can harbor a feeling of trust and respect. 

How to Create a Github Repository from the Command Line

How to Create a Github Repository from the Command Line

how to create a github repository from the command line

 

Git is a great version control system and Github is superb hosting service for git based repositories.

Github provides a nice web interface to create (blank) repositories at the start of the project. But why visit github.com to create a blank repository, so here’s a simple bash script to make this simple task even simpler.

 

git-create(){
repo_name=$1
dir_name=`basename $(pwd)`
if [ $repo_name = ]; then
echo -n Repo name [$dir_name]?:
read repo_name
fi
if [ $repo_name = ]; then
repo_name=$dir_name
fi
username=`git config user.name`
if [ $username = ]; then
echo -n Could not find username, run ‘git config –global user.name <username>’
return 1
fi
token=`git config user.token`
if [ $token = ]; then
echo -n Could not find token, run ‘git config –global user.token <token>’
return 1
fi
echo -n Creating Github repository ‘$repo_name‘…
curl -u $username:$token https://api.github.com/user/repos -d {“name”:”$repo_name“} > /dev/null 2>&1
echo Done.
echo -n Adding remote…
git remote add origin git@github.com:$username/$repo_name.git
echo Done.
}
view rawgit-create.bash hosted with ❤ by GitHub

 

Script is based on Curl and GithubApi.

 

Add this to bash_profile and reload it. Done 🙂
Use git-create to summon 146822610729350.

Be sure to configure GitHub username and access_token in global git configure file.

Hint:

git config --global user.name <username>

git config --global user.token <access_token>

 

About RemotePanda

RemotePanda is a personalized platform for companies to hire remote talent and get the quality work delivered from the city Pune. The resources in our talent pool are our close network connections. While connecting them with you, we make sure to manage the quality, growth, legalities, and the delivery of their work. The idea is to make remote work successful for you. Get in touch with us to learn why RemotePanda is the best fit solution for your business requirements.

CSRF and RAILS protect from forgery

CSRF and RAILS protect from forgery

csrf and rails protect from forgery

 

Cross-site request forgery, also known as a one-click attack or session riding and abbreviated as CSRF or XSRF, is a type of malicious exploit of a website whereby unauthorized commands are transmitted from a user that the website trusts. Unlike Cross Site Scripting (XSS), which exploits the trust a user has for a particular site, CSRF exploits the trust that a site has in a user’s browser. Let’s take a look at the schematic of the CSRF

 

CSRF Scheme

 

 

  • Step1: The Victim connects to secure Bank websites and logs into his account.
  • Step2: A cookie set in the Victims browser containing the session id of the victim.
  • Step3: Victim trips into visiting a malicious page.
  • Step4: Victim receives an html page containing the malicious hidden form.
  • Step5: A web request is executed from the victim’s browser carrying the context of the cookie set in Step2.
  • Step6: Bank Server completes the web requests.

 

Now we know what CSRF is, let’s see how Rails help prevent CSRF.
As Rails uses MVC architecture, Controller actions are protected from Cross-Site Request Forgery (CSRF) attacks by including a token in the rendered html for your application. This token is stored as a random string in the session, to which an attacker does not have access. When a request reaches your application, Rails verifies the received token with the token in the session. Only HTML and JavaScript requests are checked so this will not protect your XML API (presumably you’ll have a different authentication scheme there anyway). Also, GET requests are not protected as these should be idempotent. The requests are validated using the following piece of code

 

def verified_request?
!protect_against_forgery? || request.get? || request.head? ||
form_authenticity_token == params[request_forgery_protection_token] ||
form_authenticity_token == request.headers[X-CSRF-Token]
end
view rawverified_Request.rb hosted with ❤ by GitHub

 

This can be enabled with the protect_from_forgery method, which will perform the check and handle unverified requests, if the token doesn’t match. And it will add a _authenticity_token parameter to all forms that are automatically generated by Rails. It is recommended that this method is added in your ApplicationController, and later on, you can skip it in other controllers if not required.

With all this in mind lets take a look at Rails source code.

 

class ApplicationController < ActionController::Base
protect_from_forgery
end
def protect_from_forgery(options = {})
self.request_forgery_protection_token ||= :authenticity_token
prepend_before_action :verify_authenticity_token, options
end
def verify_authenticity_token
unless verified_request?
logger.warn Can’t verify CSRF token authenticity if logger
handle_unverified_request
end
end
def handle_unverified_request
reset_session
end

 

From the code, we figure out, CSRF protection resets session and lets the request through when CSRF token verification fails.
This in itself is a CSRF vulnerability since it allows anyone to logout users by directing their browser to a page that requires CSRF protection

With Rails 4 application, the nowApplicationController passes a parameter to.protect_from_forgery

 

class ApplicationController < ActionController::Base
# Prevent CSRF attacks by raising an exception.
# For APIs, you may want to use :null_session instead.
protect_from_forgery with: :exception
end
def protect_from_forgery(options = {})
self.forgery_protection_strategy = protection_method_class(options[:with] || :null_session)
self.request_forgery_protection_token ||= :authenticity_token
prepend_before_action :verify_authenticity_token, options
end
def verify_authenticity_token
unless verified_request?
logger.warn Can’t verify CSRF token authenticity if logger
handle_unverified_request
end
end
def handle_unverified_request
forgery_protection_strategy.new(self).handle_unverified_request
end

 

This raises an exception when an unverified request is encountered. The same behavior can be achieved with Rails 3 by overriding the default handle_unverified_request method.

 

Conclusion-

 

Banking server failed to verify the validity of the web request and hence executed it without the victim’s knowledge.

 

About RemotePanda

RemotePanda is a personalized platform for companies to hire remote talent and get the quality work delivered from the city Pune. The resources in our talent pool are our close network connections. While connecting them with you, we make sure to manage the quality, growth, legalities, and the delivery of their work. The idea is to make remote work successful for you. Get in touch with us to learn why RemotePanda is the best fit solution for your business requirements.

Clean Validations with Custom Contexts

Clean Validations with Custom Contexts

clean validations with custom contexts

 

Active Record validations are well-known and widely used in Rails.

 

class User < ApplicationRecord
  
validates :name, presence: { message: "must be given please" }

end

 

This runs the validation on save, both when creating a new record or when updating an existing record.

on option allows control over when to run the validation, commonly used with value of create or update

 

class User < ApplicationRecord
  belongs_to :club, optional: true 
  validates :name, presence: { message: "must be given please" }, on: :create
  validates :club, presence: { message: "must be given please" }, on: :update  
end

 

This allows creating users without associating them with a Club but enforces the presence of Club on subsequent updates. This pattern is commonly used to allow users to signup with bare minimum form fields and then forcing them to update their profiles with more information on subsequent visits.

Value for the on option is not limited to create and update, we can have our own custom contexts. Like in a multistep form, we can have validations for each of the steps. on options makes this really easy to do

 

class User < ApplicationRecord
  validate :basic_info, on: :basic_info
  validate :education_details, on: :education_details
  validate :professional_info, on: :professional_info

  private
  def basic_info
    # Validation for basic info, first_name, last_name, email
  end

  def education_details
    # Validation for education_details
  end

  def professional_info
    # Validation for professional_info
  end
end

 

In the controller

 

class UsersController < ApplicationController
  ...

  def update_basic_info
    @user.assign_attributes(basic_info_params)
    @user.save(:basic_info)
  end

  def update_education_details
    @user.assign_attributes(education_details_params)
    @user.save(:education_details)
  end

  def update_professional_info
    @user.assign_attributes(professional_info_params)
    @user.save(:professional_info)
  end

  private
  def basic_info_params
    # strong params
  end

  def education_details_params
    # strong params
  end

  def professional_info_params
    # strong params
  end
end

 

With Rails 5 adding support for multiple contexts, we can use multiple contexts together

 

@user.save(:basic_info, :professional_info)

 

This seems pretty neat, let’s go a step further and do this with update_attributes. In current implementation of Rails,
update_attributes does not support validation contexts. We can get around this by defining our own custom method

 

class ApplicationRecord < ActiveRecord::Base
  self.abstract_class = true

  def update_attibutes_with_context(attributes, *contexts)
    with_transaction_returning_status do
      assign_attributes(attributes)
      save(context: contexts)
    end
  end
end

 

In the controller

 

@user.update_attibutes_with_context({first_name: 'fname'}, :basic_info)

 

Lastly, we can use with_options to group multiple validations within a context

 

with_options on: :member do |member_user|
    member_user.validates :club_name, presence: true
    member_user.validates :membership_id, presence: true
  end

About RemotePanda

RemotePanda is a personalized platform for companies to hire remote talent and get the quality work delivered from the city Pune. The resources in our talent pool are our close network connections. While connecting them with you, we make sure to manage the quality, growth, legalities, and the delivery of their work. The idea is to make remote work successful for you. Get in touch with us to learn why RemotePanda is the best fit solution for your business requirements.

d3.js Appealing Visualisations

d3.js Appealing Visualisations

d3js appealing visualisations

 

With the ever-increasing amount of data, both in terms of quantity as well as quality, what we need is a precise and accurate way to represent it for better comprehension and facilitate decision making. That’s where d3.js comes to rescue.

d3 stands for Data-Driven Document, i.e. when your web-page is interacting with data. Data can be as simple a simple array of integers or can be as complex as something else.

 

Why choose d3.js?

 

  • it works seamlessly with existing web technologies
  • can manipulate any part of the document object model
  • it is as flexible as the client side web technology stack (HTML, CSS, SVG)
  • takes advantage of built in functionality that the browser has, simplifying the developer’s job, especially for mouse interaction.

 

What d3.js is not?

 

  • it is not a graphics library
  • it is not a data processing library.
  • it doesn’t have pre-built visualizations

 

D3.js is tools that make the connection between data and graphics easy. It sits right between the two, the perfect place for a library meant for data visualization.

 

D3.js is a JavaScript library for manipulating documents based on data. D3 helps you bring data to life using HTML, SVG, and CSS. D3’s emphasis on web standards gives you the full capabilities of modern browsers without tying yourself to a proprietary framework, combining powerful visualization components and a data-driven approach to DOM manipulation. ~ d3js.org

 

Show me some code…

Simple bar chart

 

<!DOCTYPE html>
<meta charset=utf-8>
<style>
.chart div {
font: 10px sans-serif;
background-color: blue;
text-align: right;
padding: 3px;
margin: 1px;
color: white;
}
</style>
<div class=chart></div>
<script src=http://d3js.org/d3.v3.min.js></script>
<script>
var data_points = [3, 5, 23, 45, 67, 98, 150, 220];
var plot_scale = d3.scale.linear()
.domain(d3.extent(data_points))
.range([5, 420]);
d3.select(.chart)
.selectAll(div)
.data(data_points)
.enter()
.append(div)
.style(width, function(data_point) { return plot_scale(data_point) + px; })
.text(function(data_point) { return data_point; });
</script>
view rawbar_chart.html hosted with ❤ by GitHub
Bar html

 

Well… that’s the only html and does look very nice and professional. We need more power.

 

About RemotePanda

RemotePanda is a personalized platform for companies to hire remote talent and get the quality work delivered from the city Pune. The resources in our talent pool are our close network connections. While connecting them with you, we make sure to manage the quality, growth, legalities, and the delivery of their work. The idea is to make remote work successful for you. Get in touch with us to learn why RemotePanda is the best fit solution for your business requirements.

Temporary Files in Ruby

Temporary Files in Ruby

temporary files in ruby

 

Working with Ruby on Rails applications, many times such as in case of file upload services, generating/processing csv data, uploading data to external services like Amazon there is a need to create temporary files.
A very common solution is to create a usual file object and delete it later. Imagine a scenario where you had created a large data file (say a 2GB) for temporary usage and forgot to delete it.

 

The Solution… Ruby Tempfile Class

 

Tempfile is a ruby utility class for managing temporary files. The class can be used to create temporary files. The file is generated with a unique name each time and is garbage collected when it goes out of scope. This saves you the trouble to have to remove them explicitly.
Since explicitly temporary deleting files is a good idea you can still do it with Tempfile object, Tempfile#unlink .
All actions on a File object are also valid on a Tempfile object, hence no loss of functionality.

Creating a using a temporary file with Tempfile class

 

> tempfile = Tempfile.new([temp, text])
=> #<Tempfile:/tmp/temp20141109-10110-2pjq04text>
> tempfile.write(sample tempfile.)
=> 16
> tempfile.rewind
=> 0
> tempfile.read
=> sample tempfile.
> tempfile.close
=> nil
> tempfile.unlink
=> #<Tempfile:>
view rawtempfile.rb hosted with ❤ by GitHub

 

For complete documentation of class ref Ruby Tempfile

About RemotePanda

RemotePanda is a personalized platform for companies to hire remote talent and get the quality work delivered from the city Pune. The resources in our talent pool are our close network connections. While connecting them with you, we make sure to manage the quality, growth, legalities, and the delivery of their work. The idea is to make remote work successful for you. Get in touch with us to learn why RemotePanda is the best fit solution for your business requirements.

gocsv.go Simple CSV parsing with GO

gocsv.go Simple CSV parsing with GO

gocsv go simple csv parsing with go

 

Recently I was working with CSV files in Ruby. Parsing CSV files in Ruby code is easy, thanks to Ruby/csv.

Let’s try it with golang.

Go seems to be the pretty power-packed language for developers. Go, also commonly referred to as golang, is a programming language initially developed at Google in 2007 by Robert Griesemer, Rob Pike, and Ken Thompson. It is a statically-typed language with a syntax loosely derived from that of C, adding garbage collection, type safety, some dynamic-typing capabilities, additional built-in types such as variable-length arrays and key-value maps, and a large standard library. And since it’s from Google, the big giant, Go has built-in support for concurrency with go-routines, channels and select.

Let’s get to work now.

 

//Simple CSV reader
package main
import (
encoding/csv //Package csv reads and writes comma-separated values (CSV) files.
fmt //Package fmt implements formatted I/O with functions analogous to C’s printf and scanf.
io //Package io provides basic interfaces to I/O primitives.
os //Package os provides a platform-independent interface to operating system functionality.
)
// Ref http://golang.org/pkg/ for more on packages and links to documentation
func main() {
//Check for command-line argument filename.
//Ignore additional arguments.
if len(os.Args) < 2 {
fmt.Printf(Error: Source file name is required\n)
fmt.Println(Usage:, os.Args[0], <filename> \n)
return
}
file, err := os.Open(os.Args[1])
if err != nil {
fmt.Println(Error:, err)
return
}
// deferred call to Close() at the end of current method
defer file.Close()
//get a new cvsReader for reading file
reader := csv.NewReader(file)
//Configure reader options Ref http://golang.org/src/pkg/encoding/csv/reader.go?s=#L81
reader.Comma = ; //field delimiter
reader.Comment = # //Comment character
reader.FieldsPerRecord = –1 //Number of records per record. Set to Negative value for variable
reader.TrimLeadingSpace = true
lineCount := 1
for {
// read just one record, but we could ReadAll() as well
record, err := reader.Read()
// end-of-file is fitted into err
if err == io.EOF {
break
} else if err != nil {
fmt.Println(Error:, err)
lineCount += 1
reader.Read()
continue
}
// record is array of strings Ref http://golang.org/src/pkg/encoding/csv/reader.go?s=#L134
fmt.Printf(Record %d: %s\n, lineCount, record)
lineCount += 1
}
}
view raw gocsv.go hosted with ❤ by GitHub

 

Here is a sample CSV file for tests.

Output:

 

 

gocsvgo

 

About RemotePanda

RemotePanda is a personalized platform for companies to hire remote talent and get the quality work delivered from the city Pune. The resources in our talent pool are our close network connections. While connecting them with you, we make sure to manage the quality, growth, legalities, and the delivery of their work. The idea is to make remote work successful for you. Get in touch with us to learn why RemotePanda is the best fit solution for your business requirements.

How to upload a large CSV efficiently using rails!

How to upload a large CSV efficiently using rails!

how to upload a large csv efficiently using rails

 

Active-record is an abstraction layer that facilitates the creation, deletion, and use of ORM objects whose data requires persistent storage to a database. This keeps us away from having to think too much about SQL level queries and makes it very easy for us to work with data. It gives us super easy interface that helps us to do “almost” anything that we can do with bare SQL statements. Apart from the basic CRUD operations, active-record lets us do more complicated database stuff like pick a group of records based on criteria, order them, join tables, perform mathematical operations, etc.

Active-record pattern is liked by most because of the above-mentioned reasons. But using active-record solely may not help when your application scales.

For example, active-record does not have support for bulk insertion. Of course, we can use gems to do that, but I personally do not like the idea of using gem for a specific purpose of the bulk insert. Gems like ActiveRecord-import lets you do that but I wanted to write the code specifically for bulk importing CSV as my requirements did not include using other features of this gem.

Databases like Mysql and Postgres provide native queries to directly import CSV into database tables. Postgres has a “COPY” command for this. But this command requires superuser access to the database, and I did not want to use superuser for CSV import.

We can import CSV records one at a time using active-record but instead, I like to use raw SQL statements in such cases. SQL statements to bulk insert are much faster than active-record way of doing the same task. Let me share an example with you, where using raw SQL statements brought down the CSV import time from several hours to a few seconds.

I am using active-record with rails in my application.

In my application, users are allowed to import large CSV’s from the UI. We require these CSV’s to be imported in the foreground – so background jobs are out of question. This CSV contains only one column – “email”.

Initially, users were uploading CSV’s of not more than 6k rows. But last week, a user tried uploading a file containing 842k records (size: 16MB) and received a timeout. Imagine this situation in a user’s perspective. This will leave the user super confused.

 

The problem – 

 

There were broadly two problems.

I was loading all the data from CSV into the memory and then iterating over it and creating database records one by one using active-record. This made the system to fall apart as the RAM utilisation went up very high.

1
2
3
4
data = CSV.read('emails.csv')
data.each do |e|
  Email.create(email: e)
end

To improve, I rectified one of the biggest mistakes that I was doing i.e. loading all the CSV data into the memory.

 

The Solution – 

 

The solution is to load one record at a time or read in chunks. Obviously, loading single CSV record in the memory and saving it to the database one by one would be very database inefficient, because it would send query to create a record in the database for each CSV row. That means 842k queries to the database. Just for the kicks, I tried it and it was still taking an hour.

1
2
3
CSV.foreach('emails.csv', headers: true) do |row|
  Email.create(email: row[:email])
end

Active-record does not provide support for bulk import. This is one of the reasons why the active-record pattern doesn’t scale very well. So, this time we started writing raw SQL queries for bulk import. We could have done the bulk import for all the 842k records which would have done all the work in not more than 2–3 sec, but raw SQL query would require us to build this query with 842k records in the memory. Hence, for optimal memory consumption, we decided to do it in batches of 5k. So, for every 5k emails, we built the query to bulk insert in the database and as we expected, the time it took to import 842k records in the database was 23sec. (Perhaps 10k batch would bring it down further. This is something I am yet to try). Given the time was acceptable to the user on UI we did not increase.

1
2
3
4
5
6
#For every 5k records in the 'emails' array below
emails = ['a@b.com', 'c@d.com']
email_string = emails.map{|email| "('#{email}')"}.join(',')
#'emails' is table name
query = "insert into emails (email) VALUES #{email_string}"
ActiveRecord::Base.connection.execute(query)

Using this method we could import all the emails in the CSV in 23 seconds but without any validations ( like duplicate emails or blank emails). To avoid duplicate emails I imported these emails into a temporary table and then used database raw queries to copy unique & non-empty records into the actual table.

IMPORTANT: We can use bulk imports, bulk update raw SQL queries carefully wherever we do not require to run the callbacks instantly. Thus, we can move them in the background. Database SQL queries are fast and efficient. In my personal opinion, we should leverage them wherever we can.

I came across this nice blog which has benchmarked timings for various methods to upload CSV using rails. It shows a comparison between 4 different methods to import CSV into the database. The first one being a basic active-record method that takes 210 sec to import 100k records in comparison to importing with SQL validations (This post uses active-record import gem for the same) which brings down this time to 4 sec for the same number of records. Notice that this post used importing with validations. It will take less than 4 sec to import if we import the file without validations.

 

Conclusion

 

Found this blog interesting? Don’t forget to leave your comments and let us know your suggestions.

About RemotePanda

RemotePanda is a personalized platform for companies to hire remote talent and get the quality work delivered from the city Pune. The resources in our talent pool are our close network connections. While connecting them with you, we make sure to manage the quality, growth, legalities, and the delivery of their work. The idea is to make remote work successful for you. Get in touch with us to learn why RemotePanda is the best fit solution for your business requirements.

Cost-effective and Variable IP address Google Search Crawler

Cost-effective and Variable IP address Google Search Crawler

cost effective and variable ip address google search crawler

 

The task was to create a highly scalable and cost-effective google search crawler. Challenges were:

1. Scalability — Perform maximum google searches in minimum time.

2. Google temporarily blocks an IP address.

3. Minimal Cost.

I will talk about step by step process of how my solution evolved to create a scalable and variable IP address crawler with infrastructure costs as low as $0.021 for 5000 searches.

 

Problem

 

Initially, it seemed straightforward. I quickly wrote a ruby script to perform a google search for different queries sequentially and used Nokogiri to parse the HTML response. This worked well until the time the number of searches was less than 500 (approx).

Once the number of Google search queries increased, the problem was that searching on google sequentially was not scalable. This problem of sequential search can be optimized by running them parallelly using a queue processing software such as sidekiq. But there was another major problem that I was facing here.

Google was blocking my IP address after approx. 500 queries.

It was impossible to scale the application with the approach I was following.

Solution

 

his was a Rails application and AWS was being used.

To tackle the challenge of getting blocked by Google, I used AWS Elastic IPs. I started running the google searches in parallel sidekiq jobs in a single AWS instance and as soon as Google started blocking the instance IP address. I would

1. Allocate a new elastic IP in my AWS account.

2. Disassociate the current elastic IP from the instance.

3. Associate newly allocated IP address with the instance.

4. Deallocate previous elastic IP address.

While this solved the problem of getting blocked by Google. The problem that persisted after this was that, all the sidekiq jobs would stop and wait for the new elastic IP address to be allocated and associated. And there is still a limit on the number of sidekiq jobs that can be run on an instance based on its infrastructure.

The most cost-effective, scalable solution that I found for the above-mentioned problems is mentioned below. Let’s say there are a certain number of Google searches to be done. I followed the following steps:

1. Divide all the search queries in a group of 500 each. (Because google blocks after approx 500 queries).

2. For each group of search queries, create a rake task to run google search for them.

3. Dockerise the application and push it on docker hub.

4. From an AWS instance, start creating micro instances that will be used for google search of 500 search queries. So, I would spawn a t2.micro instance for each group of search queries and pass the queries to it in a user-data script that runs immediately after an instance is launched.

5. Each AWS instance was spawned using Hashicorp Terraform using a prebuilt Amazon machine image (AMI) which I created using Hashicorp Packer.

6. A user-data script is a shell script that you can create to run tasks immediately after an AWS instance is launched. In my user-data script, I created a docker-compose file. And ran docker build using it.

7. After the docker container of my application was up and running, the next task in the user data script was to run that rake task for google search of all the search queries passed to the user-data script.

8. After the google search for all the search queries was complete, I called an API in main instance to destroy current instance.

This is how I would spawn an AWS instance for each group of search queries. Spawning of each instance happened parallelly in a sidekiq job and then call the main instance to destroy itself using terraform-destroy.

 

Cost

 

Each t2.micro instance ran for about 10 min for a google search of 500 queries. The cost of a t2.micro instance is $0.013. That makes the cost of running 500 google searches $0.0021 per instance.

So, if there are 5000 google searches to be done, then there will be 10 instances spawned and the cost for these google searches in total will be $0.021.

 

About RemotePanda

RemotePanda is a personalized platform for companies to hire remote talent and get the quality work delivered from the city Pune. The resources in our talent pool are our close network connections. While connecting them with you, we make sure to manage the quality, growth, legalities, and the delivery of their work. The idea is to make remote work successful for you. Get in touch with us to learn why RemotePanda is the best fit solution for your business requirements.